0% found this document useful (0 votes)
41 views6 pages

Python TamilNLP

TamilNLP in Python

Uploaded by

mmpriyad2021
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views6 pages

Python TamilNLP

TamilNLP in Python

Uploaded by

mmpriyad2021
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/239531679

Natural Language Processing for Tamil TTS

Conference Paper · October 2007


DOI: 10.13140/RG.2.1.4233.0726

CITATIONS READS

18 3,590

3 authors:

A.G. Ramakrishnan Lakshmish Kaushik


Indian Institute of Technology Hyderabad University of Texas at Dallas
474 PUBLICATIONS 4,453 CITATIONS 28 PUBLICATIONS 387 CITATIONS

SEE PROFILE SEE PROFILE

Laxmi Narayana
Mylan Inc.
5 PUBLICATIONS 36 CITATIONS

SEE PROFILE

All content following this page was uploaded by A.G. Ramakrishnan on 02 June 2014.

The user has requested enhancement of the downloaded file.


Natural Language Processing for Tamil TTS
A. G. Ramakrishnan#, Lakshmish N Kaushik*, Laxmi Narayana. M$
Department of Electrical Engineering, Indian Institute of Science, Bangalore 560012, INDIA

[email protected]#, [email protected]*, [email protected]$

Abstract
This paper describes the development of the Natural language Processing module for the Tamil TTS Synthesis System. The input to a
TTS system is not always pure text and it may contain some acronyms, abbreviations and non-standard words, which need to be first
converted to the corresponding Tamil graphemic form. A text normalization module is developed for accomplishing this task. A G2P
converter which converts the normalized orthographic text input into its phonetic form is necessary for both the input text and also for
the text corpus under consideration. The phonetic transcription helps in identifying and analysing the basic units such as mono-phones,
diphones and syllables and also for segmenting the speech corpus. A character to phoneme mapping interface is developed to map the
Tamil graphemic text to the corresponding phonetic representation in Roman script. A Rule base is created which contains the inter
and intra word rules for changing the default character phone mapping wherever necessary. A proper noun lexicon as well as foreign
word lexicon is also incorporated for dealing cases where G2P fails. The NLP module designed is to be used for Tamil TTS synthesis
in both Windows platform and Festival (Linux) environment.

are dealt with in NLP. Section 3 describes the process of


1. Introduction converting a pure word sequence into its phonetic
Over 65 million people worldwide speak Tamil, the equivalent, the inter and intra word rules which are made
official language of the south Indian state of Tamil Nadu, use of, for G2P conversion, and the creation of Foreign
and also of Singapore, Sri Lanka and Mauritius. In word lexicon. The results and conclusion are presented in
addition to the above countries, it is spoken in Bahrain, section 4.
Malaysia, Qatar, Thailand, United Arab Emirates and
United Kingdom. Tamil is a syllabic language which 2. Text Normalization
contains 12 vowels and 18 consonants. There are five The text input to the TTS system may not be pure
other phones introduced for representing some Tamil text. It may contain some non-standard words like
consonants of Sanskrit. The language has well defined acronyms, abbreviations, proper names derived from
rules, which introduce seven other phones depending on other languages or clutters, phone numbers, decimal
the relative positions of consonants with respect to the numbers, fractions, ordinary numbers, sequence of
vowels or the other consonants. Hence there are 42 numbers, money, dates, measures, titles, times and
phones in the language. symbols. The Natural Language Processing module of an
Text to Speech (TTS) synthesis is an automated advanced TTS should be able to handle such non-
encoding process, which converts a sequence of symbols standard words also. Standard words are those, whose
(text) conveying linguistic information, into an acoustic pronunciation can be obtained from the G2P rules. A G2P
waveform (G L J Rama, 2002). The two major converter maps a word to a sequence of phones. All the
components of a TTS synthesizer are - Natural Language non-standard words must be expanded into the
Processing Module (NLP), which is capable of producing corresponding Tamil graphemic form before sending to
a phonetic transcription of the given text and Digital the G2P module for phonetic expansion. This module
Signal Processing module (DSP), which transforms this should also take a decision of how a non-standard word is
phonetic transcription into speech (Thierry Dutoit, 1997). being pronounced. For example, a phone number should
One of the characteristics, based on which a TTS system not be read like an ordinary number. Each digit in the
is evaluated, is its ability to accurately synthesize the phone number must be treated as a single number and
input text i.e., the input text should be spoken accurately must be read in isolation.
prior to naturalness and expression. The Natural The corresponding Tamil graphemic representations
Language Processing module is responsible for the of possible non-standard words, English words and Tamil
determination of the phonetic transcription of the short forms are written in ‘iLEAP’ format. iLEAP is a
incoming text. This involves normalizing the input text software, where one could type in many Indian
and mapping the graphemic representation to a languages. The ISCII (Indian Script Code for Information
corresponding phonetic representation. Since the Interchange) file are exported as an ASCII file (text file)
orthographic representation and pronunciation do not and this file is used by the Text normalization module.
match in some cases, the default mapping needs to be The format of the text normalization file is shown in
changed wherever necessary. Section 2 describes the Figure 2.
need for Text Normalization and how the non-standard The input text file is searched for abbreviations or
words like acronyms and abbreviations in the input text acronyms (can be in Tamil or English). They are replaced
by the corresponding expansion (graphemic form) in ph.no ªî£¬ô«ðC â‡
Tamil in the output (ISCII) file. This is illustrated in jan üùõK
Figure 1. rs Ïð£Œèœ
Ex: aug is replaced as Ýèv† (normalization) I.e I™L e†ì˜
august is also replaced as Ýèv† (Tamil Figure 2: Format of Normalization file
transcription)
Further, there are a number of words used regularly in 3. Grapheme to Phoneme Conversion
Tamil, which are originally Sanskrit or English words.
The G2P fails to give the accurate phonetic transcription In a TTS system, the G2P module converts the
in case of such words. Hence, we have created a lexicon normalized orthographic text input into the underlying
of foreign words. linguistic and phonetic representation (Kalika Bali,
Number expansion is a ‘special’ case in 2004). G2P conversion therefore is the most basic step in
normalization, because a decision needs to be taken a TTS system (Anumanchipalli, 2005). The text
whether the encountered number is an ordinary number normalization module gives a word sequence as input to
or phone number or date or time or currency, etc. If it is the G2P module. The Grapheme to Phoneme conversion
currency, it decides whether it is rupee or dollar or pound of the word sequence can be done using the Letter to
or yen. The input number is considered as a string. An Sound Rules. The rules are based on a pronunciation
ordinary number is expanded according to the length of dictionary, in which a mapping of the spelling of a word
the string. A module is written to expand a 3 digit into a sequence of phones can be found. For example,
number. The control chooses different paths according to consider an English word – “speech”. The pronunciation
the length of the string. If the number is 4 or 5 digits dictionary converts this word to the phone sequence – S P
long, then it must be in thousands or ten thousands. In IY CH1. Traditional orthography in some languages,
this case, this module will be called twice, first to convert particularly French and English, often does not coincide
the number of thousands to words (1 or 2 digits) and next with pronunciation. However, in other languages such as
to convert the remaining 3 digit number to words. For Spanish and Italian, there is a consistent relationship
example, if the number is 12345, in the first call, the between orthography and pronunciation (Link).
number 12 will be processed and in the next call, 345 is If there is no pronunciation dictionary, a simple set
processed. If it is a 6 or 7 digit number, it must be in of rules to convert the graphemic form of a word into the
lakhs or 10 lakhs and if the number has 8 or 9 digits, then corresponding phonemic form is used. Using such rules is
it must be in crores or 10 crores; then this module is more relevant for Indian languages. In many cases, there
called thrice or four times, respectively and so on. If the is a direct correspondence between what is written and
length of the number is less than or equal to 3, then the what is spoken. For example consider a Tamil word -
module is called only once. “asiriyar”, the corresponding phonetic transcription is -
If the number string is not an ordinary number, a /A/ /s/ /i/ /r/ /i/ /y/ /a/ /r/ which is very similar to the word.
parameter (a number corresponding to the decision taken)
is set according to the type of the number string. If the 3.1. Tamil Character to Phone Mapping
number string is a decimal number (Ex: 23.8756) the The grapheme to phoneme (G2P) conversion module
number before the dot (.) is treated as one number and the receives the sequence of Tamil words, from the Text
digits after the dot are spoken in isolation. If the number normalization module, which then are converted to a
string is a date, the delimiters can be '/' or '-' (Ex: 25-10- phonetic transcription represented in Roman script. This
1999 or 25/10/1999). All the three values (date, month, is obtained by using a character to phone mapping,
and year) are extracted from the input string and which gives the corresponding phonemic representation
processed separately. Similarly the different types of of Tamil graphemes, in Roman script. The Roman
number strings like currency, range of numbers, character (or sometimes, combination of letters in case of
arithmetic, phone numbers and time are identified by the diphthongs like /ae/ or /au/ or genitives like /kk/ and /tt/)
delimiters present and expanded accordingly. which represents a Tamil phonemic unit may not ‘sound’
exactly like the Tamil phoneme and is used only as an
unambiguous representation. The .wav (speech) files in
the inventory are labeled according to this Roman
... a.d 1999 Ý‹ ݇´
representation. The DSP module in the TTS engine picks
aug ñ£î‹ ..... up the corresponding speech units (which are labeled
according to the mapping), given by the phonemic
representation, for concatenation.
… A.H. ÝJóˆ¶ Words are converted one by one. Two kinds of rules
ªî£œ÷£Jóˆ¶
– Inter-word rules and Intra-word rules are applied to the
ªî£‡ÈŸÁ
å¡ð¶ Ý‹ ݇´ input text during conversion. If the last character in a
Ýèv† ñ£î‹ ...... word is halanth, and the last but one character is /k/ or
/ch/ or /th/ or /p/ and the next word starts with /k/ or /ch/
1
Courtesy: CMU Dictionary
Figure 1: Ex. of Text Normalization using look up table
or /th/ or /p/, respectively, the two words are
concatenated. This inter word rule is very relevant here IèŠ ðö¬ñò£ù¶ IèŠðö¬ñò£ù¶
because, while speaking such words, a speaker does not Iè„ Cø‰î Iè„Cø‰î
pronounce the phoneme /k/ or /ch/ or /th/ or /p/ two times. ñ†ì‚ è÷й ñ†ì‚è÷й
Such junctions of words will combine into a single word MNˆ F¬óJ™ MNˆF¬óJ™
in which those two phonemes at the junction would (a)
manifest as a genitive, for example /p/ /p/ becomes /pp/.
The double letters are labeled with a suffix ‘l’ to the basic
phoneme. This is illustrated in Figure 3(a) and 3(b). It
was found that the duration of a double letter occurring in
the middle of a word and that which was formed at the
junction of two words is comparable. (b)
Tamil, like most Indian languages, uses a syllabic
script that is largely phonetic in nature. Thus, in most Figure 3(a) Examples of Inter-word rules
cases, there is a one-to-one mapping between graphemes (b) /p/ /p/ becoming /pp/ (pl)
and the corresponding phones. The architecture of the
Natural language Processing module is shown in Figure Format: α1 α2 …. αm { β1 β2 … βn }
4. Language specific information is fed into the system in Example: VOW KA VOW { K:1:X R:2:g K:3:X }
the form of mapping and rules. The default character to
phone mapping is defined in the mapping file. The format αi Class label of the ith character as defined in the
of the mapping is shown below and explained character phone mapping. Together these αis
subsequently. represent the context that is being matched.
βj jth action specification node. Each such node has
Format: Character Type Class Phoneme the form:
Example: Ü V VOW a ¤ V VOW a
Action_Type Action_Type:Pos:Phoneme_Str
Character: The orthographic representation of the
character. This field specifies the type of this
Type: Three types of characters are identified, C- action performed at this node. Possible
Consonant, V-Vowel and H-Halanth values are K (Keep), R (Replace), I
Class: The class to which the character belongs. These (Insert) and A (Append).
class labels can be effectively used to write a rule Pos The index of the character being
representing a broad set of characters. covered by the context of the rule (1 ≤
Phoneme: The default phonetic representation of the Pos ≤ m)
character. Phoneme_Str represents phoneme string output by
Type gives a broad classification of the characters while this action node.
Class mainly classifies the C type characters (consonants)
into different clusters like KA, CA, TA, tA, PA and YA. The example rule given above says that if the grapheme
Some examples of the default mapping are shown in /k/ (/k/ belongs to class KA) appears between two vowels
Figure 5. (VOW KA VOW), keep the first character (vowel) as it is
(K:1:X), replace the second character(/k/) with /g/
3.2. Tamil G2P Rule Base (R:2:g) and keep the third character (vowel) as it
The Rule Base is a set of rules that modify the is(K:3:X).
default mapping of the characters based on the context in
which a particular phoneme occurs. Specific contexts are
matched using rules. The system triggers the rule that
best fits the current context. The rule format is given
below.
G2P Rule Foreign word
Base Lexicon

Input Tamil Normalized Grapheme to Transcribed


Text Normalizer Tamil text Text
text Phoneme Converter

Look Up Table
Character Phone
Mapping

Figure 4: Natural Language Processing (NLP) module


Ü a è k phonetic transcription is taken from the lexicon itself.
å o Ý A Otherwise, the G2P applies mapping and the rules to
î t æ O produce the phonetic transcription.
ì T â e The phonetic transcription generated by the G2P
converter can be used for segmentation of the speech
Figure 5: Mapping examples corpus. The phonetic transcription can be aligned with the
speech waveform and the phone boundaries can be
If the same consonant occurs twice consecutively, adjusted manually or by automatic speech segmentation
the genitive is represented by replacing the second algorithms.
grapheme by ‘l’.
Ex: TA HAL TA { A:1:l } (T T -> Tl) 4. Results and Conclusion
The reason for using this kind of representation for The Natural Language Processing module for Tamil
double letters is, the speech files in the inventory are Text to Speech Synthesis has been developed effectively.
labeled accordingly and if ‘T T’ is kept as it is, then the Efficient rules have been designed in for grapheme to
DSP module selects two ‘T’ segments instead of selecting phoneme conversion, which cover most of the contexts in
a single ‘Tl’ segment. Also, linguistically, TT is a single which the default mapping needs to be changed. A
phone (genitive), not two. foreign word lexicon is created to handle cases where the
Only the letters /k/, /ch/, /T/, /th/, /p/ exist in Tamil general G2P rules do not give the exact phonetic
script. But these five graphmes would manifest as /g/, /j/, transcription. This has been used in a Tamil TTS
/D/, /dh/, /b/, respectively if they occur between two developed around Festival. The developed C code for
vowels or prefixed by nasals. NLP module is designed to be used for Tamil Text to
Ex: NAS1 HAL KA { K:1:X K:2:X R:3:g } speech synthesis in both Windows and Linux platforms.
VOW PA VOW { K:1:X R:2:b K:3:X }
However, there is an exception for /ch/. If it occurs FK«õEJ¡ Hø‰î «îF september 1, 1928. ðô
between two vowels, it becomes /s/. Also, /ch/ becomes pages ªè£‡ì 21 ï£õ™èœ, 41 CÁè¬îèœ ÜìƒAò 3
/s/ when it comes in the beginning of a word. Some more ªî£°Š¹èœ, è¡ùì Þô‚Aòˆ¶‚° Üõ¼¬ìò
rules are listed in the Appendix. Figure 6 gives some ðƒèOйèœ.
sample input Tamil sentences, the normalized Text and (a)
the phonetised text. FK«õEJ¡ Hø‰î «îF ªúŠªì‹ð˜ å¡Á /
ÝJóˆ¶ ªî£œ÷£Jóˆ¶ Þ¼ðˆ¶ â†´ ðô
3.3. Foreign Word Lexicon ð‚èƒèœ ªè£‡ì Þ¼ðˆ¶ å¡Á ï£õ™èœ, ðˆ¶
The process of corpora phonetization or the å¡Á CÁè¬îèœ ÜìƒAò Í¡Á ªî£°Š¹èœ,
development of phonetic lexicons for the western è¡ùì Þô‚Aòˆ¶‚° Üõ¼¬ìò ðƒèOйèœ.
languages is traditionally done by linguists. These (b)
lexicons are subject to constant refinement and > t i r i w E N i y i n # p i R a n d a # t E d
modification. But the phonetic nature of Indian scripts i # s e p T e m b a r # o n R u # # A y i r a tl
reduces the effort to building mere mapping tables and u # t o Ll A y i r a tl u # i r u b a tl u # e
Tl u # # p a l a # p a kl a ng g a L # k o N D a
rules for the phonetic representation. These rules and the # i r u b a tl u # o n R u # n A w a l g a L $ #
mapping tables together comprise the phonetizers or the n A R p a tl u # o n R u # s i R u g a d ae g a
Grapheme to Phoneme converters (Anumanchipalli, L # a D a ng g i y a # m U n R u # t o g u pl u
g a L $ # k a nl a D a # i l a kl i y a tl u kl
2005). u # a w a r u D ae y a # p a ng g a L i pl u g a
The G2P rule base cannot be generalized to handle L <
all the words in the input text, especially for the proper (c)
nouns derived from other languages like Sanskrit and Figure 6: (a) Input Tamil text in ISCII format (b) Text
other clutters. For example, the word ‘Buddha’ is written after Normalization (c) Phonetic transcription (G2P
in Tamil as ‘¹ˆî£’ (putla); the grapheme /p/ in the Converter Output)
beginning of the word should be pronounced as /b/ and
the /tl/ should be pronounced as /dl/. But there are no 5. References
such rules in the rule base, since it is basically a Sanskrit
Anumanchipalli Gopalakrishna, Rahul Chitturi, Sachin
word used as it is in Tamil. If a rule is introduced for this
purpose, that will effect other words. For example, the Joshi, Rohit Kumar, Satinder Singh, R.N.V Sitaram
and S.P. Kishore, 2005. Development of Indian
Tamil word ‘¹ˆîè‹’ (putlakam) is pronounced as
Language Speech Databases for Large Vocabulary
‘putlagam’ only. The initial /p/ doesn’t change to /b/ or
the genitive /tl/ doesn’t change to /dl/. Speech Recognition Systems, Proceedings of
So, an exception lexicon is created, which contains International Conference on Speech and Computer
the phonetic transcription of such words and proper (SPECOM), Patras, Greece, Oct.
nouns. The lexicon dictates the phonetic composition, or Link (http://en.wikipedia.org/wiki/Phonetic_transcription)
the pronunciation of each entry in the list. Thierry Dutoit, 1997. High-quality Text-To-Speech
The G2P first looks for each input word in the synthesis: an overview, Journal of Elec. and Electronics
lexicon file. If the word is present in the lexicon file, its Engineering, Australia: Special Issue on Speech
Recognition and Synthesis, vol. 17 no 1, pp. 25-37.
G. L. Jayavardhana Rama, A. G. Ramakrishnan, R. Kalika Bali, Partha Pratim Talukdar, N. Sridhar Krishna,
Muralishankar and P. Prathibha, ``A Complete Text-to- A.G. Ramakrishnan,"Tools for the Development of a
Speech Synthesis System in Tamil,'' Proc. IEEE 2002 Hindi Speech Synthesis System", In 5th ISCA Speech
Workshop on Speech Synthesis, Sep. 11-13, Santa Synthesis Workshop, Pittsburgh, pp.109-114, 2004.
Monica, CA USA, 2002.

6. Appendix

6.1. Character to Phone Mapping


Ç V IG X è C KA k î C tA t û C NL1 S
Ü V VOW a è C KA k ï C NAS4 n û C NL1 S
Ý V VOW A é C NAS1 ng ù C NAS4 n ú C NL2 s
Þ V VOW i ê C CA s ð C PA p ý C NL3 h
ß V VOW I ê C CA s ð C PA p £ V VOW A
À V VOW u ü C CA s ð C PA b ¤ V VOW i
Á V VOW U ü C CA s ð C PA b ¦ V VOW I
 V VOW e ë C NAS2 ny ñ C NAS5 m § V VOW u
à V VOW E ì C TA T ò C YA y ¨ V VOW U
Ä V VOW ae ì C TA T ó C NL4 r ª V VOW e
Å V VOW o ì C TA T ø C RA R « V VOW E
Æ V VOW O ì C TA T ô C La l ¬ V VOW ae
Å÷ V VOW au í C NAS3 N ÷ C La L ª£ V VOW o
È C KA k î C tA t ö C NL zh «£ V VOW O
È C KA k î C tA t õ C WA w ª÷ V VOW au
î C tA t

6.2. G2P Rule base


IG PA { R:2:F } YA HAL KA { K:1:X K:2:X R:3:g } La HAL tA { K:1:X K:2:X R:3:d }
ta VOW ka { K:1:X K:2:X R:3:h } NL4 HAL KA { K:1:X K:2:X R:3:h } YA HAL PA { K:1:X K:2:X R:3:b }
KA HAL KA { A:1:l } la HAL KA { K:1:X K:2:X R:3:g } NL4 HAL PA { K:1:X K:2:X R:3:b }
CA HAL CA { R:1:cl } WA HAL KA { K:1:X K:2:X R:3:g } la HAL PA { K:1:X K:2:X R:3:b }
TA HAL TA { A:1:l } NL HAL KA { K:1:X K:2:X R:3:g } WA HAL PA { K:1:X K:2:X R:3:b }
ta HAL ta { A:1:l } La HAL KA { K:1:X K:2:X R:3:g } NL HAL PA { K:1:X K:2:X R:3:b }
PA HAL PA { A:1:l } YA HAL CA { K:1:X K:2:X R:3:j } La HAL PA { K:1:X K:2:X R:3:b }
YA HAL YA { A:1:l } NL4 HAL CA { K:1:X K:2:X R:3:j } VOW KA VOW { K:1:X R:2:g K:3:X }
la HAL la { A:1:l } la HAL CA { K:1:X K:2:X R:3:j } VOW TA VOW { K:1:X R:2:D K:3:X }
WA HAL WA { A:1:l } WA HAL CA { K:1:X K:2:X R:3:j } VOW ta VOW { K:1:X R:2:d K:3:X }
La HAL La { A:1:l } NL HAL CA { K:1:X K:2:X R:3:j } VOW PA VOW { K:1:X R:2:b K:3:X }
NL HAL NL { A:1:l } La HAL CA { K:1:X K:2:X R:3:j } NAS KA { K:1:X R:2:g }
NL1 HAL NL1 { A:1:l } YA HAL TA { K:1:X K:2:X R:3:D } NAS ta { K:1:X R:2:d }
NL2 HAL NL2 { A:1:l } NL4 HAL TA { K:1:X K:2:X R:3:D } NAS TA { K:1:X R:2:D }
NL3 HAL NL3 { A:1:l } la HAL TA { K:1:X K:2:X R:3:D } NAS PA { K:1:X R:2:b }
NL4 HAL NL4 { A:1:l } WA HAL TA { K:1:X K:2:X R:3:D } VOW CA VOW { K:1:X R:2:s K:3:X }
NA HAL NA { A:1:l } NL HAL TA { K:1:X K:2:X R:3:D } NAS CA { K:1:X R:2:j }
NAS HAL NAS { A:1:l } La HAL TA { K:1:X K:2:X R:3:D } RA HAL RA { R:1:TR }
NAS HAL KA { K:1:X K:2:X R:3:g } YA HAL tA { K:1:X K:2:X R:3:d } ta ka { K:1:X R:2:h }
NAS HAL CA { K:1:X K:2:X R:3:j } NL4 HAL tA { K:1:X K:2:X R:3:d } HSH { }
NAS HAL TA { K:1:X K:2:X R:3:D } la HAL tA { K:1:X K:2:X R:3:d } HAL HAL ( }
NAS HAL tA { K:1:X K:2:X R:3:d } WA HAL tA { K:1:X K:2:X R:3:d }
NAS HAL PA { K:1:X K:2:X R:3:b } NL HAL tA { K:1:X K:2:X R:3:d }

View publication stats

You might also like