Paper TTS+Conversion
Paper TTS+Conversion
Our daily lives are impacted by the use of digital signal processing in speech processing. Many applications, such as
automation, audio recording, and audio-based help systems, can benefit from text to speech conversion (TTS). Transcribing
TTS is possible for many different languages, including those that are not widely spoken. Text-to-speech (TTS) systems
generate spoken equivalents from text input. Though the creation of speech is rather complex, introducing naturalness to
the speaker's expression is a major challenge in TTS. This paper proposes an efficient TTS conversion with high accuracy for
the Tamil language. The deep learning technique called Deep Quality Speech Recognition (DQSR) is developed in this research
study for Tamil language TTS. This is due to the fact that the method used for other languages like English will not work when
used in Tamil due to adaptable pronunciations that are fully dependent on the language constructs. When compared to the
traditional system, the proposed solution improves the framework's precision by 5%.
Keywords: Audio Recording; Deep Quality Speech Recognition; Digital Signal Processing; Tamil Language; Text to speech
conversion.
Introduction
While maintaining linguistic content, the voice conversion (VC) technology transforms an exact
speaker's voice into a desired target's voice [1]. They are two independent systems, but they share a
common goal: creating speech with a objective voice in a variety of situations. According to the
definition of a TTS system [2], the VC system is a speech synthesis system that uses linguistic
instructions derived from written text. These similarities in purpose but differences in operating
context make VC and TTS complement each other and have their own role in a spoken dialogue
system, which is unique. TTS is able to generate a vast amount of speech automatically and affordably
since text input is straightforward to develop and alter. There are times when a linguistic instruction
cannot be written in a way that is intended. While voice as a reference input is more time-consuming
and exclusive to generate, the scheme may be easily extended to a language that has never been used
before.
According to the nature of the data available for expansion, VC systems are usually classed as
parallel or nonparallel systems. According to this system, the parallel corpus is made up of two
speakers who speak the same words at the same time. Dynamic temporal warping (DTW) is used to
align the parallel speech utterances of source and target to form the training set. We can create a
mapping function to convert the acoustic characteristics of one speaker to another using this training
set [3]. For parallel VC systems, a variety of approaches have been presented [4, 5]. In parallel VC, only
the source and target speakers' data is used, which is one of its advantages. For example, if we want
to create a virtual assistant with a certain speaker's voice, we'll need a high planning and preparation
for parallel speech acquisition than for non-parallel speech. For this reason, numerous ways have been
proposed to construct non-parallel corpus systems for VC [6, 7]. A non-parallel VC can be adapted
568
Nat. Volatiles & Essent. Oils, 2021; 8(5): 568 – 580
from a parallel VC in the perfect space using the maximum-a posteriori (MAP) method [8] or
interpolated between multiple parallel models [9] in the usual Gaussian mixture model (GMM)
approach.
The rest of paper is prearranged as follows. Section 2 describes existing frameworks with its
advantage and limitation, Section 3 introduces explanation of our proposed method, Section 4 labels
the experiment setup with subjective results, and Section 5 concludes our findings.
2. Literature Review
The two most common ways to building TTS systems in the literature are rule-driven and data-
driven. According to the rule-based method, linguistic rules were used to generate automatic
pronunciation. [20] discusses a rule-based approaches that produce good results.for Tamil, [21] for
English and [22] for Urdu. Decision trees [23] and Bayesian networks [24] were the two initial
approaches explored for G2P. However, the resulting pronunciation was inferior to that of a phoneme-
based decision tree because the syllable boundaries were not correctly detected [25]. Researchers
next looked at Pronunciation by Analogy [21] and Pronunciation by Latent Analogy [26], which are
only applicable for uncommon context terms.
An article on Thirukural for the Tamil language was presented by Gl.J. Jayavadhan Rama [31].
There are no prosodic features in this, and they have tried to reduce the amount of the speech and
increase its quality. To improve the naturalness, a pitch adjustment method is utilized. Synthesized
speech was designed by R.Muralishankar and A.G. Ramakrishnan [32] by using prosodic elements.
Several emotions are subjected to linear prediction analysis to develop the speech quality.
569
Nat. Volatiles & Essent. Oils, 2021; 8(5): 568 – 580
In the Tamil language, K.G. Aparna [33] has created a machine that reads books for them. In
order to do concatenation synthesis, grayscale images are scanned into binary images. This is followed
by the text being split into various clusters. This is a complete TTS in Tamil by G.L. Jayavardhana Rama
[34]. As a result, the system has been divided into two parts: Afterwards, the rules will be
concatenated with the text to form the output. Prosody and pitch marking will be implemented as
part of the offline phase. Smoothing is achieved using wave form interpolation.
Web-enabled TTs, using Java, were proposed by Prathiba and Ramakrishnan [35].
Concatenation is based on the syllable as the basic unit. Festival architects SrikanthMajji and
Ramakrishnan [36] offered a TTS. If you want to write sentences in ILEAP, you can do so. In this
strategy, the units are selected based on clusters of units. Implementation of the Viterbi algorithm In
the Tamil language, J.Sangeetha [37] proposed a TTS. There are two ways to synthesize the speech.
There is a high degree of naturalness in the word-level synthesis. When an input is not included in the
speech corpus, syllable level concatenation is used to synthesize the input.
3. Proposed Methodology
In this section, the dataset description along with the explanation of DQSR are presented.
In this proposed system, to evaluate the results a new dataset is created. The data set has
been recorded with the help of mike. The overall data set consist of two different kinds of data’s such
as words and sentences. Audacity was used to record the words and sentences in a closed room with
a high-quality microphone. There is a backup of the recorded speech files. Sampling rate used for
recording was sixteen thousand hertz (KHz). Table 1 shows some of the words and sentences that
were captured. In the words section, our dataset consist of 138 words and 60 sentences. For
evaluation purposed all dataset sentence and words are manually labelled. Some of the sample words
and sentences of the dataset has been shown in the below table.
Words Sentences
அம் மா எனக்கு உதவி செய் வீர்களா
பூனன என் ெபயர் மலர்
யானன என்ன செய் தி
ஆனன காலல வணக்கம்
ஊசி சீக்கிரம் ெகாண்டு வா
உதவி நல் லா இருக்கிறேன் நீ ங் க
வணக்கம் நீ ங் க என்ன வாசிக்கிறீங் க
நன் றி நீ ங் க ெராம் ப நல் லவர்
DQSR architecture is described in this section. Rather than using the encoder-attention-
decoder architecture employed by the majority of sequence-to-sequence autoregressive and non-
autoregressive generation algorithms, we design an unique feed-forward structure. Figure 1 shows
570
Nat. Volatiles & Essent. Oils, 2021; 8(5): 568 – 580
the overall model architecture of DQSR. In the next subsections, we go into further detail on the
components.
Self-attention in Transformer and 1D convolution form the basis of the DQSR architecture. As
seen in Figure 1, we name this construction a Feed-Forward Transformer (FFT). Phoneme to Mel-
Spectrogram FFT uses multiple FFT blocks stacked one on top of the other, with N blocks on the mel-
spectrogram side. In order to retrieve cross-positional information, the self-attention network uses a
multi-head attention. We utilize a 2-layer 1D convolutional network with ReLU activation instead of
Transformer's 2-layer dense network. Due to the fact that character/phoneme and mel-spectrogram
sequences are more closely associated to nearby hidden states, this method was developed. In the
experimental part, we test the effectiveness of the 1D conv network. Transform is followed by residual
connections, layer normalization, and dropout.
As well as controlling the voice speed and portion of prosody, the length regulator (Figure 1)
is employed to solve the length mismatch problem among the phoneme and spectrogram sequence
in the FFT. The phoneme duration is the length of the mel-spectrogram that corresponds to a
phoneme (we will describe how to predict phoneme duration in the next subsection). To determine
how many hidden states there are in a phoneme sequence, we use a length regulator that multiplies
phoneme duration by d. The length of the hidden states is thus equal to the length of the mel-
spectrogram. 𝐻𝑝ℎ𝑜 = [ℎ1 , ℎ2 , . . . , ℎ𝑛 ], denotes the hidden states of the phoneme sequence, where n
is the length of the sequence. The phoneme duration sequence is 𝐷 = [𝑑1 , 𝑑2 , . . . , 𝑑𝑛 ],, where
∑𝑛𝑖=1 𝑑𝑖 = 𝑚 and m represented as the length of the mel-spectrogram sequence (see below for more
information). The length regulator LR is referred to as LR.
when 𝐻𝑚𝑒𝑙 , is an expanded sequence and is a hyperparameter that controls the voice speed.
Using Equation 1, the expanded sequence H mel becomes if 𝐷 = [2, 2, 3, 1], and H
[ℎ1 , ℎ1 , ℎ1 , ℎ2 , ℎ2 , ℎ2 , ℎ3 , ℎ3 , ℎ3 , ℎ4 ] if 𝛼 = 1. The duration sequences D=1.3 = [2.6,2.6,3.9,1.3] [3,3,4,1]
and 𝐷α=0.5 = [1, 1, 1.5, 0.5] ≈ [1, 1, 2, 1], correspondingly, and the expanded sequences become
571
Nat. Volatiles & Essent. Oils, 2021; 8(5): 568 – 580
[ℎ1 , ℎ2 , ℎ3 , ℎ4 ]We can also alter the prosody of the synthesized speech by adjusting the length of the
space characters in the phrase.
Length regulation relies heavily on the ability to forecast the duration of phonemes. It
comprises of a 2-layer 1D network with ReLU activation, charted by a layer normalization layer and a
dropout layer, as well as an extra linear layer that outputs the anticipated phoneme duration, as
illustrated in Figure 2. There is an additional FFT block for each phoneme, and this module is used in
conjunction with the DQSR model to predict mel-spectrograms for each phoneme with the MSE. As a
result, they are more Gaussian and easier to train. Observe that the learned duration predictor is only
needed during TTS inference since the phoneme duration recovered from an autoregressive teacher
model can be used directly during TTS training (see following discussions).
where 𝑆 𝑎𝑛𝑑 𝑇 are the spectrogram and phoneme lengths, 𝑎𝑠 , 𝑡 is the element in the Sth row
and Tth column of the attention matrix that has been donated. Every head's focus rate is computed,
and then the attention is aligned with the head with the highest F rate.
Let's end with a series of the phoneme durations 𝐷 = [𝑑1 , 𝑑2 , . . . , 𝑑𝑛 ] using the duration
extractor 𝑑𝑖 = ∑𝑆𝑠=1[arg 𝑚𝑎𝑥𝑡 𝑎𝑠 , 𝑡 = 1]. Meaning that a phoneme's duration depends on how many
Mel-spectrograms it has received in the previous phase.
572
Nat. Volatiles & Essent. Oils, 2021; 8(5): 568 – 580
DQSR model
Six FFT blocks are used in both the phoneme and mel-spectrogram sides of our DQSR model,
a total of 12. Including punctuation, there are 51 phonemes in the vocabulary. Phoneme embeddings,
self-attention hidden size, and 1D convolution are all set to 384. According to the settings, the sum of
attention heads will be 2. There are two 1D convolution kernel sizes set to 3, with input/output size
of 384/1536 for the first layer and 1536/384 for the second layer of the 2-layer convolution network.
Three-hundred-and-fourth-dimensional mel-spectrogram is transformed into an 80-dimensional one
via a linear output layer. Using a 1D convolution, the kernel sizes are set to 3, with input and output
sizes of 384/384.
Autoregressive Transformer
Two reasons are served by the autoregressive transformer TTS model in our research.
Phoneme duration will be extracted and used to train the duration predictor; mel-spectrogram will be
generated during sequence-level knowledge distillation. There are six layers of encoder and decoder
in this model and instead of a position-based FFN, there is 1D convolution network. It has a similar
amount of parameters to our DQSR model, but it's not as complex..
We set up an experimental environment to carry out this research and assess the outcomes
of the study. Intel i7 4th generation processor, clocking at 2.3 GHz with four megabytes of memory.
The workstation is equipped with 16 gigabytes of DDR3 RAM and a 1 terabyte (TB) SATA hard drive
rotating at 7 K RPM. Microsoft Windows 10 Pro is used as the base OS for this project. Version 2018a
of MATLAB.
The assessment of the quality of speech can be done using two measures. They are: subjective
quality measures and objective quality tests. The evaluation of Subjective measures includes the
comparisons of the original speech signal and the processed speech signals. The assessment of
Objective speech quality measure involves the mathematical comparison of the original speech signal
and the processed speech signals. The numerical expanse between the original and processed signals
is calculated to measure the Objective quality measures. The various parameters considered for the
assessment of speech quality measure are Mean Square Error (MSE), Mahalo Nobis distance, Similarity
Index, Accuracy and Perceptual Evaluation of speech Quality (PSEQ) measure. MSE is well-defined as
the average of the squares of the two errors viz., the difference between the estimated value and the
predicted value. For a good quality speech signal the value of MSE should be less than one. The MSE
is calculated by the equation:
1
𝑀𝑆𝐸 = ∑𝑟𝑖=1[𝑋𝑖 − 𝑋𝑖^ ]2 (3)
𝑟
Where, 𝑟 is the Predictions generated from the sample and 𝑋𝑖 is the observed values of the variable
being predicted 𝑋𝑖^ is the enhanced speech signal. As a result of their mean element vectors µA and
µB, as well as the covariance matrix of all samples in the database, the Mahala Nobis distance
573
Nat. Volatiles & Essent. Oils, 2021; 8(5): 568 – 580
calculates how far apart two samples. The distance is given as Mahala Nobis distance is given by the
equation,
−1
𝑀𝐷 (𝜇𝐴 − 𝜇𝐵 ) = (𝜇𝐴 − 𝜇𝐵 )𝑇 ∑(𝜇𝐴 − 𝜇𝐵 ) (4)
The similarity index is calculated by the sum of matching words divided by the sum of all words and
gap characters. Then the quotient is multiplied by 100 to give the similarity as a percent. Accuracy is
calculated by taking the differences between the synthesized speech and the pronounced speech.
Perceptual Evaluation of speech Quality (PSEQ) measure is computed by first equalizing the original
and the degraded signals. Then the signals are filtered and processed to produce the loudness. To
produce the quality rating, the difference between the original and the degraded signals are
calculated. For better quality speech signal, the PSEQ value ranges from 1 to 5. The PSEQ is calculated
by the equation,
𝑃𝐸𝑆𝑄 = 𝑎0 + 𝑎1 𝐷𝑖 + 𝑎2 𝐴𝑖 (5)
Where 𝐷𝑖 represent as the average disturbance value and 𝐴𝑖 signified as an average asymmetrical
disturbance value.
Initially, Table 2 and 3 shows the overall performance of DQSR for words and sentences in
terms of various metrics. Here the analysis are taken for more than three times to test the efficiency
of proposed DQSR.
574
Nat. Volatiles & Essent. Oils, 2021; 8(5): 568 – 580
From that analysis of Table 2 and 3, it is clearly stated that the proposed DQSR achieved better
accuracy for some words and sentences. For instance, the word "elephant" has highest accuracy
(i.e.75.68%), where the word "needle" has low accuracy (i.e.43.98%) than other words. Likewise, the
sentence "What are you reading?" achieved better accuracy (66.53%) and the sentence "Good
Morning" achieved low accuracy (44.98%) than other input sentences. The error rate of word "Amma"
is very less, while comparing with other words includes elephant, cat, needle, etc. The MND is high for
the "elephant", MND of "cat" and "needle" are nearly 5.70 and MND of "Amma" is only 4.27. The PSEQ
for all the input words are nearly 1.10. The input sentences achieved less MSE value i.e. nearly 0.06 to
0.09, expect the last input sentence. The MND of "What's news?" achieved only 2.96, where other
input sentences achieved nearly 3.8 to 5.0 of MND. Likewise, the PSEQ of all sentences achieved nearly
1.25 to 1.90. Therefore, the proposed DQSR achieved better performance, even the analysis are
carried out more number of times. The next section will explain the performance of our proposed
DQSR with existing techniques like Hidden Markov Model (HMM) and GMM.
Table 4 and 5 shows the comparison of proposed DQSR with existing HMM and GMM model
in terms of MSE and accuracy for single word and single sentence.
575
Nat. Volatiles & Essent. Oils, 2021; 8(5): 568 – 580
From the table 4, it is clearly proved that proposed DQSR achieved better performance for
every single input words. For instance, the proposed DQSR achieved 66.45% of accuracy and 0.13 MSE
for the word "cat", where the existing techniques achieved nearly 63% of accuracy and obtained nearly
0.15 to 0.19 MSE for the same word. The proposed DQSR achieved highest accuracy (75.68%) and
lowest MSE for the word "elephant". The existing HMM model achieved lowest accuracy (39.13%) and
highest MSE for the word "needle". In addition, the experiments for single sentence is carried out to
test the efficiency of proposed DQSR. From the results, it is proves that the proposed DQSR achieved
highest accuracy (66.53%) for the sentence "What are you reading?", where the existing techniques
achieved nearly 55% to 61% of accuracy for the same sentence. Likewise, for the sentence "Good
Morning", the proposed DQSR achieved lowest accuracy (44.98%), where the existing techniques
achieved only 40% of accuracy for the same sentence. Therefore, our proposed DQSR achieved better
performance than HMM and GMM models.
The implemented work is compared with existing researches [38] for various parameters given
in Table 6.
576
Nat. Volatiles & Essent. Oils, 2021; 8(5): 568 – 580
As per [38], the existing research work is implemented on Punjabi text, however in order to
test the efficiency of proposed DQSR, we implemented the existing research work on our input Tamil
sentences and words. Those results are presented in the table 6, from that it is proved that the
accuracy of our proposed DQSR is 95.21%, where the existing techniques achieved 94% and 87% of
accuracy. In addition, the sentences are altered by existing and proposed DQSR techniques, where the
proposed technique altered it correctly (98.73). The reason for that is the DQSR uses the various layers
for converting the input text to speech signals. Therefore, our proposed DQSR achieved better
performance than existing techniques.
5. Conclusion
Existing G2P (phoneme centric) rule-based and machine learning-based techniques in Tamil
are still inadequate to handle all of the problems involved. Systems that create speech from text input
are known as TTS. Though the creation of speech is rather complex, introducing naturalness to the
speaker's expression is a major challenge in TTS. The degree of intelligibility and naturalness of the
speech has been reached. This work illustrates a TTS conversion for the Tamil language that has been
produced. TTS conversion utilizing DQSR is implemented and validated in this work. The results show
that the method is highly accurate. A comparison of standard machine learning approaches with the
proposed system's results shows that the suggested system generates speech waveforms with great
accuracy. There is no deterioration or mispronunciation with the proposed system. According to the
algorithm, its accuracy is 5 percent higher than that of a conventional technique. Using efficient meta-
heuristic algorithms, the learning rate of deep learning approach will be optimized in the future.
References
[1] Tomoki Toda, Alan W Black, and Keiichi Tokuda, “Voice conversion based on maximum-likelihood
estimation of spectral parameter trajectory,” IEEE Trans. Audio, Speech, Language Process., vol. 15,
no. 8, pp. 2222–2235, 2007.
[3] Masanobu Abe, Satoshi Nakamura, Kiyohiro Shikano, and Hisao Kuwabara, “Voice conversion
through vector quantization,” J. Acoust. Soc. Jpn. (E), vol. 11, no. 2, pp. 71–76, 1990.
[4] Ryoichi Takashima, Tetsuya Takiguchi, and Yasuo Ariki, “Exemplar-based voice conversion using
sparse representation in noisy environments,” IEICE Transactions on Fundamentals of Electronics,
Communications and Computer Sciences, vol. 96, no. 10, pp. 1946–1953, 2013.
[5] Ling-Hui Chen, Zhen-Hua Ling, Li-Juan Liu, and LiRong Dai, “Voice conversion using deep neural
networks with layer-wise generative training,” IEEE/ACM Transactions on Audio, Speech and Language
Processing (TASLP), vol. 22, no. 12, pp. 1859–1872, 2014.
[6] Daniel Erro, Asuncion Moreno, and Antonio Bonafonte, “INCA algorithm for training voice
conversion systems from nonparallel corpora,” IEEE Trans. Audio, Speech, Language Process., vol. 18,
no. 5, pp. 944–953, 2009.
[7] Yining Chen, Min Chu, Eric Chang, Jia Liu, and Runsheng Liu, “Voice conversion with smoothed gmm
and map adaptation,” in Proc. EUROSPEECH, 2003, pp. 2413–2416.
577
Nat. Volatiles & Essent. Oils, 2021; 8(5): 568 – 580
[8] Chung-Han Lee and Chung-Hsien Wu, “Map-based adaptation for speech conversion using
adaptation data selection and non-parallel training,” in Proc. INTERSPEECH, 2006, pp. 2254–2257.
[9] Tomoki Toda, Yamato Ohtani, and Kiyohiro Shikano, “Eigenvoice conversion based on gaussian
mixture model,” in Proc. INTERSPEECH, 2006, pp. 2446–2449.
[10] Lifa Sun, Kun Li, Hao Wang, Shiyin Kang, and Helen Meng, “Phonetic posteriorgrams for many-to-
one voice conversion without parallel data training,” in 2016 IEEE International Conference on
Multimedia and Expo (ICME). IEEE, 2016, pp. 1–6.
[11] Hiroyuki Miyoshi, Yuki Saito, Shinnosuke Takamichi, and Hiroshi Saruwatari, “Voice conversion
using sequence-to-sequence learning of context posterior probabilities,” Proc. INTERSPEECH, pp.
1268–1272, 2017.
[12] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-Min Wang, “Voice conversion
from non-parallel corpora using variational auto-encoder,” in Proc. APSIPA. IEEE, 2016, pp. 1–6.
[13] Takuhiro Kaneko and Hirokazu Kameoka, “Paralleldata-free voice conversion using cycle-
consistent adversarial networks,” arXiv preprint arXiv:1711.11293, 2017.
[14] Fuming Fang, Junichi Yamagishi, Isao Echizen, and Jaime Lorenzo-Trueba, “High-quality
nonparallel voice conversion based on cycle-consistent adversarial network,” in 2018 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5279–
5283.
[15] Andros Tjandra, Berrak Sisman, Mingyang Zhang, Sakriani Sakti, Haizhou Li, and Satoshi
Nakamura, “Vqvae unsupervised unit discovery and multi-scale code2spec inverter for zerospeech
challenge 2019,” arXiv preprint arXiv:1905.11449, 2019.
[16] Cheng-chieh Yeh, Po-chun Hsu, Ju-chieh Chou, Hungyi Lee, and Lin-shan Lee, “Rhythm-flexible
voice conversion without parallel data using cycle-gan over phoneme posteriorgram sequences,” in
Proc. SLT, 2018, pp. 274–281.
[17] Kou Tanaka, Hirokazu Kameoka, Takuhiro Kaneko, and Nobukatsu Hojo, “Atts2s-vc: Sequence-to-
sequence voice conversion with attention and context preservation mechanisms,” in Proc. ICASSP,
2019, pp. 6805–6809.
[18] Jing-Xuan Zhang, Zhen-Hua Ling, and Li-Rong Dai, “Non-parallel sequence-to-sequence voice
conversion with disentangled linguistic and speaker representations,” arXiv preprint
arXiv:1906.10508, 2019.
[19] Hieu-Thi Luong and Junichi Yamagishi, “A unified speaker adaptation method for speech synthesis
using transcribed and untranscribed speech with backpropagation,” arXiv preprint arXiv:1906.07414,
2019.
[20] S. Yuvaraja, V. Keri, S. C. Pammi, K. Prahallad, and A. W. Black, “Building a Tamil voice using HMM
segmented labels,” in National Conference on Communication, International Institute of Information
Technology, Hyderabad, India, Language Technologies Institute, Carnegie Mellon University, USA,
2010.
578
Nat. Volatiles & Essent. Oils, 2021; 8(5): 568 – 580
[22]. S. Hussain, “Letter-to-sound conversion for Urdu TextTo-Speech system,” in Proceedings of the
Workshop on Computational Approaches to Arabic Script-Based Languages, Association for
Computational Linguistics, Geneva, Switzerland, 2004, pp. 74–9.
[23]. A. K. Kienappel and R. Kneser, “Designing very compact decision trees for grapheme-to-phoneme
transcription,” in Proceedings of the Interspeech, Aalborg, Denmark, 2001, pp. 1911–4.
[24]. C. Ma, M. A. Randolph, and J. Drish, “A support vector machines-based rejection technique for
speech recognition,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and
Signal Processing, Proceedings (ICASSP’01), IEEE, 2001, Vol. 1, pp. 381–4.
[25]. L. Jiang, H.-W. Hon, and X. Huang, “Improvements on a trainable letter-to-sound converter,” in
Proceedings of the Fifth European Conference on Speech Communication and Technology, Rhodes,
Greece
[27]. M. Bisani and H. Ney, “Joint-sequence models for grapheme-to-phoneme conversion,” 2008a,
Speech Commun., Vol. 50, no. 5, pp. 434–51, 2008.
[29]. K. Rao, F. Peng, H. Sak, and F. Beaufays, “Grapheme-tophoneme conversion using long short-
term memory recurrent neural networks,” in Proceedings of the IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), IEEE, Brisbane, QLD, Australia, 2015, pp. 4225–9.
[30]. H. Sak, A. Senior, and F. Beaufays, “Long short-term memory based recurrent neural network
architectures for large vocabulary speech recognition,” in CoRR, Vol. abs/ 1402.1128. Available:
http://arxiv.org/abs/1402.1128, eprint. 1402.1128, 2014.
[31] Rama, G.J., Ramakrishnan, A.G., Venkatesh, M.V. and Muralishankar, R., 2001. Thirukkural: a text-
to-speech synthesis system. Proc. Tamil Internet, pp.92-97.
[32] Muralishankar, R., and A. G. Ramakrishnan. "Human Touch to Tamil Speech Synthesizer."
Tamilnet2001, Kualalumpur, Malaysia (2001): 103-109
[33] Aparna, K.G., Rama, G.J. and Ramakrishnan, A.G., 2001. Machine Reading of Tamil Books.
Proceedings of ICBME.
[34] Rama, G.J., Ramakrishnan, A.G., Muralishankar, R. and Prathibha, R., 2002, September. A
complete text-to-speech synthesis system in Tamil. In Proceedings of 2002 IEEE Workshop on Speech
Synthesis, 2002. (pp. 191-194). IEEE.
579
Nat. Volatiles & Essent. Oils, 2021; 8(5): 568 – 580
[35] Prathibha, P. and Ramakrishnan, A.G., 2002. Web-enabled Speech Synthesizer for Tamil. In
Proceedings of Tamil Internet 2002 Conference, Cennai: Asian Printers (pp. 134-140).
[36] Sreekanth Majji, Ramakrisshnan A.G “ Festival Based maiden TTS system for Tamil Language”, 3rd
Language & Technology Conference:Human Language Technologies as a Challenge for Computer
Science and Linguistics October 5-7, 2007, Poznań, Poland.
[37] Sangeetha, J., Jothilakshmi, S., Sindhuja, S. and Ramalingam, V., 2013. Text to speech synthesis
system for tamil. Int J Emerging Tech Adv En, 3, pp.170-5.
[38] Rashid, M. and Singh, H., 2019. Text to speech conversion in Punjabi language using nourish
forwarding algorithm. International Journal of Information Technology, pp.1-10.
580