0% found this document useful (0 votes)

35 views13 pages

Paper TTS+Conversion

This paper presents an efficient Tamil Text to Speech (TTS) conversion technique utilizing Deep Quality Speech Recognition (DQSR) to enhance speech generation accuracy. The proposed method addresses the unique pronunciation challenges of the Tamil language, achieving a 5% improvement over traditional systems. The research outlines the architecture of DQSR, including its feed-forward transformer design and duration prediction mechanisms, aimed at producing more natural and expressive speech synthesis.

Uploaded by

dolpin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views13 pages

Paper TTS+Conversion

Uploaded by

dolpin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Nat. Volatiles & Essent.

Oils, 2021; 8(5): 568 – 580

An efficient Tamil Text to Speech Conversion Technique

based on Deep Quality Speech Recognition
Femina Jalin. A1, Jayakumari. J2
1
Department of Electronics and Communication Engineering Noorul Islam University,TamilNadu, India.
2
Mar Baselios College of Engineering and Technology, Kerala, India.
E-mail: [email protected]
Abstract

Our daily lives are impacted by the use of digital signal processing in speech processing. Many applications, such as
automation, audio recording, and audio-based help systems, can benefit from text to speech conversion (TTS). Transcribing
TTS is possible for many different languages, including those that are not widely spoken. Text-to-speech (TTS) systems
generate spoken equivalents from text input. Though the creation of speech is rather complex, introducing naturalness to
the speaker's expression is a major challenge in TTS. This paper proposes an efficient TTS conversion with high accuracy for
the Tamil language. The deep learning technique called Deep Quality Speech Recognition (DQSR) is developed in this research
study for Tamil language TTS. This is due to the fact that the method used for other languages like English will not work when
used in Tamil due to adaptable pronunciations that are fully dependent on the language constructs. When compared to the
traditional system, the proposed solution improves the framework's precision by 5%.

Keywords: Audio Recording; Deep Quality Speech Recognition; Digital Signal Processing; Tamil Language; Text to speech
conversion.

Introduction

While maintaining linguistic content, the voice conversion (VC) technology transforms an exact
speaker's voice into a desired target's voice [1]. They are two independent systems, but they share a
common goal: creating speech with a objective voice in a variety of situations. According to the
definition of a TTS system [2], the VC system is a speech synthesis system that uses linguistic
instructions derived from written text. These similarities in purpose but differences in operating
context make VC and TTS complement each other and have their own role in a spoken dialogue
system, which is unique. TTS is able to generate a vast amount of speech automatically and affordably
since text input is straightforward to develop and alter. There are times when a linguistic instruction
cannot be written in a way that is intended. While voice as a reference input is more time-consuming
and exclusive to generate, the scheme may be easily extended to a language that has never been used
before.

According to the nature of the data available for expansion, VC systems are usually classed as
parallel or nonparallel systems. According to this system, the parallel corpus is made up of two
speakers who speak the same words at the same time. Dynamic temporal warping (DTW) is used to
align the parallel speech utterances of source and target to form the training set. We can create a
mapping function to convert the acoustic characteristics of one speaker to another using this training
set [3]. For parallel VC systems, a variety of approaches have been presented [4, 5]. In parallel VC, only
the source and target speakers' data is used, which is one of its advantages. For example, if we want
to create a virtual assistant with a certain speaker's voice, we'll need a high planning and preparation
for parallel speech acquisition than for non-parallel speech. For this reason, numerous ways have been
proposed to construct non-parallel corpus systems for VC [6, 7]. A non-parallel VC can be adapted

568
Nat. Volatiles & Essent. Oils, 2021; 8(5): 568 – 580

from a parallel VC in the perfect space using the maximum-a posteriori (MAP) method [8] or
interpolated between multiple parallel models [9] in the usual Gaussian mixture model (GMM)
approach.

Using generative adversarial networks (GANs) or variational autoencoders (VAEs), An

intermediate linguistic representation extracted from an automated speech recognition (ASR) model
can be used to train a non-parallel VC [10, 11]. VC systems, whether parallel or non-parallel, are able
to modify the voice, but not the duration of a speech [12-15]. As speaking rate is a speaker attribute,
several recent studies have focused on converting speaking rate together with voices by utilizing
sequence-to-sequence models [16, 17, 18 and 19]. A speaker-adaptive TTS model was recently
developed by Luong et al. employing backpropagation method to accomplish speaker adaptation
using untranscribed speech. As a result of our findings, we believe that the projected technology can
be used to construct non-parallel VC as well. Here we present a framework for developing an
automatic voice recognition system (VCR) from DQSR's untranscribed speech. Our approach is also
tested for its ability to adapt and convert using speech utterances from an unknown language.

The rest of paper is prearranged as follows. Section 2 describes existing frameworks with its
advantage and limitation, Section 3 introduces explanation of our proposed method, Section 4 labels
the experiment setup with subjective results, and Section 5 concludes our findings.

2. Literature Review

The two most common ways to building TTS systems in the literature are rule-driven and data-
driven. According to the rule-based method, linguistic rules were used to generate automatic
pronunciation. [20] discusses a rule-based approaches that produce good results.for Tamil, [21] for
English and [22] for Urdu. Decision trees [23] and Bayesian networks [24] were the two initial
approaches explored for G2P. However, the resulting pronunciation was inferior to that of a phoneme-
based decision tree because the syllable boundaries were not correctly detected [25]. Researchers
next looked at Pronunciation by Analogy [21] and Pronunciation by Latent Analogy [26], which are
only applicable for uncommon context terms.

A joint-grapheme-phoneme ngram method [27], often known as a joint-sequence model, is

one of the various ways that have been examined. An enormous data set is required for this joint-
sequence model, as are supplementary models [28] such as an Expectation Maximization-based
alignment model and a translation model. Impossible to develop additional models, the Long-Short
Term Memory (LSTM) neural network model avoids the need to do so, while providing identical results
[29]. The use of neural networks and recurrent neural networks [30] to solve TTS difficulties has also
been advocated, and they do generate good results. But the LSTM neural network and other neural
networks have a dubious trait: how long it takes to analyze the data. Despite the fact that many models
have been examined, a comprehensive TTS is still needed. This section discusses some of the work
that has been done entirely for a Tamil TTS, as well as some material that has been converted for
Tamil.

An article on Thirukural for the Tamil language was presented by Gl.J. Jayavadhan Rama [31].
There are no prosodic features in this, and they have tried to reduce the amount of the speech and
increase its quality. To improve the naturalness, a pitch adjustment method is utilized. Synthesized
speech was designed by R.Muralishankar and A.G. Ramakrishnan [32] by using prosodic elements.
Several emotions are subjected to linear prediction analysis to develop the speech quality.

569
Nat. Volatiles & Essent. Oils, 2021; 8(5): 568 – 580

In the Tamil language, K.G. Aparna [33] has created a machine that reads books for them. In
order to do concatenation synthesis, grayscale images are scanned into binary images. This is followed
by the text being split into various clusters. This is a complete TTS in Tamil by G.L. Jayavardhana Rama
[34]. As a result, the system has been divided into two parts: Afterwards, the rules will be
concatenated with the text to form the output. Prosody and pitch marking will be implemented as
part of the offline phase. Smoothing is achieved using wave form interpolation.

Web-enabled TTs, using Java, were proposed by Prathiba and Ramakrishnan [35].
Concatenation is based on the syllable as the basic unit. Festival architects SrikanthMajji and
Ramakrishnan [36] offered a TTS. If you want to write sentences in ILEAP, you can do so. In this
strategy, the units are selected based on clusters of units. Implementation of the Viterbi algorithm In
the Tamil language, J.Sangeetha [37] proposed a TTS. There are two ways to synthesize the speech.
There is a high degree of naturalness in the word-level synthesis. When an input is not included in the
speech corpus, syllable level concatenation is used to synthesize the input.

3. Proposed Methodology

In this section, the dataset description along with the explanation of DQSR are presented.

3.1. Dataset Description

In this proposed system, to evaluate the results a new dataset is created. The data set has
been recorded with the help of mike. The overall data set consist of two different kinds of data’s such
as words and sentences. Audacity was used to record the words and sentences in a closed room with
a high-quality microphone. There is a backup of the recorded speech files. Sampling rate used for
recording was sixteen thousand hertz (KHz). Table 1 shows some of the words and sentences that
were captured. In the words section, our dataset consist of 138 words and 60 sentences. For
evaluation purposed all dataset sentence and words are manually labelled. Some of the sample words
and sentences of the dataset has been shown in the below table.

Table.1. Recorded sample words and sentences.

Words Sentences
அம் மா எனக்கு உதவி செய் வீர்களா
பூனன என் ெபயர் மலர்
யானன என்ன செய் தி
ஆனன காலல வணக்கம்
ஊசி சீக்கிரம் ெகாண்டு வா
உதவி நல் லா இருக்கிறேன் நீ ங் க
வணக்கம் நீ ங் க என்ன வாசிக்கிறீங் க
நன் றி நீ ங் க ெராம் ப நல் லவர்

3.2. Deep Quality Speech Recognition

DQSR architecture is described in this section. Rather than using the encoder-attention-
decoder architecture employed by the majority of sequence-to-sequence autoregressive and non-
autoregressive generation algorithms, we design an unique feed-forward structure. Figure 1 shows

570
Nat. Volatiles & Essent. Oils, 2021; 8(5): 568 – 580

the overall model architecture of DQSR. In the next subsections, we go into further detail on the
components.

3.2.1 Feed-Forward Transformer

Self-attention in Transformer and 1D convolution form the basis of the DQSR architecture. As
seen in Figure 1, we name this construction a Feed-Forward Transformer (FFT). Phoneme to Mel-
Spectrogram FFT uses multiple FFT blocks stacked one on top of the other, with N blocks on the mel-
spectrogram side. In order to retrieve cross-positional information, the self-attention network uses a
multi-head attention. We utilize a 2-layer 1D convolutional network with ReLU activation instead of
Transformer's 2-layer dense network. Due to the fact that character/phoneme and mel-spectrogram
sequences are more closely associated to nearby hidden states, this method was developed. In the
experimental part, we test the effectiveness of the 1D conv network. Transform is followed by residual
connections, layer normalization, and dropout.

Figure 1: Overall Structure of DQSR

3.2.2 Length Regulator

As well as controlling the voice speed and portion of prosody, the length regulator (Figure 1)
is employed to solve the length mismatch problem among the phoneme and spectrogram sequence
in the FFT. The phoneme duration is the length of the mel-spectrogram that corresponds to a
phoneme (we will describe how to predict phoneme duration in the next subsection). To determine
how many hidden states there are in a phoneme sequence, we use a length regulator that multiplies
phoneme duration by d. The length of the hidden states is thus equal to the length of the mel-
spectrogram. 𝐻𝑝ℎ𝑜 = [ℎ1 , ℎ2 , . . . , ℎ𝑛 ], denotes the hidden states of the phoneme sequence, where n
is the length of the sequence. The phoneme duration sequence is 𝐷 = [𝑑1 , 𝑑2 , . . . , 𝑑𝑛 ],, where
∑𝑛𝑖=1 𝑑𝑖 = 𝑚 and m represented as the length of the mel-spectrogram sequence (see below for more
information). The length regulator LR is referred to as LR.

𝐻𝑚𝑒𝑙 = 𝐿𝑅(𝐻𝑝ℎ𝑜 , 𝐷, 𝛼) (1)

when 𝐻𝑚𝑒𝑙 , is an expanded sequence and is a hyperparameter that controls the voice speed.
Using Equation 1, the expanded sequence H mel becomes if 𝐷 = [2, 2, 3, 1], and H
[ℎ1 , ℎ1 , ℎ1 , ℎ2 , ℎ2 , ℎ2 , ℎ3 , ℎ3 , ℎ3 , ℎ4 ] if 𝛼 = 1. The duration sequences D=1.3 = [2.6,2.6,3.9,1.3] [3,3,4,1]
and 𝐷α=0.5 = [1, 1, 1.5, 0.5] ≈ [1, 1, 2, 1], correspondingly, and the expanded sequences become

571
Nat. Volatiles & Essent. Oils, 2021; 8(5): 568 – 580

[ℎ1 , ℎ2 , ℎ3 , ℎ4 ]We can also alter the prosody of the synthesized speech by adjusting the length of the
space characters in the phrase.

3.2.3 Duration Predictor

Fig.2. Duration Predictor

Length regulation relies heavily on the ability to forecast the duration of phonemes. It
comprises of a 2-layer 1D network with ReLU activation, charted by a layer normalization layer and a
dropout layer, as well as an extra linear layer that outputs the anticipated phoneme duration, as
illustrated in Figure 2. There is an additional FFT block for each phoneme, and this module is used in
conjunction with the DQSR model to predict mel-spectrograms for each phoneme with the MSE. As a
result, they are more Gaussian and easier to train. Observe that the learned duration predictor is only
needed during TTS inference since the phoneme duration recovered from an autoregressive teacher
model can be used directly during TTS training (see following discussions).

The ground-truth phoneme duration is extracted from an autoregressive instructor TTS

model, as shown in Figure 2, in order to train the duration predictor. Following is a description of the
detailed steps::

❖ An auto-regressive encoder-attention decoder Transformer TTS model is initially trained using

this technique.
❖ Each training sequence pair's attention alignments are extracted from the trained teacher
model. Due to the multihead self-attention, there are different attention alignments, and not
all attention heads exhibit the diagonal property. Attention head closeness to diagonal is
measured by a focus rate F:
1
𝐹 = 𝑆 ∑𝑆𝑠=1 𝑚𝑎𝑥1≤𝑡≤𝑇 𝑎𝑠,𝑡 (2)

where 𝑆 𝑎𝑛𝑑 𝑇 are the spectrogram and phoneme lengths, 𝑎𝑠 , 𝑡 is the element in the Sth row
and Tth column of the attention matrix that has been donated. Every head's focus rate is computed,
and then the attention is aligned with the head with the highest F rate.

Let's end with a series of the phoneme durations 𝐷 = [𝑑1 , 𝑑2 , . . . , 𝑑𝑛 ] using the duration
extractor 𝑑𝑖 = ∑𝑆𝑠=1[arg 𝑚𝑎𝑥𝑡 𝑎𝑠 , 𝑡 = 1]. Meaning that a phoneme's duration depends on how many
Mel-spectrograms it has received in the previous phase.

572
Nat. Volatiles & Essent. Oils, 2021; 8(5): 568 – 580

3.3. Model Configuration

DQSR model

Six FFT blocks are used in both the phoneme and mel-spectrogram sides of our DQSR model,
a total of 12. Including punctuation, there are 51 phonemes in the vocabulary. Phoneme embeddings,
self-attention hidden size, and 1D convolution are all set to 384. According to the settings, the sum of
attention heads will be 2. There are two 1D convolution kernel sizes set to 3, with input/output size
of 384/1536 for the first layer and 1536/384 for the second layer of the 2-layer convolution network.
Three-hundred-and-fourth-dimensional mel-spectrogram is transformed into an 80-dimensional one
via a linear output layer. Using a 1D convolution, the kernel sizes are set to 3, with input and output
sizes of 384/384.

Autoregressive Transformer

Two reasons are served by the autoregressive transformer TTS model in our research.
Phoneme duration will be extracted and used to train the duration predictor; mel-spectrogram will be
generated during sequence-level knowledge distillation. There are six layers of encoder and decoder
in this model and instead of a position-based FFN, there is 1D convolution network. It has a similar
amount of parameters to our DQSR model, but it's not as complex..

4. Results and Discussion

We set up an experimental environment to carry out this research and assess the outcomes
of the study. Intel i7 4th generation processor, clocking at 2.3 GHz with four megabytes of memory.
The workstation is equipped with 16 gigabytes of DDR3 RAM and a 1 terabyte (TB) SATA hard drive
rotating at 7 K RPM. Microsoft Windows 10 Pro is used as the base OS for this project. Version 2018a
of MATLAB.

4.1. Evaluation metrics

The assessment of the quality of speech can be done using two measures. They are: subjective
quality measures and objective quality tests. The evaluation of Subjective measures includes the
comparisons of the original speech signal and the processed speech signals. The assessment of
Objective speech quality measure involves the mathematical comparison of the original speech signal
and the processed speech signals. The numerical expanse between the original and processed signals
is calculated to measure the Objective quality measures. The various parameters considered for the
assessment of speech quality measure are Mean Square Error (MSE), Mahalo Nobis distance, Similarity
Index, Accuracy and Perceptual Evaluation of speech Quality (PSEQ) measure. MSE is well-defined as
the average of the squares of the two errors viz., the difference between the estimated value and the
predicted value. For a good quality speech signal the value of MSE should be less than one. The MSE
is calculated by the equation:

1
𝑀𝑆𝐸 = ∑𝑟𝑖=1[𝑋𝑖 − 𝑋𝑖^ ]2 (3)
𝑟

Where, 𝑟 is the Predictions generated from the sample and 𝑋𝑖 is the observed values of the variable
being predicted 𝑋𝑖^ is the enhanced speech signal. As a result of their mean element vectors µA and
µB, as well as the covariance matrix of all samples in the database, the Mahala Nobis distance

573
Nat. Volatiles & Essent. Oils, 2021; 8(5): 568 – 580

calculates how far apart two samples. The distance is given as Mahala Nobis distance is given by the
equation,

−1
𝑀𝐷 (𝜇𝐴 − 𝜇𝐵 ) = (𝜇𝐴 − 𝜇𝐵 )𝑇 ∑(𝜇𝐴 − 𝜇𝐵 ) (4)

The similarity index is calculated by the sum of matching words divided by the sum of all words and
gap characters. Then the quotient is multiplied by 100 to give the similarity as a percent. Accuracy is
calculated by taking the differences between the synthesized speech and the pronounced speech.
Perceptual Evaluation of speech Quality (PSEQ) measure is computed by first equalizing the original
and the degraded signals. Then the signals are filtered and processed to produce the loudness. To
produce the quality rating, the difference between the original and the degraded signals are
calculated. For better quality speech signal, the PSEQ value ranges from 1 to 5. The PSEQ is calculated
by the equation,

𝑃𝐸𝑆𝑄 = 𝑎0 + 𝑎1 𝐷𝑖 + 𝑎2 𝐴𝑖 (5)

Where 𝐷𝑖 represent as the average disturbance value and 𝐴𝑖 signified as an average asymmetrical
disturbance value.

4.2. Performance Analysis of DQSR for Words and Sentences

Initially, Table 2 and 3 shows the overall performance of DQSR for words and sentences in
terms of various metrics. Here the analysis are taken for more than three times to test the efficiency
of proposed DQSR.

Table 2: Overall Performance of DQSR for Words

Word MSE MND SI PSEQ Accuracy

அம் மா 0.038412 4.2769 21434574.5 1.0659 70.526
யானன 0.145327 7.0327 16638768.5 1.0987 75.689
பூனன 0.138834 5.6534 21854987.9 1.0846 66.452
ஆனன 0.12421 4.6776 275001574 1.0467 48.017
ஊசி 0.089234 5.3894 32093841 1.096 43.982

Table 3: Overall Performance of DQSR for Sentences

SENTENCES MSE MND PSEQ SI Accuracy

எனக்கு உதவி 0.067054 50.3294
4.9873 1.0705 4844876.319
செய் வீர்களா
என் ெபயர் மலர் 0.06529 4.9877 1.2541 3357986.881 56.8291
என்ன செய் தி 0.06134 2.9694 1.8116 4414118.937 53.2901
காலல வணக்கம் 0.076432 3.8432 1.1088 4438267.284 44.9821
சீக்கிரம் ெகாண்டு 0.07197 55.29712
5.7878 1.9675 379246.963
வா

574
Nat. Volatiles & Essent. Oils, 2021; 8(5): 568 – 580

நல் லா 0.074567 63.8267

5.7978 1.8972 3917522.825
இருக்கிறேன் நீ ங் க
நீ ங் க என்ன 0.08346 66.5321
4.7653 1.0974 4387986.175
வாசிக்கிறீங் க
நீ ங் க ெராம் ப 0.09087 58.4731
4.5987 1.2098 3643639.269
நல் லவர்
ெராம் ப நன் றி 0.8635 4.6093 1.04087 4387089.175 65.9047

From that analysis of Table 2 and 3, it is clearly stated that the proposed DQSR achieved better
accuracy for some words and sentences. For instance, the word "elephant" has highest accuracy
(i.e.75.68%), where the word "needle" has low accuracy (i.e.43.98%) than other words. Likewise, the
sentence "What are you reading?" achieved better accuracy (66.53%) and the sentence "Good
Morning" achieved low accuracy (44.98%) than other input sentences. The error rate of word "Amma"
is very less, while comparing with other words includes elephant, cat, needle, etc. The MND is high for
the "elephant", MND of "cat" and "needle" are nearly 5.70 and MND of "Amma" is only 4.27. The PSEQ
for all the input words are nearly 1.10. The input sentences achieved less MSE value i.e. nearly 0.06 to
0.09, expect the last input sentence. The MND of "What's news?" achieved only 2.96, where other
input sentences achieved nearly 3.8 to 5.0 of MND. Likewise, the PSEQ of all sentences achieved nearly
1.25 to 1.90. Therefore, the proposed DQSR achieved better performance, even the analysis are
carried out more number of times. The next section will explain the performance of our proposed
DQSR with existing techniques like Hidden Markov Model (HMM) and GMM.

4.3. Comparative analysis of Proposed DQSR with other techniques

Table 4 and 5 shows the comparison of proposed DQSR with existing HMM and GMM model
in terms of MSE and accuracy for single word and single sentence.

Table 4: Performance of DQSR with existing techniques for Single Word

Technique HMM GMM Proposed DQSR

Word Accuracy MSE Accuracy MSE Accuracy MSE
அம் மா 65.255 0.239315 57.171 0.239174 70.526 0.238412
யானன 69.2962 0.22146 57.5472 0.241817 75.689 0.135327
பூனன 62.7034 0.191033 63.6949 0.140054 66.452 0.158834
ஆனன 46.951 0.6041 46.3837 0.639694 58.017 0.19421
ஊசி 39.1327 0.79278 51.1743 0.529215 53.982 0.289234

Table 5: Performance of DQSR with existing techniques for Single Sentence

Techniques HMM GMM Proposed DQSR

SENTENCES MSE Accuracy MSE Accuracy MSE Accuracy
எனக்கு உதவி 0.067054 50.3294
0.079542 48.9407 0.129969 47.4095
செய் வீர்களா

575
Nat. Volatiles & Essent. Oils, 2021; 8(5): 568 – 580

என் ெபயர் 0.06529 56.8291

0.17549 52.6695 0.22827 46.769
மலர்
என்ன செய் தி 0.168783 49.1924 0.130133 53.0766 0.16134 53.2901

காலல 0.76432 44.9821

0.81253 40.0768 0.81259 40.3113
வணக்கம்
சீக்கிரம் 0.17197 55.29712
0.277968 52.6962 0.22789 54.0986
ெகாண்டு வா
நல் லா 0.074567 63.8267
இருக்கிறேன் 0.077968 51.4609 0.034406 58.0892
நீ ங் க
நீ ங் க என்ன 0.08346 66.5321
0.389637 61.5792 0.22789 54.0986
வாசிக்கிறீங் க
நீ ங் க ெராம் ப 0.09087 58.4731
0.192307 51.7018 0.335622 45.6671
நல் லவர்
ெராம் ப நன் றி 0.189637 61.5792 0.234406 58.0892 0.08635 65.9047

From the table 4, it is clearly proved that proposed DQSR achieved better performance for
every single input words. For instance, the proposed DQSR achieved 66.45% of accuracy and 0.13 MSE
for the word "cat", where the existing techniques achieved nearly 63% of accuracy and obtained nearly
0.15 to 0.19 MSE for the same word. The proposed DQSR achieved highest accuracy (75.68%) and
lowest MSE for the word "elephant". The existing HMM model achieved lowest accuracy (39.13%) and
highest MSE for the word "needle". In addition, the experiments for single sentence is carried out to
test the efficiency of proposed DQSR. From the results, it is proves that the proposed DQSR achieved
highest accuracy (66.53%) for the sentence "What are you reading?", where the existing techniques
achieved nearly 55% to 61% of accuracy for the same sentence. Likewise, for the sentence "Good
Morning", the proposed DQSR achieved lowest accuracy (44.98%), where the existing techniques
achieved only 40% of accuracy for the same sentence. Therefore, our proposed DQSR achieved better
performance than HMM and GMM models.

4.4. Comparative Analysis of Proposed DQSR with existing techniques

The implemented work is compared with existing researches [38] for various parameters given
in Table 6.

Table 6: Comparison of previous systems with implemented DQSR work

Parameters Previous Previous Proposed system

system 1 system 2
Number of letter 110 110 110
and words
reserved TTS
Sentence changed 96 98 98.73
correctly
Accuracy 87% 94% 95.21

576
Nat. Volatiles & Essent. Oils, 2021; 8(5): 568 – 580

As per [38], the existing research work is implemented on Punjabi text, however in order to
test the efficiency of proposed DQSR, we implemented the existing research work on our input Tamil
sentences and words. Those results are presented in the table 6, from that it is proved that the
accuracy of our proposed DQSR is 95.21%, where the existing techniques achieved 94% and 87% of
accuracy. In addition, the sentences are altered by existing and proposed DQSR techniques, where the
proposed technique altered it correctly (98.73). The reason for that is the DQSR uses the various layers
for converting the input text to speech signals. Therefore, our proposed DQSR achieved better
performance than existing techniques.

5. Conclusion

Existing G2P (phoneme centric) rule-based and machine learning-based techniques in Tamil
are still inadequate to handle all of the problems involved. Systems that create speech from text input
are known as TTS. Though the creation of speech is rather complex, introducing naturalness to the
speaker's expression is a major challenge in TTS. The degree of intelligibility and naturalness of the
speech has been reached. This work illustrates a TTS conversion for the Tamil language that has been
produced. TTS conversion utilizing DQSR is implemented and validated in this work. The results show
that the method is highly accurate. A comparison of standard machine learning approaches with the
proposed system's results shows that the suggested system generates speech waveforms with great
accuracy. There is no deterioration or mispronunciation with the proposed system. According to the
algorithm, its accuracy is 5 percent higher than that of a conventional technique. Using efficient meta-
heuristic algorithms, the learning rate of deep learning approach will be optimized in the future.

References

[1] Tomoki Toda, Alan W Black, and Keiichi Tokuda, “Voice conversion based on maximum-likelihood
estimation of spectral parameter trajectory,” IEEE Trans. Audio, Speech, Language Process., vol. 15,
no. 8, pp. 2222–2235, 2007.

[2] Paul Taylor, Text-to-speech synthesis, Cambridge university press, 2009.

[3] Masanobu Abe, Satoshi Nakamura, Kiyohiro Shikano, and Hisao Kuwabara, “Voice conversion
through vector quantization,” J. Acoust. Soc. Jpn. (E), vol. 11, no. 2, pp. 71–76, 1990.

[4] Ryoichi Takashima, Tetsuya Takiguchi, and Yasuo Ariki, “Exemplar-based voice conversion using
sparse representation in noisy environments,” IEICE Transactions on Fundamentals of Electronics,
Communications and Computer Sciences, vol. 96, no. 10, pp. 1946–1953, 2013.

[5] Ling-Hui Chen, Zhen-Hua Ling, Li-Juan Liu, and LiRong Dai, “Voice conversion using deep neural
networks with layer-wise generative training,” IEEE/ACM Transactions on Audio, Speech and Language
Processing (TASLP), vol. 22, no. 12, pp. 1859–1872, 2014.

[6] Daniel Erro, Asuncion Moreno, and Antonio Bonafonte, “INCA algorithm for training voice
conversion systems from nonparallel corpora,” IEEE Trans. Audio, Speech, Language Process., vol. 18,
no. 5, pp. 944–953, 2009.

[7] Yining Chen, Min Chu, Eric Chang, Jia Liu, and Runsheng Liu, “Voice conversion with smoothed gmm
and map adaptation,” in Proc. EUROSPEECH, 2003, pp. 2413–2416.

577
Nat. Volatiles & Essent. Oils, 2021; 8(5): 568 – 580

[8] Chung-Han Lee and Chung-Hsien Wu, “Map-based adaptation for speech conversion using
adaptation data selection and non-parallel training,” in Proc. INTERSPEECH, 2006, pp. 2254–2257.

[9] Tomoki Toda, Yamato Ohtani, and Kiyohiro Shikano, “Eigenvoice conversion based on gaussian
mixture model,” in Proc. INTERSPEECH, 2006, pp. 2446–2449.

[10] Lifa Sun, Kun Li, Hao Wang, Shiyin Kang, and Helen Meng, “Phonetic posteriorgrams for many-to-
one voice conversion without parallel data training,” in 2016 IEEE International Conference on
Multimedia and Expo (ICME). IEEE, 2016, pp. 1–6.

[11] Hiroyuki Miyoshi, Yuki Saito, Shinnosuke Takamichi, and Hiroshi Saruwatari, “Voice conversion
using sequence-to-sequence learning of context posterior probabilities,” Proc. INTERSPEECH, pp.
1268–1272, 2017.

[12] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-Min Wang, “Voice conversion
from non-parallel corpora using variational auto-encoder,” in Proc. APSIPA. IEEE, 2016, pp. 1–6.

[13] Takuhiro Kaneko and Hirokazu Kameoka, “Paralleldata-free voice conversion using cycle-
consistent adversarial networks,” arXiv preprint arXiv:1711.11293, 2017.

[14] Fuming Fang, Junichi Yamagishi, Isao Echizen, and Jaime Lorenzo-Trueba, “High-quality
nonparallel voice conversion based on cycle-consistent adversarial network,” in 2018 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5279–
5283.

[15] Andros Tjandra, Berrak Sisman, Mingyang Zhang, Sakriani Sakti, Haizhou Li, and Satoshi
Nakamura, “Vqvae unsupervised unit discovery and multi-scale code2spec inverter for zerospeech
challenge 2019,” arXiv preprint arXiv:1905.11449, 2019.

[16] Cheng-chieh Yeh, Po-chun Hsu, Ju-chieh Chou, Hungyi Lee, and Lin-shan Lee, “Rhythm-flexible
voice conversion without parallel data using cycle-gan over phoneme posteriorgram sequences,” in
Proc. SLT, 2018, pp. 274–281.

[17] Kou Tanaka, Hirokazu Kameoka, Takuhiro Kaneko, and Nobukatsu Hojo, “Atts2s-vc: Sequence-to-
sequence voice conversion with attention and context preservation mechanisms,” in Proc. ICASSP,
2019, pp. 6805–6809.

[18] Jing-Xuan Zhang, Zhen-Hua Ling, and Li-Rong Dai, “Non-parallel sequence-to-sequence voice
conversion with disentangled linguistic and speaker representations,” arXiv preprint
arXiv:1906.10508, 2019.

[19] Hieu-Thi Luong and Junichi Yamagishi, “A unified speaker adaptation method for speech synthesis
using transcribed and untranscribed speech with backpropagation,” arXiv preprint arXiv:1906.07414,
2019.

[20] S. Yuvaraja, V. Keri, S. C. Pammi, K. Prahallad, and A. W. Black, “Building a Tamil voice using HMM
segmented labels,” in National Conference on Communication, International Institute of Information
Technology, Hyderabad, India, Language Technologies Institute, Carnegie Mellon University, USA,
2010.

578
Nat. Volatiles & Essent. Oils, 2021; 8(5): 568 – 580

[21]. R. I. Damper, Y. Marchand, M. J. Adamson, and K. Gustafson, “Comparative evaluation of letter-

to-sound conversion techniques for English text-to-speech synthesis,” in Proceedings of the third
ESCA/COCOSDA Workshop (ETRW) on Speech Synthesis, Jenolan Caves House, Blue Mountains, NSW,
Australia, 1998, pp. 53–58.

[22]. S. Hussain, “Letter-to-sound conversion for Urdu TextTo-Speech system,” in Proceedings of the
Workshop on Computational Approaches to Arabic Script-Based Languages, Association for
Computational Linguistics, Geneva, Switzerland, 2004, pp. 74–9.

[23]. A. K. Kienappel and R. Kneser, “Designing very compact decision trees for grapheme-to-phoneme
transcription,” in Proceedings of the Interspeech, Aalborg, Denmark, 2001, pp. 1911–4.

[24]. C. Ma, M. A. Randolph, and J. Drish, “A support vector machines-based rejection technique for
speech recognition,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and
Signal Processing, Proceedings (ICASSP’01), IEEE, 2001, Vol. 1, pp. 381–4.

[25]. L. Jiang, H.-W. Hon, and X. Huang, “Improvements on a trainable letter-to-sound converter,” in
Proceedings of the Fifth European Conference on Speech Communication and Technology, Rhodes,
Greece

[26] J. R. Bellegarda, “Unsupervised, language-independent grapheme-to-phoneme conversion by

latent analogy,” Speech Commun., Vol. 46, no. 2, pp. 140–52, 2005.

[27]. M. Bisani and H. Ney, “Joint-sequence models for grapheme-to-phoneme conversion,” 2008a,
Speech Commun., Vol. 50, no. 5, pp. 434–51, 2008.

[28]. S. Jiampojamarn and G. Kondrak., “Letter-phoneme alignment: An exploration,” in Proceedings

of the 48th Annual Meeting of the Association for Computational Linguistics, Association for
Computational Linguistics, Uppsala, Sweden, 2010, pp. 780–8.

[29]. K. Rao, F. Peng, H. Sak, and F. Beaufays, “Grapheme-tophoneme conversion using long short-
term memory recurrent neural networks,” in Proceedings of the IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), IEEE, Brisbane, QLD, Australia, 2015, pp. 4225–9.

[30]. H. Sak, A. Senior, and F. Beaufays, “Long short-term memory based recurrent neural network
architectures for large vocabulary speech recognition,” in CoRR, Vol. abs/ 1402.1128. Available:
http://arxiv.org/abs/1402.1128, eprint. 1402.1128, 2014.

[31] Rama, G.J., Ramakrishnan, A.G., Venkatesh, M.V. and Muralishankar, R., 2001. Thirukkural: a text-
to-speech synthesis system. Proc. Tamil Internet, pp.92-97.

[32] Muralishankar, R., and A. G. Ramakrishnan. "Human Touch to Tamil Speech Synthesizer."
Tamilnet2001, Kualalumpur, Malaysia (2001): 103-109

[33] Aparna, K.G., Rama, G.J. and Ramakrishnan, A.G., 2001. Machine Reading of Tamil Books.
Proceedings of ICBME.

[34] Rama, G.J., Ramakrishnan, A.G., Muralishankar, R. and Prathibha, R., 2002, September. A
complete text-to-speech synthesis system in Tamil. In Proceedings of 2002 IEEE Workshop on Speech
Synthesis, 2002. (pp. 191-194). IEEE.

579
Nat. Volatiles & Essent. Oils, 2021; 8(5): 568 – 580

[35] Prathibha, P. and Ramakrishnan, A.G., 2002. Web-enabled Speech Synthesizer for Tamil. In
Proceedings of Tamil Internet 2002 Conference, Cennai: Asian Printers (pp. 134-140).

[36] Sreekanth Majji, Ramakrisshnan A.G “ Festival Based maiden TTS system for Tamil Language”, 3rd
Language & Technology Conference:Human Language Technologies as a Challenge for Computer
Science and Linguistics October 5-7, 2007, Poznań, Poland.

[37] Sangeetha, J., Jothilakshmi, S., Sindhuja, S. and Ramalingam, V., 2013. Text to speech synthesis
system for tamil. Int J Emerging Tech Adv En, 3, pp.170-5.

[38] Rashid, M. and Singh, H., 2019. Text to speech conversion in Punjabi language using nourish
forwarding algorithm. International Journal of Information Technology, pp.1-10.

580

Project Proposal For Sinhala Language Processing
100% (5)
Project Proposal For Sinhala Language Processing
11 pages
Text - To - Speech Converter: Bachelor of Engineering IN Computer Science & Engineering
57% (7)
Text - To - Speech Converter: Bachelor of Engineering IN Computer Science & Engineering
42 pages
Real Time Voice Cloning Final
No ratings yet
Real Time Voice Cloning Final
18 pages
NAUTILUS A Versatile Voice Cloning System
No ratings yet
NAUTILUS A Versatile Voice Cloning System
15 pages
Saudi Invoice
No ratings yet
Saudi Invoice
6 pages
Multilingual Text-To-Speech Training Using Cross Language Voice Conversion and Self-Supervised Learning of Speech Representations
No ratings yet
Multilingual Text-To-Speech Training Using Cross Language Voice Conversion and Self-Supervised Learning of Speech Representations
5 pages
Suoni
No ratings yet
Suoni
38 pages
Thesis
No ratings yet
Thesis
37 pages
V L: V F S P L: Oice OOP Oice Itting and Ynthesis Via A Honological OOP
No ratings yet
V L: V F S P L: Oice OOP Oice Itting and Ynthesis Via A Honological OOP
14 pages
Indonesian Continuous Speech Recognition Optimization With Convolution Bidirectional Long Short-Term Memory Architecture
No ratings yet
Indonesian Continuous Speech Recognition Optimization With Convolution Bidirectional Long Short-Term Memory Architecture
9 pages
Lit Rev Dis 1
No ratings yet
Lit Rev Dis 1
24 pages
NLP Proposal For Evaluating and Enhancing Persian and Bengali Text To Speech TTS Technology
No ratings yet
NLP Proposal For Evaluating and Enhancing Persian and Bengali Text To Speech TTS Technology
7 pages
AI-900 Exam Notes
75% (4)
AI-900 Exam Notes
44 pages
IJCRT2204469
No ratings yet
IJCRT2204469
5 pages
Review 1 Report Presentation
No ratings yet
Review 1 Report Presentation
13 pages
Literature Survey
No ratings yet
Literature Survey
6 pages
3 Gan
No ratings yet
3 Gan
12 pages
Real Time Chat Application Using Socket - Io
No ratings yet
Real Time Chat Application Using Socket - Io
48 pages
Tamil Speech Recognition Using XLSR Wav2Vec2.0 Amp CTC Algorithm-1
No ratings yet
Tamil Speech Recognition Using XLSR Wav2Vec2.0 Amp CTC Algorithm-1
6 pages
Analysis of Conversational Speech With Application To Voice Adaptation
No ratings yet
Analysis of Conversational Speech With Application To Voice Adaptation
8 pages
Preprints202306 0223 v1
No ratings yet
Preprints202306 0223 v1
20 pages
AudioPaLM - A Large Language Model That Can Speak and Listen
No ratings yet
AudioPaLM - A Large Language Model That Can Speak and Listen
27 pages
Low Resource Text To Speech Synthesis
No ratings yet
Low Resource Text To Speech Synthesis
15 pages
2023 Emnlp-Main 990
No ratings yet
2023 Emnlp-Main 990
13 pages
Phonetic Enhanced Language Modeling For Text-to-Speech Synthesis
No ratings yet
Phonetic Enhanced Language Modeling For Text-to-Speech Synthesis
5 pages
GR - Paper Format
No ratings yet
GR - Paper Format
9 pages
Metastylespeech
No ratings yet
Metastylespeech
16 pages
NaturalSpeech End-to-End Text-to-Speech Synthesis With Human-Level Quality
No ratings yet
NaturalSpeech End-to-End Text-to-Speech Synthesis With Human-Level Quality
12 pages
Informatics 08 00084
No ratings yet
Informatics 08 00084
15 pages
8.5 Multilingual Speech Processing
No ratings yet
8.5 Multilingual Speech Processing
24 pages
VAENAR-TTS: Variational Auto-Encoder Based Non-AutoRegressive Text-to-Speech Synthesis
No ratings yet
VAENAR-TTS: Variational Auto-Encoder Based Non-AutoRegressive Text-to-Speech Synthesis
5 pages
Neural Codec Language Models Are Zero-Shot Text To Speech Synthesizers
No ratings yet
Neural Codec Language Models Are Zero-Shot Text To Speech Synthesizers
16 pages
D P O N V T - S: ATA Rocessing For Ptimizing Aturalness of Ietnamese EXT TO Speech Ystem
No ratings yet
D P O N V T - S: ATA Rocessing For Ptimizing Aturalness of Ietnamese EXT TO Speech Ystem
8 pages
Conversion of Sign Language To Text and Audio Using Deep Learning Techniques
No ratings yet
Conversion of Sign Language To Text and Audio Using Deep Learning Techniques
9 pages
Conditional Variational Autoencoder With Adversarial Learning For End-to-End Text-to-Speech
No ratings yet
Conditional Variational Autoencoder With Adversarial Learning For End-to-End Text-to-Speech
15 pages
2002 03788
No ratings yet
2002 03788
5 pages
Voice Filter Few Shot Text To Speech Speaker Adaptation Using Voice Conversion As A Post Processing Module
No ratings yet
Voice Filter Few Shot Text To Speech Speaker Adaptation Using Voice Conversion As A Post Processing Module
5 pages
VLSP2019 TTS PhungVietLam
No ratings yet
VLSP2019 TTS PhungVietLam
3 pages
Indextts: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
No ratings yet
Indextts: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
5 pages
Text To Speech With Custom Voice
No ratings yet
Text To Speech With Custom Voice
10 pages
Deep Learning-Based Analysis of A Real-Time Voice Cloning System
No ratings yet
Deep Learning-Based Analysis of A Real-Time Voice Cloning System
6 pages
Gokul Karthik Kumar Praveen S V Pratyush Kumar Mitesh M. Khapra Karthik Nandakumar
No ratings yet
Gokul Karthik Kumar Praveen S V Pratyush Kumar Mitesh M. Khapra Karthik Nandakumar
8 pages
Ieee
No ratings yet
Ieee
12 pages
Rudram in Tamil PDF Download
100% (2)
Rudram in Tamil PDF Download
1 page
கீரை
No ratings yet
கீரை
1 page
Bag PDF
No ratings yet
Bag PDF
1 page
Dolpin PDF
No ratings yet
Dolpin PDF
1 page
Voice Bot Literature Review-WPS Office
No ratings yet
Voice Bot Literature Review-WPS Office
3 pages
styleTTS2205 15439
No ratings yet
styleTTS2205 15439
20 pages
Learning To Speak Fluently in A Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
No ratings yet
Learning To Speak Fluently in A Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
5 pages
Voice Connect - S2ST Reserch Paper
No ratings yet
Voice Connect - S2ST Reserch Paper
4 pages
Deep Learning Based Multilingual Speech Synthesis Using Multi Feature Fusion Methods
No ratings yet
Deep Learning Based Multilingual Speech Synthesis Using Multi Feature Fusion Methods
16 pages
Parrot TTS
No ratings yet
Parrot TTS
13 pages
Problem Statement Sih
No ratings yet
Problem Statement Sih
3 pages
Advanced Emotion and Multi-Speaker Recognition With Multilingual VoiceCloning in Cross-Cultural Communication
No ratings yet
Advanced Emotion and Multi-Speaker Recognition With Multilingual VoiceCloning in Cross-Cultural Communication
4 pages
133-138, Tesma0810, IJEAST
No ratings yet
133-138, Tesma0810, IJEAST
6 pages
8695
No ratings yet
8695
72 pages
2018 Ge
No ratings yet
2018 Ge
8 pages
Acoustic Word Embeddings MDPI
No ratings yet
Acoustic Word Embeddings MDPI
9 pages
A Focus On Codemixing and Codeswitching in Tamil Speech To Text
No ratings yet
A Focus On Codemixing and Codeswitching in Tamil Speech To Text
12 pages
Imp Tts
No ratings yet
Imp Tts
4 pages
BASE TTS: Lessons From Building A Billion-Parameter Text-to-Speech Model On 100K Hours of Data (2402.08093)
No ratings yet
BASE TTS: Lessons From Building A Billion-Parameter Text-to-Speech Model On 100K Hours of Data (2402.08093)
27 pages
Voice Conversion With Deep Learning: Miguel Varela Ramos, Instituto Superior T Ecnico, Universidade de Lisboa
No ratings yet
Voice Conversion With Deep Learning: Miguel Varela Ramos, Instituto Superior T Ecnico, Universidade de Lisboa
10 pages
Latent Linguistic Embedding For Cross-Lingual Text-To-Speech and Voice Conversion
No ratings yet
Latent Linguistic Embedding For Cross-Lingual Text-To-Speech and Voice Conversion
5 pages
Capstone Paper
No ratings yet
Capstone Paper
3 pages
Bank Letter
No ratings yet
Bank Letter
1 page
202 03N Qmatic Display Manual
No ratings yet
202 03N Qmatic Display Manual
45 pages
Speech Synthesis
No ratings yet
Speech Synthesis
4 pages
Voice Calculator
No ratings yet
Voice Calculator
8 pages
Text-to-Speech Converter: A Mini Project Report Submitted by
No ratings yet
Text-to-Speech Converter: A Mini Project Report Submitted by
20 pages
WIREs Data Min Knowl - 2023 - Heidari - Deepfake Detection Using Deep Learning Methods A Systematic and Comprehensive
No ratings yet
WIREs Data Min Knowl - 2023 - Heidari - Deepfake Detection Using Deep Learning Methods A Systematic and Comprehensive
45 pages
Kumbharana CK Thesis Cs
No ratings yet
Kumbharana CK Thesis Cs
243 pages
A Framework For Deepfake V2
No ratings yet
A Framework For Deepfake V2
24 pages
Full Download Theory and Applications of Digital Speech Processing 1st Edition Lawrence R. Rabiner PDF
No ratings yet
Full Download Theory and Applications of Digital Speech Processing 1st Edition Lawrence R. Rabiner PDF
41 pages
Barrault Et Al. - 2025 - Joint Speech and Text Machine Translation For Up To 100 Languages
No ratings yet
Barrault Et Al. - 2025 - Joint Speech and Text Machine Translation For Up To 100 Languages
14 pages
Voice Based Translator
No ratings yet
Voice Based Translator
4 pages
Chat Bots Review 2019
No ratings yet
Chat Bots Review 2019
11 pages
Tamil Textual Image Reader
No ratings yet
Tamil Textual Image Reader
4 pages
B.M.S. College of Engineering: Department of Machine Learning
No ratings yet
B.M.S. College of Engineering: Department of Machine Learning
27 pages
S S - T P - S - I D: Caling Peech EXT RE Training With YN Thetic Nterleaved ATA
No ratings yet
S S - T P - S - I D: Caling Peech EXT RE Training With YN Thetic Nterleaved ATA
23 pages
Aliero 2023 Ijca 923106
No ratings yet
Aliero 2023 Ijca 923106
13 pages
DTDC
No ratings yet
DTDC
2 pages
200 Questions AI-900
No ratings yet
200 Questions AI-900
19 pages
Assignment No 1 BY Javed Ali
No ratings yet
Assignment No 1 BY Javed Ali
17 pages
CT ADE DOcs Very Imp
No ratings yet
CT ADE DOcs Very Imp
132 pages
Shipment Number 32202876635
No ratings yet
Shipment Number 32202876635
1 page
Smart Glove For Hearing-Impaired: Abhilasha C Chougule, Sanjeev S Sannakki, Vijay S Rajpurohit
No ratings yet
Smart Glove For Hearing-Impaired: Abhilasha C Chougule, Sanjeev S Sannakki, Vijay S Rajpurohit
5 pages
Loquendo TTS 7 Installation
No ratings yet
Loquendo TTS 7 Installation
14 pages
Spain Cs Bill
No ratings yet
Spain Cs Bill
1 page
CP Moshi
No ratings yet
CP Moshi
2 pages
Unit-Based Speech-to-Speech Translation Without Parallel Data
No ratings yet
Unit-Based Speech-to-Speech Translation Without Parallel Data
12 pages
Sg-Sin-Esc: Express Worldwide
No ratings yet
Sg-Sin-Esc: Express Worldwide
1 page
Research DR Rajiv Dharaskar
No ratings yet
Research DR Rajiv Dharaskar
35 pages
TorToiSe - Spending Compute For High Quality TTS
No ratings yet
TorToiSe - Spending Compute For High Quality TTS
12 pages
Parallel Networks That Learn To Pronounce English Text: Terrence J. Sejnowski
No ratings yet
Parallel Networks That Learn To Pronounce English Text: Terrence J. Sejnowski
24 pages
28-07-1981'MALE TAMIL NADU 600096: Mandatory Field Name of The Applicant Pincod e
No ratings yet
28-07-1981'MALE TAMIL NADU 600096: Mandatory Field Name of The Applicant Pincod e
3 pages
Technical Seminar Presentation On Voice Morphing
No ratings yet
Technical Seminar Presentation On Voice Morphing
21 pages
Proj 2
No ratings yet
Proj 2
9 pages
Outbound Campaign IVR Release Notes
No ratings yet
Outbound Campaign IVR Release Notes
6 pages

Paper TTS+Conversion

Uploaded by

Paper TTS+Conversion

Uploaded by

Nat. Volatiles & Essent.

Oils, 2021; 8(5): 568 – 580

An efficient Tamil Text to Speech Conversion Technique

Using generative adversarial networks (GANs) or variational autoencoders (VAEs), An

A joint-grapheme-phoneme ngram method [27], often known as a joint-sequence model, is

3.1. Dataset Description

Table.1. Recorded sample words and sentences.

3.2. Deep Quality Speech Recognition

3.2.1 Feed-Forward Transformer

Figure 1: Overall Structure of DQSR

3.2.2 Length Regulator

𝐻𝑚𝑒𝑙 = 𝐿𝑅(𝐻𝑝ℎ𝑜 , 𝐷, 𝛼) (1)

3.2.3 Duration Predictor

Fig.2. Duration Predictor

The ground-truth phoneme duration is extracted from an autoregressive instructor TTS

❖ An auto-regressive encoder-attention decoder Transformer TTS model is initially trained using

3.3. Model Configuration

4. Results and Discussion

4.1. Evaluation metrics

4.2. Performance Analysis of DQSR for Words and Sentences

Table 2: Overall Performance of DQSR for Words

Word MSE MND SI PSEQ Accuracy

Table 3: Overall Performance of DQSR for Sentences

SENTENCES MSE MND PSEQ SI Accuracy

நல் லா 0.074567 63.8267

4.3. Comparative analysis of Proposed DQSR with other techniques

Table 4: Performance of DQSR with existing techniques for Single Word

Technique HMM GMM Proposed DQSR

Table 5: Performance of DQSR with existing techniques for Single Sentence

Techniques HMM GMM Proposed DQSR

என் ெபயர் 0.06529 56.8291

காலல 0.76432 44.9821

4.4. Comparative Analysis of Proposed DQSR with existing techniques

Table 6: Comparison of previous systems with implemented DQSR work

Parameters Previous Previous Proposed system

[2] Paul Taylor, Text-to-speech synthesis, Cambridge university press, 2009.

[21]. R. I. Damper, Y. Marchand, M. J. Adamson, and K. Gustafson, “Comparative evaluation of letter-

[26] J. R. Bellegarda, “Unsupervised, language-independent grapheme-to-phoneme conversion by

[28]. S. Jiampojamarn and G. Kondrak., “Letter-phoneme alignment: An exploration,” in Proceedings

You might also like