Deep Learning Based Multilingual Speech Synthesis Using Multi Feature Fusion Methods
Deep Learning Based Multilingual Speech Synthesis Using Multi Feature Fusion Methods
Deep Learning Based Multilingual Speech Synthesis Using Multi Feature Fusion Methods
Praveena Nuthakki
Department of CSIT, Koneru Lakshmaiah Education Foundation, Vaddeswaram522302, AP, India.
Madhavi Katamaneni
Department of IT, Velagapudi Ramakrishna Siddhartha Engineering College, Vijayawada, India.
Chandra Sekhar J. N.
Department of EEE, Sri Venkateswara University College of Engineering, Sri Venkateswara University 517502, Tirupati, A.P., India.
Kumari Gubbala
Associate Professor, Department of CSE (CS),CMR Engineering College, Hyderabad, Telangana, India. ( ORCID Id: 0000-0002-
63781065 )
Bullarao Domathoti*
Department of CSE, Shree Institute of Technical Education, Jawaharlal Nehru Technological University, Ananthapuram , 517501,
India;
Venkata Rao Maddumala
Department of CSE, Koneru Lakshmaiah Education Foundation, Vaddeswaram522302, AP, India.
Kumar Raja Jetti
Department of CSE, Bapatla Engineering College, Bapatla, Guntur, Andhra Pradesh, India; ( ORCID ID: 0009-0000-
5169-3829 )
*Corrresponding Author:Bullarao Domathoti (Email: [email protected])
The poor intelligibility and out-of-the-ordinary nature of the traditional concatenation speech synthesis technologies are two major problems. CNN's
context deep learning approaches aren't robust enough for sensitive speech synthesis. Our suggested approach may satisfy such needs and modify the
complexities of voice synthesis. The suggested model's minimal aperiodic distortion makes it an excellent candidate for a communication recognition
model. Our suggested method is as close to human speech as possible, despite the fact that speech synthesis has a number of audible flaws.
Additionally, there is excellent hard work to be done in incorporating sentiment analysis into text categorization using natural language processing. The
intensity of feeling varies greatly from nation to country. To improve their voice synthesis outputs, models need to include more and more concealed
layers & nodes into the updated mixture density network. For our suggested algorithm to perform at its best, we need a more robust network foundation
and optimization methods. We hope that after reading this article and trying out the example data provided, both experienced researchers and those just
starting out would have a better grasp of the steps involved in creating a deep learning approach. Overcoming fitting issues with less data in training,
the model is making progress. More space is needed to hold the input parameters in the DL-based method.
Keywords: Natural Language Processing, Deep Learning, Machine Learning, Speech to Text.
1 INTRODUCTION
For voice synthesis to sound genuine and understandable, fundamental frequency (F0) modelling is essential. Mandarin speech
synthesis is unaffected by the use of a mixed Tibetan train corpus [1]. When compared to the Tibetan monolingual framework, the
mixed BBERT-based cross-lingual synthesis of speech framework only requires 60% of the learning corpus to synthesis a comparable
voice. Thus, the suggested approach may be used to low-resource languages for voice synthesis by using the corpus of a high-resource
language [2]. The process of synthesizing the sounds of numerous voices from a single model is known as "multi-speaker speech
synthesis." There have been various proposals for employing deep neural networks (DNNs), however DNNs are vulnerable to
overfitting when there is insufficient training data [3]. Utilising deep Gaussian processes (DGPs), which are a complex structure of
Bayesian kernel correlations that are resistant to overfitting, we provide a framework for multi-speaker voice synthesis. Voice
synthesis studies that convey a speaker's personality and feelings is thriving because to recent advancements in the field made possible
by deep learning-based voice synthesis. For speakers with extremely low or very high vocal ranges as well as speakers with dialects,
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components
of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on
servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
2375-4699/2023/1-ART1 $15.00
http://dx.doi.org/10.1145/3618110
ACM Trans. Asian Low-Resour. Lang. Inf. Process.
the present day does not adequately portray a variety of moods and features [5]. Modern neural text-to-speech (TTS) systems are able
to create speech that is indistinguishable from normal speech [6] thanks to the remarkable progress in deep learning. However, rather
than having extensive prosodic variance, the produced utterances often retain the average prosodic character of the database. To
effectively transmit meaning in pitch-stressed languages like English, proper intonation and emphasis are essential [7]. Recognizing
feelings in a speaker's voice automatically is a difficult and time-consuming task. Humans' complex range of emotions makes
categorization challenging. Extracting relevant speech elements is a significant challenge because of the need for human intervention
[8]. Deep learning methods that make use of high-level aspects of voice signals improve the accuracy of emotion identification. One of
the primary challenges of employing target-driven models for articulatory-based speech synthesis is representation learning [9].
Computer-aided diagnostic systems that help in clinical decision-making rely heavily on deep learning methodologies. Because of the
scarcity of annotated medical data, however, the development of such automated systems is difficult [10, 11]. The purpose of the user
interface is to accurately read the user's mood. The most pressing concern in speech emotion identification research is how to
efficiently extract relevant speech characteristics in tandem with an adequate classification engine. [12][13]. Emotion recognition and
analysis from speech signals need well-fined speech databases as well. Numerous real-world applications, such as those dealing with
human-computer interaction, gaming on a computer mobile services, as well as emotion assessment, have made use of data science in
recent years. Speech emotion recognition (SER) is a very new and difficult area of study with a broad variety of potential applications.
When it comes to SER, current research has relied on handmade characteristics that provide the greatest results but fall short when
applied to complicated settings. Later, SER made advantage of automated feature detection from voice data using deep learning
methods. Despite the fact that deep learning-based SER approaches have solved the accuracy problems, they still have a long way to
go before they can be considered fully comprehensive. [14] [15].
The following is a problem statement we developed to address the aforementioned challenges and provide a solution to the issue of
multilingual voice recognition.
The ability to identify and convert a multilingual sentence into a sentence in a single language.
Creating a model that can interpret voice navigation questions spoken in many languages and provide a unified response in
one of those languages.
Although there are 22 official languages in India, most are only spoken within a single state. As a result, resources for these
languages are limited, and high-quality voice data is hard to come by. Furthermore, no publicly accessible multilingual dataset
includes these languages and English.
2 RELATED WORK
In this study, we work to refine Isarn's speech-synthesis modelling of F0. For this problem, we recommend an RNN-based model
called F0. Supra-segmental aspects of the F0 contour are represented by using sampled values of F0 at the syllable degree of
continuous Isarn sound in conjunction with their dynamic features. We investigate several deep RNN architectures and feature
combinations to find those that provide the highest performance. We evaluated the suggested technique against different RNN-based
baselines to determine its efficacy. Objective and subjective evaluations show that the proposed model outperforms the baseline RNN
strategy that predicts F0 values at the frame level as well as the baseline RNN model that depicts the F0 contours of words using a
discrete cosine transform [16].
This study introduces a deep Gaussian process (DGP) as well as sequence-to-sequence (Seq2Seq) learning-based approach to voice
synthesis, with the goal of producing high-quality end-to-end speech. Since DGP makes use of Bayesian training and kernel
regression, it is able to create more lifelike artificial words than deep neural networks (DNNs). Such DGP models, however, need
extensive knowledge of text processing since they are pipeline architectures of separate models, including acoustic and duration
models. The suggested model makes use of Sequence-to-Sequence (Seq2Seq) learning, which allows for joint training of spectral and
temporal models. Parameters are learned using a Bayesian model, and Gaussian process regressions (GPRs) are used to represent the
encoder and decoder layers. To adequately simulate character-level inputs in the encoder, we additionally propose a self-attention
method based on Gaussian processes. In a subjective assessment, the suggested Seq2Seq-SA-DGP was shown to produce more
natural-sounding speech synthesis than DNNs with self-awareness & recurrent structures. In addition, Seq2Seq-SA-DGP is useful
when a straightforward input is provided for a complete system, and it helps alleviate the smoothing issues of recurrent structures.
Sequence-to-Sequence Deep Neural Network Model. The field evaluations also demonstrate the efficacy of the DGP framework's self-
attention structure, with Seq2Seq-SA-DGP being able to synthesise more natural discourse than Seq2Seq-SRU-DGP & SRU-DGP
[17].
3 PROPOSED WORK
The given formula computes the Frobenius norm square of the residual matrix.
In this work, we use VecMap under supervision. A 5,000-word English-Persian lexicon is used for teaching and testing purposes
under close supervision.
English TTS
Japanese TTS Mandarin TTS
Corpuse
Corpuse Corpuse
Data Collection
Testing
Data Preprocessing
Multi-Speaker
Terminate
Speech converted
Yes
Terminate
...
Global pooling layer
...
Softmax layer
3.2 CNN
In the figure 2 shows the Words in the system correspond to a real number vector. The words represent the integers 0 through N-1.
These word indices are represented as vectors of length N that may be plugged right into a model of the corresponding network. The
textual sequences were in time, much as a skip gramme model would predict. The conditional probability of each word is denoted as a
vector. Using the softmax function on the vector, as shown in Equation 1, it identifies the most pivotal target word,
( )
∑
(1)
( )
Where is index set. is shorthand for the sequence's duration. The term in context will be found in the
lexicon (index). During the approximative training phase, the loss function is built using the path from one node to another in a binary
tree structure. Gradient computational, a part of the training process, has complexity that grows as the lexicon does. The serial data is
pre-processed using negative sampling. We've developed a model for word embedded that can compare words like "Hi" and "Hai" to
determine their level of similarity. Instead of using the morphology function, a separate vector is utilised for word to vector
conversion. In other words, the skip gramme model is now live. The minimum and maximum length of retrieved subwords were both
set by us. The size of dictionaries, however, is not a fixed parameter. For this purpose, we have used a byte-pair encoding strategy [1].
The Convolution Neural Network (CNN) may be rapidly trained based on the quality of the input voice. Input vectors that are able to
conform to the probability distributions are welcomed by the input layers. The parameters of the underlying hidden layers will also be
determined. Convergence data is not entered into the hidden layer until it is reached. Based on the length of the recovered subwords,
the results layer will interpolate Acoustic characteristics. These variables are trained inside the same network, thus there will be no
where Z is the vector of input values and C is the total number of categories in the dataset, and are the common exponential
equations for the vectors of input and output. Equation 4 describes the model confidence-averaged prediction of a class label. The
agrgmax for the anticipated labels in the target class is used to determine this.
∑ ( )
ˆ (3)
where is the base learner, is the number of selected base CNN learners, and in the proposed TTD;
represents the likelihood that the category value i will be present in a given data sample x .
Feature Extraction
Speech
Input Text
Waveform Parameter
Synthesis Generation
The specifics of the ASR systems tested here are listed in Table 3.
The matrix of uncertainty [19] uses the following fundamental terminology.
Observation is positive and consistent with the expected outcome; this is a true positive (TP).
The opposite of a true positive is a false negative.
A true negative (TN) is an observation that matches expectations of negativity.
A false positive (FP) occurs when an unfavourable observation contradicts a favourable prediction.
To analyse test results, we calculated an Accuracy, Extractness, Recollotion, & F1Score. The performance evaluation techniques
are shown by the equations. (4), (5), (6), & (7).
The accuracy of a model is the easiest to understand when compared to other assessment indices. However, a supplemental index is
necessary when the data categories are not evenly distributed. The F1-Score is a useful measure since it takes into account both
accuracy and recollotion.
We use the Common voice Semantics (CSS10) voice corpus, which has a single speaker for each of 10 languages. We choose
German, Dutch, Chinese, and Japanese to build a work sequence. We use the publicised train/validation splits from, which provide
15.1 hours, 11.5 hours, 5.4 hours, and 14.0 hours of training audio, in that order. For each language, we used the latest 20 examples
from the validation set as test samples. We rank the various methods of lifelong learning as follows: German (DE), Dutch (NL), Swiss
German (CHZ), and Japanese (JA). Our buffer size for replay-based algorithms is 300 utterances, or around 0.6h of audio. After
instruction on each language, random samples are added to the buffer. To maintain linguistic parity in the buffer during the course of
the training sequence, old samples are removed at random as new ones are added.
Table 2. When compared to the results of the baseline monolingual models using CNN as the classifier, the suggested cross-lingual sentiment analysis
approach shows significant improvement.
English
58.3 66.2
Static embedded 60.29 59.31 67.05 66.63
6 1
65.9 72.8
Dynamic embedded 68.28 67.13 73.93 73.37
9 1
Japanese
68.5 71.7
Static embedded 70.32 69.42 72.69 72.18
4 1
78.1 80.0
Dynamic embedded 79.73 78.96 81.48 80.75
8 1
Mangaria
n 71.6 78.3
Static embedded 73.29 72.46 80.12 79.22
4 3
76.1 83.2
Dynamic embedded 77.73 76.94 85.33 84.28
6 9
Table 3. When compared to the results of the baseline multilingual model using BERT as the classifier, the suggested cross-lingual sentiment
assessment model shows significant improvement.
Recal Recal
Precision F-measure Precision F-measure
l l
English
Static embedded 63.75 62.13 62.93 69.98 68.51 69.24
Japanese
Static embedded 70.99 69.25 70.11 75.13 73.68 74.38
Mangarian
Static embedded 77.24 75.55 76.39 80.92 79.45 80.18
Englis Static 64 70
65.85 64.93 72.31 71.63
h embedded .03 .96
Dynamic 70 76
71.77 71.11 78.09 77.41
embedded .44 .74
Japane Static 68 76
70.72 69.83 77.31 76.84
se embedded .96 .27
Dynamic 76 83
78.28 77.59 84.93 84.18
embedded .91 .48
Manga 75 83
76.83 76.05 84.26 83.76
rian .28 .26
Dynamic 83 88
85.03 84.31 89.69 88.91
embedded .58 .13
100
Accuracy (%)
80
60
40
100 200 300
400 500
600 700
800
Sample Size (N)
(a)
CNN BERT CNN-BERT
100
90
80
Accuracy (%)
70
60
50
40
30
2 4 6 8 10 12 14 16 18 20 10
Features Size
(b)
Fig.4.Comparison of Accuracy (a). Number of Samples, and (b). Feature Size
100
Precision (%)
80
60
40
100 200 300
400 500
600 700
800
Sample Size (N)
(a)
70
60
50
40
30
2 4 6 8 10 12 14 16 18 20 10
Features Size
(b)
Fig. 5. Comparison of Precision (a). Number of Samples, and (b). Feature Size
100
Recall (%)
80
60
40
100 200 300
400 500
600 700
800
Sample Size (N)
Recall (%)
80
70
60
50
40
30
2 4 6 8 10 12 14 16 18 20 10
Features Size
(b)
Fig.6.Comparison of Recall (a). Number of Samples, and (b). Feature Size
100
F-measure (%)
80
60
40
100 200 300
400 500
600 700
800
Sample Size (N)
(a)
CNN BPNN Proposed
100
90
80
F-measure (%)
70
60
50
40
30
2 4 6 8 10 12 14 16 18 20 10
Features Size
(b)
Fig.7.Comparison of F-measure (a). Number of Samples, and (b). Feature Size
Time (S)
95
70
90
60
85
50
40 80
30 75
2 4 6 8 10 12 14 16 18 20 10
Features Size
Fig.8.Comparison of Runtime
Two distinct cross-lingual embeddeds (BERT and Vector Map) and classifiers (CNN and CNN-BERT) are employed in our
research. Furthermore, we conducted our tests in two modes because to the fact that the cross-lingual embeddeds have been trained
with no information from the data used for training of the sentiment evaluation job: There are two types of embedded models: static,
which employs the pre-trained vectors directly, and dynamic, which fine-tunes the embeddings using the training data. All experiments
are provided with regards to their levels of accuracy, extractness, recollection, and F-measure.
The following are noteworthy insights gleaned from a comparison of the outputs of several training models and cross-lingual
embeddeds:
Our suggested model, trained on English data rather than Persian data, consistently outperforms state-of-the-art baseline algorithms
in experimental settings.
Although the suggested cross-lingual sentiment analysis performs much better than monolingual models, we may further enhance
the model by taking use of English as well as Persian information in a Bilingual training phase, as shown by a comparison of the
results from English training as well as Bilingual training. Bilingual training allows for the use of the extensive English data set as well
as the study of the language-dependent characteristics present in the Persian dataset.
VectorMap consistently beats BERT, while the performance gaps are small. Because BilBOWA needs parallel data, the corpora
available for training the algorithm are restricted, while with VectorMap, this is not the case.
We may further enhance the model by collecting semantic properties that are present in the subject matter of training data, as
shown by a comparison of static and dynamic embeddeds.
Figure 8 shows that when all classifiers are compared, hybrid models perform the best. Taking use of the strengths of both models,
BERT-CNN and CNN-BERT architectures are able to collect sequential and atypical information from the text, leading to more
precise predictions. Furthermore, the fact that BERT- CNN outperforms CNN-BERT demonstrates the importance of gathering
sequential data early on in the design. However, with the CNN-BERT model, some sequential information may be lost since a CNN
model is performed first. When combined, BERT-CNN and VectorMap provide an F-measure of 95.04, making them the most
effective methods.
In contrast to the technique of concatenative speech synthesis, the method that was provided is capable of creating synthetic speech
that is not only extremely easily comprehended but also sounds very much like real speech. One of the disadvantages of the existing
TTS-based speech synthesis model is that it makes use of context decision trees to trade speech parameters. This is one of the
limitations. As a direct consequence of this, the artificially produced speech lacks the necessary degree of realism to fulfill the
requirements of expressive speech synthesis. In the suggested method, the clustering phase of the context CNN is replaced by the deep
learning-based speech synthesis models. These models employ entire context information and distributed illustration in lieu of the
clustering process. They map the context characteristics to high-dimensional acoustic data by using several hidden layers, which leads
in the quality of the synthesized speech being better to that of the approaches that were utilized in the past. The enormous
representation possibilities of DL-based models, on the other hand, have led to the appearance of a number of brand new difficulties.
The models need an increase in the number of hidden layers and nodes so that they can provide more accurate predictions of the
world's behavior. This, in turn, will undoubtedly result in a rise in the total number of network parameters, in addition to the amount of
5 CONCLUSION
The poor intelligibility and out-of-the-ordinary nature of the traditional concatenation speech synthesis technologies are two major
problems. CNN's context deep learning approaches aren't robust enough for sensitive speech synthesis. Our suggested model may
satisfy those voice synthesis needs and alter the associated difficulties. The suggested model's minimal aperiodic distortion makes it an
excellent candidate for a communication recognition model. Our suggested method is as close to human speech as possible, despite the
fact that speech synthesis has a number of audible flaws. Additionally, there is excellent hard work to be done in incorporating
sentiment analysis into text categorization using natural language processing. The intensity of feeling varies greatly from nation to
country. To improve their voice synthesis outputs, models need to include increasingly hidden layers and nodes into the updated
mixture density network. If we want to assess the model in the best way possible, our suggested method will require a more robust
network structure and optimization approaches. We hope that after reading this article and trying out the example data provided, both
experienced researchers and those just starting out would have a better grasp of the steps involved in creating a deep learning
approach. Overcoming fitting issues with less data in training, the model is making progress. More space is needed to hold the input
parameters in the DL-based method.
Authors Contribution: Each author contributed equally in each part.
ETHICAL APPROVAL:
This article does not contain any studies with human participant and Animals performed by author.
CONFLICTS OF INTEREST
The authors are declared there are no conflicts of interest regarding the publication of this paper.
FUNDING STATEMENT
Author declared that no funding was received for this Research and Publication.
ACKNOWLEDGMENT
The authors thank for providing characterization supports to complete this research work.
REFERENCES
[1]. Bollepalli, B., Juvela, L., & Alku, P. (2019). Lombard Speech Synthesis Using Transfer Learning in a Tacotron Text-to-Speech System.
Interspeech.
[2]. Mishev, K., Karovska Ristovska, A., Trajanov, D., Eftimov, T., & Simjanoska, M. (2020). MAKEDONKA: Applied Deep Learning Model for
Text-to-Speech Synthesis in Macedonian Language. Applied Sciences.
[3]. Nishimura, Y., Saito, Y., Takamichi, S., Tachibana, K., & Saruwatari, H. (2022). Acoustic Modeling for End-to-End Empathetic Dialogue Speech
Synthesis Using Linguistic and Prosodic Contexts of Dialogue History. Interspeech.
[4]. Barhoush, M., Hallawa, A., Peine, A., Martin, L., & Schmeink, A. (2023). Localization-Driven Speech Enhancement in Noisy Multi-Speaker
Hospital Environments Using Deep Learning and Meta Learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31, 670-
683.
[5]. Ning, Y., He, S., Wu, Z., Xing, C., & Zhang, L. (2019). A Review of Deep Learning Based Speech Synthesis. Applied Sciences.
[6]. Gudmalwar, A.P., Basel, B., Dutta, A., & Rao, C.V. (2022). The Magnitude and Phase based Speech Representation Learning using Autoencoder
for Classifying Speech Emotions using Deep Canonical Correlation Analysis. Interspeech.
[7]. Tu, T., Chen, Y., Liu, A.H., & Lee, H. (2020). Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech
Representation. Interspeech.
[8]. Wu, P., Watanabe, S., Goldstein, L.M., Black, A.W., & Anumanchipalli, G.K. (2022). Deep Speech Synthesis from Articulatory Representations.
Interspeech.
[9]. Lee, M., Lee, J., & Chang, J. (2022). Non-Autoregressive Fully Parallel Deep Convolutional Neural Speech Synthesis. IEEE/ACM Transactions
on Audio, Speech, and Language Processing, 30, 1150-1159.
[10]. Khalil, R.A., Jones, E., Babar, M.I., Jan, T., Zafar, M.H., & Alhussain, T. (2019). Speech Emotion Recognition Using Deep Learning Techniques:
A Review. IEEE Access, 7, 117327-117345.