Oromigna
Oromigna
org
ISSN 2224-3267 (Paper) ISSN 2224-3275 (Online)
Vol.101, 2022
Raja.K (PhD)
Ambo University, Hachalu Hundessa Campus Institute of Technology Department of Computer Science
Abstract
Text to speech synthesis (TTS) which generate input texts is generate to the speech from texts. TTS is very
important in aiding impaired people, in teaching and learning process. But, to implemented TTS have a lot of
challenging such as text processing, time to phoneme mapping and acoustic modeling for Afaan Oromoo
language. So, Afaan Oromoo language mostly required to text to speech synthesis for development of this
language. The application of Natural Language Processing is provide that input texts pair speech to generate the
desired result outputs of speech in waveforms from prepared text corpus. The normalized text was used for
linguistic features are extracted by using Festival toolkit for Afaan Oromoo TTS. The labeled texts are done
using Festival toolkit, and generated the utterances of texts from scheme file parameters. The Festival toolkit is
used for texts normalized in linguistic extraction from label phoneme alignment to match with speech corpus in
trains and tests. The forced alignment is done by HTK toolkit for prepared environment, checked data extracting
features within timestamps of state level alignment for acoustic feature extracted. So, this study focus on TTS
approach deep learning model based on BLSTM-RNN for Afaan Oromoo language. The RNN model used from
a given input feature sequence to extracted duration model and acoustic model. The implementation is done in
BLSTM-based on RNN using pytorch library on jupyter notebook, create duration model and generated speech
samples from trained acoustic model. We have prepared 1000 texts corpus their matching text transcription from
Afaan Oromoo speech corpus by a female speaker dependent for training 700 sentences and tests 300 sentences
from dataset domains. In this study, two evaluation techniques used. Frist, the Mean Opinion Score (MOS)
evaluation technique is used for intelligibility and naturalness in TTS. The second is Mel Cepstral Distortion
(MCD) which is highly used for objective evaluation in model approach for TTS. So, the performance of this
model was measured and quality of synthesized speech is assessed in terms of intelligibility and naturalness
which results are 3.77 and 3.76 respectively. The total average processed using objective evaluation technique
the speech corpus on 16 kHz standards is generated by MCD BLSTM-based on RNN is 3.89 and merlin wave
generated is 3.71 correspondingly.
Keywords: Text To Speech Synthesis, Mel Cepstral Distortion (MCD), Mean Opinion Square (MOS),
Bidirectional Long Short Term Memory Recurrent Neural Network (BLSTM-RNN)
DOI: 10.7176/NMMC/101-02
Publication date: April 30th 2022
1. Introduction
Text-to-speech (TTS) means input texts is to generate the audio and used for in communication, the sound hear
to human. Natural language processing (NLP) is a field which employs computational techniques for learning,
understanding and producing human language properties at the intersection of computer science, artificial
intelligence and computational linguistics. It is used for both generating human readable information from
computer system and converting human language into more formal structures that a computer can understand.
The Text-To-Speech (TTS) synthesizer is a computer-based system that able to read any text aloud, whether it is
directly introduced in the computer. Among 83 languages which are registered in the country Afaan Oromo is a
Cushitic language that has the greatest number of speakers Ethiopia. Moreover, Afaan Oromoo has 60 million
speaker as a mother thong and as second language. The speech is formed from phonemes and combined together
to form words in Natural Processing Language. The Natural Language Processing study human language learns
with the natural sound and they speak to communication throughout their life. Humans also learns easy and
efficient mode of communication with machines. So, the Natural Language Processing accepts the input texts
pair speech corpus and able to generated speech output after text analysis method. This text analysis contains
text normalization, sentence segmentation, tokenization and non-standard words like abbreviations into full word
covert (Trilla, 2009).
The method used to develop a text to speech system in concatenative synthesis is based on speech signal
processing of natural speech databases and speech signal processing able to perform speech in waveform. In
such a manner that, appropriate speech units are concatenated to construct the required speech. The segmental
database is built to show the main phoneme extract features of a language from the concatenative synthesis
15
New Media and Mass Communication www.iiste.org
ISSN 2224-3267 (Paper) ISSN 2224-3275 (Online)
Vol.101, 2022
recorded audio. The method used to set the phonemes is built diphones units, representing the phoneme to
phoneme junctures. No uniform units are also used diphones, syllables and words in this concatenative methods
for speech synthesis.
The aims to generate a mapping between the textual diphones and their equivalent speech units. Each
diphone is represented by two characters, consequently producing speech unit of that diphone. Syllables are any
words with the exception of abbreviations and acronyms were considered in system written words considering
rule based constrictive depend on language in consonant and vowels.
This syllables is used for small database select unit in TTS systems. The words Systems that simply
concatenate isolated words or parts of sentences, are only applicable when a limited vocabulary is required
typically a few words and the sentences to be pronounced respect a very restricted structure. The synthesizer of
the speech segments, and performs some signal processing to smooth unit transitions and to match predefined
prosodic schemes. The direct pitch synchronous waveform processing is one of the most simple and popular
synthesis algorithms.
The multi-pulse excitation linear predictive Coding system produces synthetic speech that is more natural
sounding than the classical linear predictive coder. In the multi-pulse excitation linear predictive coding system,
the excitation signal is modeled with a few pulses per frame of speech regardless of whether the frame is voiced
or unvoiced. The quality of the synthesized speech improves with the number of pulses used per frame. Pulses
are computed by minimizing the weighted square error between the original speech and the synthetic speech.
The Digital Signal Processing (DSP) turns NLP representation into an output signal (Tilahun, 1993). The
advantage of DSP module are the controlling the duration time and frequency (aperiodicity) of the vocal folds so
that the output signal matches the input requirements show speech signal processing.
Generating (converting) text to speech encompasses both natural language processing and digital signal
processing (Morka.M, 2003). The application of Natural Language Processing (NLP) is to produce a phoneme
translation of the text reading with the required intonation and rhythm (Alula, 2010). Text analysis is the
responsible for analysis of input text into soundable texts. To achieve this, it organizes the input texts into
control the lists of words and proposes all possible part of speech categories for each word taken individually on
the basis of words, and then considers words in their context of the recorded and written. Phonetic analysis is
purpose for the finding of the phonetic translation of the incoming text. This work can be organize in different
ways dictionary based and rule-based strategies (context-based).
The speech is greatly affected by accents occur at stressed syllables and form characteristic words in the
pitch tones. The transition periods between syllables place of produce and found to be dependent on the nature of
articulation of boundary sound units. The component are responsible for generate the acoustic sequence required
to synthesize the input text by finding the pronunciation of individual words in the input text. The style of
pronunciation was influenced by the gender, physical state, and emotional state and focused on the speakers. The
prosody features depend on many aspects like the speaker characteristics gender, emotions and meaning of the
sentence (Samuel, 2007).
Speech is the most efficient and natural way to communicate with each other. Speech is the agreement and
common understood of communication between human being. When human read text as the rule based of the
phonology, with native language (mother language) speech, the person hears the individual words and sounds.
Every speech not converted to standard written words or texts. So, speech can be written using letter to sound
format the words. But, this is not true if the person hearing the speech is not familiar with the language.
The conversation of text to speech are several method .The development of society and economic system
since prehistory time has been paralleled by a growth in man’s dependence upon technology. Speech enabled
interfaces are desirable because they promise hands-free, natural, and ubiquitous access to the interacting device
(Solomon, 2005). However, it is one of the least supported and least researched languages in the world. He
remarkable works some contributed doing on text to speech synthesis for Afaan Oromo languages (Solomon,
2005).
Amharic Text-To-Speech Speech Synthesis System stated phonetic once analysis done, the final block of
NLP to prosody generation, which is responsible for finding correct intonation, stress, and duration from written
text in prosodic features. The prosodic features are features that appear when to input sounds together in
connected speech (Alula, 2010). It is advanced in prosodic features as successful communication depends on
intonation, stress and rhythm as on the correct pronunciation of sounds. Intonation, stress and rhythm are
prosodic features. The rule-based methods use manually produced rules, extracted from utterances structures.
Afaan Oromoo speech synthesis system was developed on a hidden Markov model method (Wosho, 2020).
The HMM model stated for the neighbor rule and able to processes limited datasets. The researcher not stated
and mentioned acoustic feature and linguistic feature in statistical parameter used for text to speech synthesis
based on Hidden Markov Model (HMM).
In NLP several methods have been used in the phone duration model working like linear regression models
are based on the assumption that among the features which affect the segmental duration there is linear
16
New Media and Mass Communication www.iiste.org
ISSN 2224-3267 (Paper) ISSN 2224-3275 (Online)
Vol.101, 2022
independency. The models used to predictions linear regression with small amount of training data not model the
dependency among the features extracted. On the other hand, decision tree models and in particular classification
and regression tree models, which are based on binary splitting of the feature space, can represent the
dependencies among the features. The phone duration model, where the segment duration prediction was based
on a sum of factors and their product terms that affect the noise duration (Yang, 2014).It can effectively
extracted the hidden internal structures of data and use more powerful modeling capabilities to characterize the
data.
Deep learning is a part of machine learning which trains the model with large datasets using multiple
layers and Feedforward neural networks for single layer. Deep learning that is capable of process in short
time duration large dataset of training become important method for text to speech system. The HMM
based for speech synthesis method maps linguistic features into probability densities of speech parameters
with various decision trees. The deep learning based method directly perform mapping from linguistic
features to acoustic features with BLSTM-RNN which have proven extra ordinarily efficient at learning
inherent features of data. It is important for readers better understand the development process of these
methods used deep learning approach. Deep learning based models approach significant progresses like
handwriting recognition machine translation (Sutskever, Vinyals, & Le, 2014). The speech recognition (Graves
A. Mohamed, 2013) and speech synthesis (Zen & Alan, 2009).
Recurrent Neural Networks (RNNs) is the also the family of deep learning that are well-suited for pattern
classification tasks whose inputs and outputs are sequences, for example tasks such as speech recognition,
speech synthesis, named-entity recognition, language modelling, and machine translation (M. S. Al-Radhi, 2017).
The Recurrent Neural Network (RNN) method to model speech-like sequential data that represents associations
among bordering frames training duration model and acoustic model. It can also practice all the accessible input
features to forecast output features at each frame. Particularly, the RNN model is different from the DNN since it
operates not only on inputs but also on network internal states that are updated as a function of the entire input
history. Training RNN incorporates backpropagation.
The RNN connections are able to mapping the utterance and understanding input datasets for train the
acoustic sequence, which is purpose waveforms to show speech in signal processing to generated prediction
outputs desired (M. S. Al-Radhi, 2017).
Long short-term memory networks (LSTM) are a class of recurrent networks composed of units with a
certain structure to manage better with the vanishing gradient problems during training of recurrent neural
network and maintain potential long-distance dependencies (M. S. Al-Radhi, 2017). This focused on linguistics
adapted with technology. The Text to Speech is soundable communicate information to the user, where digital
audio recordings, for developing a user of speech synthesizes in Natural Language Processes. The performance
of evaluation used intelligibility and naturalness encourage to investigate the text to speech synthesizer in Afaan
Oromo language. So, this research training datasets are extract the linguistic features for Afaan Oromoo
language.
2. Related Work
In this chapter, from presented the review of number of speech synthesizer developed focusing on different
approaches. The deep learning approach is one of advanced in Natural Language Process for text to speech
synthesis. Deep learning approach is extract the hidden internal structures of data and use more powerful
modeling capabilities to characterize the data. Therefore, concerning to Afaan Oromoo language from the
previous work done (literature reviewed), text to speech synthesis for Afaan Oromoo language has not still
methodically exploring using deep learning approach. The researcher used to the deep learning approach with
the BLSTM-based on RNN, text to speech synthesis for Afaan Oromoo language and consideration the model
used to synthesize the desire fully context labels (speech, texts) pairs which are phone mapping,
duration(linguistic) modeling, acoustic modeling, generated speech for Afaan Oromoo language Table 1 showed
below.
17
New Media and Mass Communication www.iiste.org
ISSN 2224-3267 (Paper) ISSN 2224-3275 (Online)
Vol.101, 2022
3. Research Methodology
This section deals with the Afaan Oromoo text to speech synthesis system methodology within architecture. It
explains the whole of design the representation and description of components. The design architecture easily
understand the method of a text to speech synthesis for Afaan Oromoo approach to deep learning. The training
phase a text corpus is used passes through the text analysis process (tokenization, normalization and linguistic
features extraction).The extracted features are used as an input for the duration model, from the speech corpus
acoustic features are extracted. The input is used for the acoustic model with the linguistic features and duration
information generated by the duration model. Output of the acoustic model is used as input for the Vocoder to
generate the final speech. The generation of speech duration model and acoustic features are extract. The
extracted features are then used as an input for duration model and acoustic model training. Finally, the speech
synthesis is evaluated. So, the architecture of text to speech using BLSTM-based on RNN for the Afaan Oromoo
language is illustrated in Figure 1.
18
New Media and Mass Communication www.iiste.org
ISSN 2224-3267 (Paper) ISSN 2224-3275 (Online)
Vol.101, 2022
19
New Media and Mass Communication www.iiste.org
ISSN 2224-3267 (Paper) ISSN 2224-3275 (Online)
Vol.101, 2022
to extract linguistic features and the extracted features are used as an input for the duration model. The durations
for each phone are first is predicted using the duration model using the pre-trained model. Then after, the
duration model is used to predict the timestamps of each phoneme mapping duration of each phone to be
synthesized. The abbreviations and acronyms are not pronounced as they are input or written. The first work for
the raw input abbreviation to replaces the abbreviations and acronyms in their expanded as rule in Afaan
Oromoo language. The python scripting language used for performs this normalization with sample Afaan
Oromoo abbreviations is common known when expanded. Text corpus preparation for linguistic feature
extraction for each of wave (.wav) files, it requires text (.txt) files with exactly the same name that contains
exactly the text that was recorded in the speech corpus. Text to speech (TTS) synthesis is conversion of text into
speech.TTS system consists in text analysis, where the input text is transcribed into linguistic representation. In
this TTS part, the input sentence is segmented into tokenize.
The Utterance structures are central when synthesizing speech used the Festival speech synthesis system for
text transcribed into phonetics. The relevant properties of the utterance to be synthesized in a way they are
specifications of desired utterances. Instead of created an utterance structure by synthesizing some text, we can
create an utterance structure from a speaker's natural utterance, taking all properties from the speaker's utterance
instead of predicting them from texts. We have an utterance that looks like a synthesized utterance, but which is
guaranteed to be a valid natural utterance, with correct phonetic properties.
An utterance structure consists of many items like single words, syllables, phones, phrases and items are
connected through several relations. Exactly which relations are present depends on the synthesis method,
among other things. But some of them are always present, such as the word relation connecting word items, the
syllable relation connecting syllables, the segment relation connecting phones, and the syllables structure relation
connecting items.
The requirements for generated utterances structures by the Festival system and provided a script to create
utterance structures from text corpus, linking the hierarchical relations by their time information ( the script
looks at the end times of items in lower relations to decide which items in a higher relation must be their parent).
However, these label files need and for punctuation marks. The punctuation is represented as a feature at token
level by Festival into utterances. The aligner did not differentiate between words and tokens, instead of word
label files created by the aligner are actually tokens in Festival. So in order to have punctuation at the token level
in Festival system, we have that information in the word label files. We have able to make use of the features and
the scripts for converting to utterances created one utterance per word label file. Unfortunately, the librivox
recordings are usually quite long, and it is not possible to cut them into reasonably small pieces for Festival
system before running the aligner, because then you would have to sit down and segment the text(.txt) files
accordingly. Speech corpus files move all short wave files into the wav directory within
/home/soro/merlin/Text2Speech/AO_Speech_Syntesis/soro and phonemic files and the words files to this
directory, so everything is at the same place. The raw audio preparation for training is used to htk toolkit audio
make feature acoustic extraction Mel generation coefficient (mgc), fundamental log frequency (lf0), and band
periodicities (bap).
The made the file was the extracted acoustic feature from composed acoustic features. These made train
was generated in scripts. The full context train labeled extracted from text corpus for utterance build. The made
file was generated used HTK tools for full context label like monophone mapped and full labeled texts pair
speech. The list of made file is generated under folder master label files and model list files. The acoustic
features accesses is extracted audio waveform from make features which includes log fundamental frequency
(fl0) represented the pitch and Mel generalized cepstral features which displayed spectral parameter of the
speech.
3.2.1 Text pre-processing
Text pre-processing is task must be expanded into full words digits and numerals. Another task is to find correct
pronunciation for different contexts in the text. The synthesis speech in text analysis of the raw input text into
pronounceable words. From the texts contains string punctuation marks to clean like!”#$ %&()*+,-./:;?@[\]^_{|}
~\t\n providing many functionalities in deep learning.
The construct datasets from pertaining linguistic duration and acoustic features because computing features
from the label files on-demand are performance, particularly for duration model extract features used to python
script and using bash in form (./filename.sh) as following steps:-
Step1. Prepare corpus similar text within speech corpus in this same file name in text and audio filename
separately in folders. The purpose for preprocessing text pair speech easily undestanded for machine learning in
linguistic (text) analysis.
Step2. From saved file text pair speech automatically created label phoneme alignment and state alignment using
method htk tool and festival fronted tools
Step3. Create the duration model and acoustic model
step4. Create training duration model and training acoustic model
20
New Media and Mass Communication www.iiste.org
ISSN 2224-3267 (Paper) ISSN 2224-3275 (Online)
Vol.101, 2022
21
New Media and Mass Communication www.iiste.org
ISSN 2224-3267 (Paper) ISSN 2224-3275 (Online)
Vol.101, 2022
22
New Media and Mass Communication www.iiste.org
ISSN 2224-3267 (Paper) ISSN 2224-3275 (Online)
Vol.101, 2022
3.4.1 Spectrogram
The first spectrograms to generation of speech synthesis to analysis is used python speech parameter toolkit
(pysptk) and important used the librosa for representation features. The pysptk was contains have sequential
include windowing, mel-generalized cestrum analysis, visualize spectral envelope estimates and F0 estimation.
The spectral parameter estimation and visualize its spectral envelop estimate. The first option for an audio
features representation is spectrogram. The spectrogram two dimensional tensor is displayed vertical dimension
indexes times and horizontal dimension frequency.
23
New Media and Mass Communication www.iiste.org
ISSN 2224-3267 (Paper) ISSN 2224-3275 (Online)
Vol.101, 2022
timber. In contrast, physical features are properties that present mathematical, statistical, and physical properties
of audio signals. The most commonly used acoustic features in speech synthesis are the Mel Frequency Cepstral
Coefficients. The acoustic features are often used as a low level audio representation to bridge the synthesizer
and the Vocoder in the backend of a TTS system. In this study, we used Mel Frequency Cepstral Coefficients
and linear scale frequency spectrograms as an intermediate acoustic feature representation. To be specific, the
output of the feature predictor and the conditional input of the Vocoder are sequences of Mel and linear
frequency spectrograms. The frame period at five and the trimming zero frames of the spectrogram is small
power to make good visualize using pyworld (python world).
24
New Media and Mass Communication www.iiste.org
ISSN 2224-3267 (Paper) ISSN 2224-3275 (Online)
Vol.101, 2022
From this Figure 9, the number of epoch or iteration increase it displaying different training and test loss in
datasets. So, good result displayed.
Table 2: Duration Model using Phone Align and State Align
Labels phone align Labels state align
Learnin Valid in Correctio Test in Learnin Valid in Correctio Test in
g Rate RMSE n RMSE g Rate RMSE n RMSE
0.002 6.777 CORR 7.665 0.002 6.826 0.624 7.840
frames/phone 0.633 frames/phone frames/phone frames/phone
me me me me
Table 2 describe duration model label phone alignment and state level alignment to demonstration speech
pair text train data, test data for train, valid, and test respectively evaluation used to RMSE within learning rate at
0.002 in all data sources 63.3 % in label phone align was one of preferred.
3.5.2 Training Acoustic Model
The train acoustic model is somewhat good to decreasing training and validation loss over time. But there is one
thing. The validation loss to be stopped decreasing and started to increase showed. This means that the network
had started to overfitted to the training set and regularization techniques such dropout and obtain predictions of
the acoustic model displayed finished train datasets texts pair speech at number of epoch 25.
25
New Media and Mass Communication www.iiste.org
ISSN 2224-3267 (Paper) ISSN 2224-3275 (Online)
Vol.101, 2022
26
New Media and Mass Communication www.iiste.org
ISSN 2224-3267 (Paper) ISSN 2224-3275 (Online)
Vol.101, 2022
27
New Media and Mass Communication www.iiste.org
ISSN 2224-3267 (Paper) ISSN 2224-3275 (Online)
Vol.101, 2022
28
New Media and Mass Communication www.iiste.org
ISSN 2224-3267 (Paper) ISSN 2224-3275 (Online)
Vol.101, 2022
29
New Media and Mass Communication www.iiste.org
ISSN 2224-3267 (Paper) ISSN 2224-3275 (Online)
Vol.101, 2022
consonant and seven double phoneme (Qubee dacha) where the entry total phoneme in Afaan oromoo language
in numbers 33.The characters (Phone) are mapped to indices in the estimate numerical form (binary) and look at
the char mapping in the files.
The predictable of RNN model is that they are only able to make use of previous context. In acoustic model,
where whole utterances are transcribed at once, predict the feature context. The BLSTM-based on RNN, to
complete the bidirectional rnn model function in sample models and specifying the bidirectional covering. The
train the deep learning used BLSTM based RNN as specified in input to softmax. Then, the model was finished
training, saved in the path for training duration model and training acoustic model to visualize binary formats.
5.2 Conclusion
Text-to-speech synthesis (TTS) which means input texts is generate to the audio from text. In Afaan Oromoo
language mostly required to text to speech synthesis for development of this language. So as to transmit
information. In this work, a first attempt is investigated speech synthesis for Afaan Oromoo language using deep
learning approach based on BLSTM- based on RNN model. The filename is long, but it contains some very
important information in activation functions used in hidden layers type is TANH, hidden layer size was 1024
and number of hidden layers were five in numbers. The dimensionality of hidden layers is 1024 number of input
nodes (the dimensionality of labels phonemes in model) was 416 and number of output nodes (the
dimensionality of acoustic features to predict) is five .Datasets for train file number seven hundred valid files
three and tests file three. The buffer size of each block of data to buffer size is 200000 the model file name feed
forward 6tanh and the learning rate used was 0.002. From this automatically created in python training and the
feature extracted prepared using python scripts.
The purpose of prepared the python feature for extracted duration model and acoustic model preprocessed
converted into binary formatted. The dimension of vector was created from parameter generated for visualization
utterances length for linguistic features and acoustic feature training and test.
From duration and acoustic plotted visual utterances generated total utterance number and frames. From this
utterance plotted we prepared normalization in X max, X min, and Y mean. The Y variable and Y scale for both
duration and acoustic train utterance length(utt_lengeth).We Used to the pytorch dataset for generated acoustic
model in waveforms and using Recurrent Neural Network within bidirectional activation true in LSTM within
vector(416,256) the model for duration and model for acoustic was generated.
At the end from label state alignment and label phone alignment used sample generated speech and user can
download the generated speech from Jupyter notebook on python scripting. It can be used as message readers,
teaching assistants, tools to aid in communication and learning for the handicapped and impaired challenged
people. During developing Afaan Oromoo speech synthesis, the system involved collecting text, preprocessing
the text, preparing phonetically balanced sentences, recording the sentences, preparing annotated speech
database, and design a prototype. In training first the text and speech corpus are manually prepared for
processing. Mel-cepstral coefficients parameters are obtained from speech data sources used Mel-cepstral
analysis like pyworld and pysptk. Then using automatic htk toolkit and front end used to festival toolkit context
labeler in state level alignment and label phone alignment. The text corpus and speech parameters are align to
generate linguistic (utterance) feature. However, every feature of Afaan Oromoo language was not considered
because it needs a lot of time and deep linguistic way of creation of Afaan Oromoo phonemes are considered.
The Mean opinion score evaluation technique was used to subjective test the performance of synthesized speech
from the evaluator of native speech to listen the audio recorded within their test. The thirteen sentences used for
testing used for subjective test, the result is 3.76 and 3.77 out of 5 score in terms of naturalness and intelligibility
respectively.
30
New Media and Mass Communication www.iiste.org
ISSN 2224-3267 (Paper) ISSN 2224-3275 (Online)
Vol.101, 2022
References
A. Balyan, S. S. (2013). Speech Synthesis. A Review," International Journal of Engineering Research &
Technology (IJERT), pp. 57-75.
Alem F., K. N. (2007). “Text To Speech for Bangla Language using Festival” . BRSC University, Bangladesh.
Allen Jonathan, H. M. (1987). From Text to Speech: The MITalk system. Cambridge University Press.
Alula. (2010). A generalized approach for Amharic Text To Speech system. Addis Abeba.
Barnwell, A. V. (1995, Jul ). A mixed excitation LPC vocoder model for low bit rate speech coding. in IEEE
Transactions on Speech and Audio Processing, vol. 3, pp. 242-250.
Bluche, T. N. (2013). Tandem HMM with Convolutional Neural Network for Handwritten Word Recognition. In
Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing
(ICASSP2013),. Vancouver, BC, Canada.
Cassia Valentin. (2013). Intelligibility Enhancement Of Synthetic Speech In Noise. Ph. D. Dissertation.
University Of Edinburgh, Germany.
Charpentier, M. a. (1990). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using
diphones. Speech Communication.
Christian Kratzenstein. (1779). the Danish scientist working at the Russian Academy of Sciences, built the first
talking machine. Danish, Russian Academy.
Dutoit, T. (1997). A Short Introduction to Text-to-Speech. Dordrecht, Boston, London.: Kluwer Academic
Publisher.
Figueiredo, A. I. (2006). Automatically Estimating the Input Parameters of Formant-Based Speech Synthesizers.
Flanagan, J. (1965). Speech analysis, synthesis, and perception. Springer, Berlin.
Goubanova, O. T. (2000 ). Using Bayesian belief networks for model duration in text-to -speech systems. In
Proc. of ICSLP ICSLP-20 00, Beijing, China.
Graves A. Mohamed, A. G. (2013). Speech recognition with deep recurrent neural networks.In Proceedings of
the IEEE International Conference on Acoustics, Speech and Signal Processing, (Vol. 38th). Vancouver,
BC, Canada,.
Graves, A. (2012.). Supervised Sequence Labelling with Recurrent Neural Networks;. Springer: Berlin,
Germany,.
Graves, A., Jaitly, N., & Mohamed, A. (8–12 December 2013;). Hybrid speech recognition with Deep
Bidirectional LSTM. In Proceedings of the 2013 IEEEWorkshop on Automatic Speech Recognition and
Understanding,. Olomouc, Czech Republic,.
Holmes, J. H. (2003). “Speech Synthesis and Recognition” e,. Taylor and Francies New Fetter Lan, London
ECAP 4EE.
I. H. Witten, E. F. (2012). Practical Machine Learning Tools and Techniques of Data Mining .
Javidan, R. (2010). Concatenative Synthesis ofPersian Language Based on Word, Diphone and Triphone Data-
bases. Persian.
Klatt, D. (1976). Linguistic uses of segmental duration in English: Acoustic and perceptual evidence. Journal of
31
New Media and Mass Communication www.iiste.org
ISSN 2224-3267 (Paper) ISSN 2224-3275 (Online)
Vol.101, 2022
32
New Media and Mass Communication www.iiste.org
ISSN 2224-3267 (Paper) ISSN 2224-3275 (Online)
Vol.101, 2022
synthesis. In Proceedings of the Sixth European Conference on Speech Communication and Technology
(EUROSPEECH’99). Budapest, Hungary,.
Zen, H. S. (2013). Deep learning method speech sysnthesis. Vancouver, BC, Canada.
Zen, H. T., & Alan, W. (2009). Statistical parametric speech synthesis. Speech Commun.
33