0% found this document useful (0 votes)
15 views5 pages

NLPReport Phase 1

Uploaded by

Raymond Themi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views5 pages

NLPReport Phase 1

Uploaded by

Raymond Themi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Neural Text-To-Speech Synthesis Literature Review

Ahmed Tammaa Ahmed Ramy


Artificial Intelligence Major Artificial Intelligence Major
The British University in Egypt The British University in Egypt
Cairo, Egypt Cairo, Egypt
[email protected] [email protected]

Abstract—This paper covers a literature review for the Neural function. The final output is the element-wise multiplication of
TTS synthesis by comparing different vocoders. Our Results the two halves as shown in formula one. Lastly, it is building
shows that High Fidelity GAN (HiFi-GAN) is currently style is based on residual blocks which allows the output of layer
outperform its main competitors which are Mel-GAN for non- 𝑖 to the th layer 𝑖 + 𝑚 deeper layer this fix the problem of deep
auto-regressive vocoders and the auto-regressive WaveNet and convolution vanishing gradient and skip channels. As shown in
GlowNET as the Hifi-GAN has Mean Opinion Score (MOS) of 4.36 Figure1.
with comparison by ground truth of 4.45. The closest performing
network to Hifi-GAN is the WaveNet which has MOS of 4.02. 𝐺𝐿𝑈(𝑥𝑖 𝑤𝑖 + 𝑏𝑖 ) = ∑(𝑥𝑖 𝑤𝑖 + 𝑏𝑖 ) ⊗ 𝜎(𝑥𝑖 𝑤𝑖 + 𝑏𝑖 ) (1)

Keywords—text-to-speech, accessibility, Tacotron2, Generative On contrary, there are another non-auto-regressive models
Adversarial Network, GAN, Word Embedding, CBOW, Skip-gram, such as HiFi-GAN which is discussed in the next sub-section.
TTS, Natural Language Processing, NLP, Gensim, Deep Learning

I. INTRODUCTION
There are 32.4 million blind people worldwide and another
191 million visually impaired due to Cataracts only [1]. The
elder people suffer when reading screens. Visual impairment
and blindness are considered one of the most challenging
accessibility domains for computer developers. Text-to-Speech
(TTS) is considered one of the solutions that can help on creating
better accessibility for computers for everyone. However, the
TTS sounds unnatural and inconvenient for the users. In some
languages, such as Arabic, the pronunciation is wrong. Hence,
Generative Adversarial Networks can naturalize and improve
the output through High Fidelity GAN (HIFI-GAN) [2]. This
Figure 1 WaveNet Architecture
paper provides a literature review for the text-to-speech
synthesis, word embeddings comparative study between Skip-
Gram and Continuous Bag of Words (CBOW) and discusses B. Hifi GAN
Nvidia’s Tacotron2 and HiFi-GAN for TTS tasks.
HiFi GAN, or High Fidelity GAN is introduced, it is a
II. LITERATURE REVIEW Generative Adversarial Networks model able to generate high
quality speech with high computation efficiency. This model
A. Vocoders
achieved higher Mean Opinion Score (MOS) than other speech
The Neural TTS does not produce a synthesized speech synthesis models such as WaveNet, WaveGlow, and MelGAN.
directly. Instead, it outputs acoustic features. For Instance, The architecture of the model consists of one generator and two
Nvidia’s Tacotrron2 outputs a mel-spectogram that needs to be discriminators in which both are trained in parallel with two
converted to a raw waveform to be heard. One of the popular
losses to improve the model performance and the generated
vocoders is WaveNet. WaveNet is an autoregressive model [3].
speech [4].
Autoregressive models are the models that predict the future
based on past data. The WaveNet architecture is consisting of
dilated convolutions and Gated Activation Units (GLU). The • Generator
dilated convolution is convolution in it increases the kernel size The input of the generator is a mel spectogram, as
by inserting holes between neighboring elements. The GLU shown in Figure 2, the generator uses transposed
activation function works by doubling the input layer into two convolutions (ConvTransopse) for upsampling the
halves. The First half flows normally with its weights and biases, mel spectrogram to match the length of the raw
while the second half goes through a sigmoid activation waveform.

Natural Language Processing Phase 1 ©2022 IEEE


spectrogram loss is shown below, the φ is a function
• Discriminator that converts waveform to mel spectrogram.
Since the audio clips consist of sinusoidal signals, the ℒMel (𝐺) = 𝔼(𝑥,𝑠) [‖(𝜙(𝑥) − 𝜙(𝐺(𝑠))‖1 ]
periods should be identified to help generating
realistic speech. To achieve this, the paper proposed
the multi period discriminator (MPD) model consists
of small sub discriminators where each obtains a
periodic part from the raw waveform. The MPD model
was compared with the multi scale discriminator
(MSD) which was introduced in MelGAN[4].

• Training Loss
Both the generator and the discriminator have 2
different losses. The loss function is used by the
generator to improve the generated audio sample
based on the feedback of the discriminator. The aim of
the discriminator is to maximize the loss to be able to
classify the output of the generator as fake, while the
Figure 2 Hi-fi GAN architecture
generator aims to minimize the loss for deceiving the
discriminator to classify the output as real.
• Feature matching loss
The loss function of the discriminator (1) formula two the feature matching loss is used as an additional loss
and the generator (2) are shown in formula three, for the generator. The loss is the distance between the ground
where x denotes the audio clip and s denotes the input truth audio and a generated sample in each feature space. The
(mel spectrogram of the of the audio clip. feature matching loss is shown in formula four, T denotes
ℒAdv (𝐷; 𝐺) = 𝔼(𝑥,𝑠) [(𝐷(𝑥) − 1)2 number of layers in the discriminator, Di denotes the features,
2
Ni denotes the number of features in the 𝑖 𝑡ℎ layer. The Final
+ (𝐷(𝐺(𝑠))) ] (2) Loss is in formula five.

1
ℒAdv (𝐺; 𝐷) = 𝔼𝑠 [(𝐷(𝐺(𝑠)) − 1) (3)
2 ℒFM (𝐺; 𝐷) = 𝔼(𝑥,𝑠) [∑ ||𝐷𝑖 (𝑥)
𝑁𝑖
− 𝐷𝑖 (𝐺(𝑠))|| ](4)
1

• Mel Spectrogram Loss


ℒG = ℒAdv (𝐺; 𝐷) + 𝜆𝑓𝑚 ℒFM (𝐺; 𝐷) + 𝜆𝑚𝑒𝑙 ℒMel (𝐺)
The mel spectrogram loss is used for helping the
generator to produce a realistic waveform ℒD = ℒAdv (𝐷; 𝐺)(5)
Since the model has sub discriminators such as MPD and MSD,
corresponding to the input. The loss will be the
Equations 4, 5 are converted with respect to the sub discriminators
distance between the mel spectrogram of a waveform
where Dk denotes the k-th sub-discriminators.
generated and the ground truth waveform. The mel
ℒG = ∑(ℒadv (𝐺; 𝐷) + 𝜆𝑓𝑚 ℒFM (𝐺; 𝐷𝑘 ) + 𝜆𝑚𝑒𝑙 ℒMel (𝐺) (6)
Model Speed on CPU Speed on GPU # Param
MOS
(kHz) (kHz) (M)
Ground - - - ℒD = ∑(ℒAdv (𝐷𝑘 ; 𝐺) (7)
4.45(±0.06)
Truth
WaveNet 4.02(±0.08) - 0.07 (×0.003) 24.73 Results of the Hi-Fi GAN
(MoL) The HiFi GAN model was trained on two datasets, the LJSpeech
WaveGlow 3.81(±0.08) 4.72 (×0.21) 501 (×22.75) 87.73 dataset and the VCTK multi-speaker dataset. The LJSpeech dataset
consists of 13,100 audio clips for a single speaker. The Mean Opinion
MelGAN 3.79(±0.09) 145.52 14,238(×645.73) 4.26
Score (MOS) was used for evaluating the performance, fifty audio
HiFi-GAN 4.36(±0.07) 31.74(× 1.43) 3,701(×167.86) 13.92 clips were selected randomly from the LJSpeech dataset. As shown in
V1 table 1, three variants were used for the Hi-Fi GAN each were trained
HiFi-GAN 4.23(±0.07) 214.97(× 9.74) 16,863(×764.80) 0.92 using different hyperparamets. The three variants of the Hi-Fi GAN
V2 model outperformed all other models.
HiFi-GAN 4.05(±0.08) 296.38(× 13.44) 26,169(×1,186.80) 1.46
V3 C. TacoTron 2
Table 1. Comparison between models MOS and speed
This paper introduces a neural network architecture called
Tacotron2, developed by Nvidia. The model is used to
synthesize speech from text by generating mel spectrogram can be taken in both forward and backward direction, this
from the input text using encoder-decoder architecture, this is increases the amount of information for the network.
done by mapping character embeddings to mel-scale. Then, as The decoder is an autoregressive recurrent neural network.
shown in figure 3, a WaveNet model uses the mel spectrogram First, the previous time step prediction is passed through a 2
for synthesizing time domain waveforms. In other words, fully connected with 256 hidden ReLU units defined as pre-net.
transforms mel spectrogram to speech audio. It achieved a The output of the pre-net and the attention context are
concatenated and passed to 2 LSTM layers. Finally, the
concatenation of the 2 LSTM layers output and the attention
context is passed through Linear transform for predicting the
mel spectrogram [4].
III. WORD EMBEDDING
Word Embedding solves the problem of creating an
efficient learnable approach for relationships between words.
Hence, it converts the plain text into 𝑛 − 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑎𝑙 vectors.
This step makes the words numerical and there can be a
mathematical relationship between them such as similarity
which can be measured by distance. One example for a distance
which is commonly used is the Euclidean distance as shown in
Figure 4 Skip-gram and CBOW architecture side-by-side formula eight.
𝑑(𝑥, 𝑦) = √∑(𝑥𝑖 − 𝑦𝑖 ) (8)
Mean Opinion Score (MOS) 4.53 compared to 4.58 for a
professional recorded speech. The aim of Tacotron2 is to However, the Euclidian distance is very sensitive. For
synthesize high quality speech that cannot be easily example, two documents can be similar, but the size difference
distinguished from human speech [4]. is so high, hence that dimension will have a considerable cost
on the function. Hence, Cosine similarity is a good solution
since it calculates the cosine of the angle between two variables.
Thus, cosine similarity is more convenient that the angles can
give more accurate representation of the similarity. The cosine
similarity can be defined by formula nine.
𝐴 .𝐵
𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 (𝐴, 𝐵) = (9)
||𝐴|| × ||𝐵||
A. Continous Bag of Words
The CBOW is a model that can predict words based on the
given context. Hence, it predicts the probability of the centroid
word in the input (window size) from other variables. The
architecture is simple as it is consisting of three layers (only one
hidden). Thus, it learns fast as there are fewer trainable
parameters comparing with skip-gram. The CBOW is used
whenever the context is available and there is a missing word
[5]. The architecture for CBOW is visualized in figure five.

Figure 3 TacoTron2 Architecture

As shown in figure three. A learned 512-dimensional character


embedding is used to represent the input text and passed to 3
convolution layers where each layer contains 512 filters. In
addition, batch normalization is used followed by ReLU
activations. After the 3 convolution layers, the output is passed
through bidirectional LSTM (Long Short-Term Memory) for
generating the encoded features, the LSTM layer containing
512 units (256 are bidirectional). Bidirectional LSTM can be
defined as duplication the recurrent layer in reverse, so the input
Besides, we used the most similar which is a function that gives
words like the input and the function of cosine similarity that
gives similarity index between two given words [7].
V. DATASET
The dataset used for the word embedding is LJSpeech
1.1 dataset which contains audio files with their respective
transcript. It has total of 13,100 short audio clip for the same
person. The total duration for all clips combined is 24 hours.
Each small passage was extracted from non-fiction books. Our
word embedding results from this dataset was very
inconvenient. The corpus and the available word in the corpus
are very little and unrelated to each other. Thus, on both models,
Skip-gram and CBOW, did not show reasonable on our analogy
analysis. We conclude that we need larger dataset that contains
bigger corpus to make reasonable analogy and word
relationship.
Figure 5 CBOW Architecture

VI. CONCLUSION
B. Skip-Gram In conclusion, our literature review shows an
Like CBOW the skip-gram is a word-to-vector model. The advantage for the non-auto-regressive vocoder over the auto-
difference is that it inverses the operation. Furthermore, it regressive vocoder by comparing mel-GAN and HiFi-GAN
predicts a context given a word [6]. Intuitively, it is architecture with waveNet and waveglow. We concluded that HiFi-GAN
is the same of CBOW but mirrored as shown in Figure four. In outmatches its competitors in the MOS metric. Which is a good
conclusion, the CBOW is faster and extracts only one word metric based on people’s opinions which can be a good
given context, while the skip-gram is slower and extracts a indicator of naturalness. TacoTron2 is used as pretrained model
context given only one word. that outputs acoustic features that can be supplied to the
vocoder as it specifically outputs a mel-spectogram. Which can
be considered as steppingstone for testing different vocoder.
IV. IMPLEMENTATION DETAILS TacoTron2 be default is connected to waveNet Vocoder which
In phase 1, only the word embedding is implemented. The shown in the literature review section that it is outperformed by
implementation is done through Python 3 supplied with utility HiFi-GAN. For Further research, we advise changing the
libraries. The implementation is available online on this link architecture of the TacoTron2 instead of using a WaveNet
Ahmed181532 and Ahmed181532.ipynb - Colaboratory vocoder it can be a Hifi-GAN vocoder.
(google.com)
REFERENCES

A. Data Cleaning and Preprocessing [1] KHAIRALLAH, MONCEF, ET AL. "NUMBER OF PEOPLE BLIND OR VISUALLY
IMPAIRED BY CATARACT WORLDWIDE AND IN WORLD REGIONS, 1990 TO
The data cleaning was converting all letters to lower case 2010." INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE 56.11
and removing the stop words and punctuation from the string (2015): 6762-6769.
using Gensim, NLTK, and string liberary. After that the words [2] KONG, JUNGIL, JAEHYEON KIM, AND JAEKYOUNG BAE. "HIFI-GAN:
GENERATIVE ADVERSARIAL NETWORKS FOR EFFICIENT AND HIGH
are stemmed i.e. returned back to its roots. Finally, the data is FIDELITY SPEECH SYNTHESIS." ADVANCES IN NEURAL INFORMATION
cleaned and tokinized; now, it is ready to be inserted to model. PROCESSING SYSTEMS 33 (2020): 17022-17033.
[3] OORD, AARON VAN DEN, ET AL. "WAVENET: A GENERATIVE MODEL FOR
B. Gensim Library RAW AUDIO." ARXIV PREPRINT ARXIV:1609.03499 (2016).
The Gensim library offers a very helpful tools that facilitate [4] Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen,
the word embedding. We used this library to train both CBOW Z., Zhang, Y., Wang, Y., Skerrv-Ryan, R., Saurous, R. A.,
and Skip-Gram. By referring to its implementation form “wv” Agiomvrgiannakis, Y., & Wu, Y. (2018). Natural TTS Synthesis by
Conditioning Wavenet on MEL Spectrogram Predictions. ICASSP, IEEE
which stands for word-to-vector it shows how to train the International Conference on Acoustics, Speech and Signal Processing -
model. The function used to train was “Word2Vec”, the hyper Proceedings, 2018-April. https://doi.org/10.1109/ICASSP.2018.8461368
parameter it gets is the data. The other hyper parameters are
window size, min count i.e., ignores the words if it has lower [5] LIU, BING. "TEXT SENTIMENT ANALYSIS BASED ON CBOW MODEL AND
frequency given, vector size which is size of the embedding DEEP LEARNING IN BIG DATA ENVIRONMENT." JOURNAL OF AMBIENT
INTELLIGENCE AND HUMANIZED COMPUTING 11.2 (2020): 451-458..
vectors, number of epochs and number of workers for parallel
processing. Finally, the last parameter is “sg” by default it is 0
i.e., trains a CWOB model, otherwise, trains a skip gram model.
[6] MCCORMICK, CHRIS. "WORD2VEC TUTORIAL-THE SKIP-GRAM [7] ŘEHŮŘEK, RADIM, AND PETR SOJKA. "GENSIM—STATISTICAL
MODEL." APR-2016.[ONLINE]. AVAILABLE: HTTP://MCCORMICKML. SEMANTICS IN PYTHON." RETRIEVED FROM GENISM. ORG (2011).
COM/2016/04/19/WORD2VEC-TUTORIAL-THE-SKIP-GRAM-MODEL (2016).

You might also like