NLPReport Phase 1
NLPReport Phase 1
Abstract—This paper covers a literature review for the Neural function. The final output is the element-wise multiplication of
TTS synthesis by comparing different vocoders. Our Results the two halves as shown in formula one. Lastly, it is building
shows that High Fidelity GAN (HiFi-GAN) is currently style is based on residual blocks which allows the output of layer
outperform its main competitors which are Mel-GAN for non- 𝑖 to the th layer 𝑖 + 𝑚 deeper layer this fix the problem of deep
auto-regressive vocoders and the auto-regressive WaveNet and convolution vanishing gradient and skip channels. As shown in
GlowNET as the Hifi-GAN has Mean Opinion Score (MOS) of 4.36 Figure1.
with comparison by ground truth of 4.45. The closest performing
network to Hifi-GAN is the WaveNet which has MOS of 4.02. 𝐺𝐿𝑈(𝑥𝑖 𝑤𝑖 + 𝑏𝑖 ) = ∑(𝑥𝑖 𝑤𝑖 + 𝑏𝑖 ) ⊗ 𝜎(𝑥𝑖 𝑤𝑖 + 𝑏𝑖 ) (1)
Keywords—text-to-speech, accessibility, Tacotron2, Generative On contrary, there are another non-auto-regressive models
Adversarial Network, GAN, Word Embedding, CBOW, Skip-gram, such as HiFi-GAN which is discussed in the next sub-section.
TTS, Natural Language Processing, NLP, Gensim, Deep Learning
I. INTRODUCTION
There are 32.4 million blind people worldwide and another
191 million visually impaired due to Cataracts only [1]. The
elder people suffer when reading screens. Visual impairment
and blindness are considered one of the most challenging
accessibility domains for computer developers. Text-to-Speech
(TTS) is considered one of the solutions that can help on creating
better accessibility for computers for everyone. However, the
TTS sounds unnatural and inconvenient for the users. In some
languages, such as Arabic, the pronunciation is wrong. Hence,
Generative Adversarial Networks can naturalize and improve
the output through High Fidelity GAN (HIFI-GAN) [2]. This
Figure 1 WaveNet Architecture
paper provides a literature review for the text-to-speech
synthesis, word embeddings comparative study between Skip-
Gram and Continuous Bag of Words (CBOW) and discusses B. Hifi GAN
Nvidia’s Tacotron2 and HiFi-GAN for TTS tasks.
HiFi GAN, or High Fidelity GAN is introduced, it is a
II. LITERATURE REVIEW Generative Adversarial Networks model able to generate high
quality speech with high computation efficiency. This model
A. Vocoders
achieved higher Mean Opinion Score (MOS) than other speech
The Neural TTS does not produce a synthesized speech synthesis models such as WaveNet, WaveGlow, and MelGAN.
directly. Instead, it outputs acoustic features. For Instance, The architecture of the model consists of one generator and two
Nvidia’s Tacotrron2 outputs a mel-spectogram that needs to be discriminators in which both are trained in parallel with two
converted to a raw waveform to be heard. One of the popular
losses to improve the model performance and the generated
vocoders is WaveNet. WaveNet is an autoregressive model [3].
speech [4].
Autoregressive models are the models that predict the future
based on past data. The WaveNet architecture is consisting of
dilated convolutions and Gated Activation Units (GLU). The • Generator
dilated convolution is convolution in it increases the kernel size The input of the generator is a mel spectogram, as
by inserting holes between neighboring elements. The GLU shown in Figure 2, the generator uses transposed
activation function works by doubling the input layer into two convolutions (ConvTransopse) for upsampling the
halves. The First half flows normally with its weights and biases, mel spectrogram to match the length of the raw
while the second half goes through a sigmoid activation waveform.
• Training Loss
Both the generator and the discriminator have 2
different losses. The loss function is used by the
generator to improve the generated audio sample
based on the feedback of the discriminator. The aim of
the discriminator is to maximize the loss to be able to
classify the output of the generator as fake, while the
Figure 2 Hi-fi GAN architecture
generator aims to minimize the loss for deceiving the
discriminator to classify the output as real.
• Feature matching loss
The loss function of the discriminator (1) formula two the feature matching loss is used as an additional loss
and the generator (2) are shown in formula three, for the generator. The loss is the distance between the ground
where x denotes the audio clip and s denotes the input truth audio and a generated sample in each feature space. The
(mel spectrogram of the of the audio clip. feature matching loss is shown in formula four, T denotes
ℒAdv (𝐷; 𝐺) = 𝔼(𝑥,𝑠) [(𝐷(𝑥) − 1)2 number of layers in the discriminator, Di denotes the features,
2
Ni denotes the number of features in the 𝑖 𝑡ℎ layer. The Final
+ (𝐷(𝐺(𝑠))) ] (2) Loss is in formula five.
1
ℒAdv (𝐺; 𝐷) = 𝔼𝑠 [(𝐷(𝐺(𝑠)) − 1) (3)
2 ℒFM (𝐺; 𝐷) = 𝔼(𝑥,𝑠) [∑ ||𝐷𝑖 (𝑥)
𝑁𝑖
− 𝐷𝑖 (𝐺(𝑠))|| ](4)
1
VI. CONCLUSION
B. Skip-Gram In conclusion, our literature review shows an
Like CBOW the skip-gram is a word-to-vector model. The advantage for the non-auto-regressive vocoder over the auto-
difference is that it inverses the operation. Furthermore, it regressive vocoder by comparing mel-GAN and HiFi-GAN
predicts a context given a word [6]. Intuitively, it is architecture with waveNet and waveglow. We concluded that HiFi-GAN
is the same of CBOW but mirrored as shown in Figure four. In outmatches its competitors in the MOS metric. Which is a good
conclusion, the CBOW is faster and extracts only one word metric based on people’s opinions which can be a good
given context, while the skip-gram is slower and extracts a indicator of naturalness. TacoTron2 is used as pretrained model
context given only one word. that outputs acoustic features that can be supplied to the
vocoder as it specifically outputs a mel-spectogram. Which can
be considered as steppingstone for testing different vocoder.
IV. IMPLEMENTATION DETAILS TacoTron2 be default is connected to waveNet Vocoder which
In phase 1, only the word embedding is implemented. The shown in the literature review section that it is outperformed by
implementation is done through Python 3 supplied with utility HiFi-GAN. For Further research, we advise changing the
libraries. The implementation is available online on this link architecture of the TacoTron2 instead of using a WaveNet
Ahmed181532 and Ahmed181532.ipynb - Colaboratory vocoder it can be a Hifi-GAN vocoder.
(google.com)
REFERENCES
A. Data Cleaning and Preprocessing [1] KHAIRALLAH, MONCEF, ET AL. "NUMBER OF PEOPLE BLIND OR VISUALLY
IMPAIRED BY CATARACT WORLDWIDE AND IN WORLD REGIONS, 1990 TO
The data cleaning was converting all letters to lower case 2010." INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE 56.11
and removing the stop words and punctuation from the string (2015): 6762-6769.
using Gensim, NLTK, and string liberary. After that the words [2] KONG, JUNGIL, JAEHYEON KIM, AND JAEKYOUNG BAE. "HIFI-GAN:
GENERATIVE ADVERSARIAL NETWORKS FOR EFFICIENT AND HIGH
are stemmed i.e. returned back to its roots. Finally, the data is FIDELITY SPEECH SYNTHESIS." ADVANCES IN NEURAL INFORMATION
cleaned and tokinized; now, it is ready to be inserted to model. PROCESSING SYSTEMS 33 (2020): 17022-17033.
[3] OORD, AARON VAN DEN, ET AL. "WAVENET: A GENERATIVE MODEL FOR
B. Gensim Library RAW AUDIO." ARXIV PREPRINT ARXIV:1609.03499 (2016).
The Gensim library offers a very helpful tools that facilitate [4] Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen,
the word embedding. We used this library to train both CBOW Z., Zhang, Y., Wang, Y., Skerrv-Ryan, R., Saurous, R. A.,
and Skip-Gram. By referring to its implementation form “wv” Agiomvrgiannakis, Y., & Wu, Y. (2018). Natural TTS Synthesis by
Conditioning Wavenet on MEL Spectrogram Predictions. ICASSP, IEEE
which stands for word-to-vector it shows how to train the International Conference on Acoustics, Speech and Signal Processing -
model. The function used to train was “Word2Vec”, the hyper Proceedings, 2018-April. https://doi.org/10.1109/ICASSP.2018.8461368
parameter it gets is the data. The other hyper parameters are
window size, min count i.e., ignores the words if it has lower [5] LIU, BING. "TEXT SENTIMENT ANALYSIS BASED ON CBOW MODEL AND
frequency given, vector size which is size of the embedding DEEP LEARNING IN BIG DATA ENVIRONMENT." JOURNAL OF AMBIENT
INTELLIGENCE AND HUMANIZED COMPUTING 11.2 (2020): 451-458..
vectors, number of epochs and number of workers for parallel
processing. Finally, the last parameter is “sg” by default it is 0
i.e., trains a CWOB model, otherwise, trains a skip gram model.
[6] MCCORMICK, CHRIS. "WORD2VEC TUTORIAL-THE SKIP-GRAM [7] ŘEHŮŘEK, RADIM, AND PETR SOJKA. "GENSIM—STATISTICAL
MODEL." APR-2016.[ONLINE]. AVAILABLE: HTTP://MCCORMICKML. SEMANTICS IN PYTHON." RETRIEVED FROM GENISM. ORG (2011).
COM/2016/04/19/WORD2VEC-TUTORIAL-THE-SKIP-GRAM-MODEL (2016).