Wijethunga 2020
Wijethunga 2020
Abstract—The recent advancements in deep learning and other persons are victimized. The major concern goes to an advanced
related technologies have led to improvements in various areas technology called Deepfake which mostly uses Generative
such as computer vision, bio-informatics, and speech recognition Adversarial Network (GAN) [1] to generate synthetic audio
etc. This research mainly focuses on a problem with synthetic
speech and speaker diarization. The developments in audio have to impersonate real people.
resulted in deep learning models capable of replicating natural- Studies proved the worst case of Deepfake audio used as
sounding voice also known as text-to-speech (TTS) systems. This a shred of crucial evidence to let criminals go free. Raising
technology could be manipulated for malicious purposes such of Deepfake audio in a group conversation is the next level
as deepfakes, impersonation, or spoofing attacks. We propose a of crime. In a group conversation, multiple user voices are
system that has the capability of distinguishing between real and
synthetic speech in group conversations.We built Deep Neural integrated. Previous research studies clearly showed how a
Network models and integrated them into a single solution Deepfake mostly Deepfake image or Deepfake video can be
using different datasets, including but not limited to Urban- identified. Now, the challenge is to detect Deepfake audio in
Sound8K (5.6GB), Conversational (12.2GB), AMI-Corpus (5GB), a conversation.
and FakeOrReal (4GB). Our proposed approach consists of Deepfake audio detection starts with audio signal processing
four main components. The speech-denoising component cleans
and preprocesses the audio using Multilayer-Perceptron and [2]. An audio signal is the representation of sound and using
Convolutional Neural Network architectures, with 93% and 94% signal processing mechanisms, spectrogram and wavelength
accuracies accordingly. The speaker diarization was implemented are processed. By studying previous research, the audio signal
using two different approaches, Natural Language Processing should be processed in such a way that all the speakers must
for text conversion with 93% accuracy and Recurrent Neural be diarized before the detection of Deepfake. Preprocessing of
Network model for speaker labeling with 80% accuracy and
0.52 Diarization-Error-Rate. The final component distinguishes audio is to reduce noise and other unnecessary factors from
between real and fake audio using a CNN architecture with the audio signal.
94% accuracy. With these findings, this research will contribute Natural Language Processing deals with the text format re-
immensely to the domain of speech analysis. trieved from the audio. Mechanisms of both classification and
Index Terms—Deep Neural Networks, Natural Language Pro- clustering are to reach an acceptable accuracy of diarization.
cessing, Speaker Diarization, Deepfake, Deep learning
According to previous studies, the outcome of NLP based
diarization can be integrated with Machine Learning [3] to
I. I NTRODUCTION
cluster the speakers with improved accuracy levels.
Machine-generated voices are mostly populated in our day- Diarized contents can be used in the detection phase. In
to-day lives. With the automation of technology, people are terms of synthetic speech detection, the research problem
more intended to control daily works over voice. Machine- discusses the drawbacks of synthetic speech in more detail.
generated voice or synthetic voice is rapidly used in virtual For example, a TTS system can be used to train a targeted
assistants such as Google Assistant, Alexa, Siri, Bixby, and or individual voice. Once trained, it could be used for imperson-
some other. Despite having advantages of such technologies, ation attacks. Therefore, to detect these malicious utterances
people such as celebrities, politicians, or some other famous when authenticity is required, there must be a mechanism to
192
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 16,2021 at 23:15:49 UTC from IEEE Xplore. Restrictions apply.
tell the difference. squares, and with the use of a convolution operator to differ-
The aim is to build a model with DNN which could extract entiate each square to learnable filters. Stacking convolutional
the dynamic acoustic features [4] of the voice, which will layers with fully-connected layers results in forming a power-
determine whether given audio is artificially generated or real ful architecture that has the ability to identify patterns, from
human speech. CNN and RNN will be used in the development shapes to complex objects; lowest layers to highest layers [9].
of the model. CNN has proven to be the most efficient 2) Recurrent Neural Networks (RNN): The concept of
technology used in Computer Vision and Image Processing, RNNs is that the output of one layer will be the input to
therefore the Spectrum will be analyzed with CNN. An RNN is the same layer. What happens is that this mechanism provides
a generalization of Feedforward Neural Network with internal the architecture a “memory” and allows us to comprehend
memory, in which RNN can use it to process sequences of correlation in sequences. That helps in remembering the past
inputs, that is applicable in determining a new state. and its present decisions which are influenced by what it
learned from past experiences [13], [12]. There is a modified
II. R ELATED W ORK
version of RNN, Long Short-Term Memory (LSTM), instead
Most of the researchers were focused on reducing one of using classic neurons it uses memory cells, composed
specific noise reduction in audio in their noise reduction by several components and processes to provide long term
algorithms [5]. They have proposed adaptive filtering for echo memory. And these states could be changed accordingly [12].
cancellation and based on the denoising functions, wavelet The idea behind RNNs is the output of one layer will be the
transform has been used to remove noises [2]. input to the same layer. By doing this, the mechanism provides
Spectral Subtraction also plays a major role in noise re- a memory to the architecture and allows comprehending cor-
moval systems. Spectral Subtraction is a method to remove relation in sequences. This helps the RNN to make decisions
background noise from noisy speech signals in the frequency based on past decisions it has made in the same state, this
domain. This approach consists of calculating the noise spec- induces the memory state. Long Short-Term Memory (LSTM)
trum using the Fast Fourier Transform (FFT) and subtracting is the modified version of RNN; the states could be changed
the average noise level from the noisy speech spectrum [6]. in LSTM so that at given times the memory of states would
Apart from those methods, the researchers have used dif- be more or less [9], [10].
ferent filters for the removal of unwanted noises such as the I-Vector Framework - The i-vector subspace modeling is
Gaussian filter, Butterworth filter, Comb filter, Chebyshev I a recent state-of-the-art technique. It is proven to be the
and II filters, Elliptic filter, etc [7]. Butterworth filter, Cheby- most effective feature for speaker diarization according to a
shev I filter, and Elliptic filter is used in one research paper to recent study. Speaker or session variability is the variability
remove the noises in ECG signals using MATLAB software exhibited by a given speaker from one recording session to
and suggested that the Butterworth filter is the best filter when another. This type of variability is usually attributed to channel
comparing the others [8]. effects, although this is not strictly accurate since intra-speaker
Natural Language Processing is one of the most trending variation and phonetic variation are also involved [11]. In
technologies which has been applied in most of the areas this approach, a speech segment is represented by a low-
to revolutionize the modern world. NLP is proposed to use dimensional “identity vector” (i-vector for short) extracted by
to diarize the different speakers from a group conversation. Factor Analysis [12].
Analyzing the raw text obtained from audio format based on
the important relationships and factors, the speakers will be III. M ETHODOLOGY
diarized.
Previous studies were analyzed to understand the flexibility A. Speech Denoising using DNN
of the different approaches in NLP. According to the acquired Speech denoising aims to remove noise from speech signals
knowledge from those, it is understood that to diarize the while enhancing the quality and intelligibility of speech. The
speakers there are two different approaches in NLP – text system does not know what background noises are in the
classification which requires pre-labeled data that means su- audio. So, a background noise dataset (UrbanSound8K) was
pervised learning and text clustering which deals with unsu- used here to check what noises are there using different deep
pervised data that is known as unsupervised learning. learning techniques. When an audio sample (.wav format)
Since the text from audio format is mostly unsupervised, the which contains some background noises was given as an
clustering approach will get a higher priority. Along with text input to the algorithm, it can determine if it contains one
clustering, the classification can be used for better accuracy. of the sounds with a corresponding likelihood score. Con-
Clustering algorithms analyze the unlabeled data and grouping versely, if none of the target sounds were detected, it will be
similar data points together [6]. presented with an unknown score. To achieve this, different
1) Convolutional Neural Networks (CNN): First published neural network architectures such as Multi-Layer Perceptron’s
in the ’80s, later researchers were able to use the architecture (MLPs) and Convolutional Neural Networks (CNNs) were
to learn positional relation between pixels, enabling neural used. Adaptive filters were used to remove the predicted noise
networks to identify shapes and patterns. The scheme behind from the original audio to make clean audio which can be used
a convolutional layer is to break down an image into miniature to diarize the speakers easily.
193
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 16,2021 at 23:15:49 UTC from IEEE Xplore. Restrictions apply.
a) Dataset: UrbanSound8K is a background noise The ability to identify the words and phrases in spoken
dataset which contains 8732 labeled sound excerpts of urban language and convert them to human-readable text is called
sounds from 10 classes: air conditioner, car horn, children speech recognition and it can be done in Python using the
playing, dog bark, drilling, enginge idling, gun shot, jackham- Speech-Recognition library [11].
mer, siren, and street music. All excerpts are taken from field • Picking a Python Speech Recognition Package – Google
recordings uploaded to freesound.org [15]. In addition to the Speech Recognition
sound excerpts, a CSV file containing metadata about each • Installing SpeechRecognition in Python
excerpt is also provided. • Working with Audio Files – Loading the audio file and
b) Preprocessing Stage: The preprocessing of the data converting the speech into text using Google Speech
set is performed using sample rate conversion and merging Recognition
audio channels. Mel-Frequency Cepstral Coefficients (MFCC)
Diarization of the speaker in a conversation is more chal-
were extracted on a per-frame basis from the audio samples. To
lenging where more speakers are present in the conversa-
analyze frequency and time characteristics, MFCC was used
tion. The random conversation often differs from a formal
because it summarizes the frequency distribution, across the
and sophisticated conversation. A raw transcript of a group
window-size. The dataset is split using train test split which
conversation contains different noises and if the noises are
allocates 20% as the testing set and 80% as the training set
properly handled, it will make the diarization job one step
[16].
ahead. Natural Language Processing plays an important role
c) Train and Test Model : Train a Deep Neural Network
to deal with such a scenario.
with the dataset to predict the noises was the next step. A
a) Dataset: A conversational dataset from Opensubtitles
simple neural network architecture like MLP was used before
[19] is used in this research. Raw data of this dataset is pre-
experimenting with a more complex neural architecture like
processed and normalized according to this research criteria.
CNN [16]. CNNs build upon the architecture of MLPs but
The final dataset is split into 4415 segments containing 100000
with several important changes. Next, they group the layers
lines each and placed in Google Cloud storage.
into three dimensions: width, height, and depth. Second, the
b) Train and Test Model: The large conversational
nodes in one layer do not necessarily connect to all nodes
dataset is split into the train and test model using the
in the layer that follows, but often only a sub-region of it
Apache Beam pipeline. Considering the dataset size, Tensor-
[16]. This allows CNN to perform two important stages named
Flow framework is implemented with Python programming
Feature Extraction and Classification.
language. The training model includes the features extracted
from the dataset and correlation among the speeches. The
system can detect the speakers based on the behavioral anal-
ysis, common words usages, and the names provided in the
transcripts. The testing model is prepared to the system to
measure how accurate the system works. To build the models,
Google Dataflow Engine is needed and to initiate, a suitable
apache pipeline is built.
Dataset is stored and retrieved from Google Cloud Storage
and processed in Google Dataflow environment, then models
Fig. 1. System Diagram for Audio Classification are built and written back to the Cloud Storage. The system
is more convenient to use since there is less risk in the cloud.
Adaptive filters are used to remove the predicted back- A transcript is taken as input, and based on the trained
ground noises from the source and generate clean audio model the system finds the relationships and features from the
without damaging the features of the audio. For initial section transcript. Once the context is understood by the system, it
of adaptive filters require two inputs and create the adaptive does the clustering of common speeches. K-Means clustering
filter according to the parameters. After that data filtering algorithm is used to cluster the speeches from the transcript.
process is started to remove the background noises and smooth The whole idea about clustering the speeches are based
the audio. on Natural Language Processing techniques where Machine
Learning is significantly used. Machine Learning applies to
B. Speaker Diarization with NLP diarize the speakers based on the factors mentioned above.
1) Audio Dataset Conversion to Text Dataset Conversion:
The process of converting spoken words into written texts is C. Speaker Diarization with DNN
called a speech-to-text (STT) conversion and it is used to Speaker diarization can be defined as the identification of
get a wider understanding of the speech process [17]. Like the number of different speakers in a conversation. This task
speech recognition, STT follows the same principles and steps, can be achieved using different technologies, but the accuracy
with different combinations of various techniques for each step levels will not be the same. Here the task is accomplished by
[17]. training a Deep Neural Network model, which will then be
194
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 16,2021 at 23:15:49 UTC from IEEE Xplore. Restrictions apply.
be recorded and executed if the session ends suddenly. To
overcome this problem, the Google Drive is mounted, and
processed data and models are transferred to the Google Drive.
The model which produced the best accuracy results con-
tained the following layers; 4 CNNs for image processing, a
max pooling function and dropout. The audio feature extrac-
tion for i-vectors was done with a basic flatten layer. Then, 3
dense layers and a final layer would predict the probability
of the number of speakers in the given audio. The model
was trained with 50 epochs, with 100 steps per epoch. The
Fig. 2. System Diagram for Speaker Diarization with NLP
following structure of the model was the most accurate.
D. Synthetic Speech Detection with DNN
used to predict the various number of speakers in each audio
sample. The approach was initiated by analyzing several research
a) Dataset and Model Training: The dataset used to train papers and publications to understand which architecture is
the model was freely available from International Computer best suited for the synthetic speech detection domain. The
Science Institute (The ICSI Meeting Corpus), the AMI Corpus subsections describe the relevant methodology.
Dataset. This dataset contained meeting recordings of different a) Dataset: The dataset was taken from a group of
speakers with a size of 5GB all files in the .wav format. researchers (APTLY Lab) which was freely available Fake-
With the dataset the model was trained, evaluated and tested OrReal Dataset of size 4GB. It was readily labelled and pre-
[20]. After training the model, it can be used for predictions. processed for direct use in the training phase. The dataset
Generally the speaker diarization task is achieved undergoing contained utterances labelled as real and fake (synthetic) in
several sub processes; the format of .wav files. The audio samples labelled ‘real’
• Speech segmentation – The process in which input audio
were collected from online sources like YouTube and other
is segmented to short sections assuming they are from streaming platforms. Audio samples labelled ‘fake’ were ob-
the same speaker tained from the output of some of the latest state-of-the-art
• Audio embedding extraction – i-vector feature used for
text-to-speech systems.
extraction of audio clips b) Training and Validation: To train the model, the
• Homogeneous segmentation – Identification of segments
architecture and build of it had to be decided. A thorough
per speaker and aggregates them study was done before concluding the architecture. The archi-
• Clustering – Final number of speakers are determined,
tecture was built upon Convolution Neural Networks (CNN)
and homogeneous segments clustered accordingly [21]- and Recurrent Neural Networks (RNN). The architecture of
[23] CNNs which was first published in the ’80s, was later used
by researchers to learn positional relation between pixels,
enabling neural networks to identify shapes and patterns. The
scheme behind a convolutional layer is to break down an
image into miniature squares, and with the use of convolution
operator to differentiate each square to learnable filters. Stack-
ing convolutional layers with fully connected layers results in
forming a powerful architecture that has the ability to identify
patterns from shapes to complex objects, lowest layers to
highest layers [9]. The concept of RNNs is that the output of
one layer will be the input to the same layer. What happens
is that this mechanism provides the architecture a “memory”
and allows to comprehend correlation in sequences. This
helps in remembering past and its present decisions which
are influenced by what it learnt from past experiences [13],
[12].The idea of using both architectures in a single system
Fig. 3. System Diagram for Speaker Diarization with DNN is that CNNs are good at feature extraction and RNNs are
good at identifying long-term dependencies in a time domain.
The model was trained in Google Colab platform. This Therefore, both architectures could produce better accurate
provides free Linux environment, supporting all tensorflow results. This model was also trained in the Google Colab
and python commands. The platform also gives users access to platform.
install other packages and libraries. Google also provides free The structure of the model was designed with 4 CNNs,
GPU for deep neural network training. But the lab environment followed by a max pooling and dropout. And then a flatten
is only saved on session based, so every command needs to layer is used to extract the different features of the audio.
195
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 16,2021 at 23:15:49 UTC from IEEE Xplore. Restrictions apply.
Next, 4 variants of dense layers and finally an activation layer calculated for a word as the ratio of the number of times the
with 2 classes for the fake and real predictions. The best model word occurs in the document to the total number of words
containing the most accurate results had undergone 50 epochs in a document. The inverse document frequency is calculated
and 100 steps per each epoch. by dividing the total number of documents by the number of
documents obtaining the term and taking the logarithm of that
quotient.
tf − idft,d = tft,d × idft (2)
196
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 16,2021 at 23:15:49 UTC from IEEE Xplore. Restrictions apply.
D. Synthetic Speech Detection with DNN. could be taken to implement fully automated features such as
The implementation was done in the Google Colab platform automatic audio processing, filtering, diarization, and detection
which is a free cloud service, provides a dedicated runtime with improved models that might add great advantage to this
environment also supports free GPU for machine learning/ solution.
R EFERENCES
deep learning. The results for the proposed model was mea-
sured by the loss function and accuracy, during and after [1] F. Ahmad, “The Role of Deepfake Audio in the Growing
AI Voice Market,” Medium, 2019. [Online]. Available:
training. It is the measure of how accurate your model’s https://medium.com/@mfaizan.ahmad/the-role-of-deepfake-audio-
prediction is compared to the true data and the loss value in-the-growing-ai-voice-market-9246365442f5.
implies how poorly/well a model behaves after each iteration [2] P. Rao, “Audio signal processing,” Stud. Comput. Intell., vol. 83, pp.
169–189, 2008.
of optimization. The result of the model achieved an accuracy [3] N. Melethadathil, P. Chellaiah, B. Nair, and S. Diwakar, “Classification
of 94% and a loss value of 0.691. and clustering for neuroinformatics: Assessing the efficacy on reverse-
We have used three different approaches on the audio for ex- mapped NeuroNLP data using standard ML techniques,” 2015 Int. Conf.
Adv. Comput. Commun. Informatics, ICACCI 2015, pp. 1065–1070,
tracting features: Short-Term-Fourier-Transformation (STFT): 2015.
type of Fourier transform which calculates the frequency [4] H. Yu, Z. H. Tan, Z. Ma, R. Martin, and J. Guo, “Spoofing Detection
content of local sections of the signal, Mel-Spectrogram: it in Automatic Speaker Verification Systems Using DNN Classifiers and
Dynamic Acoustic Features,” IEEE Trans. Neural Networks Learn. Syst.,
is similar to STFT but the non-linear mel-scale frequency and vol. 29, no. 10, pp. 4633–4644, 2018.
Mel-Frequency Cepstral Coefficients (MFCC): derived from [5] J. Vijayakumar, “A Systematic Algorithm for Denoising Audio Signal
the cepstral representation of an audio clip, the coefficients Using Savitzky - Golay Method,” no. April, pp. 676–679, 2018.
that jointly make the MFC. This was performed for the [6] C. Cole, M. Karam, and H. Aglan, “Increasing additive noise removal
in speech processing using spectral subtraction,” Proc. - Int. Conf. Inf.
proposed deep learning architecture. The results achieved for Technol. New Gener. ITNG 2008, no. May, pp. 1146–1147, 2008.
each feature compared to the previous [9] are as follows: [7] H. Magsi, A. H. Sodhro, F. A. Chachar, and S. A. K. Abro, “Analysis of
signal noise reduction by using filters,” 2018 Int. Conf. Comput. Math.
TABLE II
Eng. Technol. Inven. Innov. Integr. Socioecon. Dev. iCoMET 2018 -
M ODEL C OMPARISON
Proc., vol. 2018-Janua, pp. 1–6, 2018.
Algorithm SFTF Mel MFCC [8] P. Podder, M. Mehedi Hasan, M. Rafiqul Islam, and M. Sayeed, “Design
and Implementation of Butterworth, Chebyshev-I and Elliptic Filter for
Proposed Model 47.55% 41.14% 88.80% Speech Signal Analysis,” Int. J. Comput. Appl., vol. 98, no. 7, pp. 12–18,
Previous Studies [9] 50.10% 41.57% 90.48% 2014.
[9] R. A. M. Reimao, “Synthetic Speech Detection Using Deep Neural
However, the outcome did not meet expectations. To over- Networks,” 2019.
come this problem a new model was trained using a pre-trained [10] A. Deshpande, “A Beginner’s Guide To Understanding
Convolutional Neural Networks,” 2016. [Online]. Available:
model, here the last layer is added to a pre-trained model. This https://adeshpande3.github.io/A-Beginner%27s-Guide-To-
approach is known as “Transfer learning”, the advantage is the Understanding-Convolutional-Neural-Networks/.
pre-trained models have been trained with thousands of data [11] A. Rockikz, “How to Convert Speech to Text in Python,” 2020. [On-
line]. Available: https://www.thepythoncode.com/article/using-speech-
and have better accuracy results. The pre-trained model chosen recognition-to-convert-speech-to-text-python.
was VGG19, the specialty of this model is that it was trained [12] W. Feng, N. Guan, Y. Li, X. Zhang, and Z. Luo, “Audiovisual speech
with audio data. That is images of spectrograms of the audio recognition with multimodal recurrent neural networks,” Proc. Int. Jt.
Conf. Neural Networks, vol. 2017-May, no. March 2018, pp. 681–688,
dataset. Out of the two trained models the model trained with 2017.
Transfer learning had better accuracy results. [13] M. Venkatachalam, “Recurrent Neural Networks,” 2019. [Online].
Available: https://towardsdatascience.com/recurrent-neural-networks-
V. C ONCLUSION d4642c9bc7ce.
[14] C. J. and J. P. B. J. Salamon, “URBAN-
Within this research work, an innovative approach is made SOUND8K DATASET,” 2014. [Online]. Available:
to train and validate the models to achieve some of the best https://urbansounddataset.weebly.com/urbansound8k.html.
[15] “Free Sound.” [Online]. Available: https://freesound.org/.
accuracy results. To accomplish this, we have used some of [16] M. Smales, “Classifying Urban sounds using Deep Learning,” 2018.
the best available datasets prepared by researchers which were [17] A. Trivedi, N. Pant, P. Shah, S. Sonik, and S. Agrawal, “Speech to text
freely available for public use. This paper has included audio and text to speech recognition systems-Areview,” IOSR J. Comput. Eng.,
vol. 20, no. 2, pp. 36–43, 2018.
signal processing techniques, speaker diarization approaches,
[18] A. Rockikz, “How to Convert Speech to Text in Python,” 2020. [On-
and synthetic speech detection mechanisms. Though signal line]. Available: https://www.thepythoncode.com/article/using-speech-
denoising tasks were accomplished with a satisfactory result, recognition-to-convert-speech-to-text-python.
the methods can be improved by implementing better filtering [19] Jörg Tiedemann, “Conversational Dataset.”
[20] “ICSI Corpus Meeting.” [Online]. Available:
techniques and improving the datasets as well. Audio to text http://groups.inf.ed.ac.uk/ami/icsi/download/.
generation could be more accurate since this is a vital input [21] Q. Wang, C. Downey, L. Wan, P. A. Mansfield, and I. L. Moreno,
to deal with Natural Language Processing based algorithms. “Speaker diarization with LSTM,” ICASSP, IEEE Int. Conf. Acoust.
Speech Signal Process. - Proc., vol. 2018-April, pp. 5239–5243, 2018.
Audio-based speaker diarization comes to an important play [22] C. Wooters and M. Huijbregts, “The ICSI RT07s speaker diarization
for this research. Reduced false positive can help to achieve system,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif.
a better outcome. The fine-tuned solution will be suitable for Intell. Lect. Notes Bioinformatics), vol. 4625 LNCS, pp. 509–519, 2008.
[23] S. E. Tranter and D. A. Reynolds, “An overview of automatic speaker
different security-critical organizations as well as government diarization systems,” IEEE Trans. Audio, Speech Lang. Process., vol.
organizations to improve security measures. The next steps 14, no. 5, pp. 1557–1565, 2006.
197
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 16,2021 at 23:15:49 UTC from IEEE Xplore. Restrictions apply.