Wijethunga 2020

This research presents a deep learning-based solution for detecting deepfake audio in group conversations, focusing on differentiating between real and synthetic speech. The proposed system utilizes various deep neural network architectures, achieving high accuracy in speech denoising, speaker diarization, and audio classification. By integrating multiple datasets and advanced signal processing techniques, the study aims to enhance the reliability of speech analysis and address the growing concerns surrounding deepfake technology.

Uploaded by

narutoitaliya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views6 pages

Wijethunga 2020

Uploaded by

narutoitaliya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Deepfake Audio Detection: A Deep Learning Based

Solution for Group Conversations

R.L.M.A.P.C. Wijethunga D.M.K. Matheesha Abdullah Al Noman
Department of Computer Systems Department of Computer Systems Department of Computer Systems
Engineering Engineering Engineering
Sri Lanka Institute of Information Sri Lanka Institute of Information Sri Lanka Institute of Information
Technology Technology Technology
Malabe, Sri Lanka Malabe, Sri Lanka Malabe, Sri Lanka
[email protected] [email protected] [email protected]
2020 2nd International Conference on Advancements in Computing (ICAC) | 978-1-7281-8412-8/20/$31.00 ©2020 IEEE | DOI: 10.1109/ICAC51239.2020.9357161

K.H.V.T.A. De Silva Muditha Tissera Lakmal Rupasinghe

Department of Information Department of Computer Systems Department of Computer Systems
Technology Engineering Engineering
Sri Lanka Institute of Information Sri Lanka Institute of Information Sri Lanka Institute of Information
Technology Technology Technology
Malabe, Sri Lanka Malabe, Sri Lanka Malabe, Sri Lanka
[email protected] [email protected] [email protected]

Abstract—The recent advancements in deep learning and other persons are victimized. The major concern goes to an advanced
related technologies have led to improvements in various areas technology called Deepfake which mostly uses Generative
such as computer vision, bio-informatics, and speech recognition Adversarial Network (GAN) [1] to generate synthetic audio
etc. This research mainly focuses on a problem with synthetic
speech and speaker diarization. The developments in audio have to impersonate real people.
resulted in deep learning models capable of replicating natural- Studies proved the worst case of Deepfake audio used as
sounding voice also known as text-to-speech (TTS) systems. This a shred of crucial evidence to let criminals go free. Raising
technology could be manipulated for malicious purposes such of Deepfake audio in a group conversation is the next level
as deepfakes, impersonation, or spoofing attacks. We propose a of crime. In a group conversation, multiple user voices are
system that has the capability of distinguishing between real and
synthetic speech in group conversations.We built Deep Neural integrated. Previous research studies clearly showed how a
Network models and integrated them into a single solution Deepfake mostly Deepfake image or Deepfake video can be
using different datasets, including but not limited to Urban- identified. Now, the challenge is to detect Deepfake audio in
Sound8K (5.6GB), Conversational (12.2GB), AMI-Corpus (5GB), a conversation.
and FakeOrReal (4GB). Our proposed approach consists of Deepfake audio detection starts with audio signal processing
four main components. The speech-denoising component cleans
and preprocesses the audio using Multilayer-Perceptron and [2]. An audio signal is the representation of sound and using
Convolutional Neural Network architectures, with 93% and 94% signal processing mechanisms, spectrogram and wavelength
accuracies accordingly. The speaker diarization was implemented are processed. By studying previous research, the audio signal
using two different approaches, Natural Language Processing should be processed in such a way that all the speakers must
for text conversion with 93% accuracy and Recurrent Neural be diarized before the detection of Deepfake. Preprocessing of
Network model for speaker labeling with 80% accuracy and
0.52 Diarization-Error-Rate. The final component distinguishes audio is to reduce noise and other unnecessary factors from
between real and fake audio using a CNN architecture with the audio signal.
94% accuracy. With these findings, this research will contribute Natural Language Processing deals with the text format re-
immensely to the domain of speech analysis. trieved from the audio. Mechanisms of both classification and
Index Terms—Deep Neural Networks, Natural Language Pro- clustering are to reach an acceptable accuracy of diarization.
cessing, Speaker Diarization, Deepfake, Deep learning
According to previous studies, the outcome of NLP based
diarization can be integrated with Machine Learning [3] to
I. I NTRODUCTION
cluster the speakers with improved accuracy levels.
Machine-generated voices are mostly populated in our day- Diarized contents can be used in the detection phase. In
to-day lives. With the automation of technology, people are terms of synthetic speech detection, the research problem
more intended to control daily works over voice. Machine- discusses the drawbacks of synthetic speech in more detail.
generated voice or synthetic voice is rapidly used in virtual For example, a TTS system can be used to train a targeted
assistants such as Google Assistant, Alexa, Siri, Bixby, and or individual voice. Once trained, it could be used for imperson-
some other. Despite having advantages of such technologies, ation attacks. Therefore, to detect these malicious utterances
people such as celebrities, politicians, or some other famous when authenticity is required, there must be a mechanism to

192
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 16,2021 at 23:15:49 UTC from IEEE Xplore. Restrictions apply.
tell the difference. squares, and with the use of a convolution operator to differ-
The aim is to build a model with DNN which could extract entiate each square to learnable filters. Stacking convolutional
the dynamic acoustic features [4] of the voice, which will layers with fully-connected layers results in forming a power-
determine whether given audio is artificially generated or real ful architecture that has the ability to identify patterns, from
human speech. CNN and RNN will be used in the development shapes to complex objects; lowest layers to highest layers [9].
of the model. CNN has proven to be the most efficient 2) Recurrent Neural Networks (RNN): The concept of
technology used in Computer Vision and Image Processing, RNNs is that the output of one layer will be the input to
therefore the Spectrum will be analyzed with CNN. An RNN is the same layer. What happens is that this mechanism provides
a generalization of Feedforward Neural Network with internal the architecture a “memory” and allows us to comprehend
memory, in which RNN can use it to process sequences of correlation in sequences. That helps in remembering the past
inputs, that is applicable in determining a new state. and its present decisions which are influenced by what it
learned from past experiences [13], [12]. There is a modified
II. R ELATED W ORK
version of RNN, Long Short-Term Memory (LSTM), instead
Most of the researchers were focused on reducing one of using classic neurons it uses memory cells, composed
specific noise reduction in audio in their noise reduction by several components and processes to provide long term
algorithms [5]. They have proposed adaptive filtering for echo memory. And these states could be changed accordingly [12].
cancellation and based on the denoising functions, wavelet The idea behind RNNs is the output of one layer will be the
transform has been used to remove noises [2]. input to the same layer. By doing this, the mechanism provides
Spectral Subtraction also plays a major role in noise re- a memory to the architecture and allows comprehending cor-
moval systems. Spectral Subtraction is a method to remove relation in sequences. This helps the RNN to make decisions
background noise from noisy speech signals in the frequency based on past decisions it has made in the same state, this
domain. This approach consists of calculating the noise spec- induces the memory state. Long Short-Term Memory (LSTM)
trum using the Fast Fourier Transform (FFT) and subtracting is the modified version of RNN; the states could be changed
the average noise level from the noisy speech spectrum [6]. in LSTM so that at given times the memory of states would
Apart from those methods, the researchers have used dif- be more or less [9], [10].
ferent filters for the removal of unwanted noises such as the I-Vector Framework - The i-vector subspace modeling is
Gaussian filter, Butterworth filter, Comb filter, Chebyshev I a recent state-of-the-art technique. It is proven to be the
and II filters, Elliptic filter, etc [7]. Butterworth filter, Cheby- most effective feature for speaker diarization according to a
shev I filter, and Elliptic filter is used in one research paper to recent study. Speaker or session variability is the variability
remove the noises in ECG signals using MATLAB software exhibited by a given speaker from one recording session to
and suggested that the Butterworth filter is the best filter when another. This type of variability is usually attributed to channel
comparing the others [8]. effects, although this is not strictly accurate since intra-speaker
Natural Language Processing is one of the most trending variation and phonetic variation are also involved [11]. In
technologies which has been applied in most of the areas this approach, a speech segment is represented by a low-
to revolutionize the modern world. NLP is proposed to use dimensional “identity vector” (i-vector for short) extracted by
to diarize the different speakers from a group conversation. Factor Analysis [12].
Analyzing the raw text obtained from audio format based on
the important relationships and factors, the speakers will be III. M ETHODOLOGY
diarized.
Previous studies were analyzed to understand the flexibility A. Speech Denoising using DNN
of the different approaches in NLP. According to the acquired Speech denoising aims to remove noise from speech signals
knowledge from those, it is understood that to diarize the while enhancing the quality and intelligibility of speech. The
speakers there are two different approaches in NLP – text system does not know what background noises are in the
classification which requires pre-labeled data that means su- audio. So, a background noise dataset (UrbanSound8K) was
pervised learning and text clustering which deals with unsu- used here to check what noises are there using different deep
pervised data that is known as unsupervised learning. learning techniques. When an audio sample (.wav format)
Since the text from audio format is mostly unsupervised, the which contains some background noises was given as an
clustering approach will get a higher priority. Along with text input to the algorithm, it can determine if it contains one
clustering, the classification can be used for better accuracy. of the sounds with a corresponding likelihood score. Con-
Clustering algorithms analyze the unlabeled data and grouping versely, if none of the target sounds were detected, it will be
similar data points together [6]. presented with an unknown score. To achieve this, different
1) Convolutional Neural Networks (CNN): First published neural network architectures such as Multi-Layer Perceptron’s
in the ’80s, later researchers were able to use the architecture (MLPs) and Convolutional Neural Networks (CNNs) were
to learn positional relation between pixels, enabling neural used. Adaptive filters were used to remove the predicted noise
networks to identify shapes and patterns. The scheme behind from the original audio to make clean audio which can be used
a convolutional layer is to break down an image into miniature to diarize the speakers easily.

193
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 16,2021 at 23:15:49 UTC from IEEE Xplore. Restrictions apply.
a) Dataset: UrbanSound8K is a background noise The ability to identify the words and phrases in spoken
dataset which contains 8732 labeled sound excerpts of urban language and convert them to human-readable text is called
sounds from 10 classes: air conditioner, car horn, children speech recognition and it can be done in Python using the
playing, dog bark, drilling, enginge idling, gun shot, jackham- Speech-Recognition library [11].
mer, siren, and street music. All excerpts are taken from field • Picking a Python Speech Recognition Package – Google
recordings uploaded to freesound.org [15]. In addition to the Speech Recognition
sound excerpts, a CSV file containing metadata about each • Installing SpeechRecognition in Python
excerpt is also provided. • Working with Audio Files – Loading the audio file and
b) Preprocessing Stage: The preprocessing of the data converting the speech into text using Google Speech
set is performed using sample rate conversion and merging Recognition
audio channels. Mel-Frequency Cepstral Coefficients (MFCC)
Diarization of the speaker in a conversation is more chal-
were extracted on a per-frame basis from the audio samples. To
lenging where more speakers are present in the conversa-
analyze frequency and time characteristics, MFCC was used
tion. The random conversation often differs from a formal
because it summarizes the frequency distribution, across the
and sophisticated conversation. A raw transcript of a group
window-size. The dataset is split using train test split which
conversation contains different noises and if the noises are
allocates 20% as the testing set and 80% as the training set
properly handled, it will make the diarization job one step
[16].
ahead. Natural Language Processing plays an important role
c) Train and Test Model : Train a Deep Neural Network
to deal with such a scenario.
with the dataset to predict the noises was the next step. A
a) Dataset: A conversational dataset from Opensubtitles
simple neural network architecture like MLP was used before
[19] is used in this research. Raw data of this dataset is pre-
experimenting with a more complex neural architecture like
processed and normalized according to this research criteria.
CNN [16]. CNNs build upon the architecture of MLPs but
The final dataset is split into 4415 segments containing 100000
with several important changes. Next, they group the layers
lines each and placed in Google Cloud storage.
into three dimensions: width, height, and depth. Second, the
b) Train and Test Model: The large conversational
nodes in one layer do not necessarily connect to all nodes
dataset is split into the train and test model using the
in the layer that follows, but often only a sub-region of it
Apache Beam pipeline. Considering the dataset size, Tensor-
[16]. This allows CNN to perform two important stages named
Flow framework is implemented with Python programming
Feature Extraction and Classification.
language. The training model includes the features extracted
from the dataset and correlation among the speeches. The
system can detect the speakers based on the behavioral anal-
ysis, common words usages, and the names provided in the
transcripts. The testing model is prepared to the system to
measure how accurate the system works. To build the models,
Google Dataflow Engine is needed and to initiate, a suitable
apache pipeline is built.
Dataset is stored and retrieved from Google Cloud Storage
and processed in Google Dataflow environment, then models
Fig. 1. System Diagram for Audio Classification are built and written back to the Cloud Storage. The system
is more convenient to use since there is less risk in the cloud.
Adaptive filters are used to remove the predicted back- A transcript is taken as input, and based on the trained
ground noises from the source and generate clean audio model the system finds the relationships and features from the
without damaging the features of the audio. For initial section transcript. Once the context is understood by the system, it
of adaptive filters require two inputs and create the adaptive does the clustering of common speeches. K-Means clustering
filter according to the parameters. After that data filtering algorithm is used to cluster the speeches from the transcript.
process is started to remove the background noises and smooth The whole idea about clustering the speeches are based
the audio. on Natural Language Processing techniques where Machine
Learning is significantly used. Machine Learning applies to
B. Speaker Diarization with NLP diarize the speakers based on the factors mentioned above.
1) Audio Dataset Conversion to Text Dataset Conversion:
The process of converting spoken words into written texts is C. Speaker Diarization with DNN
called a speech-to-text (STT) conversion and it is used to Speaker diarization can be defined as the identification of
get a wider understanding of the speech process [17]. Like the number of different speakers in a conversation. This task
speech recognition, STT follows the same principles and steps, can be achieved using different technologies, but the accuracy
with different combinations of various techniques for each step levels will not be the same. Here the task is accomplished by
[17]. training a Deep Neural Network model, which will then be

194
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 16,2021 at 23:15:49 UTC from IEEE Xplore. Restrictions apply.
be recorded and executed if the session ends suddenly. To
overcome this problem, the Google Drive is mounted, and
processed data and models are transferred to the Google Drive.
The model which produced the best accuracy results con-
tained the following layers; 4 CNNs for image processing, a
max pooling function and dropout. The audio feature extrac-
tion for i-vectors was done with a basic flatten layer. Then, 3
dense layers and a final layer would predict the probability
of the number of speakers in the given audio. The model
was trained with 50 epochs, with 100 steps per epoch. The
Fig. 2. System Diagram for Speaker Diarization with NLP
following structure of the model was the most accurate.
D. Synthetic Speech Detection with DNN
used to predict the various number of speakers in each audio
sample. The approach was initiated by analyzing several research
a) Dataset and Model Training: The dataset used to train papers and publications to understand which architecture is
the model was freely available from International Computer best suited for the synthetic speech detection domain. The
Science Institute (The ICSI Meeting Corpus), the AMI Corpus subsections describe the relevant methodology.
Dataset. This dataset contained meeting recordings of different a) Dataset: The dataset was taken from a group of
speakers with a size of 5GB all files in the .wav format. researchers (APTLY Lab) which was freely available Fake-
With the dataset the model was trained, evaluated and tested OrReal Dataset of size 4GB. It was readily labelled and pre-
[20]. After training the model, it can be used for predictions. processed for direct use in the training phase. The dataset
Generally the speaker diarization task is achieved undergoing contained utterances labelled as real and fake (synthetic) in
several sub processes; the format of .wav files. The audio samples labelled ‘real’
• Speech segmentation – The process in which input audio
were collected from online sources like YouTube and other
is segmented to short sections assuming they are from streaming platforms. Audio samples labelled ‘fake’ were ob-
the same speaker tained from the output of some of the latest state-of-the-art
• Audio embedding extraction – i-vector feature used for
text-to-speech systems.
extraction of audio clips b) Training and Validation: To train the model, the
• Homogeneous segmentation – Identification of segments
architecture and build of it had to be decided. A thorough
per speaker and aggregates them study was done before concluding the architecture. The archi-
• Clustering – Final number of speakers are determined,
tecture was built upon Convolution Neural Networks (CNN)
and homogeneous segments clustered accordingly [21]- and Recurrent Neural Networks (RNN). The architecture of
[23] CNNs which was first published in the ’80s, was later used
by researchers to learn positional relation between pixels,
enabling neural networks to identify shapes and patterns. The
scheme behind a convolutional layer is to break down an
image into miniature squares, and with the use of convolution
operator to differentiate each square to learnable filters. Stack-
ing convolutional layers with fully connected layers results in
forming a powerful architecture that has the ability to identify
patterns from shapes to complex objects, lowest layers to
highest layers [9]. The concept of RNNs is that the output of
one layer will be the input to the same layer. What happens
is that this mechanism provides the architecture a “memory”
and allows to comprehend correlation in sequences. This
helps in remembering past and its present decisions which
are influenced by what it learnt from past experiences [13],
[12].The idea of using both architectures in a single system
Fig. 3. System Diagram for Speaker Diarization with DNN is that CNNs are good at feature extraction and RNNs are
good at identifying long-term dependencies in a time domain.
The model was trained in Google Colab platform. This Therefore, both architectures could produce better accurate
provides free Linux environment, supporting all tensorflow results. This model was also trained in the Google Colab
and python commands. The platform also gives users access to platform.
install other packages and libraries. Google also provides free The structure of the model was designed with 4 CNNs,
GPU for deep neural network training. But the lab environment followed by a max pooling and dropout. And then a flatten
is only saved on session based, so every command needs to layer is used to extract the different features of the audio.

195
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 16,2021 at 23:15:49 UTC from IEEE Xplore. Restrictions apply.
Next, 4 variants of dense layers and ﬁnally an activation layer calculated for a word as the ratio of the number of times the
with 2 classes for the fake and real predictions. The best model word occurs in the document to the total number of words
containing the most accurate results had undergone 50 epochs in a document. The inverse document frequency is calculated
and 100 steps per each epoch. by dividing the total number of documents by the number of
documents obtaining the term and taking the logarithm of that
quotient.
tf − idft,d = tft,d × idft (2)

The complete training of models excludes sections of

unidentified speakers. Speaker detection error rates are found
higher when identifying the next or previous speakers. The
change of speakers is quite challenging. Some of the errors
could not be traced due to proper names or missing places.
Fig. 4. System Diagram for Synthetic Speech Detection with DNN The speaker identification error rate on the 65hours of training
transcripts was unexpectedly low. This approach needs to be
IV. R ESULTS AND D ISCUSSION combined with the automatic partitioning of the procedure of
A. Speech Denoising using DNN homogeneous speakers. The next step would be an automatic
filter for the diarization of the whole process.
This research is not focused on removing the previously
added noises but removing any kind of background noise from
C. Speaker Diarization with DNN
the audio file which can be used in day-to-day life scenarios.
A different approach is used here to predict background Just as the proposed methodology, a model was designed,
noises before reducing them that has been explained in the and the necessary pre-processing was done to the downloaded
methodology section. Adaptive filters have also given good dataset. After successfully training, a moderate accuracy level
results compared to other filtering methods in echo cancelation was achieved. During the testing phase, some flaws were
as stated in the literature review. Spectral Subtraction has not identified, and few assumptions had to be made. In the audio
been used for this research since adaptive filters, Gaussian fil- analysis for speaker separation, there is another problem that
ter, and Butterworth filter covered the noise reduction perfectly was identified after identifying the flaw. The cocktail party
as mentioned in the literature review section. problem- the human brain can focus one’s auditory attention
According to the methodology section, after training the on a specific stimulus while filtering out a variety of other
model using two different neural networks such as MLP and stimuli. An example would be a person who has entered
CNN, the testing and validation process were carried out a noisy party could focus on a single conversation. The
separately. Accuracy of the model on both the training and denoising process is done so the background noises are filtered
test data sets are as below. before the speaker diarization process starts. The problem
TABLE I faced was if the audio sample had two speakers speaking at
T RAINING AND T ESTING ACCURACY
the same time. To overcome this problem, the dataset went
Model Training Accuracy Testing Accuracy through some reductions and the model was trained again.
MLP Architecture 93% 88% Due to this limitation, an assumption was made that the audio
CNN Architecture 94% 89% input had speakers which uttered one at a time. The proposed
model achieved an accuracy of 80% and a loss value of 0.47.
Diarization error rate (DER) is the standard metric for
Classification Accuracy(a) can be calculated using correct
comparing and evaluating speaker diarization systems. Its
classifications(c) and number of classifications(n) by following
definition is as below:
equation. [14]
c/n = a (1) DER = (f alsealarm+misseddetection+conf usion)/total
The validation and prediction process was done using two (3)
separate dataset such as a sample test dataset and various DER can be defined as the sum of; false alarm- the time
copyright-free audio set from the internet and both validation period of incorrectly classified non-speech as speech; missed
processes were succeeded. detection- duration of speech that is incorrectly specified as
non-speech; and confusion- the duration of speaker confusion
B. Speaker Diarization with NLP divided by the total duration of speech. The best diarization
Speech-to-text conversion achieved 93% accuracy. Prepro- error rate that I achieved for the speaker diarization system
cessed audio provides more accuracy. was 0.52. Previous studies related to Speaker Diarization has
TF-IDF (term frequency-inverse document frequency), is a achieved DER up to 0.56 [23]. Fine-tuning and optimization
numerical statistic used to describe documents in the Vector of the system prove to upgrade the model, but these results
Space Model, especially on IR problems. Term frequency is are proven to be time-consuming.

196
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 16,2021 at 23:15:49 UTC from IEEE Xplore. Restrictions apply.
D. Synthetic Speech Detection with DNN. could be taken to implement fully automated features such as
The implementation was done in the Google Colab platform automatic audio processing, filtering, diarization, and detection
which is a free cloud service, provides a dedicated runtime with improved models that might add great advantage to this
environment also supports free GPU for machine learning/ solution.
R EFERENCES
deep learning. The results for the proposed model was mea-
sured by the loss function and accuracy, during and after [1] F. Ahmad, “The Role of Deepfake Audio in the Growing
AI Voice Market,” Medium, 2019. [Online]. Available:
training. It is the measure of how accurate your model’s https://medium.com/@mfaizan.ahmad/the-role-of-deepfake-audio-
prediction is compared to the true data and the loss value in-the-growing-ai-voice-market-9246365442f5.
implies how poorly/well a model behaves after each iteration [2] P. Rao, “Audio signal processing,” Stud. Comput. Intell., vol. 83, pp.
169–189, 2008.
of optimization. The result of the model achieved an accuracy [3] N. Melethadathil, P. Chellaiah, B. Nair, and S. Diwakar, “Classification
of 94% and a loss value of 0.691. and clustering for neuroinformatics: Assessing the efficacy on reverse-
We have used three different approaches on the audio for ex- mapped NeuroNLP data using standard ML techniques,” 2015 Int. Conf.
Adv. Comput. Commun. Informatics, ICACCI 2015, pp. 1065–1070,
tracting features: Short-Term-Fourier-Transformation (STFT): 2015.
type of Fourier transform which calculates the frequency [4] H. Yu, Z. H. Tan, Z. Ma, R. Martin, and J. Guo, “Spoofing Detection
content of local sections of the signal, Mel-Spectrogram: it in Automatic Speaker Verification Systems Using DNN Classifiers and
Dynamic Acoustic Features,” IEEE Trans. Neural Networks Learn. Syst.,
is similar to STFT but the non-linear mel-scale frequency and vol. 29, no. 10, pp. 4633–4644, 2018.
Mel-Frequency Cepstral Coefficients (MFCC): derived from [5] J. Vijayakumar, “A Systematic Algorithm for Denoising Audio Signal
the cepstral representation of an audio clip, the coefficients Using Savitzky - Golay Method,” no. April, pp. 676–679, 2018.
that jointly make the MFC. This was performed for the [6] C. Cole, M. Karam, and H. Aglan, “Increasing additive noise removal
in speech processing using spectral subtraction,” Proc. - Int. Conf. Inf.
proposed deep learning architecture. The results achieved for Technol. New Gener. ITNG 2008, no. May, pp. 1146–1147, 2008.
each feature compared to the previous [9] are as follows: [7] H. Magsi, A. H. Sodhro, F. A. Chachar, and S. A. K. Abro, “Analysis of
signal noise reduction by using filters,” 2018 Int. Conf. Comput. Math.
TABLE II
Eng. Technol. Inven. Innov. Integr. Socioecon. Dev. iCoMET 2018 -
M ODEL C OMPARISON
Proc., vol. 2018-Janua, pp. 1–6, 2018.
Algorithm SFTF Mel MFCC [8] P. Podder, M. Mehedi Hasan, M. Rafiqul Islam, and M. Sayeed, “Design
and Implementation of Butterworth, Chebyshev-I and Elliptic Filter for
Proposed Model 47.55% 41.14% 88.80% Speech Signal Analysis,” Int. J. Comput. Appl., vol. 98, no. 7, pp. 12–18,
Previous Studies [9] 50.10% 41.57% 90.48% 2014.
[9] R. A. M. Reimao, “Synthetic Speech Detection Using Deep Neural
However, the outcome did not meet expectations. To over- Networks,” 2019.
come this problem a new model was trained using a pre-trained [10] A. Deshpande, “A Beginner’s Guide To Understanding
Convolutional Neural Networks,” 2016. [Online]. Available:
model, here the last layer is added to a pre-trained model. This https://adeshpande3.github.io/A-Beginner%27s-Guide-To-
approach is known as “Transfer learning”, the advantage is the Understanding-Convolutional-Neural-Networks/.
pre-trained models have been trained with thousands of data [11] A. Rockikz, “How to Convert Speech to Text in Python,” 2020. [On-
line]. Available: https://www.thepythoncode.com/article/using-speech-
and have better accuracy results. The pre-trained model chosen recognition-to-convert-speech-to-text-python.
was VGG19, the specialty of this model is that it was trained [12] W. Feng, N. Guan, Y. Li, X. Zhang, and Z. Luo, “Audiovisual speech
with audio data. That is images of spectrograms of the audio recognition with multimodal recurrent neural networks,” Proc. Int. Jt.
Conf. Neural Networks, vol. 2017-May, no. March 2018, pp. 681–688,
dataset. Out of the two trained models the model trained with 2017.
Transfer learning had better accuracy results. [13] M. Venkatachalam, “Recurrent Neural Networks,” 2019. [Online].
Available: https://towardsdatascience.com/recurrent-neural-networks-
V. C ONCLUSION d4642c9bc7ce.
[14] C. J. and J. P. B. J. Salamon, “URBAN-
Within this research work, an innovative approach is made SOUND8K DATASET,” 2014. [Online]. Available:
to train and validate the models to achieve some of the best https://urbansounddataset.weebly.com/urbansound8k.html.
[15] “Free Sound.” [Online]. Available: https://freesound.org/.
accuracy results. To accomplish this, we have used some of [16] M. Smales, “Classifying Urban sounds using Deep Learning,” 2018.
the best available datasets prepared by researchers which were [17] A. Trivedi, N. Pant, P. Shah, S. Sonik, and S. Agrawal, “Speech to text
freely available for public use. This paper has included audio and text to speech recognition systems-Areview,” IOSR J. Comput. Eng.,
vol. 20, no. 2, pp. 36–43, 2018.
signal processing techniques, speaker diarization approaches,
[18] A. Rockikz, “How to Convert Speech to Text in Python,” 2020. [On-
and synthetic speech detection mechanisms. Though signal line]. Available: https://www.thepythoncode.com/article/using-speech-
denoising tasks were accomplished with a satisfactory result, recognition-to-convert-speech-to-text-python.
the methods can be improved by implementing better filtering [19] Jörg Tiedemann, “Conversational Dataset.”
[20] “ICSI Corpus Meeting.” [Online]. Available:
techniques and improving the datasets as well. Audio to text http://groups.inf.ed.ac.uk/ami/icsi/download/.
generation could be more accurate since this is a vital input [21] Q. Wang, C. Downey, L. Wan, P. A. Mansfield, and I. L. Moreno,
to deal with Natural Language Processing based algorithms. “Speaker diarization with LSTM,” ICASSP, IEEE Int. Conf. Acoust.
Speech Signal Process. - Proc., vol. 2018-April, pp. 5239–5243, 2018.
Audio-based speaker diarization comes to an important play [22] C. Wooters and M. Huijbregts, “The ICSI RT07s speaker diarization
for this research. Reduced false positive can help to achieve system,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif.
a better outcome. The fine-tuned solution will be suitable for Intell. Lect. Notes Bioinformatics), vol. 4625 LNCS, pp. 509–519, 2008.
[23] S. E. Tranter and D. A. Reynolds, “An overview of automatic speaker
different security-critical organizations as well as government diarization systems,” IEEE Trans. Audio, Speech Lang. Process., vol.
organizations to improve security measures. The next steps 14, no. 5, pp. 1557–1565, 2006.

197
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 16,2021 at 23:15:49 UTC from IEEE Xplore. Restrictions apply.

AI ML Book
100% (1)
AI ML Book
20 pages
Deepfake Audio Detection Using MFCC Features: Priya N V, Pavan H, Prajwal S, Varun R Vinay A
100% (1)
Deepfake Audio Detection Using MFCC Features: Priya N V, Pavan H, Prajwal S, Varun R Vinay A
11 pages
Guest Lecture - Introduction To AI
No ratings yet
Guest Lecture - Introduction To AI
107 pages
Stock Market Prediction Using Machine Learning
100% (1)
Stock Market Prediction Using Machine Learning
49 pages
Audio Deepfake Detection Paper
100% (1)
Audio Deepfake Detection Paper
6 pages
Synthetic Speech Detection Through Short Term and Long-Term Prediction Traces
100% (1)
Synthetic Speech Detection Through Short Term and Long-Term Prediction Traces
14 pages
Msge Desta Review of Speaker Recognition From Spectrogram Images
No ratings yet
Msge Desta Review of Speaker Recognition From Spectrogram Images
5 pages
A Review of Modern Audio Deepfake Detection Methods: Challenges and Future Directions
100% (1)
A Review of Modern Audio Deepfake Detection Methods: Challenges and Future Directions
20 pages
One Algorithm To Predict Them All
No ratings yet
One Algorithm To Predict Them All
40 pages
Live Event Detection For Peoples Safety Using NLP and Deep Learning
No ratings yet
Live Event Detection For Peoples Safety Using NLP and Deep Learning
18 pages
Prompting - Unleashing The Potential of Prompt Engineering in Large Language Models
No ratings yet
Prompting - Unleashing The Potential of Prompt Engineering in Large Language Models
58 pages
Deep Learning
No ratings yet
Deep Learning
169 pages
Deepfake Audio Detection Via MFCC Features Using Machine Learning
No ratings yet
Deepfake Audio Detection Via MFCC Features Using Machine Learning
11 pages
Final PPT-1
No ratings yet
Final PPT-1
60 pages
Deepfake Report Finalll-1
No ratings yet
Deepfake Report Finalll-1
37 pages
Main Report Draft UNTOCHED
No ratings yet
Main Report Draft UNTOCHED
82 pages
BTP Report
No ratings yet
BTP Report
39 pages
Final Deepfake Voice Detection Report
No ratings yet
Final Deepfake Voice Detection Report
36 pages
2024 - Mamba-360 - Survey of State Space Models As Transformer Alternative For Long Sequence Modelling - Patro - Agneeswaran - Arxiv
No ratings yet
2024 - Mamba-360 - Survey of State Space Models As Transformer Alternative For Long Sequence Modelling - Patro - Agneeswaran - Arxiv
46 pages
Deep Learning Experiments
No ratings yet
Deep Learning Experiments
42 pages
TDP Report
No ratings yet
TDP Report
45 pages
Audio Deepfake Approaches
No ratings yet
Audio Deepfake Approaches
31 pages
Pitch Imperfect: Detecting Audio Deepfakes Through Acoustic Prosodic Analysis
No ratings yet
Pitch Imperfect: Detecting Audio Deepfakes Through Acoustic Prosodic Analysis
13 pages
Electronics 14 02040
No ratings yet
Electronics 14 02040
13 pages
1 s2.0 S0950705125007725 Main
No ratings yet
1 s2.0 S0950705125007725 Main
15 pages
Pitch UNY Oct2024
No ratings yet
Pitch UNY Oct2024
16 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
21 pages
Implementation Paper
No ratings yet
Implementation Paper
13 pages
ADD 2023: Towards Audio Deepfake Detection and Analysis in The Wild
No ratings yet
ADD 2023: Towards Audio Deepfake Detection and Analysis in The Wild
12 pages
Anomaly Detection of Deepfake Audio Based On Real Audio Using Generative Adversarial Network Model
No ratings yet
Anomaly Detection of Deepfake Audio Based On Real Audio Using Generative Adversarial Network Model
16 pages
Phoneme-Level Feature Discrepancies: A Key To Detecting Sophisticated Speech Deepfakes
No ratings yet
Phoneme-Level Feature Discrepancies: A Key To Detecting Sophisticated Speech Deepfakes
10 pages
RBPRATYUSH448
No ratings yet
RBPRATYUSH448
20 pages
Audio Deepfake (Camera Ready Paper)
No ratings yet
Audio Deepfake (Camera Ready Paper)
13 pages
Deepfake Audio Detection and Justification With Ex
No ratings yet
Deepfake Audio Detection and Justification With Ex
19 pages
AudioVeritas A Machine Learning Model To
No ratings yet
AudioVeritas A Machine Learning Model To
8 pages
Deep Learning and Convolutional Neural Networks For Medical Imaging and Clinical Informatics
No ratings yet
Deep Learning and Convolutional Neural Networks For Medical Imaging and Clinical Informatics
452 pages
Project PPT Bhu
No ratings yet
Project PPT Bhu
12 pages
Final Camera Ready ASVSpoof2024 Workshop-2
No ratings yet
Final Camera Ready ASVSpoof2024 Workshop-2
9 pages
Project
No ratings yet
Project
14 pages
Few Shot Learning Seminar
No ratings yet
Few Shot Learning Seminar
14 pages
Base Paper 1 (Hybrid Approach)
No ratings yet
Base Paper 1 (Hybrid Approach)
6 pages
GRAM Graph-Based Attention Model For Healthcare Representation Learning
No ratings yet
GRAM Graph-Based Attention Model For Healthcare Representation Learning
15 pages
Computers 13 00256
No ratings yet
Computers 13 00256
13 pages
Unmasking - The - Truth - A - Deep - Learning - Approach - To - Detecting - Deepfake - Audio - Through - MFCC - Features - P
No ratings yet
Unmasking - The - Truth - A - Deep - Learning - Approach - To - Detecting - Deepfake - Audio - Through - MFCC - Features - P
8 pages
Minor Project Ms
No ratings yet
Minor Project Ms
12 pages
IV Deepfake Audio Detection A Deep Learning Based Solution For Group Conversations
No ratings yet
IV Deepfake Audio Detection A Deep Learning Based Solution For Group Conversations
7 pages
Audio - Deepfake - Detection - Using - Deep - Learning Paper2
No ratings yet
Audio - Deepfake - Detection - Using - Deep - Learning Paper2
6 pages
Deep4SNet: Deep Learning For Fake Speech Classification
No ratings yet
Deep4SNet: Deep Learning For Fake Speech Classification
12 pages
AI Audio Deepfake
No ratings yet
AI Audio Deepfake
18 pages
Final Intro AIReport
No ratings yet
Final Intro AIReport
9 pages
Deep Learning in Next-Frame Prediction A Benchmark Review
No ratings yet
Deep Learning in Next-Frame Prediction A Benchmark Review
11 pages
Electronics 12 03595
No ratings yet
Electronics 12 03595
17 pages
Deepfakes Audio Detection Techniques Using Deep Convolutional Neural Network-Paper3
No ratings yet
Deepfakes Audio Detection Techniques Using Deep Convolutional Neural Network-Paper3
6 pages
A Review On Machine Learning For EEG Signal Processing in Bioengineering
No ratings yet
A Review On Machine Learning For EEG Signal Processing in Bioengineering
15 pages
Department of Electronics & Electrical Engineering: Ec5245: Artificial Neural Network & Fuzzy Logic
No ratings yet
Department of Electronics & Electrical Engineering: Ec5245: Artificial Neural Network & Fuzzy Logic
51 pages
Does Current Deepfake Audio Detection Model Effectively Detect ALM-based Deepfake Audio
No ratings yet
Does Current Deepfake Audio Detection Model Effectively Detect ALM-based Deepfake Audio
5 pages
Detection of Fake AudioA Deep
No ratings yet
Detection of Fake AudioA Deep
11 pages
Advanced Techniques in Machine Learning and Optimization
No ratings yet
Advanced Techniques in Machine Learning and Optimization
8 pages
Paper 6
No ratings yet
Paper 6
10 pages
Deepfake Basepaper
No ratings yet
Deepfake Basepaper
3 pages
Does Audio Deepfake Detection Generalize?: Fraunhofer Aisec Technical University Munich Why Do Birds GMBH
No ratings yet
Does Audio Deepfake Detection Generalize?: Fraunhofer Aisec Technical University Munich Why Do Birds GMBH
5 pages
Deepfake Audio Detection Via MFCC Features Using M
No ratings yet
Deepfake Audio Detection Via MFCC Features Using M
11 pages
AI Chapter 5
No ratings yet
AI Chapter 5
31 pages
BIA Data Science Artificial Intelligence
No ratings yet
BIA Data Science Artificial Intelligence
24 pages
Base Paper Audio Deep Fake Detection
No ratings yet
Base Paper Audio Deep Fake Detection
16 pages
IJISAE 3 Dr.+Shwetambari+Borade 3 1899
No ratings yet
IJISAE 3 Dr.+Shwetambari+Borade 3 1899
8 pages
ANN (Artificial Neural Network) 4. LSTM (Long Short-Term Memory)
No ratings yet
ANN (Artificial Neural Network) 4. LSTM (Long Short-Term Memory)
2 pages
Report
No ratings yet
Report
7 pages
AReviewofModernAudioDeepfakeDetectionMethods PDF
No ratings yet
AReviewofModernAudioDeepfakeDetectionMethods PDF
20 pages
Ahmed Raza
No ratings yet
Ahmed Raza
4 pages
Deep Learning Assignment 1,2
No ratings yet
Deep Learning Assignment 1,2
5 pages
Vocal Tract
No ratings yet
Vocal Tract
19 pages
Applsci 13 08488 v2
No ratings yet
Applsci 13 08488 v2
15 pages
Deep Learning Hybrid Approaches To Detect Fake Reviews and Ratings
No ratings yet
Deep Learning Hybrid Approaches To Detect Fake Reviews and Ratings
8 pages
DL Based Speech To Text Converter For Audio Visual Applications
No ratings yet
DL Based Speech To Text Converter For Audio Visual Applications
4 pages
Final Year Project
No ratings yet
Final Year Project
16 pages
Short Termm Feature
No ratings yet
Short Termm Feature
5 pages
BERT - PLI-Modeling Paragraph-Level Interactions For Legal Case Retrieval
No ratings yet
BERT - PLI-Modeling Paragraph-Level Interactions For Legal Case Retrieval
7 pages
2022 ICASSP Audio Deepfake Emotions+
No ratings yet
2022 ICASSP Audio Deepfake Emotions+
5 pages
AlBadawy Detecting AI-Synthesized Speech Using Bispectral Analysis CVPRW 2019 Paper
No ratings yet
AlBadawy Detecting AI-Synthesized Speech Using Bispectral Analysis CVPRW 2019 Paper
7 pages
A Deep Learning Framework For Audio Deepfake Detection
No ratings yet
A Deep Learning Framework For Audio Deepfake Detection
12 pages
Purdue PGP AI and ML
No ratings yet
Purdue PGP AI and ML
35 pages
Deep Learning DSE Handout
No ratings yet
Deep Learning DSE Handout
6 pages
Onlinepay
No ratings yet
Onlinepay
23 pages
A Hybrid CNN-LSTM Approach For Deepfake Audio Detection CRC FINAL
No ratings yet
A Hybrid CNN-LSTM Approach For Deepfake Audio Detection CRC FINAL
6 pages
Coping With Threats Towards Speaker Recognition Systems, Spoofing Risk Minimization
100% (1)
Coping With Threats Towards Speaker Recognition Systems, Spoofing Risk Minimization
22 pages
Detection of Synthetically Generated Speech
No ratings yet
Detection of Synthetically Generated Speech
5 pages
Deaf and Dumb Gesture Recognition System
No ratings yet
Deaf and Dumb Gesture Recognition System
7 pages
Introduction to MATLAB for Scientists and Engineers: A Practical Guide to Computational Problem Solving
From Everand
Introduction to MATLAB for Scientists and Engineers: A Practical Guide to Computational Problem Solving
Eric Okoth Ogur
No ratings yet
Telecommunications Traffic : Technical and Business Considerations
From Everand
Telecommunications Traffic : Technical and Business Considerations
Sigit Haryadi
No ratings yet

Wijethunga 2020

Uploaded by

Wijethunga 2020

Uploaded by

Deepfake Audio Detection: A Deep Learning Based

Solution for Group Conversations

K.H.V.T.A. De Silva Muditha Tissera Lakmal Rupasinghe

978-1-7281-8412-8/20/$31.00 ©2020 IEEE

The complete training of models excludes sections of

You might also like