Deep4SNet: Deep Learning For Fake Speech Classification
Deep4SNet: Deep Learning For Fake Speech Classification
Deep4SNet: Deep Learning For Fake Speech Classification
A R T I C L E I N F O A B S T R A C T
Keywords: Fake speech consists on voice recordings created even by artificial intelligence or signal processing techniques.
Fake voice Among the methods for generating false voice recordings are Deep Voice and Imitation. In Deep voice, the re
Convolutional neural network cordings sound slightly synthesized, whereas in Imitation, they sound natural. On the other hand, the task of
Imitation
detecting fake content is not trivial considering the large number of voice recordings that are transmitted over
Deep learning
Deep voice
the Internet. In order to detect fake voice recordings obtained by Deep Voice and Imitation, we propose a solution
Classification based on a Convolutional Neural Network (CNN), using image augmentation and dropout. The proposed ar
chitecture was trained with 2092 histograms of both original and fake voice recordings and cross-validated with
864 histograms. 476 new histograms were used for external validation, and Precision (P) and Recall (R) were
calculated. Detection of fake audios reached P = 0.997, R = 0.997 for Imitation-based recordings, and P = 0.985,
R = 0.944 for Deep Voice-based recordings. The global accuracy was 0.985. According to the results, the pro
posed system is successful in detecting fake voice content.
* Corresponding author.
E-mail addresses: dora.ballesteros@unimilitar.edu.co (D.M. Ballesteros), est.yohanna.rodrig@unimilitar.edu.co (Y. Rodriguez-Ortega), diego.renza@unimilitar.
edu.co (D. Renza), arce@udel.edu (G. Arce).
https://doi.org/10.1016/j.eswa.2021.115465
Received 22 November 2019; Received in revised form 22 April 2021; Accepted 21 June 2021
Available online 29 June 2021
0957-4174/© 2021 Elsevier Ltd. All rights reserved.
D.M. Ballesteros et al. Expert Systems With Applications 184 (2021) 115465
the dot product between two utterances by two dot products between sounds that may reveal that it is fake. Using public examples of cloned
two i-vectors. With the purpose of reducing the redundancy of the high voices (e.g. https://audiodemos.github.io/ or https://r9y9.github.io/
dimension feature vectors used in SVM-based speaker verification sys deepvoice3_pytorch/), it is possible to find a pattern to distinguish be
tems, ACO is applied as a feature selection approach, obtaining a 64% tween original and fake recordings. Specifically, the spectrogram of
feature dimension reduction with a EER of 1.7% (Rashno et al., 2015). cloned signals has been found to have a lower power/frequency ratio for
On the other hand, the authors of Loughran et al. (2017) addressed normalized frequencies close to 0.5 than in the case of original voice
the problem of unbalanced data (i.e. the number of examples of one class signals (Fig. 1).
is significantly higher than the other) by applying GA in the adjustment Similarly, VoCo is a text-to-speech system capable of producing fake
of the cost function. For DL-based systems, proposals include, for voices (Jin, Mysore, Diverdi, Lu, & Finkelstein, 2017). In this case, as
example, deep features of the same speaker to form an objective vector reported in Adobe’s MAX 2016 Conference, the system needs 20 min of
(Liu et al., 2015), a nonlinear metric learning method to discriminate if original voice recordings to train the ML-based system. Until now, as
two utterances belong to the same person (Feng et al., 2017), spectro well as we know, a commercial VoCo product does not yet exist. Another
grams of utterances as inputs for a Convolutional Neural Network, CNN- approach, Lyrebird, a division of IA-based Descript, is working on a
based model (Bunrit et al., 2019), or a deep Siamese network to learn private Beta of the Voice Double algorithm, which creates a fake voice
pairs of equal/different speakers from audio recordings. According to that sounds like you, but with different plain text content (visit http
the reported results, DL-based speaker verification systems can work s://www.descript.com/lyrebird).
with EER values of 0.1% and an accuracy of 95%. Unlike Deep Voice, VoCo and Double Voice, the method proposed by
Moreover, the problem is further complicated by the fact that not Ballesteros and Moreno in 2012 to create fake voice recordings uses
only can erroneous evidence be created by faking another person’s signal processing techniques instead of training a machine learning
voice, but also by using methods or models to create a fake voice from a model. It is a bio-inspired solution in the behavior of the chameleon,
real voice. Note that in this document, we use the term false when the which is able to adapt (i.e. imitate) its “color” to the surrounding
recording is obtained by voice impersonation, and the term fake when environment (Ballesteros L & Moreno A, 2012a, 2012b). A voice
the original voice recording is transformed by machine learning or recording can then mimic the accent, rhythm, tone, language, genre, and
signal processing techniques. For example, Deep Voice is a text-to- plain text of another voice recording through a mapping process. The
speech algorithm based on deep neural networks that can clone any number of fake recordings that can be obtained from a voice recording is
one’s voice (Arık et al., 2017); when such a method was introduced in extremely high and depends on its duration; the longer the voice
2017, it required the original voice recordings to last a few minutes to recording, the more fake voices can be obtained. The main characteristic
create the cloned voice, while currently it only needs a few seconds to to highlight of the fake voices created with this method is that they
create the fake recording. Although this technology has improved in the sound and look like an original voice recording, as well as their spec
different versions of the algorithm (Ping et al., 2018), it still has a trograms (Figs. 2 and 3).
challenge related to the naturalness of the cloned voice. With few Because of the high similarity between original and fake signals
samples (less than 100), the cloned voice has some artifacts or synthetic obtained with the Imitation method, detecting its fake content is not a
Fig. 1. Spectrogram examples for one authentic voice signal and three signals generated using Deep Voice.
2
D.M. Ballesteros et al. Expert Systems With Applications 184 (2021) 115465
Therefore, if Eva hears the fake signal, its content may differ
completely from the original Alice’s message in terms of accent, rhythm,
tone, language and gender. In order to illustrate the performance of the
Imitation-based method, some pairs of original/fake recordings are
available on https://doi.org/10.17632/ytkv9w92t6.1, along with their
corresponding keys and an algorithm to reverse the Imitation process.
The pair of recordings (speaker5_1.wav and fake4_1.wav) are signifi
cantly the same, but one of them is original and the other is not. The fake
Fig. 2. Examples of time-domain signals generated using the Imitation-
based approach.
recording comes from a Spanish speaking male, even though the fake
audio sounds like a Spanish speaking female.
trivial task. It is extremely easy to deceive a listener (or legal authority)
3. Hand-crafted vs. Automatic feature extraction
about the originality of its content, and a fake recording could be used as
evidence within a legal process. To address this challenge, this paper
According to the IBM Foundational Methodology for Data Science
makes the following contributions:
(Rollins, 2015), there are four steps related to data in any data analytics
lifecycle, as follows: data requirements, data collection, data under
• We propose a solution based on Deep Learning, called Deep4SNet, to
standing and data preparation. For simplicity, in this text we will refer to
classify original and fake voice recordings obtained by Imitation.
the four blocks as a big block named “data creation”.
Deep4SNet is a text-independent classifier which allows to be used
In our research we propose two approaches to data creation step:
for a wide range of voice recordings.
hand-crafted feature extraction and data transformation. The former for
• An analysis of a set of features is performed to identify whether the
machine learning (ML)-based models and the latter for deep learning
relationship between the feature and the label is strong or not, as a
(DL)-based models. It should be noted that the data creation step for DL-
previous stage of CNN training.
based models does not include the feature extraction process, as this is
• Our proposed solution also classifies fake voices obtained using other
done automatically within the model. In addition, we carry out the data
methods (e.g. Deep Voice). With the trained model and parameters
transformation process within the data creation block because we intend
provided at https://github.com/yohannarodriguez/Deep4SNet.git,
to use a 2D-CNN architecture that is suitable for image-type inputs,
anyone can use the proposed model as a tool to distinguish between
rather than 1D-array-type inputs. There are works in the literature that
3
D.M. Ballesteros et al. Expert Systems With Applications 184 (2021) 115465
Fig. 4. General diagram for Alice’s communication with Bob using fake voices.
have transformed audios into images, e.g. spectrograms, as inputs of 2D- subsections.
CNNs for audio classification tasks as speech emotion recognition (Satt,
Rozenberg, & Hoory, 2017; Yenigalla et al., 2018) or speaker recogni
4.1. Data creation
tion (Zeng, Mao, Peng, & Yi, 2019). Fig. 5 shows the difference between
the two approaches.
This section presents three proposals for data creation, where each
In Section 4, the data creation block for the two approaches, for three
proposal is based on a hypothesis.
possible datasets and three different hypotheses will be explained.
4.1.1. Statistics-based features and Hypothesis 1
4. Data creation and hypothesis validation
In the first approach, the statistics of the voice signal were consid
ered, specifically the mean, standard deviation, the minimum of the
For ML-based models we propose two candidate datasets, derived
normalized voice signal, and the maximum of the normalized voice
from two hypotheses. In addition, for DL-based models we propose one
signal. This implies that the voice signal is scaled by the maximum be
data transformation, related to the third hypothesis. However, each
tween the magnitude of its peak and valley.
proposal has the characteristic of being text-independent, unlike others
These features are selected because they are text-independent.
such as the spectrogram in which the behavior changes not only for the
Typically, natural speech signals have average around 0.0 and stan
speaker but also for the plain text content. The description of each
dard deviation of 0.2, when the signal ranges between [− 1 1]. There
dataset and its hypothesis validation are presented in the following
fore, if fake signals differ in terms of their statistics from the above
4
D.M. Ballesteros et al. Expert Systems With Applications 184 (2021) 115465
values, a machine learning model could classify the audio recordings as 2017; Rodriguez-Ortega, Ballesteros, & Renza, 2021).
original and fake. For this hypothesis we created a specific dataset (Ballesteros,
In terms of statistics, the following hypothesis is proposed: Rodriguez, & Renza, 2020) which contains histograms of original speech
signals, and fake signals created by Imitation and Deep Voice. These
Hypothesis 1. Natural speech signals have very similar statistics to each
images may have more extensive information than the hand-crafted
other, as do fake signals. Additionally, the statistic values between natural
features based on entropy and statistics, e.g. width and slope of the
and fake signals are dissimilar.
curve in the lower, middle and upper part of the histogram, which may
Complementing the previous hypothesis, if the correlation coeffi be useful for the current classification task.
cient between each characteristic (i.e., each statistical value) and the Then, the third proposed hypothesis is as follows:
classifier label (1 for original, 0 for false) is greater than or equal to 0.5
Hypothesis 3. Natural speech signals have very similar histogram shapes
(in the range of 0 to 1), then Hypothesis 1 is true. Otherwise, it is false.
to each other, as do fake signals. Additionally, the histogram shapes between
natural and fake signals are dissimilar.
4.1.2. Entropy-based features and Hypothesis 2
It is well known that entropy can be used as a measure of data un Unlike Hypothesis 1 and Hypothesis 2, the correlation coefficient is
certainty (Robinson, 2008). The higher the level of uncertainty, the not used to determine whether Hypothesis 3 is true, but it is carried out
greater the value of entropy. This value depends on the distribution of by visual inspection of the histograms of each label, since in this case the
the data, but not on the plain text of the message, and then, entropy may histogram does not correspond to the features of the model.
help to identify fake content. For that reason, we propose the following The advantage of using histograms instead of other types of trans
hypothesis: formation such as spectrograms is that the model is not dependent on the
plaintext of the message, i.e. it can classify original and fake voice re
Hypothesis 2. Natural speech signals have very similar entropy values to
cordings for any plaintext message.
each other, as do fake signals. Additionally, the entropy values between
natural and fake signals are dissimilar.
In this case, if the correlation coefficient between each feature (i.e. 4.2. Validation of the hypothesis
entropy of a segment of the signal) and the classifier label (1 for original,
0 for fake) is at least greater in magnitude than 0.5 (in the range 0 to 1), The next step is to select the most appropriate dataset to feed the ML
then Hypothesis 2 is true. Otherwise it is false. or DL-based model. Therefore, every hypothesis is validated, as pre
sented below.
4.1.3. Transforming voice recordings into histogram, and Hypothesis 3
The previous hypotheses have a disadvantage in terms of the limited 4.2.1. Validation of the Hypothesis 1
number of features extracted from the speech signal. In the first hy For this hypothesis validation, 100 original and 100 fake voice re
pothesis there are only 4-features (one for each selected statistic) and in cordings were used. Each recording has a duration of 10 s and a sam
the second hypothesis there are (n + 1) features (i.e. the number of pling rate of 44100 Hz with 16-bit quantization. The statistics calculated
signal segments used for comparison, plus the whole signal). Therefore, for each signal were: mean, standard deviation (desv), normalized
in the third hypothesis, we propose a new approach in which the fea minimum (min) and normalized maximum (max).
tures are not hand-crafted, but extracted using a DL-based model, and The correlation coefficient between two series (e.g. mean vs. label, or
the input of the architecture is an image instead of a 1D array. Therefore, dev vs. label) allows to identify whether they are correlated or not, i.e.
the classification problem is treated as a computer vision problem that whether one of them (the label) depends on the other (the feature). This
can be addressed by using Convolutional Neural Networks (CNNs), coefficient was obtained for each feature/label pair, through a correla
which have demonstrated in the last decade superior performance in tion matrix, in which the last column is the one of interest (See Fig. 6).
classification tasks over ML-based shallow models (Shin & Balasingham, According to Fig. 6, all correlation coefficients between the input
feature and the label are very close to 0 and far from 1, so Hypothesis 1 is
5
D.M. Ballesteros et al. Expert Systems With Applications 184 (2021) 115465
5. Proposed method
Fig. 6. Validation results of Hypothesis 1. Correlation matrix for statistic vs. Due to the difficulty of identifying fake voice recordings created with
label. The lighter the color, the higher the correlation. the Imitation-based method, because they sound and look very much
like the original voice signals (i.e., without artifacts or synthetic effects),
not true. Therefore, they will not allow us to distinguish between the the problem becomes a computer vision task, in which the input is the
original and the fake voice recording, regardless of which machine histogram-image rather than some hand-crafted features. The advantage
learning method may be used. of using CNNs in classification tasks is that they have demonstrated
excellent performance for this type of image analysis.
4.2.2. Validation of the Hypothesis 2 Fig. 9 shows the proposed solution: it covers the training stage and
For a more complete evaluation of this hypothesis, the addition of the validation stage. In the training stage, the histograms of the training
noise to the signal was considered, since by the stochastic behavior of voices are separated into original and fake. They are transformed by
the noise the entropy of the noisy signal can change. The objective is to image resizing, image scaling and horizontal flipping. Then, the CNN is
evaluate the influence of noise on the correlation values between each trained with the new original/fake images. In the validation stage, the
feature and label. images are transformed by resizing and scaling, but not by horizontal
The test protocol to validate Hypothesis 2 is as follows: flipping. They are then fed to the trained model, which classifies the
histograms into original (class = 1) and fake (class = 0), and their results
• Voice recordings: 360 original voice signals and 360 target signals. are compared with the true values.
Each recording has a duration of 10 s and a sampling rate of 44100 In order to avoid overfitting, we apply the following strategies: first,
Hz with 16-bit quantization. to use horizontal split as image augmentation in the pre-processing
• Features: 11 entropy values were used, one value per every second of module; second, to add dropout into the CNN architecture. With the
signal duration, and one for the entire voice recording. These values horizontal flip, the CNN is trained with a wider variety of histograms,
are calculated in both the original and fake voice recordings. which differ in the value of displacement of the central point of the
• Dataset: four datasets are created from the original and target re graph (i.e. histograms more positives or negatives). With the dropout as
cordings. The difference between them lies in whether noise is added regularization technique, some of the neurons are randomly ignored, so
to the recordings or not. Table 1 shows how is composed every the neurons in the next layer learn without overfitting.
dataset. Each voice recording imitates a single target voice, and The structure of the proposed solution is explained as follows.
therefore, there are 360 fake recordings in each dataset.
5.1. Pre-processing
In a similar way of the analysis of statistic-based features, the cor
relation matrix is calculated for every dataset (Fig. 7). In the current In the training stage, we use the ImageDataGenerator module of
case, the column of interest is the first one. Keras to perform three tasks: image resize, image scaling and horizontal
Considering that not all the correlation coefficients between every flip. The images are adjusted to 150 × 150 × 3 pixels normalized in the
feature and the label are higher than 0.5, Hypothesis 2 is not true. range [0, 1]. The selected image augmentation is a horizontal flip, which
corresponds to a mirror effect across the y-axis. It takes advantage of the
4.2.3. Validation of the Hypothesis 3 imperfect symmetry between the left and right sides of the histogram. In
Unlike the validation of the previous hypotheses, the correlation the validation stage, horizontal flip is not considered in the pre-
matrix between the features and the label is not used to determine processing step.
whether Hypothesis 3 is true or not. The reason is that the dataset
related to Hypothesis 3 does not correspond to the features, but to a data 5.2. Network architecture
transformation from the voice recordings, using histograms. This image
set is intended to be used in a DL-based model, where feature extraction In our custom architecture (Fig. 10), the number of convolutional
is part of the model. and pooling layers is significantly lower than in other computer vision
Then, the validation of Hypothesis 3 is performed by visual networks because, unlike the typical classification task, we do not need
features from deeper layers to identify differently shaped objects, but
rather features from shallow layers. Several similar works aimed at fake
Table 1 recognition have used 2D-CNN with a low number of convolutional
Content of each dataset (voice and target) according to the presence of noise.
layers, with satisfactory results (Zhuo, Tan, Zeng, & Lit, 2018;
Voice Noisy voice Target Noisy target Rodriguez-Ortega et al., 2021; Goel, Kaur, & Bala, 2021).
Dataset 1 × × In Fig. 10, f represents the size of the filter, s the size of the stride and
Dataset 2 × × p the size of the padding task; CONV and POOL are convolutional and
Dataset 3 × × pooling operations. There are 3 layers of CONV + POOL followed by a
Dataset 3
flatten layer, a hidden layer and the output layer. The architecture works
× ×
6
D.M. Ballesteros et al. Expert Systems With Applications 184 (2021) 115465
Fig. 7. Validation examples of Hypothesis 2. Correlation matrix for entropy vs label. The lighter the color, the higher the correlation.
with dropout in the hidden layer to avoid overfitting. The output size of Finally, the activation function of the last neuron is sigmoid, ob
tained by the Eq. (3). For binary classification problems, this type of
a convolutional block is ⌊n+2p− f
+ 1⌋ × ⌊m+2p− f
+ 1⌋ × nf where n corre
s s activation is widely recommended by the scientific community.
sponds to the number of rows, m to the number of columns of the input ( )
image, and nf is the number of filters. For CONV1, n = m = 150, nf = f x =
1
(3)
1 + e− x
32, then, the output is ⌊150+0−
1
3
+ 1⌋ × ⌊150+0−
1
3
+ 1⌋ × 32 = 148 ×
148 × 32. 6. Experimental results and analysis
The number of trainable parameters in each convolutional layer is
equal to the number of weights of the filters including the bias. For the The proposed model was developed using Python 3.0, TensorFlow
first convolutional layer, there are 32 RGB filters, everyone with 3 × 3 and Keras runing on GPU. The experiments are designed to evaluate the
weights, and one bias by filter, with a total of 896 weights (i.e. 32 × (3 × performance of the proposed solution with fake voices obtained from the
3 × 3 + 1). Table 2 shows the summary of trainable parameters. It is Imitation-based method, as well as Deep Voice. This section encom
emphasized that the pooling operation has no trainable parameters. passes the experimental setup, the evaluation metrics, the strategies to
The selected loss function is binary crossentropy ((L(y, ̂ y )), which is avoid overfitting and the final results.
related to the dissimilarity in terms of entropy between two data se
( )
quences, in our case, the entropy of the known labels yi , and the en
tropy of the predicted labels (̂ y i )). This kind of loss function is very 6.1. Experimental setup
useful in binary classification problems. Mathematically, it is calculated
as shown in Eq. (1). Training and validation dataset: the first step in creating the exper
imental dataset is to obtain original and fake recordings using the
1 ∑N
Imitation-based method and the Deep Voice algorithm. In the first case,
L(y, ̂y ) = − yi ⋅log(̂
y i ) + 〈(1 − yi )⋅log(1 − ̂y i )〉 (1)
N i=0 360 original voice recordings from 44 speakers and four languages were
used. Some of these recordings are available in https://doi.org/10
For the optimizer, the selected method is RMSprop which consists of .17632/ytkv9w92t6.1. A white noise with SNR of 20 dB was added to
scaling (dinamically) the learning rate by dividing it into the root of the these recordings, resulting in 360 noisy recordings. Of the 720 original
square (average) gradient of the mini-batch Taqi, Awad, Al-Azzo, and voice recordings, 720 fake recordings were calculated, one for each
Milanova (2018). The activation functions for the convolutional and original voice recording. With original and fake recordings their 2880
hidden layers are ReLU (Linear Rectifier Unit) which has a good histograms were obtained, 1440 from the original voice recordings, and
compromise between performance and computational cost. The goal of the others from the fake recordings. In the case of Deep Voice, recordings
ReLU is to discard negative values and allow positive values to pass, of the Voice Cloning Experiment I published at https://audiodemos.
according to Eq. (2). github.io/ were selected to train the CNN model. A 16-bit re-
f (x) = max(0, x) (2) quantization was applied for working with the same quantization of
the recordings that in the training step. In total, there are 76 histograms
7
D.M. Ballesteros et al. Expert Systems With Applications 184 (2021) 115465
Fig. 8. Validation results of Hypothesis 3. Histograms obtained from speech signals (natural and fake). Available at Ballesteros et al. (2020).
TP + TN
Accuracy = (4)
TP + TN + FP + FN
TP
Precision = (5)
TP + FP
TP
Recall = (6)
TP + FN
The terms TP, TN, FP, FN are explained in Table 3. The accuracy
corresponds to the correct classified recordings divided into the total
recordings; the precision is the ratio between the corrected recordings
classified as original divided into all the recordings classified as original;
while recall corresponds to the corrected recordings classified as original
divided into the total original recordings.
8
D.M. Ballesteros et al. Expert Systems With Applications 184 (2021) 115465
disappears for dropout equal to 0.2, this value is the one selected for the
Table 2
model, according to Section 5.2.
Summary of trainable parameters.
Layer Trainable parameters
6.4. External test
Conv1 896
Pool1 0 The trained model, the trainable parameters, and the experiment for
Conv2 9.248
Pool2 0
external test are posted at https://github.com/yohannarodriguez/Deep
Conv3 18.496 4SNet.git. The dataset with the histograms are available at https://doi.
Pool3 0 org/10.17632/k47yd3m28w.1.
FC1 0 For the method based on Imitation, we use 400 new recordings (not
FC2 1.183.808
FC3 65
Total 1.212.513 Table 3
Description of terms of the evaluation metrics.
Term Description
horizontal flip (image augmentation) in the pre-processing step is
included. Their results are shown in Fig. 13. TP (True Positive) Original speech classified as original
TN (True Negative) Fake speech classified as fake
Comparing the results in Fig. 12 with those in Fig. 13, it is clear that
FP (False Positive) Fake speech classified as original
model performance is better when both horizontal flip and dropout are FN (False Negative) Original speech classified as fake
included. Bearing in mind that the elbow effect in the loss graphs
9
D.M. Ballesteros et al. Expert Systems With Applications 184 (2021) 115465
Fig. 12. Dropout effect. Up to down: accuracy, loss, precision, and recall; left to right: without dropout, dropout = 0.2, dropout = 0.3.
previously used for training or testing), corresponding to 20 original Imitation and Deep Voice. Three hypotheses of datasets to predict
recordings and 380 fake recordings (Fig. 11). The original recordings whether a voice recording is original or fake have been studied in this
were obtained from The LJ Speech Dataset (available at https://keith work. The first hypothesis aims to use statistical values as features to
ito.com/LJ-Speech-Dataset/). In the case of Deep Voice, 76 recordings train machine learning models, and according to the results, the label is
were selected from the Voice Cloning Experiment II (available at not dependent of the features as the correlation coefficient values are
https://audiodemos.github.io/) corresponding to 4 original voice re very close to zero. The second approach aims to use entropy-based
cordings and 72 false voice recordings. Table 4 shows the results, the features, but, most of the correlation coefficient values between every
positive class correspond to the label original and the negative class to pair of feature vs. label are less than 0.5, and therefore, this hypothesis is
the label fake. false, too. The third hypothesis uses histograms from voice recordings as
According to the results shown in Table 4, recall is better for fake they have more extensive information than entropy itself. In a pre
voice recordings obtained by Imitation and DeepVoice than for original liminary review of the histograms, the behavior of the original voice
voice recordings. This means that a fake voice recording is less likely to recordings is different from that of fake voice recordings. This hypoth
be incorrectly labeled than the original voice recording. Similarly, pre esis is selected in the final solution. In this case, the classification task is
cision is better for fake voice recordings than for original voice re treated as a computer vision problem and therefore the classifier is based
cordings. So, if a recording is labeled as fake, there is great confidence in on a custom CNN. In order to avoid overfitting, two strategies were
the truth of the label. In general, histograms are correctly classified applied: image augmentation and dropout. According to experimental
98.5% of the time. tests, the horizontal flip and the dropout equal to 0.2 are good hyper
parameter values for the current problem. The model has precision and
7. Conclusion and future work recall of 0.997 when it is used to classify fake voice recordings with the
Imitation method. Also, when the model is used to identify voice re
We have proposed a solution based on convolutional neural net cordings with Deep Voice the precision is 0.985 and recall is 0.944. The
works to classify original and fake speech recordings obtained by results showed that the proposed solution is successful in identifying
10
D.M. Ballesteros et al. Expert Systems With Applications 184 (2021) 115465
Acknowledgements
References
Arık, S.Ö., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X.,
Miller, J., Ng, A., Raiman, J., et al. (2017). Deep voice: Real-time neural text-to-
speech. In International Conference on Machine Learning (pp. 195–204). PMLR.
Ballesteros, D. M., Rodriguez, Y., & Renza, D. (2020). A dataset of histograms of original
and fake voice recordings (h-voice). Data in brief, 29, Article 105331.
Ballesteros L, D. M., & Moreno A, J. M. (2012a). Highly transparent steganography model
of speech signals using efficient wavelet masking. Expert Systems with Applications,
39, 9141–9149.
Ballesteros L, D. M., & Moreno A, J. M. (2012b). On the ability of adaptation of speech
signals and data hiding. Expert Systems with Applications, 39, 12574–12579.
Bunrit, S., Inkian, T., Kerdprasop, N., & Kerdprasop, K. (2019). Text-independent speaker
identification using deep learning model of convolution neural network. International
Journal of Machine Learning and Computing, 9, 143–148.
Chao, Y.-H. (2014). Using lr-based discriminant kernel methods with applications to
speaker verification. Speech Communication, 57, 76–86.
Chao, Y.-H., Tsai, W.-H., Wang, H.-M., & Chang, R.-C. (2008). Using kernel discriminant
analysis to improve the characterization of the alternative hypothesis for speaker
verification. IEEE transactions on audio, speech, and language processing, 16,
1675–1684.
Feng, Y., Xiong, Q., & Shi, W. (2017). Deep nonlinear metric learning for speaker
verification in the i-vector space. IEICE Transactions on Information and Systems, 100,
215–219.
Goel, N., Kaur, S., & Bala, R. (2021). Dual branch convolutional neural network for copy
move forgery detection. IET Image Processing, 15, 656–665.
Jati, A., & Georgiou, P. (2019). Neural predictive coding using convolutional neural
networks toward unsupervised learning of speaker characteristics. IEEE/ACM
Transactions on Audio, Speech, and Language Processing, 27, 1577–1589.
Jin, Z., Mysore, G. J., Diverdi, S., Lu, J., & Finkelstein, A. (2017). Voco: Text-based
insertion and replacement in audio narration. ACM Transactions on Graphics (TOG),
36, 1–13.
Liu, Y., Qian, Y., Chen, N., Fu, T., Zhang, Y., & Yu, K. (2015). Deep feature for text-
dependent speaker verification. Speech Communication, 73, 1–13.
Loughran, R., Agapitos, A., Kattan, A., Brabazon, A., & O’Neill, M. (2017). Feature
selection for speaker verification using genetic programming. Evolutionary
Intelligence, 10, 1–21.
Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, A., Narang, S., Raiman, J., &
Miller, J. (2018). Deep voice 3: 2000-speaker neural text-to-speech. In Proc. ICLR
(pp. 214–217).
Rashno, A., Ahadi, S. M., & Kelarestaghi, M. (2015). Text-independent speaker
verification with ant colony optimization feature selection and support vector
machine. In 2015 2nd International Conference on Pattern Recognition and Image
Fig. 13. Image augmentation and dropout effect. Up to down: accuracy, loss, Analysis (IPRIA) (pp. 1–5). IEEE.
precision, and recall; left to right: dropout = 0.2, dropout = 0.3. Reynolds, D. A. (1995). Speaker identification and verification using gaussian mixture
speaker models. Speech communication, 17, 91–108.
Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted
gaussian mixture models. Digital signal processing, 10, 19–41.
Table 4 Robinson, D. W. (2008). Entropy and uncertainty. Entropy, 10, 493–506.
Evaluation metrics. Rodriguez-Ortega, Y., Ballesteros, D. M., & Renza, D. (2021). Copy-move forgery
Precision Recall Global Accuracy detection (cmfd) using deep learning for image and video forensics. Journal of
Imaging, 7, 59.
Fake (Imitation) 0.997 0.997 Rollins, J. (2015). Foundational methodology for data science. Whitepaper: Domino Data
Fake (Deep Voice) 0.985 0.944 0.985 Lab Inc.
Original 0.814 0.916 Satt, A., Rozenberg, S., & Hoory, R. (2017). Efficient emotion recognition from speech
using deep learning on spectrograms. In Interspeech (pp. 1089–1093).
Shin, Y., & Balasingham, I. (2017). Comparison of hand-craft feature based svm and cnn
based deep learning framework for automatic polyp classification. In 2017 39th
fake voice recordings obtained by Imitation and Deep Voice. annual international conference of the IEEE engineering in medicine and biology society
(EMBC) (pp. 3277–3280). IEEE.
Taqi, A. M., Awad, A., Al-Azzo, F., & Milanova, M. (2018). The impact of multi-
CRediT authorship contribution statement
optimizers and data augmentation on tensorflow convolutional neural network
performance. In 2018 IEEE Conference on Multimedia Information Processing and
Dora M. Ballesteros: Conceptualization, Methodology, Investiga Retrieval (MIPR) (pp. 140–145). IEEE.
tion, Writing - original draft, Funding acquisition. Yohanna Rodriguez- Yaman, S., & Pelecanos, J. (2013). Using polynomial kernel support vector machines for
speaker verification. IEEE Signal Processing Letters, 20, 901–904.
Ortega: Software, Data curation, Validation, Writing - review & editing. Yenigalla, P., Kumar, A., Tripathi, S., Singh, C., Kar, S., & Vepa, J. (2018). Speech
Diego Renza: Conceptualization, Formal analysis, Writing - review & emotion recognition using spectrogram & phoneme embedding. In Interspeech (pp.
editing, Funding acquisition. Gonzalo Arce: Supervision, Writing - re 3688–3692).
Zakariah, M., Khan, M. K., & Malik, H. (2018). Digital multimedia audio forensics: past,
view & editing. present and future. Multimedia tools and applications, 77, 1009–1040.
Zeng, Y., Mao, H., Peng, D., & Yi, Z. (2019). Spectrogram based multi-task audio
classification. Multimedia Tools and Applications, 78, 3705–3722.
Declaration of competing interest Zhao, J., Dong, Y., Zhao, X., Yang, H., Lu, L., & Wang, H. (2008). Advances in svm-based
system using gmm super vectors for text-independent speaker verification. Tsinghua
Science and Technology, 13, 522–527.
The authors declare that they have no known competing financial Zhuo, L., Tan, S., Zeng, J., & Lit, B. (2018). Fake colorized image detection with channel-
interests or personal relationships that could have appeared to influence wise convolution based deep-learning framework. In 2018 Asia-Pacific Signal and
the work reported in this paper.
11
D.M. Ballesteros et al. Expert Systems With Applications 184 (2021) 115465
Information Processing Association Annual Summit and Conference (APSIPA ASC) (pp.
733–736). IEEE.
12