0% found this document useful (0 votes)
28 views9 pages

AI-Synthesized Voice Detection Using Neural Vocoder Artifacts

Uploaded by

Sujan Mujawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views9 pages

AI-Synthesized Voice Detection Using Neural Vocoder Artifacts

Uploaded by

Sujan Mujawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

This CVPR workshop paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;


the final published version of the proceedings is available on IEEE Xplore.

AI-Synthesized Voice Detection Using Neural Vocoder Artifacts

Chengzhe Sun Shan Jia Shuwei Hou Siwei Lyu


University at Buffalo, State University of New York, NY, USA
{csun22, shanjia, shuweiho, siweilyu}@buffalo.edu

Abstract thetic human voices also pose significant risks. Scammers


have used AI-synthesized voices to impersonate individu-
Advancements in AI-synthesized human voices have cre- als and deceive others into transferring money or providing
ated a growing threat of impersonation and disinforma- sensitive information. In one instance, a scammer used an
tion, making it crucial to develop methods to detect syn- AI-synthesized voice to impersonate a UK company’s CEO
thetic human voices. This study proposes a new approach and tricked an employee into transferring a large sum of
to identifying synthetic human voices by detecting arti- money to the scammer’s account [1]. Moreover, trolls on
facts of vocoders in audio signals. Most DeepFake audio the internet have used free AI voice cloning tools to imi-
synthesis models use a neural vocoder, a neural network tate the voices of celebrities and create content ranging from
that generates waveforms from temporal-frequency repre- memes to virulent hate speech [2].
sentations like mel-spectrograms. By identifying neural While methods to detect AI-synthesized images and
vocoder processing in audio, we can determine if a sam- videos have been extensively studied, methods to detect
ple is synthesized. To detect synthetic human voices, we synthetic human voices have received less attention and are
introduce a multi-task learning framework for a binary- underdeveloped. This is because audio signals have dif-
class RawNet2 model that shares the feature extractor with ferent characteristics that make it difficult to apply image-
a vocoder identification module. By treating vocoder iden- based detection methods. Early detection methods often an-
tification as a pretext task, we constrain the feature extrac- alyze statistical features unique to audio signals. For exam-
tor to focus on vocoder artifacts and provide discrimina- ple, [3] compares higher-order statistics in the bi-spectral
tive features for the final binary classifier. Our experiments domain that capture local phase inconsistencies in synthetic
show that the improved RawNet2 model based on vocoder voices. Recent studies [4–6] tend to use well-designed mod-
identification achieves high classification performance on els for automatic and comprehensive feature learning to de-
the binary task overall. Codes and data can be found tect synthesized audio.
at https://github.com/csun22/Synthetic-
This work proposes a new approach to detecting syn-
Voice-Detection-Vocoder-Artifacts.
thetic human voices based on artifacts introduced by the
neural vocoders used in the generation process. A neu-
ral vocoder is a specialized neural network that synthesizes
1. Introduction audio waveforms from temporal-frequency representations
In recent years, the rapid development of AI technolo- like mel-spectrograms. Since neural vocoders are the fi-
gies, particularly deep learning, has resulted in a surge of nal step in most AI-based audio synthesis models, it is un-
synthetic media, commonly referred to as ”DeepFakes.” likely that real audio signals will be processed with neural
These media are highly realistic and can be challenging to vocoders. Thus, the vocoder artifacts can provide cues to
distinguish from genuine content, making them a significant identify synthetic human voices.
concern. While AI-synthesized still images and videos have We aim to highlight the distinct signal artifacts left by
received much attention, synthetic human voices have also neural vocoders in synthetic audio signals. The foremost
undergone significant advances, achieving unprecedented objective of this study is to explore the artifacts of vocoders.
quality and efficiency. These voices have the potential To do this, we constructed a dataset called LibriSeVoc,
to revolutionize voice-based user interfaces for intelligent which controls for other factors and only probes for the
home assistants and wearable devices and can even help in- vocoder signature. The dataset contains evenly distributed
dividuals who have lost their ability to speak due to condi- data for various vocoders. We used six different neural
tions like strokes or Amyotrophic Lateral Sclerosis (ALS). vocoders to create the LibriSeVoc dataset to reflect the di-
However, the increasing realism and availability of syn- versity in architecture and mechanisms of neural vocoders.

904
Original WaveNet WaveRNN MelGAN Parallel WaveGAN WaveGrad DiffWave

Figure 1. The artifacts introduced by the neural vocoders to a voice signal. We show the mel-spectrogram of the original (top left) and
the self-vocoded voice signal (top right five). Their differences corresponding to the artifacts introduced by the vocoder are shown at the
bottom.

”Self-vocoding” samples were sourced from the same orig- Overall, our work provides a new and promising ap-
inal audio signals to highlight the artifacts introduced by proach to detecting synthetic human voices by focusing on
the vocoders. Figure 1 shows the differences in the mel- neural vocoder artifacts. Our multi-task learning strategy,
spectrogram of one original voice and its self-vocoded voice together with the LibriSeVoc dataset, could serve as a valu-
signals. Visible artifacts introduced by different neural able resource for future research in this area.
vocoder models can be observed, which serve as the ba-
sis of our detection algorithm. While these artifacts may be 2. Related Works
subtle to visualize, this work demonstrates that they can be
In this section, we provide a literature review that is
captured by a trained classifier.
relevant to our research, including voice synthesis meth-
To take advantage of the vocoder artifacts in detecting
ods, state-of-the-art neural vocoder models, and existing
synthetic human voices, we developed a multi-task learning
AI-synthesized voice detection methods.
strategy. We used a binary classifier that shares the front-
end feature extractor (e.g., RawNet2 [4]) with the vocoder 2.1. Human Voice Synthesis
identification module. This accommodates the insufficient
number of existing real and synthetic human voice samples The synthesis of human voice is a significant challenge
by including the self-vocoding samples in LibriSeVoc as in the field of artificial intelligence, with various practical
additional training data. We treated vocoder identification applications such as voice-driven smart assistants and ac-
as a pretext task to constrain the front-end feature extrac- cessible user interfaces. Human voice synthesis can be clas-
tion module to focus on vocoder-level artifacts and build sified into two general categories: text-to-speech (TTS) and
highly discriminative features for the final binary classifier. voice conversion (VC). In this work, we focus on recent
Our experiments showed that our RawNet2 model achieved TTS and VC methods that use deep neural network models.
outstanding classification performance on our LibriSeVoc TTS systems transform input text into audio using the
dataset and two public DeepFake audio datasets. We also target voice and typically consist of three components: a
evaluated our method under different post-processing sce- text analysis module that converts the input text into lin-
narios and demonstrated good detection robustness to re- guistic features, an acoustic model that generates acoustic
sampling and background noise. features in the form of a mel-spectrogram from the linguis-
The main contributions of our work are as follows: tic features, and a vocoder. Recent TTS models based on
deep neural networks include WaveNet [7], Tacotron [8],
• We propose to focus on neural vocoder artifacts as Tacotron 2 [9], ClariNet [10], and FastSpeech 2s [11].
specific and interpretable features for detecting AI- In contrast, VC models take a sample of one subject’s
synthesized audio; voice as input and create output audio of another sub-
ject’s voice of the same utterance. Recent VC models
• We designed a novel multi-task learning approach that (e.g., [12–14]) usually work within the mel-spectrum do-
combines a binary classification task with a vocoder main and use deep neural network models to map between
identification module. This approach constrains the the mel-spectrograms of the input and output voice signals.
feature extractor to learn discriminative vocoder arti- These models use neural style transfer methods such as vari-
facts for detecting synthetic human voices; ational auto-encoder (VAE) or generative adversarial net-
• We provide LibriSeVoc as a dataset with self-vocoding work (GAN) models to capture the utterance elements in
samples created using six state-of-the-art vocoders to the input voice and then combine them with the style of the
highlight and exploit the vocoder artifacts; output voice. The resulting mel-spectrogram is then recon-
structed to an audio waveform using a neural vocoder. Both
• Our proposed method was experimentally evaluated on the TTS and VC models employ deep neural network mod-
three datasets and demonstrated its effectiveness. els trained on large-scale human voice corpora.

905
2.2. Neural Vocoders human voices. Real human voice signals have random lo-
cal phases due to audio waves transmitting and bouncing
Vocoders are crucial components in both TTS and VC
around in the physical environment, while synthetic human
models as they synthesize output audio waveforms from
voices do not have these characteristics. Although these lo-
mel-spectrograms. However, the transformation from audio
cal phase inconsistencies cannot be detected by the human
waveforms to mel-spectrograms leads to the loss of infor-
auditory system, they can be identified through bi-spectral
mation due to binning and filtering, making it difficult to
analysis. Another method, known as DeepSonar [23], uses
recover the audio waveform from a mel-spectrogram. In re-
network responses of audio signals as the feature to detect
cent years, deep neural network-based vocoders have been
synthetic audio. The ASVspoof Challenge 2021 evaluates
developed, significantly improving training efficiency and
additional state-of-the-art synthetic voice detection meth-
synthesis quality. There are three main categories of exist-
ods. The Gaussian mixture models CQCC-GMM [24],
ing neural vocoders: autoregressive models, diffusion mod-
LFCC-GMM [24], a light convolutional neural network
els, and GAN-based models.
model LFCC-LCNN [24], and RawNet2 [4] have achieved
Autoregressive models are probabilistic models that pre-
the most reliable performance as primary baseline algo-
dict the distribution of each audio waveform sample based
rithms.
on all previous samples. However, since this process in-
Recent studies have focused on improving the gener-
volves linear sample-by-sample generation, autoregressive
alization capacity of fake audio detection. Various well-
models are slower than other methods. WaveNet [7], the
designed models have been proposed for Deepfake audio
first autoregressive neural vocoder, can also serve as a TTS
detection, such as the spectro-temporal graph attention net-
or VC model depending on the input. WaveRNN [15] is an-
work [4], unsupervised pretraining models [5], biometric
other autoregressive vocoder that uses a single-layer recur-
characteristics verification model [25], and self-distillation
rent neural network to efficiently predict 16-bit raw audio
framework [6]. However, Müller et al. [26] evaluated
samples from mel-spectrogram slices.
twelve architectures on their dataset with 37.9 hours of au-
Diffusion models are probabilistic generative models
dio recording and found that related work performs poorly
that run diffusion and reverse processes. The diffusion pro-
on real-world data, with some models even degenerating to
cess is characterized by a Markov chain, which gradually
random guessing. Thus, there is a high demand for develop-
adds Gaussian noise to an original signal until the noise
ing efficient and effective models for AI-synthesized audio
is eliminated. The reverse process is a de-noising stage
detection.
that removes the added Gaussian noise and converts a sam-
ple back to the original signal. WaveGrad [16] and Dif- 3. Method
fWave [17] are two notable examples of diffusion-based
vocoder models. While diffusion models are the most time- We aim to detect synthetic human voices by identifying
efficient vocoders, their reconstruction qualities are inferior vocoder artifacts present in the audio signals. Since real hu-
to autoregressive models, and the generated samples may man voice signals typically do not have vocoder artifacts,
contain higher levels of noise and artifacts. except for our self-vocoding signals that are specifically de-
GAN-based models follow the generative adversarial signed to have them, identifying the presence of vocoder ar-
network (GAN) architecture [18], which employs a deep tifacts is a key feature in detecting synthetic human voices.
neural network generator to model the waveform signal in To achieve this, let x be the waveform of a human voice
the time domain and a discriminator to estimate the quality signal with a label y ∈ 0, 1, where 0 corresponds to a real
of the generated speech. Mel-GAN [19] and Parallel Wave- human voice and 1 corresponds to a synthetic human voice.
GAN [20] are the two most commonly used GAN-based Our goal is to build a classifier ŷ = Fθ (x) that predicts
neural vocoders. Recent works have shown that GAN-based the label of an input x. We utilize the recent RawNet2
vocoders outperform autoregressive and diffusion models in model [4] as the backbone for our classifier, as it was de-
both generation speed and generation quality. signed to operate directly on raw waveforms. This reduces
the risk of losing information related to neural vocoder ar-
2.3. AI-synthetic Human Voice Detection tifacts when compared to using pre-processed features such
In recent years, detecting synthetic human voices has be- as mel-spectrograms or linear frequency cepstral coeffi-
come crucial due to their potential misuse. While exten- cients (LFCCs).
sive research has focused on audio authentication for speech The binary detection model can be constructed as a cas-
synthesis and replay attack detection [21, 22], detecting AI- cade of neural networks
generated audio with high realism and varying models is a
F_\theta (\mathbf {x}) = B_{\theta _B}(R_{\theta _R}(\mathbf {x})) (1)
developing field. One of the earliest methods for detecting
AI-synthetic audio is bi-spectral analysis [3]. This method where RθR (x) is the front-end RawNet2 model for feature
captures subtle inconsistencies in local phases of synthetic extraction with its own set of parameters θR , BθB is a back-

906
Lm

Vocoder Identification Lb
Real

RES BLOCK

RES BLOCK
Fixed Sinc Filter
Binary

GRU
Classifier

FC

FC
Fake

Audio Clips

Figure 2. Framework of the proposed synthesized voice detection method. Rather than simply learning discriminative features for binary
classification, we incorporate a vocoder identification module with a multi-class classification loss to direct the feature extraction network
toward prioritizing vocoder artifacts.

end binary classifier and θB are its specific parameters, with vocoders as corresponding labels. This dataset is much eas-
θ = (θR , θB ). We can train this classifier directly as in the ier to create by performing “self-vocoding”, i.e., creating
previous work [27], by solving synthetic human voices by running real samples through
the mel-spectrogram transform and inverse, the latter per-
\min _\theta \sum _{(\mathbf {x},y) \in T} L_\text {b}(y, F_\theta (\mathbf {x})) (2) formed with neural vocoders. We created such a dataset,
LibriSeVoc, which will be described in detail in Section
where Lb (y, ŷ) could be any loss function for binary classi- 4.1. λ is an adjustable hyper-parameter that controls the
fication, for instance, the cross-entropy loss. The variable trade-off between the two loss terms.
T refers to the training dataset that contains labeled real The whole framework of our detection model is shown
and synthetic human voice samples. However, this method in Figure 2. Note that the two classification modules in the
assumes that there is a large number of synthetic human new learning objective function serve different roles. The
voice samples available, which is increasingly difficult to first term pertains to binary classification and aims to dis-
achieve due to the rapid advancement of synthesis technol- tinguish between authentic and fake audio. Meanwhile, the
ogy. Additionally, this approach does not take into account second term focuses on vocoder identification, serving as
the unique statistical properties of neural vocoders, which a pretext task to direct the feature extractor’s attention to-
can be an essential indicator for synthetic audio signals. ward vocoder-related artifacts. The two tasks share the fea-
To address the aforementioned problem, we propose a ture extraction component so that the distinct features of the
multi-task learning approach that combines binary classi- vocoders can be captured and transferred to the binary clas-
fication with a vocoder identification task. This approach sification task.
is designed to emphasize the importance of identifying
vocoder-level artifacts in synthetic audio signals. Specifi- 4. Experiments
cally, we augment our detection model with a vocoder iden- In this section, we present a series of experiments to
tifier MθM , which categorizes a synthetic voice into one of evaluate the effectiveness of our proposed synthetic voice
the c possible neural vocoder models (c ∈ [0, C] where detection method. We start by introducing our LibriSevoc
C ≥ 2). Our goal is to ensure that the feature extractor dataset, as well as two publicly available datasets that we
is trained to capture the distinct statistical characteristics of use for evaluation. Then, we compare our method with
vocoders, making it more sensitive to these features. This state-of-the-art models on all three datasets, using both
approach is similar to self-supervised representation learn- intra- and cross-dataset testing scenarios. Finally, we ex-
ing [28]. To this end, we form a new classification objective, amine the robustness of our detection models to common
as post-processing operations.
\begin {array}{l} \min _{\theta _B,\theta _R} \lambda \sum _{(\mathbf {x},y) \in T} L_\text {b}(y, B_{\theta _B}(R_{\theta _R}(\mathbf {x}))) \\ + \min _{\theta _M,\theta _R} (1-\lambda )\sum _{(\mathbf {x},c) \in T'} L_\text {m}(c, M_{\theta _M}(R_{\theta _R}(\mathbf {x}))) \end {array}
4.1. Datasets
(3) Three DeepFake audio datasets are considered in exper-
In this equation, Lm is a multi-class loss function, and we iments, namely our LibriSeVoc, and two public datasets
use the softmax loss in our experiments. T ′ is a dataset con- WaveFake [29] and ASVspoof 2019 [24].
taining synthetic human voices created with different neural LibriSeVoc Dataset. We have created a new open-source

907
Table 1. Data details of voice samples from real and each vocoder category in the LibriSeVoc dataset.

Model Train-hour Train-sample Dev-hour Dev-sample Test-hour Test-sample Total-hour Total-smaple


Real 20.95 7,920 6.97 2,640 7.00 2,641 34.92 13,201
WaveNet (A01) 20.87 7,920 6.97 2,640 6.91 2,641 34.77 13,201
WaveRNN (A02) 20.95 7,920 7.00 2,640 6.94 2,641 34.90 13,201
WaveGrad (D01) 20.98 7,920 7.01 2,640 6.95 2,641 34.95 13,201
DiffWave (D02) 20.98 7,920 7.01 2,640 6.95 2,641 34.94 13,201
MelGAN (G01) 20.76 7,920 6.94 2,640 6.88 2,641 34.59 13,201
Parallel WaveGAN (G02) 20.76 7,920 6.94 2,640 6.88 2,641 34.59 13,201
Total 146.25 55,440 48.84 18,480 48.51 18,487 243.66 92,407

dataset, named LibriSeVoc, for the task of vocoder artifact


detection. The statistical features of neural vocoders have
not been extensively studied before, and the availability of
large-scale datasets for the task of vocoder identification,
particularly those that include multiple types of vocoders, is
limited. We derived the LibriSeVoc dataset from the widely
used LibriTTS speech corpus [30], which is often utilized
in text-to-speech research [31–33]. The LibriTTS corpus is
based on the Librispeech dataset [34], which contains sam-
ples extracted from audiobooks available on LibriVox [35].
We consider six state-of-the-art neural vocoders to gen-
erate speech samples in the LibriSeVoc dataset, namely,
WaveNet and WaveRNN from the autoregressive vocoders,
Mel-GAN and Parallel WaveGAN from the GAN-based
vocoders, and WaveGrad and DiffWave from the diffusion-
based vocoders. Specifically, we have 34.92 hours of real
Figure 3. Histogram of audio samples length in LibriSeVoc.
audio samples and have generated self-vocoded audio using
six vocoders for each sample. A total of 208.74 hours of
synthesized samples are created in the dataset. FullBand-MelGAN, MultiBand-MelGAN, HiFi-GAN, Par-
Specifically, each vocoder synthesizes waveform sam- allel WaveGAN, and WaveGlow. It consists of approxi-
ples from a given mel-spectrogram extracted from an origi- mately 196 hours of generated audio files derived from the
nal sample; we refer to this process as “self-vocoding.” By LJSPEECH [37] dataset. Note that five of the six vocoder
providing each vocoder with the same mel-spectrogram, we models in WaveFake are GAN-based vocoders. Differently,
ensure that any unique artifacts present in the synthesized our LibriSeVoc dataset aims to consider a high diversity of
samples are attributable to the specific vocoder used to re- vocoders for artifact extraction and covers three categories
construct the audio signal. The statistical information of the of widely-used vocoder structures, including autoregressive
LibriSeVoc dataset is summarized below. models, diffusion models, and GAN-based models.
• The dataset contains 13, 201 real audio samples and ASVspoof 2019 Dataset. This dataset [36] is derived from
79, 206 fake audio samples from six vocoders (each the VCTK base corpus [38], which includes speech data
with 13, 201 audio samples). captured from 107 speakers. It contains three major forms
of spoofing attacks, namely synthetic, converted, and re-
• A total of 92, 407 audio samples are included in the
played speech. We labeled the samples from different
dataset, with audio length from 5 seconds to 34 sec-
vocoders as different classes in the training set for the multi-
onds at 24kHz. Figure 3 shows the detailed audio
class loss calculation. More details about the dataset can be
length distribution in the dataset.
found in Table 2.
• We further split the whole dataset into three non-
overlapped sets for training (55, 440 samples), devel- 4.2. Implementation Details
opment (18, 480 samples), and testing (18, 487 sam-
We utilize the RawNet2 [4] model as the backbone
ples) at the ratio of 6:2:2. More details can be found in
for feature learning. The Adaptive Moment Estimation
Table 1.
(Adam) [39] is used as our optimizer with a learning rate
WaveFake Dataset. This dataset [29] collects DeepFake of 0.0001 and a batch size of 32. The loss weight λ is set as
audios from six vocoder architectures, including MelGAN, 0.5 in the experiment. To report the detection performance,

908
Table 2. Details of evaluation datasets.

Dataset #Vocoder type Frequency Training size Dev size Testing size
LibriSeVoc 6 24kHz 55,440 18,480 18,487
WaveFake Dataset [29] 6 16kHz 64,000 16,000 24,800
ASVspoof 2019 [36] 6 16kHz 25,380 24,844 71,237

we calculate the Equal Error Rate (EER) following previous Table 3. Detection EER (%) of intra-dataset testing on three
studies [24, 29, 40]. datasets.

Methods LibriSeVoc WaveFake ASVspoof


4.3. Baseline methods LFCC-LCNN [36] 0.14 0.19 11.60
LFCC-LCNN [36]. The LFCC-LCNN method combines RawNet2 [4] 0.17 0.32 6.10
WavLM [41] 0.45 2.92 6.94
LFCC feature extraction with an LCNN classifier (DNN). Wav2Vec2-XLS-R [42] 1.54 2.33 13.48
LFCC-LCNN has been widely used since ASVspoof 2019. Ours 0.13 0.19 4.54
It achieved the second-best performance in the ASVspoof
2021 Speech Deepfake track during the ASVspoof chal-
lenge. that our model performed the best in detecting various audio
RawNet2 [4]. The RawNet2 model is based on DNN spoofing attacks.
speaker embedding extraction with the raw waveform as in-
puts. This powerful model uses a technique named feature 4.5. Cross-dataset Evaluation
map scaling which scales feature maps similar to squeeze- In cross-domain testing, we evaluated the performance
excitation. It performed the best in the ASVspoof 2021 of pre-trained detection models trained on our LibriSeVoc
Speech Deepfake track. dataset on the WaveFake dataset, which contained voice
WavLM [41]. The WavLM model from Microsoft is a samples not included in the training set. Table 4 displays the
self-supervised pre-trained multilingual model that can be comparison results with EER for each vocoder of the Wave-
used for a variety of downstream speech tasks. It is specif- Fake dataset. It is evident that all the methods showed a
ically designed to maintain speech content modeling based noticeable degradation in performance during cross-domain
on masked speech prediction while also improving the po- testing, indicating poor generalization ability. Our pro-
tential for non-ASR tasks through speech denoising. Ad- posed method achieved the lowest EERs in detecting un-
ditionally, WavLM uses a gated relative position bias in its seen voice samples generated with the same vocoders as
Transformer structure to better capture the sequence order- those in our LibriSeVoc dataset, specifically MelGAN and
ing of input speech. Parallel WaveGAN models. However, it is not surprising
Wav2Vec2-XLS-R [42]. The Wav2Vec2-XLS-R model that the model exhibited poorer performance on other un-
is a large-scale multilingual pre-trained model for speech- seen vocoders. This highlights a limitation of our method,
related tasks developed by Meta AI. It uses the wav2vec particularly when compared to the original Raw2net, as the
2.0 objective for speech representation learning in 128 lan- vocoder artifacts are not generalized and are constrained by
guages. This model can be fine-tuned for downstream tasks the training vocoder categories.
such as automatic speech recognition, translation, or speech
classification. 4.6. Robustness Evaluation
We evaluate the robustness of our detection method
4.4. Intra-dataset Evaluation
against common post-processing operations by construct-
Synthetic Human Voice Detection. We first report the per- ing an augmented, degraded dataset from the LibriSeVoc
formance on the main task, i.e., the classification of real and test dataset. First, we resample the input speech to interme-
synthetic human voices on the LibriSeVoc and WaveFake diate sampling rates (8kHz, 16kHz, 22.05kHz, 32kHz, and
datasets. The results in the first three columns of Table 3 44.1kHz) and then resample back to the original sampling
show that our multi-task learning-based RawNet2 model rate (24 kHz). Additionally, we introduce background noise
with vocoder identification as a pretext task achieves the by adding a single pre-recorded crowd noise sample at three
lowest EER of 0.13% on LibriSeVoc and 0.19% on Wave- SNR values (8dB, 10dB, and 20dB). We randomly choose
Fake, obviously outperforming other baselines re-trained on between the original, resampled, or noisy speech segments,
each dataset. with probabilities of 40%, 40%, and 20%, respectively. We
Evaluation on ASVspoof 2019. We then evaluate our use this augmented dataset to evaluate the robustness of our
method on the ASVspoof 2019 dataset with spoofed and detection method against these post-processing operations.
fake audios. The results in the last column of Table 3 show The confusion matrix in Figure 4 further compares the

909
Table 4. Detection EER (%) of cross-dataset testing on WaveFake dataset.

Seen vocoder Unseen vocoder


Methods Overall
MelGAN Parallel WaveGAN WaveGlow MultiBand MelGAN FullBand MelGAN HiFiGAN
LFCC-LCNN [36] 77.47 30.06 90.98 98.40 97.17 99.45 98.02
RawNet2 [4] 25.98 3.39 44.31 0.58 23.34 43.82 33.38
WavLM [41] 50 50 50 50 50 50 50
Wav2Vec2-XLS-R [42] 50 50 50 50 50 50 50
Ours 26.95 2.16 35.53 4.60 31.38 45.35 35.80

vocoders in audio signals. To leverage the vocoder artifacts


for synthetic human voice detection, we introduce a binary-
class RawNet2 model that shares the front-end feature ex-
tractor with the one for vocoder identification. We employ
a multi-task learning strategy where vocoder identification
serves as a pretext task to constrain the front-end feature ex-
traction module for building the final binary classifier. Our
experiments demonstrate that our method achieves a high
classification performance overall.
While our proposed approach has shown promising re-
sults, there is still room for improvement and several areas
for future work. Firstly, we plan to expand the LibriSe-
Voc dataset to include a wider variety of real audio signals
and environments. This will help to increase the diversity
of the dataset and improve the generalization capability of
our model. Secondly, although our method is effective in
identifying synthetic audio through vocoder artifacts, we
acknowledge that it is an indirect approach. In future work,
we aim to explore methods that can directly differentiate
real and synthetic audio by combining cues from vocoders
and other signal features of audio DeepFakes. Finally, we
will explore the use of our approach for detecting and pre-
venting the misuse of synthetic audio in various applica-
tions, such as voice cloning or DeepFake videos.

Social Impact Statement


The quality and efficiency of synthetic human voices
generated by AI models have reached unparalleled heights.
Figure 4. Confusion matrices evaluated on LibriSeVoc. Top: origi- However, with the increasing realism and accessibility of
nal testing set. Bottom: post-processed testing set. (OGA: ground these voices, there are significant risks involved. Fraudsters
truth, A01: WaveNet, A02: WaveRNN, D01: WaveGrad, D02: have exploited AI-generated voices to impersonate individ-
Diffwave, G01: MelGAN, G02: Parallel WaveGAN.)
uals and trick others into sharing sensitive information or
transferring money. In this work, we propose a method that
detection and vocoder identification performance on the detects vocoder artifacts, which can expose AI-synthesized
original LibriSeVoc set (with a detection EER of 0.13%) voices and mitigate the risks associated with them.
and on the post-processed dataset (with an EER of 2.73%). To ensure transparency, codes and data related with
This shows that our method can extract discriminative this work are made available as Open Source at https:
vocoder-level features for vocoder identification and is also / / github . com / csun22 / Synthetic - Voice -
robust to common data post-processing operations. Detection-Vocoder-Artifacts.

5. Conclusion Acknowledgment. This work is supported by the Center


for Identification Technology Research (CITeR) and the Na-
In this work, we propose a novel approach for detecting tional Science Foundation under Grant No. 1822190.
synthetic human voices by identifying the traces of neural

910
References [12] E. A. AlBadawy and S. Lyu, “Voice conversion us-
ing speech-to-speech neuro-style transfer,” Proc. In-
[1] Forbes, “A Voice Deepfake Was Used To Scam A terspeech 2017, 2020. 2
CEO Out Of $243,000,” https://www.cnn.com/2020/
02/20/tech/fake-faces-deepfake/index.html, 11 2019. [13] L.-H. Chen, Z.-H. Ling, L.-J. Liu, and L.-R. Dai,
1 “Voice conversion using deep neural networks with
[2] “4chan users embrace AI voice clone tool to gener- layer-wise generative training,” IEEE/ACM Transac-
ate celebrity hatespeech,” https://www.theverge.com/ tions on Audio, Speech, and Language Processing,
2023/1/31/23579289/ai-voice-clone-deepfake-abuse- vol. 22, no. 12, pp. 1859–1872, 2014. 2
4chan-elevenlabs, January 2023. 1
[14] S. H. Mohammadi and A. Kain, “Voice conversion
[3] E. A. AlBadawy, S. Lyu, and H. Farid, “Detecting using deep neural networks with speaker-independent
AI-synthesized speech using bispectral analysis.” in pre-training,” in 2014 IEEE Spoken Language Tech-
CVPR Workshops, 2019, pp. 104–109. 1, 3 nology Workshop (SLT). IEEE, 2014, pp. 19–23. 2
[4] H. Tak, J.-w. Jung, J. Patino, M. Kamble, M. Todisco, [15] N. Kalchbrenner, E. Elsen, K. Simonyan et al., “Effi-
and N. Evans, “End-to-end spectro-temporal graph at- cient neural audio synthesis,” in ICML, 2018. 3
tention networks for speaker verification anti-spoofing
and speech deepfake detection,” arXiv preprint [16] N. Chen, Y. Zhang, H. Zen et al., “WaveGrad: Esti-
arXiv:2107.12710, 2021. 1, 2, 3, 5, 6, 7 mating gradients for waveform generation,” in ICLR,
2020. 3
[5] Z. Lv, S. Zhang, K. Tang, and P. Hu, “Fake audio de-
tection based on unsupervised pretraining models,” in [17] Z. Kong, W. Ping, J. Huang et al., “DiffWave: A ver-
ICASSP 2022-2022 IEEE International Conference on satile diffusion model for audio synthesis,” in ICLR,
Acoustics, Speech and Signal Processing (ICASSP). 2020. 3
IEEE, 2022, pp. 9231–9235. 1, 3
[6] J. Xue, C. Fan, J. Yi, C. Wang, Z. Wen, D. Zhang, [18] I. Goodfellow, J. Pouget-Abadie, M. Mirza et al.,
and Z. Lv, “Learning from yourself: A self-distillation “Generative adversarial nets,” NeurIPS, vol. 27, 2014.
method for fake speech detection,” arXiv preprint 3
arXiv:2303.01211, 2023. 1, 3 [19] K. Kumar, R. Kumar, T. de Boissiere et al., “MelGAN:
[7] A. van den Oord, S. Dieleman, H. Zen et al., Generative adversarial networks for conditional wave-
“WaveNet: A generative model for raw audio,” in form synthesis,” arXiv, 2019. 3
arXiv, 2016. [Online]. Available: https://arxiv.org/
abs/1609.03499 2, 3 [20] R. Yamamoto, E. Song, and J.-M. Kim, “Paral-
lel WaveGAN: A fast waveform generation model
[8] Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. based on generative adversarial networks with multi-
Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, resolution spectrogram,” in ICASSP, 2020. 3
Q. V. Le, Y. Agiomyrgiannakis, R. Clark, and R. A.
Saurous, “Tacotron: A fully end-to-end text-to-speech [21] Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Ale-
synthesis model,” CoRR, vol. abs/1703.10135, 2017. gre, and H. Li, “Spoofing and countermeasures for
[Online]. Available: http://arxiv.org/abs/1703.10135 speaker verification: A survey,” speech communica-
2 tion, vol. 66, pp. 130–153, 2015. 3
[9] Z. Wang, Y. Liu, and L. Shan, “CE-Tacotron2: End-to- [22] H. A. Patil and M. R. Kamble, “A survey on replay at-
end emotional speech synthesis,” in 2021 60th Annual tack detection for automatic speaker verification (asv)
Conference of the Society of Instrument and Control system,” in 2018 Asia-Pacific Signal and Information
Engineers of Japan (SICE), 2021, pp. 48–52. 2 Processing Association Annual Summit and Confer-
[10] W. Ping, K. Peng, and J. Chen, “Clarinet: Parallel ence (APSIPA ASC). IEEE, 2018, pp. 1047–1053.
wave generation in end-to-end text-to-speech,” arXiv 3
preprint arXiv:1807.07281, 2018. 2
[23] R. Wang, F. Juefei-Xu, Y. Huang, Q. Guo, X. Xie,
[11] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, L. Ma, and Y. Liu, “Deepsonar: Towards effective and
and T.-Y. Liu, “FastSpeech 2: Fast and high- robust detection of ai-synthesized fake voices,” in Pro-
quality end-to-end text to speech,” arXiv preprint ceedings of the 28th ACM International Conference on
arXiv:2006.04558, 2020. 2 Multimedia, 2020, pp. 1207–1216. 3

911
[24] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, [35] “Librivox – free public domain audiobooks,” https://
H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, librivox.org/. 5
T. Kinnunen, and K. A. Lee, “ASVspoof 2019: Future
horizons in spoofed and fake audio detection,” arXiv [36] G. Lavrentyeva, S. Novoselov, A. Tseren, M. Volkova,
preprint arXiv:1904.05441, 2019. 3, 4, 6 A. Gorlanov, and A. Kozlov, “Stc antispoofing sys-
tems for the asvspoof2019 challenge,” arXiv preprint
[25] A. Pianese, D. Cozzolino, G. Poggi, and L. Verdo- arXiv:1904.05576, 2019. 5, 6, 7
liva, “Deepfake audio detection by speaker verifica-
[37] K. Ito and L. Johnson, “The lj speech dataset,”
tion,” in 2022 IEEE International Workshop on Infor-
https://keithito.com/LJ-Speech-Dataset/, 2017. 5
mation Forensics and Security (WIFS). IEEE, 2022,
pp. 1–6. 3 [38] C. Veaux, J. Yamagishi, K. MacDonald et al.,
“Superseded-cstr vctk corpus: English multi-speaker
[26] N. M. Müller, P. Czempin, F. Dieckmann, A. Frogh-
corpus for cstr voice cloning toolkit,” 2016. 5
yar, and K. Böttinger, “Does audio deepfake detection
generalize?” arXiv preprint arXiv:2203.16263, 2022. [39] D. P. Kingma and J. Ba, “Adam: A method
3 for stochastic optimization,” arXiv preprint
arXiv:1412.6980, 2014. 5
[27] G. Yang, S. Yang, K. Liu et al., “Multi-band melgan:
Faster waveform generation for high-quality text-to- [40] J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah,
speech,” in SLT workshop, 2021, pp. 492–498. 4 J. Patino, A. Nautsch, X. Liu, K. A. Lee, T. Kin-
nunen, N. Evans et al., “ASVspoof 2021: accelerating
[28] L. Ericsson, H. Gouk, C. C. Loy, and T. M. progress in spoofed and deepfake speech detection,”
Hospedales, “Self-supervised representation learning: arXiv preprint arXiv:2109.00537, 2021. 6
Introduction, advances and challenges,” IEEE Signal
Processing Magazine, vol. 32, 2022. 4 [41] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen,
J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou,
[29] J. Frank and L. Schönherr, “Wavefake: a data set to S. Ren, Y. Qian, Y. Qian, M. Zeng, X. Yu, and F. Wei,
facilitate audio deepfake detection,” Thirty-fifth Con- “Wavlm: Large-scale self-supervised pre-training for
ference on Neural Information Processing Systems full stack speech processing,” IEEE Journal of Se-
(NeurIPS 2021), 2021. 4, 5, 6 lected Topics in Signal Processing, vol. 16, pp. 1–14,
10 2022. 6, 7
[30] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss,
Y. Jia, Z. Chen, and Y. Wu, “LibriTTS: A corpus [42] T. Arun Babu, C. Wang, A. Tjandra, K. Lakhotia,
derived from LibriSpeech for text-to-speech,” arXiv Q. Xu, N. Goyal, K. Singh, P. Platen, Y. Saraf, J. Pino,
preprint arXiv:1904.02882, 2019. 5 A. Baevski, A. Conneau, and M. Auli, “Xls-r: Self-
supervised cross-lingual speech representation learn-
[31] J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-TTS: ing at scale,” 11 2021. 6, 7
A generative flow for text-to-speech via monotonic
alignment search,” Advances in Neural Information
Processing Systems, vol. 33, pp. 8067–8077, 2020. 5

[32] R. Valle, K. Shih, R. Prenger, and B. Catanzaro,


“Flowtron: an autoregressive flow-based generative
network for text-to-speech synthesis,” arXiv preprint
arXiv:2005.05957, 2020. 5

[33] M. Chen, X. Tan, Y. Ren, J. Xu, H. Sun, S. Zhao,


T. Qin, and T.-Y. Liu, “Multispeech: Multi-speaker
text to speech with transformer,” arXiv preprint
arXiv:2006.04664, 2020. 5

[34] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur,


“LibriSpeech: an ASR corpus based on public domain
audio books,” in 2015 IEEE international conference
on acoustics, speech and signal processing (ICASSP).
IEEE, 2015, pp. 5206–5210. 5

912

You might also like