0% found this document useful (0 votes)
15 views9 pages

Final Intro AIReport

This study presents a deepfake voice detection system that combines Convolutional Neural Networks (CNNs) and Bidirectional Long Short-Term Memory (BiLSTM) networks, achieving a high accuracy of 97% in distinguishing between genuine and synthetic audio. The model utilizes a curated dataset, LibriSeVoc, and advanced preprocessing techniques to optimize input data quality. While effective, the system faces challenges related to computational complexity and fixed-length segmentation, prompting suggestions for future research to enhance scalability and robustness.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views9 pages

Final Intro AIReport

This study presents a deepfake voice detection system that combines Convolutional Neural Networks (CNNs) and Bidirectional Long Short-Term Memory (BiLSTM) networks, achieving a high accuracy of 97% in distinguishing between genuine and synthetic audio. The model utilizes a curated dataset, LibriSeVoc, and advanced preprocessing techniques to optimize input data quality. While effective, the system faces challenges related to computational complexity and fixed-length segmentation, prompting suggestions for future research to enhance scalability and robustness.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

SYNTHESIZED VOICE DETECTION

Tran Vu Manh Duc, Do Tuan Nam, Trinh Ha Phuong, Nguyen Thanh Lam, Pham Chi Bang

Abstract

The rapid advancement of artificial intelligence has facilitated the generation of highly realistic synthetic
voices, which present substantial risks to cybersecurity, communication integrity, and public trust. This
study introduces a robust detection system for synthetic audio (deepfake voices) by integrating
Convolutional Neural Networks (CNNs) and Bidirectional Long Short-Term Memory (BiLSTM) networks.
The system utilizes spectrogram-based representations of audio signals to effectively extract both spatial
and temporal features, thereby enabling a comprehensive assessment of voice authenticity. A carefully
curated dataset, LibriSeVoc, comprising genuine and DiffWave-generated synthetic voices, was employed
to train the model. Advanced preprocessing techniques, such as noise reduction, bandpass filtering, and
Mel spectrogram transformation, were employed to optimize input data quality. The model achieved a high
accuracy of 97% on the evaluation dataset, demonstrating strong generalizability across diverse voice
samples. Despite computational complexity, the proposed system serves as a crucial tool for mitigating the
risks associated with synthetic audio misuse in practical applications. All our work is available at:
ronanhansel/deepfake-audio-detector

1 INTRODUCTION

State Real-Life Problem

The rise of artificial intelligence has enabled the creation of highly realistic
synthetic audio, known as deepfake voice. While these technologies have valid uses
in entertainment and virtual assistants, they also pose serious threats, such as
deception, impersonation, and fraud. These capabilities raise concerns in sectors
like finance, cybersecurity, and public safety. The ability to create convincing
synthetic voices undermines audio reliability, emphasizing the need for effective
detection methods.

Aims of the Project

This project aims to develop a system to detect deepfake voices, distinguishing


between genuine and synthesized audios with high accuracy. Using deep learning
techniques, specifically Bidirectional Long Short-Term Memory (BiLSTM)
networks and fully connected (FC) layers, the project aims to enhance
communication security and reduce the risks of synthetic audio misuse.
2 RELATED WORK

Spectrogram

Spectrograms are widely used as visual representations of audio signals, capturing


critical time-frequency information for analysis. In deepfake voice detection,
spectrograms serve as input data for neural networks, transforming audio signals
into a format suitable for deep learning models.

Convolutional Neural Networks (CNN)

Convolutional Neural Networks (CNN) are effective in analyzing spectrograms for


audio classification. By detecting patterns in visual data, CNNs are well-suited for
distinguishing between real and synthetic voice signals. In related studies, CNNs
have processed spectrogram inputs to extract features that identify subtle
differences between genuine and synthesized audios, improving detection
performance.

Bidirectional Long Short-Term Memory Networks (BiLSTM)

Bidirectional LSTMs (BiLSTMs) extend standard LSTMs by processing sequential


data in both forward and backward directions, allowing for a more complete
understanding of temporal dependencies. This dual perspective enables BiLSTMs
to capture both past and future context, making them particularly effective for
tasks that involve sequential patterns, such as audio signal processing.

3 SYNTHETIC VOICE DETECTOR

Data Collection

The dataset used is ground truth(GT) folder and diffwave folder of LibriSeVoc,
consisting of approximately 25,000 audio files categorized as genuine or synthetic.
LibriSeVoc is derived from LibriSpeech (credibility and utility in academic and
practical contexts), a dataset designed to balance male and female voices, ensuring
fair gender representation. This balance is critical for training unbiased models.
Additionally, the dataset includes diverse accents, enhancing the model's
robustness to different speech patterns. The GT folder contains real voice
recordings, while the diffwave folder includes synthetic audio generated using
DiffWave technology. DiffWave leverages diffusion-based generative techniques
that allow for high-quality, natural-sounding speech synthesis, making it an ideal
benchmark for testing the robustness of detection models.

Preprocessing

The preprocessing pipeline involved several key steps to optimize the dataset for
training. These included bandpass filtering, noise reduction, dynamic range
adjustment, and segment extraction to prepare the audio for feature extraction.

Bandpass Filtering: A bandpass filter was applied to the audio signals with
contain frequency between 250 Hz to 4000 Hz to retain the most critical speech
components while filtering out irrelevant background noise. By focusing on this
frequency range, we targeted the key elements of human speech, ensuring that the
resulting features accurately represent the spoken content.

Noise Reduction: To further enhance the quality of the audio, a custom noise
reduction process was applied to reduce low-amplitude noise. This process
involved analyzing the decibel level of the audio signals and selectively reducing the
amplitude of segments that fell below a specified threshold. This dynamic
adjustment ensured that noise was minimized without compromising the integrity
of the primary speech signal.

Segmentation: The audio signals were segmented into smaller chunks to facilitate
more granular analysis. Each audio file was divided into 30 segments of 1 second
each, with overlapping segments to ensure coverage of the entire recording. This
segmentation strategy allowed the model to learn from shorter, more manageable
units of data, which is particularly beneficial for capturing temporal dynamics
within speech. The segments were then converted into Mel spectrograms, a type of
time-frequency representation that provides a detailed view of the energy
distribution across frequency bands. The Mel spectrograms were further
transformed into decibel scale to highlight variations in the energy levels, making it
easier for the model to identify distinguishing features.

Spectrogram Generation: For each segment, a Mel spectrogram was computed


using a Fast Fourier Transform (FFT) with 2048 frequency bins and a hop length
of 512. This representation captures the time-frequency structure of the audio,
which is crucial for identifying the distinctive characteristics of genuine and
synthetic speech. Mel spectrograms were used because they provide a perceptually
relevant representation of audio, focusing on features that are more closely aligned
with human auditory perception. The spectrograms were subsequently normalized
to ensure consistency across the dataset, improving the model's ability to learn
meaningful patterns.

These preprocessing steps ensured that the input data was highly representative of
the variations present in human and synthetic speech, thereby facilitating the
effective training of deep learning models. By combining filtering, noise reduction,
segmentation, and feature extraction, the preprocessing pipeline was designed to
provide high-quality inputs for the model, enhancing its ability to learn
discriminative features for deepfake detection.

Model selection

CNN: The model architecture employed for this project combines Convolutional
Neural Networks (CNN) and Bidirectional Long Short-Term Memory (BiLSTM)
networks to extract both spatial and temporal features from audio signals. The
initial phase utilizes a CNN-based feature extractor to process spectrograms. This
feature extractor comprises multiple convolutional layers with increasing filter
sizes: 32, 64, 128, 256 and kernel sizes of (5, 5). Each convolutional layer is
followed by Batch Normalization to stabilize the learning process and Leaky ReLU
activation to introduce non-linearity while avoiding issues related to vanishing
gradients. MaxPooling layers are used to progressively reduce the spatial
dimensions of the feature maps, thereby retaining the most salient features while
reducing computational complexity.

BiLSTM: The CNN-derived features are then flattened and processed through a
series of Bidirectional LSTM layers. The use of Bidirectional LSTMs enables the
model to capture both forward and backward dependencies within temporal
sequences, which is crucial for effectively modeling the sequential nature of human
speech. Three Bidirectional LSTM layers, each with 128 units, are utilized to
analyze the temporal dynamics of the audio segments, capturing the relationships
between different time steps in the audio. Dropout layers are added after each
BiLSTM layer to mitigate overfitting and enhance the model's generalization
capabilities. Dropout is particularly important in this context, as the model is prone
to overfitting due to the complexity of the audio data and the need to capture both
spatial and temporal features.

FULLY CONNECTED: For the final classification, a series of fully connected


layers is employed, including dense layers with 512 and 64 units, utilizing ReLU
activation and L2 regularization to enforce weight constraints and improve model
generalizability. ReLU activation introduces non-linearity, allowing the model to
learn complex decision boundaries, while L2 regularization helps prevent
overfitting by penalizing large weight values. The output layer employs a sigmoid
activation function to predict the probability that the input audio is synthesized,
providing a binary classification output. The model is trained using the Adam
optimizer with an exponential decay learning rate schedule to balance convergence
speed and model stability. The exponential decay schedule gradually reduces the
learning rate during training, allowing the model to fine-tune its weights more
precisely as it approaches convergence.

This combination of CNN and BiLSTM architectures leverages the strengths of


both approaches—CNNs excel at extracting spatial features from spectrograms,
while BiLSTMs are adept at capturing temporal dependencies, resulting in a robust
approach for deepfake voice detection. The integration of these architectures
allows the model to effectively analyze both the spectral content and temporal
evolution of the audio signal, providing a comprehensive solution to the deepfake
detection problem. By incorporating advanced regularization techniques,
bidirectional LSTMs, and a sophisticated learning rate schedule, the model is
designed to achieve high accuracy and generalizability, making it suitable for real-
world deployment in scenarios where the authenticity of audio communications is
critical.

Training:

The dataset, consisting of 25,000 audio files from the GT and Diffwave folders, was
divided into two subsets: 20,000 files for training and 5,000 files for evaluation.
During training, 16,000 files were used as the training set and 4,000 files for
validation. Binary Cross Entropy served as the loss function. Upon observing
overfitting, the Dropout rate was increased to 0.5 to improve generalization. All
preprocessing steps were performed on an Apple M1 Max, while the model was
trained on an Nvidia A100 using Google Colab Runtime. The initial learning rate
was set to 0.0001 with an exponential decay schedule to mitigate overfitting, and a
batch size of 128 was used. The model converged before 30 epochs, achieving a
training accuracy of 1.0000 and a training loss of 0.0112. The validation accuracy
reached 0.9998 with a validation loss of 0.01.
4 TECHNICAL CHALLENGES

Diversity of Length

Variability in audio length across recordings complicates model training. To


address this, audio was segmented into fixed-length segments, enabling the model
to learn from consistent units while generalizing across varying durations.

Noise

Background noise can interfere with feature extraction, making accurate


classification challenging. To mitigate this, noise reduction was applied during
preprocessing to minimize low-amplitude noise, complemented by bandpass
filtering to remove irrelevant frequencies.

Sequential Data

Speech is inherently sequential, with temporal relationships critical for


distinguishing genuine from synthetic audio. Bidirectional LSTMs were employed
to capture both forward and backward temporal dependencies, allowing the model
to analyze the entire temporal context and detect subtle anomalies indicative of
synthetic generation.

Underfitting and Overfitting

Our first model architecture uses a Global Average Pooling layer to produce the
features description vector after convolution layers. But it causes too much
information loss, that model cannot learn the training set distribution that the
accuracy does not improve over training. We then flatten the output tensor of
256x8x2 of the last convolutional layer instead, model well converged then.
In training process, overfit occurred. We used Dropout layers and L2
regularization to prevent that.

5 Result
As we are looking for a model that can support detecting synthetic audio with high
accuracy, we focus on considering accuracy of model. We also look at some other
popular metrics for binary classification task, like precision, recall and ROC-curve
LibriSevoc

The model, trained on the Ground Truth (GT) and Diffwave folders of the
LibriSeVoc dataset, exhibited strong performance during evaluation. Specifically,
5,000 examples were reserved for evaluation after training, achieving an accuracy
of 97%.

During evaluation, where 0 represents genuine audio and 1 represents synthetic


audio, the model exhibited a higher error rate in detecting synthetic samples,
suggesting a bias towards misclassifying fake audio as genuine. This limitation
highlights the need for further refinement to enhance detection capabilities and
address this shortcoming.
6 Discussion
Pros: The proposed deepfake voice detection system demonstrates robust
performance through the integration of spatial and temporal feature extraction
using CNNs and BiLSTMs. The CNN component effectively captures frequency
patterns from spectrograms, while the BiLSTM layers model the sequential
dependencies of audio signals, which are essential for distinguishing between
genuine and synthesized speech. The use of Bidirectional LSTMs enhances the
model's ability to capture both forward and backward temporal relationships,
making it particularly adept at detecting subtle inconsistencies that are
characteristic of deepfake audio. Additionally, the advanced preprocessing steps,
including bandpass filtering and noise reduction, improve input quality, thereby
enhancing overall model accuracy.

Cons: Despite its strengths, the model has certain limitations. The reliance on
fixed-length segmentation may reduce its adaptability to audio recordings with
highly variable lengths or unexpected content. Furthermore, the complexity of the
CNN-BiLSTM architecture requires substantial computational resources, which
could limit its scalability and applicability in real-time scenarios. The model's
performance may also degrade in highly noisy environments, where residual noise
persists despite preprocessing. Finally, while Bidirectional LSTMs are effective for
modeling temporal dependencies, they are computationally expensive, affecting the
feasibility of deploying the model on resource-constrained devices.

Future Research

Future research could focus on improving the model's scalability and efficiency,
particularly for real-time deepfake detection. Exploring more lightweight
architectures, such as attention-based models or Transformer networks, could
reduce computational overhead while maintaining accuracy. Additionally,
incorporating adaptive segmentation strategies would allow the model to handle
variable-length audio more effectively, enhancing its applicability across diverse use
cases. Further efforts to enhance noise robustness could include integrating
advanced denoising techniques, such as deep learning-based noise suppression.
Expanding the dataset to include more diverse and challenging deepfake samples,
especially those generated by state-of-the-art generative models, would also
contribute to enhancing the model's generalizability and robustness against
emerging deepfake technologies.
7 Conclusion
This study presented a deepfake voice detection system using a CNN-BiLSTM
hybrid architecture for effective spatial and temporal feature extraction. The model
achieved an impressive accuracy of 97%, demonstrating its efficacy in
distinguishing between genuine and synthetic audio.

However, the reliance on fixed-length segmentation and the computational


demands of the architecture present challenges for real-time deployment. Future
work should focus on improving efficiency, scalability and robustness, particularly
in real-time scenarios, exploring more lightweight architectures contributing to the
development of reliable safeguards against synthetic audio threats.

You might also like