Final Intro AIReport
Final Intro AIReport
Tran Vu Manh Duc, Do Tuan Nam, Trinh Ha Phuong, Nguyen Thanh Lam, Pham Chi Bang
Abstract
The rapid advancement of artificial intelligence has facilitated the generation of highly realistic synthetic
voices, which present substantial risks to cybersecurity, communication integrity, and public trust. This
study introduces a robust detection system for synthetic audio (deepfake voices) by integrating
Convolutional Neural Networks (CNNs) and Bidirectional Long Short-Term Memory (BiLSTM) networks.
The system utilizes spectrogram-based representations of audio signals to effectively extract both spatial
and temporal features, thereby enabling a comprehensive assessment of voice authenticity. A carefully
curated dataset, LibriSeVoc, comprising genuine and DiffWave-generated synthetic voices, was employed
to train the model. Advanced preprocessing techniques, such as noise reduction, bandpass filtering, and
Mel spectrogram transformation, were employed to optimize input data quality. The model achieved a high
accuracy of 97% on the evaluation dataset, demonstrating strong generalizability across diverse voice
samples. Despite computational complexity, the proposed system serves as a crucial tool for mitigating the
risks associated with synthetic audio misuse in practical applications. All our work is available at:
ronanhansel/deepfake-audio-detector
1 INTRODUCTION
The rise of artificial intelligence has enabled the creation of highly realistic
synthetic audio, known as deepfake voice. While these technologies have valid uses
in entertainment and virtual assistants, they also pose serious threats, such as
deception, impersonation, and fraud. These capabilities raise concerns in sectors
like finance, cybersecurity, and public safety. The ability to create convincing
synthetic voices undermines audio reliability, emphasizing the need for effective
detection methods.
Spectrogram
Data Collection
The dataset used is ground truth(GT) folder and diffwave folder of LibriSeVoc,
consisting of approximately 25,000 audio files categorized as genuine or synthetic.
LibriSeVoc is derived from LibriSpeech (credibility and utility in academic and
practical contexts), a dataset designed to balance male and female voices, ensuring
fair gender representation. This balance is critical for training unbiased models.
Additionally, the dataset includes diverse accents, enhancing the model's
robustness to different speech patterns. The GT folder contains real voice
recordings, while the diffwave folder includes synthetic audio generated using
DiffWave technology. DiffWave leverages diffusion-based generative techniques
that allow for high-quality, natural-sounding speech synthesis, making it an ideal
benchmark for testing the robustness of detection models.
Preprocessing
The preprocessing pipeline involved several key steps to optimize the dataset for
training. These included bandpass filtering, noise reduction, dynamic range
adjustment, and segment extraction to prepare the audio for feature extraction.
Bandpass Filtering: A bandpass filter was applied to the audio signals with
contain frequency between 250 Hz to 4000 Hz to retain the most critical speech
components while filtering out irrelevant background noise. By focusing on this
frequency range, we targeted the key elements of human speech, ensuring that the
resulting features accurately represent the spoken content.
Noise Reduction: To further enhance the quality of the audio, a custom noise
reduction process was applied to reduce low-amplitude noise. This process
involved analyzing the decibel level of the audio signals and selectively reducing the
amplitude of segments that fell below a specified threshold. This dynamic
adjustment ensured that noise was minimized without compromising the integrity
of the primary speech signal.
Segmentation: The audio signals were segmented into smaller chunks to facilitate
more granular analysis. Each audio file was divided into 30 segments of 1 second
each, with overlapping segments to ensure coverage of the entire recording. This
segmentation strategy allowed the model to learn from shorter, more manageable
units of data, which is particularly beneficial for capturing temporal dynamics
within speech. The segments were then converted into Mel spectrograms, a type of
time-frequency representation that provides a detailed view of the energy
distribution across frequency bands. The Mel spectrograms were further
transformed into decibel scale to highlight variations in the energy levels, making it
easier for the model to identify distinguishing features.
These preprocessing steps ensured that the input data was highly representative of
the variations present in human and synthetic speech, thereby facilitating the
effective training of deep learning models. By combining filtering, noise reduction,
segmentation, and feature extraction, the preprocessing pipeline was designed to
provide high-quality inputs for the model, enhancing its ability to learn
discriminative features for deepfake detection.
Model selection
CNN: The model architecture employed for this project combines Convolutional
Neural Networks (CNN) and Bidirectional Long Short-Term Memory (BiLSTM)
networks to extract both spatial and temporal features from audio signals. The
initial phase utilizes a CNN-based feature extractor to process spectrograms. This
feature extractor comprises multiple convolutional layers with increasing filter
sizes: 32, 64, 128, 256 and kernel sizes of (5, 5). Each convolutional layer is
followed by Batch Normalization to stabilize the learning process and Leaky ReLU
activation to introduce non-linearity while avoiding issues related to vanishing
gradients. MaxPooling layers are used to progressively reduce the spatial
dimensions of the feature maps, thereby retaining the most salient features while
reducing computational complexity.
BiLSTM: The CNN-derived features are then flattened and processed through a
series of Bidirectional LSTM layers. The use of Bidirectional LSTMs enables the
model to capture both forward and backward dependencies within temporal
sequences, which is crucial for effectively modeling the sequential nature of human
speech. Three Bidirectional LSTM layers, each with 128 units, are utilized to
analyze the temporal dynamics of the audio segments, capturing the relationships
between different time steps in the audio. Dropout layers are added after each
BiLSTM layer to mitigate overfitting and enhance the model's generalization
capabilities. Dropout is particularly important in this context, as the model is prone
to overfitting due to the complexity of the audio data and the need to capture both
spatial and temporal features.
Training:
The dataset, consisting of 25,000 audio files from the GT and Diffwave folders, was
divided into two subsets: 20,000 files for training and 5,000 files for evaluation.
During training, 16,000 files were used as the training set and 4,000 files for
validation. Binary Cross Entropy served as the loss function. Upon observing
overfitting, the Dropout rate was increased to 0.5 to improve generalization. All
preprocessing steps were performed on an Apple M1 Max, while the model was
trained on an Nvidia A100 using Google Colab Runtime. The initial learning rate
was set to 0.0001 with an exponential decay schedule to mitigate overfitting, and a
batch size of 128 was used. The model converged before 30 epochs, achieving a
training accuracy of 1.0000 and a training loss of 0.0112. The validation accuracy
reached 0.9998 with a validation loss of 0.01.
4 TECHNICAL CHALLENGES
Diversity of Length
Noise
Sequential Data
Our first model architecture uses a Global Average Pooling layer to produce the
features description vector after convolution layers. But it causes too much
information loss, that model cannot learn the training set distribution that the
accuracy does not improve over training. We then flatten the output tensor of
256x8x2 of the last convolutional layer instead, model well converged then.
In training process, overfit occurred. We used Dropout layers and L2
regularization to prevent that.
5 Result
As we are looking for a model that can support detecting synthetic audio with high
accuracy, we focus on considering accuracy of model. We also look at some other
popular metrics for binary classification task, like precision, recall and ROC-curve
LibriSevoc
The model, trained on the Ground Truth (GT) and Diffwave folders of the
LibriSeVoc dataset, exhibited strong performance during evaluation. Specifically,
5,000 examples were reserved for evaluation after training, achieving an accuracy
of 97%.
Cons: Despite its strengths, the model has certain limitations. The reliance on
fixed-length segmentation may reduce its adaptability to audio recordings with
highly variable lengths or unexpected content. Furthermore, the complexity of the
CNN-BiLSTM architecture requires substantial computational resources, which
could limit its scalability and applicability in real-time scenarios. The model's
performance may also degrade in highly noisy environments, where residual noise
persists despite preprocessing. Finally, while Bidirectional LSTMs are effective for
modeling temporal dependencies, they are computationally expensive, affecting the
feasibility of deploying the model on resource-constrained devices.
Future Research
Future research could focus on improving the model's scalability and efficiency,
particularly for real-time deepfake detection. Exploring more lightweight
architectures, such as attention-based models or Transformer networks, could
reduce computational overhead while maintaining accuracy. Additionally,
incorporating adaptive segmentation strategies would allow the model to handle
variable-length audio more effectively, enhancing its applicability across diverse use
cases. Further efforts to enhance noise robustness could include integrating
advanced denoising techniques, such as deep learning-based noise suppression.
Expanding the dataset to include more diverse and challenging deepfake samples,
especially those generated by state-of-the-art generative models, would also
contribute to enhancing the model's generalizability and robustness against
emerging deepfake technologies.
7 Conclusion
This study presented a deepfake voice detection system using a CNN-BiLSTM
hybrid architecture for effective spatial and temporal feature extraction. The model
achieved an impressive accuracy of 97%, demonstrating its efficacy in
distinguishing between genuine and synthetic audio.