0% found this document useful (0 votes)
5 views16 pages

CNN Bilstm

The document outlines the implementation of a CNN-BiLSTM model for detecting deep fake audio, emphasizing its importance for security and media integrity. It utilizes the ASVspoof 2019 dataset, which includes over 121,000 audio samples of both genuine and spoofed speech, and details the preprocessing steps and model architecture. The approach combines CNNs for feature extraction and BiLSTM for capturing temporal dependencies, ultimately classifying audio as real or fake.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views16 pages

CNN Bilstm

The document outlines the implementation of a CNN-BiLSTM model for detecting deep fake audio, emphasizing its importance for security and media integrity. It utilizes the ASVspoof 2019 dataset, which includes over 121,000 audio samples of both genuine and spoofed speech, and details the preprocessing steps and model architecture. The approach combines CNNs for feature extraction and BiLSTM for capturing temporal dependencies, ultimately classifying audio as real or fake.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

CNN-BILSTM IMPLEMENTATION FOR DEEP

FAKE AUDIO DETECTION

By Koukuntla Pranav
MT24MCS022
INTRODUCTION

Detecting deep fake audio using a CNN-BiLSTM


model.
Deep fake audio detection is crucial for security,
media integrity, and fraud prevention.
A hybrid CNN-BiLSTM model for feature extraction
and sequence modeling.
DATASET

ASVspoof 2019 Dataset


Contains both genuine and spoofed (fake)
speech samples.
Two classes – Real (genuine) and Fake (spoofed).
File Format: WAV files.
Audio clips vary in duration.
Multiple subsets: training, evaluation.
Total Clips: Over 121,000 audio samples (training,
development, and evaluation splits).
DATASET

Audio samples are up to 7 seconds long.


Spoofing Techniques includes TTS (text-to-speech),
VC (voice conversion), and replay attacks.
Contains metadata like speaker ID and attack
type.
Training: ~25,380 clips
Development: ~24,844 clips
Evaluation: ~71,311 clips
PREPROCESSING

Using Librosa to load audio files.


Standardizing all audio to 16kHz sample rate.
Applying noise reduction filters to clean audio.
Mel-Spectrogram Extraction: Converting raw
waveforms into spectrogram images for CNN
processing.
Scaling data between 0 and 1 for stable training.
PREPROCESSING

Using Librosa to load audio files.


Standardizing all audio to 16kHz sample rate.
Applying noise reduction filters to clean audio.
Mel-Spectrogram Extraction: Converting raw
waveforms into spectrogram images for CNN
processing.
Scaling data between 0 and 1 for stable training.
PREPROCESSING

Mel-Spectrogram mimics how humans perceive


sound frequencies.
Mel-Spectrogram is convers to (Channels, Time,
Frequency).
Visualization: Real vs. Fake spectrograms show
clear differences in frequency patterns.
CNN-BILSTM MODEL
ARCHITECTURE

Convolutional Neural Networks (CNNs) extract


spatial features from spectrograms.
Bidirectional LSTM (BiLSTM) captures temporal
dependencies in speech.
Fully Connected Layers & Sigmoid Activation for
binary classification (real vs. fake).
CNN-BILSTM MODEL
ARCHITECTURE
IMPLEMENTATION

Apply CNN layers for feature extraction.


Pass extracted features to BiLSTM layers.
Classify using fully connected layers with a
sigmoid activation.
Use Binary Cross Entropy (BCE) loss for training.
IMPLEMENTATION
RESULT
RESULT
RESULT
REFERENCES
Todisco, M., et al. (2019). ASVspoof 2019: Enabling Logical
Access Spoofing Detection. Interspeech 2019. Link
Wani, T. M., Qadri, S. A. A., Comminiello, D., & Amerini, I.
(2024). Detecting Audio Deepfakes: Integrating CNN and
BiLSTM with Multi-Feature Concatenation. Proceedings of the
2024 ACM Workshop on Information Hiding and Multimedia
Security, 271–276. Link
THANK YOU

You might also like