The document outlines the implementation of a CNN-BiLSTM model for detecting deep fake audio, emphasizing its importance for security and media integrity. It utilizes the ASVspoof 2019 dataset, which includes over 121,000 audio samples of both genuine and spoofed speech, and details the preprocessing steps and model architecture. The approach combines CNNs for feature extraction and BiLSTM for capturing temporal dependencies, ultimately classifying audio as real or fake.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
5 views16 pages
CNN Bilstm
The document outlines the implementation of a CNN-BiLSTM model for detecting deep fake audio, emphasizing its importance for security and media integrity. It utilizes the ASVspoof 2019 dataset, which includes over 121,000 audio samples of both genuine and spoofed speech, and details the preprocessing steps and model architecture. The approach combines CNNs for feature extraction and BiLSTM for capturing temporal dependencies, ultimately classifying audio as real or fake.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16
CNN-BILSTM IMPLEMENTATION FOR DEEP
FAKE AUDIO DETECTION
By Koukuntla Pranav MT24MCS022 INTRODUCTION
Detecting deep fake audio using a CNN-BiLSTM
model. Deep fake audio detection is crucial for security, media integrity, and fraud prevention. A hybrid CNN-BiLSTM model for feature extraction and sequence modeling. DATASET
ASVspoof 2019 Dataset
Contains both genuine and spoofed (fake) speech samples. Two classes – Real (genuine) and Fake (spoofed). File Format: WAV files. Audio clips vary in duration. Multiple subsets: training, evaluation. Total Clips: Over 121,000 audio samples (training, development, and evaluation splits). DATASET
Audio samples are up to 7 seconds long.
Spoofing Techniques includes TTS (text-to-speech), VC (voice conversion), and replay attacks. Contains metadata like speaker ID and attack type. Training: ~25,380 clips Development: ~24,844 clips Evaluation: ~71,311 clips PREPROCESSING
Using Librosa to load audio files.
Standardizing all audio to 16kHz sample rate. Applying noise reduction filters to clean audio. Mel-Spectrogram Extraction: Converting raw waveforms into spectrogram images for CNN processing. Scaling data between 0 and 1 for stable training. PREPROCESSING
Using Librosa to load audio files.
Standardizing all audio to 16kHz sample rate. Applying noise reduction filters to clean audio. Mel-Spectrogram Extraction: Converting raw waveforms into spectrogram images for CNN processing. Scaling data between 0 and 1 for stable training. PREPROCESSING
Mel-Spectrogram mimics how humans perceive
sound frequencies. Mel-Spectrogram is convers to (Channels, Time, Frequency). Visualization: Real vs. Fake spectrograms show clear differences in frequency patterns. CNN-BILSTM MODEL ARCHITECTURE
Convolutional Neural Networks (CNNs) extract
spatial features from spectrograms. Bidirectional LSTM (BiLSTM) captures temporal dependencies in speech. Fully Connected Layers & Sigmoid Activation for binary classification (real vs. fake). CNN-BILSTM MODEL ARCHITECTURE IMPLEMENTATION
Apply CNN layers for feature extraction.
Pass extracted features to BiLSTM layers. Classify using fully connected layers with a sigmoid activation. Use Binary Cross Entropy (BCE) loss for training. IMPLEMENTATION RESULT RESULT RESULT REFERENCES Todisco, M., et al. (2019). ASVspoof 2019: Enabling Logical Access Spoofing Detection. Interspeech 2019. Link Wani, T. M., Qadri, S. A. A., Comminiello, D., & Amerini, I. (2024). Detecting Audio Deepfakes: Integrating CNN and BiLSTM with Multi-Feature Concatenation. Proceedings of the 2024 ACM Workshop on Information Hiding and Multimedia Security, 271–276. Link THANK YOU