0% found this document useful (0 votes)
100 views6 pages

A Lightweight CNN For Efficient Deepfake Detection of Low-Resolution Images in Frequency Domain

Uploaded by

Ashiya Ajare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views6 pages

A Lightweight CNN For Efficient Deepfake Detection of Low-Resolution Images in Frequency Domain

Uploaded by

Ashiya Ajare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2024 Second International Conference on Emerging Trends in Information Technology and Engineering (ICETITE)

A Lightweight CNN for Efficient Deepfake


Detection of Low-resolution Images in Frequency
Domain
2024 Second International Conference on Emerging Trends in Information Technology and Engineering (ICETITE) | 979-8-3503-2820-2/24/$31.00 ©2024 IEEE | DOI: 10.1109/IC-ETITE58242.2024.10493406

Sabareshwar D Raghul S Shawn Ravi


School of Computer Science and School of Computer Science and School of Computer Science and
Engineering Engineering Engineering
Vellore Institute of Technology Vellore Institute of Technology Vellore Institute of Technology
Vellore, Tamilnadu Vellore, Tamilnadu Vellore, Tamilnadu
[email protected] [email protected] [email protected]

Dr. Varalakshmi M Peer Mohamed P U


School of Computer Science and CapitalOne
Engineering San Francisco, USA
Vellore Institute of Technology [email protected]
Vellore, Tamilnadu
[email protected]

Abstract— Deepfake detection in frequency domain is of social media has amplified the adverse effects of deepfake-
extensively explored for two reasons – artifacts generated by the based cybercrimes.
up-sampling layer of Generative Adversarial Networks are more
prominent in the frequency domain and some image processing The potential misuse of deepfakes pose a growing threat to
tasks are simpler and efficient in this domain. Traditional democracy, privacy and social security. Researchers, domain
convolutional neural networks outperform the machine learning
experts and policy makers work together to develop effective
models for their ability to detect deepfakes of low-resolution
images also. But these networks involve numerous trainable solutions to combat deepfake threats. Deepfake detection is
parameters resulting in huge computation cost, time and high one such solution that aims to develop techniques and tools to
memory requirements. This paper presents a lightweight identify and distinguish deepfakes from real videos or images.
Convolutional Neural Network (CNN) architecture that is This is important because deepfakes can be used to spread
optimized for efficient deepfake detection of low-resolution false information, demean individuals, and mislead public
images in the frequency domain. The proposed model achieves opinion.
almost the same accuracy as that of the traditional network but
with the parameter requirement reduced by 92%. This renders Detecting deepfakes requires a combination of technical
the model suitable for use in memory-constrained and power- expertise, knowledge of machine learning algorithms, and
constrained environments such as mobile devices and other access to large datasets of both real and fake videos or
embedded systems. images. Owing to the fact that a few image processing
Keywords—Deepfake detection, frequency domain, pixel operations are simpler and more efficient when applied in the
domain, lightweight CNN, depthwise separable convolution frequency domain, some studies focus on doing deepfake
detection in the frequency domain. Fig. 1 depicts the
I. INTRODUCTION difference between authentic images and the fake images
The emergence of Generative Adversarial Networks (GAN) generated using deep neural networks and Fig.2 shows the
that can generate realistic synthetic data is considered a representation of the real and fake images in different
milestone in the advancement of machine learning and deep spectrums. The power spectrum is a 2D array of real-valued
learning models, for its ability to surpass the creative thinking numbers that represents the strength of the frequency
of humans. Interesting applications of GAN such as components. The frequency domain representation of a signal
generating new patterns for the fashion industry, composing carries information about the signal's amplitude and phase at
music and generating images from text account for its each frequency.
widespread popularity. Deepfakes are digitally manipulated
visual or audio content to replace the likeness of one person
with that of another and Deepfake technology leverages high-
end systems and complex machine learning algorithms to co-
create such convincing fake artistic contents which are
indistinguishable from the real ones. However, the rapid
growth of this technology has raised concerns regarding the
potential for misuse and exploitation. The technology is being
used to impersonate celebrities and spread fake news, commit
financial fraudulence, perform election manipulation,
identity theft, scams and hoaxes. The tremendous expansion Fig. 1 Differences between Real and Fake Images (a) Real
(b) Fake

979-8-3503-2820-2/$31.00 ©2024 IEEE 1

Authorized licensed use limited to: K K Wagh Inst of Engg Education and Research. Downloaded on August 21,2024 at 16:22:36 UTC from IEEE Xplore. Restrictions apply.
[11] were able to detect a deep fake media from a genuine
one from behavioral aspects such as unusual head poses,
mouth movement, unnatural blinking rate etc. As suggested
in [12], video classification approaches can be used to take
advantage of 3D input for the deepfake classification
challenge. Except for Region-based Convolutional Neural
Network (RCN), all of the convolutional networks they tested
were pre-trained using the Kinetics dataset, a large-scale
video categorization dataset. The methods were tested using
video snippets of varying lengths. The authors chose the
Celeb-DF (v2) dataset for their work, which contains 590 real
films gathered from YouTube and 5639 comparable high-
quality synthesized movies. Celeb-DF was chosen since it has
shown to be a complex and challenging dataset for existing
deepfake detection systems due to the lack of discernible
visual artifacts in synthesized videos. Lewis, J. K et al. [13]
Fig. 2 Real and Fake Images in Different Spectrums go even a step further by including spectral information along
with spatial and temporal data and propose a novel
Existing literature demonstrates that in the frequency domain, multileveled complex CNN-LSTM (Long Short-Term
traditional machine learning models are sufficient to Memory) pipeline to detect Deepfakes. By utilizing both
successfully distinguish between authentic and fake images, visual and audio artifacts, they were able to achieve high
for high resolution images; however, they are not efficient for detection accuracy. As demonstrated in [14], a basic classifier
low resolution images. Hence, it is necessary to build a robust following the frequency domain analysis of the images
image classifier using powerful neural networks such as CNN provides good accuracy even with a few labeled data and in
to accurately detect the artifacts left by the deep generative unsupervised settings. Another study based on using the
models. Recent studies focus on training convolutional neural frequency spectrum to detect deepfakes has also shown 99%
networks in frequency domain to produce better results for accuracy [15]. Another study by Frank, J., [16]
low-resolution images too. But CNN incur excessive experimentally proves that DCT transformed images are fully
parameters resulting in huge computation costs, time and linearly separable, in contrast to classification on pixel
high memory requirements. This paper proposes and domains that require neural networks. Kohli, A., & Gupta, A
implements a lightweight CNN to make possible, the [17] have shown good results by converting the faces
deployment of CNN on small devices. It requires fewer extracted from the videos to frequency domain and then
parameters, still without comprising the accuracy. feeding them to a 3-layered CNN to detect deepfakes.
In another unique approach, face manipulation is divided into
II. RELATED WORKS
face replacement and face reenactment, and face
In recent years, deep learning has made remarkable progress manipulation detection is treated as a three-classification
in every domain ranging from environmental modelling problem [18]. They proposed the three-classification face
[1][2][3] to computer vision and robotics. Furthermore, manipulation detection technique (TFMD) based on
digital face photos and video modification are of particular attentional feature decomposition, as well as a novel face
interest since they make use of the capabilities of GANs. forgery feature representation.. As suggested in [19], a
Studies show that deep fakes are the outcome of Generative biometric-based forensic technique can be efficiently used for
Adversarial Networks (GANs), which are two artificial detecting deep face-swap impersonations. Jung, T et al. [20]
neural networks that collaborate to produce a compelling suggested an approach called Deep Vision that analyses eye
realistic like media e.g., photos and videos [4],[5]. The blinks that are continually repeated in a very short period of
Generative Adversarial Network is capable of producing a time; Deep Vision is used as a measure to validate an
new portrait that approximates all thousands of photos it has anomaly based on the period, repeated number, and expired
been trained on earlier, without being an exact copy of any of eye blink time. In seven out of eight different types of videos,
them. This is achieved by utilizing two networks namely the their proposed methodology was able to correctly identify
generator and discriminator networks that are calibrated on deepfakes (87.5% accuracy rate).
the same dataset of pictures, video recordings, or audios. The Deepfake techniques can only create face images with
first network attempts to generate new samples that are good defined dimensions and low quality and these images must be
enough to fool the second network, which works to determine further altered and blurred to match the original video's faces.
whether the new media it sees is real. Different algorithms This additive blur and ROI transformations leave unique
are used for the generation and detection of deepfakes [6]-[8]. artifacts in the generated deepfake movies, that may be
Deepfakes creation methods have some challenges such as efficiently recorded by spotting discrepancies between the
identity translation, spatial inconsistencies, temporal Region of Interest (ROI) and the remainder of the image
inconsistencies and more. According to [9], some artifacts are using Haar Wavelet transformation, that helps for deepfake
left by the network whilst creation which may not be visible detection [21].
to the naked eye but can be detected by image forensics and
machine learning. Recent studies exploit these
inconsistencies to detect the deepfakes successfully.
Furthermore, Tariq, S et al. [10] and Ganiyusufoglu, I. et al

Authorized licensed use limited to: K K Wagh Inst of Engg Education and Research. Downloaded on August 21,2024 at 16:22:36 UTC from IEEE Xplore. Restrictions apply.
III. METHODOLOGY
b) Deepfake Detection
A. Dataset Description
Traditional machine learning models fail to accurately
Two different datasets are used : Flickr-Faces-High distinguish between real and fake images of low resolution.
Quality dataset (FFHQ) consisting of 70000 images and CNN is a robust solution that works for both low and high
Deepfake Detection dataset by Google consisting of 10000 resolution. The proposed architecture is a lightweight CNN
real and fake images which are divided in the ratio 80:20 for optimized to work in small devices. It is implemented by
training and testing, for achieving optimal results.
replacing the normal convolutional layers with depthwise
B. System Archtiecture convolutions, and reducing the number of fully connected
The proposed framework consists of two modules - units carefully without much loss in accuracy score.
Preprocessing and Deepfake detection.
Depthwise Separable Convolution
a) Preprocessing
A robust pre-processing pipeline is employed to properly Depthwise separable convolution is a type of convolutional
convert the videos into frequency data needed by the CNN. neural network layer that consists of two consecutive layers -
The pipeline is further divided into two phases – Face depth wise convolution and pointwise convolution.
extraction and Transformation to frequency domain. Fig. 3
gives the schematic of the two phases of the preprocessing In a traditional convolutional layer, the convolution is
pipeline. performed over the entire input volume, which includes all
In the face extraction phase, the video is split into frames, and the input channels, using a filter that has a depth equal to the
around 1/3 of frames per second are used to avoid redundant number of input channels. This can be computationally
data. The frames are extracted using the ffmpeg tool, and then expensive, especially when dealing with large input volumes.
a face-recognition module, dllib is used to detect faces.
In the second phase, the detected faces are converted into the
In a depthwise convolution, the convolution operation is
frequency domain using the 2D Global Discrete Cosine
performed separately for each input channel, using a filter
Transform (2D-GDCT). The 2D DCT is applied to a 256x256
that has a depth of one. This means that each filter is
image by first dividing it into 8x8 blocks, and each block is
responsible for convolving only one channel of the input
transformed using the 2D DCT to produce 64 DCT
volume. This reduces the number of parameters required and
coefficients. The DCT coefficients that represent the
makes the computation more efficient. After the depthwise
frequency content of the image are stored in a compact and
convolution layer, a pointwise convolution layer is added to
efficient way, in the form of a matrix, with the low frequency
combine the output of the depthwise convolution layer. In a
components in the upper left corner and the high frequency
pointwise convolution, a 1x1 convolution is applied to the
components in the lower right corner. Normalization is then
output of the depthwise convolution, to combine the filtered
performed to rescale the data to the range [0,1].
channels and produce the final output. This pointwise
convolution helps to preserve the non-linearity of the model.

Depthwise separable convolution is obtained by combining


the depthwise convolution and pointwise convolution layers.
It is a more efficient alternative to traditional convolutional
layers and has been used in many lightweight CNN
architectures.
The proposed method uses a convolutional neural network to
Fig. 3 Pre-processing (Feature Extraction and Domain
recognize deepfake faces in the target video. It is made up of
Transformation)
six layers as illustrated in Fig. 4 - a depthwise separable
Global Discrete Cosine Transform (GDCT) calculation is as
convolutional layer, activation layer, additional pooling
follows.
layer, Dropout Layer, Batch Normalization Layer and a fully
connected layer. The feature extraction is handled by the
convolutional, activation and pooling layers, while the fully
connected layer transfers extracted features into classification
output. As shown in Fig. 5, majority voting is applied to the
results of the convolution network to classify the image into
real or fake.

Authorized licensed use limited to: K K Wagh Inst of Engg Education and Research. Downloaded on August 21,2024 at 16:22:36 UTC from IEEE Xplore. Restrictions apply.
frequency domain. Fig. 6 and Fig. 7 give the pictorial
representation of these results.
From the empirical results for high resolution images that
agree with [19], it can be inferred that traditional machine
learning models are not much viable in pixel domain;
nevertheless, they are able to differentiate high resolution
deepfakes in frequency domain as their spectrum is clearly
separable.

Table 1: Accuracy of Machine Learning Models in Pixel and


Frequency Domain

MODEL ACCURACY D. ACCURACY


(Pixel Domain) (Frequency Domain)

Random Forest G. 0.58 H. 0.98

SVM 0.7 0.88

Naïve Bayes M. 0.68 0.96

O. Decision Tree 0.60 Q. 0.97

Logistic Regression 0.75 1

Fig. 4 CNN with Depthwise Separable Convolution Layers

Fig. 5: Detection stage [Framewise Detection and Majority


Voting]
C. Experimental Setup
The proposed model is implemented in Python and executed
in a Dell Precision T7820 workstation configured with Intel
Xeon processor, 96 GB RAM, and Nvidia Quadro RTX5000
graphics card.
From the Keras library, Conv2D layers with various filters
such as 16, 32, 64,128, and 256 are employed, with each
Fig. 6 Receiver Operating Characteristic-Area Under the
followed by a BatchNormalization layer, a MaxPooling layer
Curve (ROC – AUC) Curve for Various Machine Learning
and a DepthwiseConv2D layer. At last, a GlobalAverage
Models in Pixel Domain
Pooling2D layer and a Dense layer are added. A learning rate
of 0.0005 and Adam optimizer are used.
IV. RESULTS AND DISCUSSION
Traditional machine learning algorithms such as Logistic
Regression, Support Vector Machine, Random Forest, Naïve
Bayes are implemented for FFHQ Dataset High Resolution
to establish the baseline performance. Table 1 presents the
experimental results of these models in pixel domain and

Authorized licensed use limited to: K K Wagh Inst of Engg Education and Research. Downloaded on August 21,2024 at 16:22:36 UTC from IEEE Xplore. Restrictions apply.
in Pixel Domain

Fig. 9 Accuracy Curve for FFHQ Low-Resolution Dataset


in Frequency Domain

Fig. 7 ROC - AUC Curve for Various Machine Learning The proposed lightweight CNN results in 84.27% accuracy
Models in Frequency Domain which is only slightly less than the traditional CNN but with
a huge reduction in the number of model parameters.
A traditional CNN is implemented so as to detect the artifacts Let PT denote the number of trainable parameters for
produced by GANs for low resolution images. The model is traditional CNN
trained for 15 epochs with 20,000 training data and 2,000 Let PL denote the number of trainable parameters for
validation data and the results of evaluation on both pixel and lightweight CNN
frequency domain are listed in Table 2. It can be observed Parameter reduction percentage is given by
that the model takes additional training time in frequency ( )
Parameter Reduction% = × 100 ………….. (2)
domain, owing to the domain transformation overhead but it
has contributed to 4-5% increase in accuracy with only The parameter reduction percentage for the proposed
20,000 training data. Fig. 8 and Fig. 9 show the comparative lightweight model with PL = 963,521 and PT = 12,502,801,
results of the traditional CNN model’s accuracy in pixel and is found to be 92.29%.
frequency domain.
V. CONCLUSION AND FUTURE WORK
Table 2: Accuracy and Training Time for Traditional CNN in The fast-growing popularity of GANs and deepfake
Pixel and Frequency Domain. technology with their attractive applications has raised
concerns regarding its potential misuse. It is indispensable to
METRIC PIXEL FREQ DOMAIN develop tools and techniques to detect deepfakes and combat
DOMAIN the related cybercrimes. A lightweight CNN is designed and
Test Data 85.85% 89.70% developed to perform deepfake detection for low-resolution
Accuracy images in frequency domain. For comparison purposes,
various machine learning models such as SVM, Naïve Bayes
ROC Score 0.8350 0.8860
and Random Forests and traditional CNN are also
Training Time 42 65
implemented. Results show that the machine learning
(mins)
algorithms are not suitable for working in pixel domain. Also
in frequency domain, the accuracy is not great for low-
resolution images. On the other hand, CNN works well for
low-resolution images also but it incurs huge computation
cost and time for training the excessive number of model
parameters. The proposed lightweight CNN offers
approximately the same accuracy as that of the traditional
CNN but with 92% reduction in the total number of
parameters required.
For future work, there are two primary objectives. First
objective is to optimize the CNN model further to achieve
higher accuracies for images and videos while reducing the
number of trainable parameters. The second objective is to
improve the model's generalizability to all existing methods
of deepfake generation. This can be achieved by using a
diverse dataset that encompasses a wide range of deepfake
variations, including both known and emerging techniques.
Fig. 8 Accuracy Curve for FFHQ Low-Resolution Dataset

Authorized licensed use limited to: K K Wagh Inst of Engg Education and Research. Downloaded on August 21,2024 at 16:22:36 UTC from IEEE Xplore. Restrictions apply.
By exposing the model to this comprehensive dataset, it can [10] Tariq, S., Lee, S., & Woo, S. S. (2020). A convolutional lstm based
residual network for deepfake video detection. arXiv preprint
learn robust features and patterns that are indicative of arXiv:2009.07480.
deepfake content, enabling it to effectively detect deepfakes [11] Ganiyusufoglu, I., Ngô, L. M., Savov, N., Karaoglu, S., & Gevers, T.
regardless of the method used to generate it. Iterative model (2020). Spatio-temporal features for generalized detection of deepfake
evaluation and fine-tuning with diverse validation datasets videos. arXiv preprint arXiv:2010.11844.
should be performed. Additionally, exploring novel [12] De Lima, O., Franklin, S., Basu, S., Karwoski, B., & George, A. (2020).
architectural designs, regularization techniques, and dataset Deepfake detection using spatiotemporal convolutional Networks
augmentation approaches can further enhance the model's [13] Lewis, J. K., Toubal, I. E., Chen, H., Sandesera, V., Lomnitz, M.,
Hampel-Arias, Z., ... & Palaniappan, K. (2020, October). Deepfake
performance and generalizability. video detection based on spatial, spectral, and temporal inconsistencies
using multimodal deep learning. In 2020 IEEE Applied Imagery
REFERENCES Pattern Recognition Workshop (AIPR) (pp. 1-9). IEEE.
[1] Manna, T., & Anitha, A. (2023). Hybridization of rough set–wrapper [14] Durall Lopez, R., Keuper, M., Pfreundt, F. J., & Keuper, J. (2019).
method with regularized combinational LSTM for seasonal air quality Unmasking deepfakes with simple features.
index prediction. Neural Computing and Applications, 1-20. [15] Dzanic, T., Shah, K., & Witherden, F. (2020). Fourier spectrum
[2] Manna, T., & Anitha, A. (2023). Precipitation prediction by integrating discrepancies in deep network generated images. Advances in neural
Rough Set on Fuzzy Approximation Space with Deep Learning information processing systems, 33, 3022-3032.
techniques. Applied Soft Computing, 110253. Doi: [16] Frank, J., Eisenhofer, T., Schönherr, L., Fischer, A., Kolossa, D., &
https://doi.org/10.1016/j.asoc.2023.110253 Holz, T. (2020, November). Leveraging frequency analysis for deep
[3] Manna, T., & Anitha, A. (2023). Deep Ensemble-Based Approach fake image recognition. In International conference on machine
Using Randomized Low-Rank Approximation for Sustainable learning (pp. 3247-3258). PMLR.
Groundwater Level Prediction. Applied Sciences, 13(5), 3210. [17] Kohli, A., & Gupta, A. (2021). Detecting deepfake, faceswap and
https://doi.org/10.3390/app13053210 face2face facial forgeries using frequency cnn. Multimedia Tools and
[4] Chadha, A., Kumar, V., Kashyap, S., & Gupta, M. (2021). Deepfake: Applications, 80, 18461-18478.
an overview. In Proceedings of Second International Conference on [18] Cao, Y., Chen, J., Huang, L., Huang, T., & Ye, F. (2023). Three-
Computing, Communications, and Cyber-Security: IC4S 2020 (pp. classification face manipulation detection using attention-based feature
557-566). Springer Singapore. decomposition. Computers & Security, 125, 103024.
[5] Malik, A., Kuribayashi, M., Abdullahi, S. M., & Khan, A. N. (2022). [19] Agarwal, S., Farid, H., El-Gaaly, T., & Lim, S. N. (2020, December).
DeepFake detection for human face images and videos: A survey. Ieee Detecting deep-fake videos from appearance and behavior. In 2020
Access, 10, 18757-18775. IEEE international workshop on information forensics and security
[6] Swathi, P., & Sk, S. (2021, September). Deepfake creation and (WIFS) (pp. 1-6). IEEE.
detection: A survey. In 2021 Third International Conference on [20] Jung, T., Kim, S., & Kim, K. (2020). Deepvision: Deepfakes detection
Inventive Research in Computing Applications (ICIRCA) (pp. 584- using human eye blinking pattern. IEEE Access, 8, 83144-83154.
588). IEEE. [21] Younus, M. A., & Hasan, T. M. (2020, April). Effective and fast
[7] Yu, P., Xia, Z., Fei, J., & Lu, Y. (2021). A survey on deepfake video deepfake detection method based on haar wavelet transform. In 2020
detection. Iet Biometrics, 10(6), 607-624. International Conference on Computer Science and Software
[8] Mirsky, Y., & Lee, W. (2021). The creation and detection of deepfakes: Engineering (CSASE) (pp. 186-190). IEEE.
A survey. ACM Computing Surveys (CSUR), 54(1), 1-41.
[9] Maksutov, A. A., Morozov, V. O., Lavrenov, A. A., & Smirnov, A. S.
(2020, January). Methods of deepfake detection based on machine
learning. In 2020 IEEE conference of russian young researchers in
electrical and electronic engineering (EIConRus) (pp. 408-411). IEEE.

Authorized licensed use limited to: K K Wagh Inst of Engg Education and Research. Downloaded on August 21,2024 at 16:22:36 UTC from IEEE Xplore. Restrictions apply.

You might also like