Deepfake 1
Deepfake 1
Abstract—The fast and continuous growth in number and through multi-task learning [22], attention mechanisms [23],
quality of deepfake videos calls for the development of reliable de- and ensembles of CNN [24].
tection systems capable of automatically warning users on social As detecting manipulated faces in videos becomes more im-
media and on the Internet about the potential untruthfulness of
such contents. While algorithms, software, and smartphone apps portant [25], [26], many deepfake detection systems proposed
are getting better every day in generating manipulated videos in the literature and in challenges are based on data-driven
and swapping faces, the accuracy of automated systems for face approaches, often backed by one or more CNNs trained on a
forgery detection in videos is still quite limited and generally specific dataset. However, the black-box model of data-driven
biased toward the dataset used to design and train a specific CNN-based methods is notoriously prone to a drawback: over-
detection system. In this paper we analyze how different training
strategies and data augmentation techniques affect CNN-based fitting. Oftentimes, a bare train/validation/test split done within
deepfake detectors when training and testing on the same dataset a single dataset collected with a uniform methodology and
or across different datasets. by a single team proves insufficient in avoiding over-fitting
on that very same dataset conditions and scenarios. A recent
I. I NTRODUCTION example is shown in [27], where the winning model of the
Facebook/Kaggle DeepFake Detection Challenge [28] scored
As the number of techniques and algorithms to generate an Average Precision of 82.56% on the public dataset used
deepfake videos and swap faces grows rapidly, the effort for the temporary leader board of the challenge, and then
of the forensic community is steering even more towards dropped to 65.18% on the sequestered dataset used for the
the development of reliable, robust, and automated deepfake final evaluation. Moreover, it is known that data dependency
detection methods. Techniques and pipelines for facial manip- creates the risk of developing solutions unable to generalize
ulation [1] and facial expression transfer between videos [2], over unseen methods or contexts.
[3] are rapidly improving [4], while the availability of source While most detectors prove to be very effective on a test
code (Deepfake [5], FaceSwap [6]) and even smartphone subset coming from the same data distribution they are trained
apps (Impressions [7], Doublicat [8]) makes face swapping on, what are the detection performance in a cross-dataset
available to a wider audience with either legitimate or harmful scenario? What happens when a CNN trained for deepfake
intents. Tampered video detection is not a novel task to the detection on a dataset A is tested on dataset B, C, and D? As
forensics community [9]–[11]. Codec history [12], [13], copy- it is difficult to gain direct insights about what happens inside
move detection [14], [15], frame duplication or deletion [16], a CNN black-box model, in this paper we offer a set of pre-
[17] are just a few examples of the many contributions in liminary analysis on cross-dataset performance of CNN-based
the last decades. The main drawback of the earlier systems deepfake detection approaches. Rather than focusing on de-
developed by the community is that the exploited traces are veloping a new technique optimized for a specific dataset, we
inherently subtle and vanish with compression or multiple train one of the most popular architectures used by competitors
editing operations [10]. The first generation of deepfake de- in the DeepFake Detection Challenge [28] and we evaluate
tection methods exploited several semantic traces, including how different training approaches [24] and data augmentation
eye blinking [18], face warping [19], head poses [20] or techniques [29] affect the intra-dataset and cross-dataset de-
lighting inconsistencies [21]. Due to the improvement of new tection performances. We base our experiments on publicly
and more accurate generation techniques, methods based on accessible datasets, i.e., FaceForensics++ [30], the DeepFake
semantic artifacts began to fail, leading to the proposal of data- Detection Challenge Dataset [28], and CelebDF(v2) [31]. We
driven solutions capable of providing localization information focus on faces extracted from deepfake videos rather than just
deepfake images, as video compression is usually stronger than
WIFS‘2020, December, 6-11, 2020, New York, USA. 978-1- image compression. We also perform some analysis taking
7281-9930-6/20/$31.00 ©2020 IEEE. This work was sup- into account a limited availability of training data. Far from
ported by the PREMIER project, funded by the Italian Ministry being an exhaustive evaluation or overview of all the available
of Education, University, and Research within the PRIN 2017 techniques and datasets, we wish to share with the readers
program. Hardware support was generously provided by the some insights to consider when developing a new deepfake
NVIDIA Corporation. detection system.
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 25,2021 at 22:58:08 UTC from IEEE Xplore. Restrictions apply.
II. M ETHODOLOGY TABLE I
ROC AUC FOR BASELINE INTRA AND CROSS - DATASET DETECTION
In order to effectively compare the intra-dataset and cross- PERFORMANCE .
dataset detection performances, we first need to define a
homogeneous training and testing methodology. The process Train\Test CelebDF DF DFD DFDC
of determining whether a face in a video is manipulated CelebDF 0.998 0.615 0.708 0.665
starts with a face detection and extraction phase. We rely DF 0.734 0.960 0.844 0.695
DFD 0.754 0.636 0.987 0.669
on BlazeFace [32], a fast and GPU-enabled face detector, DFDC 0.755 0.722 0.891 0.922
and we extract the face with the highest confidence from 32
frames for each video, uniformly sampled over time. This
choice follows from [24], thus taking into account that time TABLE II
ROC AUC FOR TRIPLET TRAINING INTRA AND CROSS - DATASET
and computational power may be a limited resource. As the DETECTION PERFORMANCE .
extracted faces have different scales and aspect ratios, we crop
the faces with a fixed aspect ratio of 1:1 before resizing to a Train\Test CelebDF DF DFD DFDC
fixed size of 256 × 256 pixels. Once faces are extracted and CelebDF 0.995 0.557 0.554 0.619
uniform in size, we train an EfficientNetB4 [33] architecture DF 0.717 0.960 0.829 0.684
DFD 0.759 0.709 0.882 0.666
as reference CNN, due to its popularity in the DeepFake DFDC 0.773 0.714 0.886 0.907
Detection Challenge. The trained model is used to predict the
likelihood of each face being fake. Results are reported at
frame level as the Area Under Curve (AUC) of a Receiver-
Operating-Characteristic (ROC) curve. model is the one at the iteration that minimizes the validation
loss. Training and validation batches are always balanced, with
Among the several available datasets, we select the fol-
randomly selected equal amounts of real and fake faces. No
lowing four, due to their availability and ease of access and
data augmentation is performed at this stage. We train four
download:
CNN models on the training sets of the four datasets, then
• DF: FaceForensics [30], in its original version with 1000
test each model against the test set of each dataset.
real videos and 4000 fake videos generated with four
Results are reported in Table I, where the header column
different methods.
denotes the training dataset, while the header row reports the
• DFD: Actors-based videos added to FaceForensics [34],
test dataset. Reading the table by rows, we observe how on
with 363 real and 3068 fake videos.
CelebDF and DFD the intra-dataset detection is very accurate,
• DFDC: The DeepFake Detection Challenge [28], with
with an AUC above 0.98. This, however, is not reflected on
19154 real and 100000 fake videos.
cross-dataset performance, as the model trained on CelebDF
• CelebDF: The Celeb-DF(v2) dataset [31], with 890 real
and tested on DFD presents an AUC of just 0.708 (29% gap
and 5639 fake videos.
compared to intra-dataset AUC), while the model trained on
The four dataset are divided into disjoint train, validation, and DFD and tested on CelebDF reaches an AUC of 0.754 (23%
test sets at video level. In particular, for DF and DFD we gap). The model trained on DF has a slightly lower AUC
follow the 720/140/140 split proportion as suggested in [30]. when tested on the same dataset (0.960) with a 12% gap when
For DFDC we use the folders from 40 to 49 as test set and tested on DFD. DFDC is the dataset presenting the lowest
the folders from 35 to 39 as validation set. The remaining 40 intra-dataset AUC (0.922) being at the same time the one
folders are the training set. For CelebDF we use the test set that generalizes better, with 3%, 17%, and 20% gap to DFD,
provided by the dataset itself, and we randomly select 15% CelebDF, and DF, respectively. The baseline results are in
of the videos as validation set, with the remaining 85% for line with what expected from data-driven methods: the largest
training. For both DF and DFD we consider only the videos dataset (i.e., DFDC) seems to provide more variety during the
compressed with H.264 at CRF 23. training phase, thus better generalization on unseen data.
We run all our experiments with the PyTorch [35] frame-
work on a workstation equipped with two Intel Xeon E5-
IV. T RAINING STRATEGY
2687W-v4 and several NVIDIA Titan V.
The first analysis we perform is related to the training
III. BASELINE strategy adopted for the CNN. Instead of relying on Binary
As a baseline for the upcoming experiments, we first need Cross Entropy (BCE) loss, we train the CNN with a triplet
to evaluate the deepfakes detection performance of Efficient- loss [36], by running the CNN up to the last-minus-one
NetB4 trained as a classifier using the Binary Cross Entropy layer (features layer). Considering triplets as (anchor sample,
(BCE) loss. The network is initialized with a model pre- positive sample, negative sample), we generate the training
trained on ImageNet, batch of 32 faces, Adam optimizer, triplets as (fake face, fake face, real face) and (real face, real
initial learning rate of 10−4 multiplied by a factor 0.1 after face, fake face) in an equal number for each batch, so to
2000 batch iterations with no reduction in validation loss. The balance the batch itself. The training proceeds in a two step
training ends when the learning rate falls below 10−8 . The final fashion.
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 25,2021 at 22:58:08 UTC from IEEE Xplore. Restrictions apply.
Test on CelebDF Test on DF Test on DFD Test on DFDC
BCE
0.9 0.9 0.9 0.9 Triplet
Train on CelebDF
500
1000
1500
2000
2500
500
1000
1500
2000
2500
500
1000
1500
2000
2500
0.9 0.9 0.9 0.9
500
1000
1500
2000
2500
3000
3500
500
1000
1500
2000
2500
3000
3500
500
1000
1500
2000
2500
3000
3500
0.9 0.9 0.9 0.9
Train on DFD
500
1000
1500
2000
2500
500
1000
1500
2000
2500
500
1000
1500
2000
0.80 0.80 0.80 0.80 2500
0.75 0.75 0.75 0.75
Train on DFDC
1000
2000
3000
4000
5000
7000
1000
2000
3000
4000
5000
7000
1000
2000
3000
4000
5000
7000
# videos in train set # videos in train set # videos in train set # videos in train set
Fig. 1. ROC AUC for BCE and triplet training in data-limited conditions. For each dataset two CNNs are trained selecting an increasing number of videos
with BCE and triplet loss. Interestingly, we can see that the cross-dataset performances are generally higher on DFD. This might be related to the overall
quality of the dataset: while DFD consists generally of high resolution videos, the other ones are more various and present also low quality samples. Training
for detection in such difficult settings therefore might be helpful in generalizing on different, yet of higher quality, datasets.
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 25,2021 at 22:58:08 UTC from IEEE Xplore. Restrictions apply.
CelebDF BCE CelebDF Triplet
FAKE
REAL
10
10
5 5
FAKE
0 0 REAL
5 5
10
10
10 5 0 5 10 15 10 5 0 5 10
(a) Trained on CelebDF with BCE loss (b) Trained on CelebDF with triplet loss
DFDC BCE DFDC Triplet
15 FAKE
REAL 7.5
10
5 5.0
0 2.5
5 0.0
10
2.5
15
5.0
20
7.5 FAKE
25 REAL
15 10 5 0 5 10 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5
(c) Trained on DFDC with BCE loss (d) Trained on DFDC with triplet loss
Fig. 2. MDS projection of 10 pairs of REAL/FAKE videos from the CelebDF test dataset. Each point represent a frame in a video, 32 frames are extracted
from each video. Projections are produced starting from the features extracted by an EfficientNetB4 architecture trained on different datasets with binary cross
entropy (BCE) or triplet loss.
In the first step the CNN is trained with triplet loss up For both steps, the model at the iteration with the smallest
to the features layer. In the second step, only the last layer validation loss is selected as the final one.
(classifier) of the CNN is trained (fine tuned) with binary As for the baseline, we are interested in understanding both
cross entropy. In the context of the EfficientNetB4 architecture, the intra and cross-dataset detection performance, as reported
feature vectors are 1792 elements while the classifier has in Table II. The results for DF, DFD, and CelebDF show
1793 weights (1792 multipliers and one bias coefficient). This almost the same intra-detection AUC as with BCE training,
means the classification layer accounts for less than 0.01% with a loss in generalization capability more marked for the
of the net coefficients. Triplet training is initialized with the CelebDF dataset. For the model trained on DFDC, the intra-
model trained through BCE from the baseline, as this provides detection AUC is similar to the BCE training, with slightly
a faster convergence and prevents the model from failing into better cross-dataset AUC (with a modest 2% increase in AUC
a trivial solution (all-zeros feature vector). The batch size is 10 with respect to the same combination in BCE training) only
triplets to fit into 12GB of GPU memory, the initial learning when testing on CelebDF.
rate is set to 10−5 and it is dropped by a factor 10 after 500 A different perspective on the differences between BCE and
batch iterations with no improvements on the validation loss. triplet losses is offered in Figure 1, where EfficientNetB4 is
The fine tuning of the classifier is initialized with the triplet- trained in data-limited conditions by sub-sampling the training
trained model, with an initial learning rate of 10−6 dropped dataset. In this context, triplet loss proves beneficial in intra-
by a factor 10 after 100 iterations with no validation loss dataset detection (DFDC, CelebDF, and DF) as well as in
improvements. Both the triplet training and the fine-tuning cross-dataset detection and outperforms BCE.
process are stopped when the learning rate falls below 10−8 . Even though the triplet training procedure is not revolu-
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 25,2021 at 22:58:08 UTC from IEEE Xplore. Restrictions apply.
1.0
0.9
CelebDF
ROC AUC
DF
0.8 DFD
DFDC
0.7
0.6
HF
BC
NONE
JPEG
HSV
GAUS
ISO
DS
Fig. 3. ROC AUC of EfficientNetB4 trained with BCE on DFDC with different augmentation techniques. HF: Horizontal Flip. BC: Brightness and Contrast
change. HSV: Hue, Saturation and Value changes. ISO: Addition of ISO noise. GAUS: Addition of Gaussian noise. DS: Down-scaling. JPEG: JPEG compression.
MIX: baseline mix of all the other single augmentations. NONE: no augmentations. The horizontal lines are the AUC values when no augmentations are used.
tionary in terms of AUC, we are interested in analyzing the • HF: Horizontal Flip
differences in representations learned at feature level with • BC: Brightness and Contrast changes
BCE and triplet loss on the same dataset and across different • HSV: Hue, Saturation and Value changes
datasets. To this end, Figure 2 shows the Multidimensional • ISO: Addition of ISO noise
Scaling (MDS) projection on two components of the features • GAUS: Addition of gaussian noise
extracted with four differently trained EfficientNetB4 models. • DS: Downscaling with a factor between 0.7 and 0.9
All four subplots project faces from the very same 10 pairs • JPEG: JPEG compression with a random quality factor
of real/fake videos randomly extracted from the CelebDF test between 50 and 99
set. Figure 2a uses features extracted with the CNN trained on
We test the aforementioned augmentations independently,
the CelebDF dataset with BCE, while Figure 2b uses features
training with BCE on the DFDC dataset. All the proposed
extracted with the CNN trained on the same dataset with triplet
experiments are performed with the Albumentations [37]
loss instead. While in both cases the separation between real
framework. Results are reported in Figure 3, ordered left to
and fake frames is quite evident, in the triplet case the over-
right in decreasing order of AUC on the DFDC test set.
lapping frames are less in number. This improvement could
Two interesting considerations can be drawn in light of these
prove useful when aggregating the predictions from several
results.
frames at video level. Figure 2c and 2d are generated with
features extracted by CNNs trained on DFDC with BCE and First, augmentations do not seem to help much increasing
triplet loss respectively. While certainly the overlap between intra-dataset detection, maybe due to the cross-contamination
real and fake frames is more evident than in Figures 2a and 2b, between train, validation, and test set in terms of video settings
the triplet loss seems to offer a bit more separation between and scenarios. The only exception is the HF augmentation, that
the two classes, despite the feature extractor being trained on provides a boost of just 0.7% in AUC.
a different dataset. Second, some augmentations are beneficial (at times by
a large margin) in terms of cross-dataset generalization. In
V. DATA AUGMENTATION particular, HF, BC, HSV, and JPEG provide for an AUC
The second batch of experiments is devoted to under- increase on networks trained on both CelebDF and DFD.
standing the effect of different data augmentation techniques. DF does not seem to benefit much from augmentations,
It is known that for deepfake images [29] some type of maybe due to the very different scenes depicted in DFDC
data augmentation techniques prove beneficial in terms of compared to the ones in DF. While the former has actors at
robustness and cross-dataset generalization. Among the many distance, moving in the scene, often two actors, the latter has
possible data augmentations techniques, we focus on the subset almost only a single actor, in the center of the scene, in a TV
that could represent the transformations a face undergoes in studio or during an interview with studio-level lights.
the wild. The following augmentations are considered: In light of the results in terms of single augmentation,
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 25,2021 at 22:58:08 UTC from IEEE Xplore. Restrictions apply.
we build a data augmentation pipeline based on HF, BC, [14] P. Bestagini, S. Milani, M. Tagliasacchi, and S. Tubaro, “Local tamper-
HSV, and JPEG, and re-train the CNN with both BCE and ing detection in video sequences,” in IEEE International Workshop on
Multimedia Signal Processing (MMSP), 2013.
triplet loss. Table III reports results for BCE loss. The fusion [15] L. D'Amiano, D. Cozzolino, G. Poggi, and L. Verdoliva, “A patchmatch-
of augmentations brings important improvements in terms of based dense-field algorithm for video copy-move detection and localiza-
cross-dataset detection AUC, with up to +9% when training on tion,” IEEE Transactions on Circuits and Systems for Video Technology
(TCSVT), vol. 29, pp. 669–682, 2019.
DFD and testing on CelebDF, and when training on CelebDF [16] M. C. Stamm, W. S. Lin, and K. J. R. Liu, “Temporal forensics and
and testing on DFD. The intra-dataset detection performances anti-forensics for motion compensated video,” IEEE Transactions on
are instead mostly unaffected. With augmentations applied Information Forensics and Security (TIFS), vol. 7, pp. 1315–1329, 2012.
[17] A. Gironi, M. Fontani, T. Bianchi, A. Piva, and M. Barni, “A video
to the CNN trained with triplet loss, Table IV shows how forensic technique for detecting frame deletion and insertion,” in IEEE
the few beneficial effects of triplet loss when training on International Conference on Acoustics, Speech and Signal Processing
full dataset are not visible anymore. In facts, triplet loss (ICASSP), 2014.
[18] Y. Li, M. Chang, and S. Lyu, “In ictu oculi: Exposing AI created fake
with data augmentations provides lower AUC for almost all videos by detecting eye blinking,” in IEEE International Workshop on
combinations compared to BCE loss with data augmentation. Information Forensics and Security (WIFS), 2018.
[19] Y. Li and S. Lyu, “Exposing deepfake videos by detecting face warping
VI. C ONCLUSIONS artifacts,” in IEEE Conference on Computer Vision and Pattern Recog-
nition Workshops (CVPRW), 2019.
Two are the main conclusions we can draw from the [20] X. Yang, Y. Li, and S. Lyu, “Exposing deep fakes using inconsistent
experiments presented in this paper. First, a carefully built head poses,” in IEEE International Conference on Acoustics, Speech
and tested data-augmentation pipeline can prove useful in and Signal Processing (ICASSP), 2019.
[21] F. Matern, C. Riess, and M. Stamminger, “Exploiting visual artifacts to
increasing the generalization of a CNN model for deepfake expose deepfakes and face manipulations,” in IEEE Winter Applications
video detection across different datasets. Not all augmentations of Computer Vision Workshops (WACVW), 2019.
are beneficial though, and checking the usefulness of each type [22] H. H. Nguyen, F. Fang, J. Yamagishi, and I. Echizen, “Multi-task
learning for detecting and segmenting manipulated facial images and
of augmentation could be an important step in the workflow videos,” CoRR, vol. abs/1906.06876, 2019.
of developing a detection pipeline. Second, triplet loss proves [23] H. Dang, F. Liu, J. Stehouwer, X. Liu, and A. Jain, “On the detection
to be helpful in terms of both intra-dataset and cross-dataset of digital face manipulation,” CoRR, vol. abs/1910.01717, 2019.
[24] N. Bonettini, E. D. Cannas, S. Mandelli, L. Bondi, P. Bestagini, and
detection performances under limited availability of training S. Tubaro, “Video Face Manipulation Detection Through Ensemble of
data. When large datasets are available, data augmentation CNNs,” in International Conference on Pattern Recognition (ICPR),
on a BCE-trained CNN architecture proves to be the winning 2020.
[25] S. Agarwal, H. Farid, Y. Gu, M. He, K. Nagano, and H. Li, “Protecting
combination. world leaders against deep fakes,” in IEEE Conference on Computer
Vision and Pattern Recognition Workshops (CVPRW), 2019.
R EFERENCES [26] L. Verdoliva, “Media forensics and deepfakes: an overview,” CoRR, vol.
abs/2001.06564, 2020.
[1] M. Zollhöfer, J. Thies, P. Garrido, D. Bradley, T. Beeler, P. Pérez,
[27] “DeepFake Detection Challenge Results,” https://ai.facebook.com/blog/
M. Stamminger, M. Nießner, and C. Theobalt, “State of the art on
deepfake-detection-challenge-results-an-open-initiative-to-advance-ai.
monocular 3D face reconstruction, tracking, and applications,” Computer
[28] B. Dolhansky, J. Bitton, B. Pflaum, R. Lu, Jikuo ans Howes, M. Wang,
Graphics Forum, vol. 37, pp. 523–550, 2018.
and C. Canton Ferrer, “The deepfake detection challenge dataset,” CoRR,
[2] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner,
vol. abs/2006.07397, 2020.
“Face2face: Real-time face capture and reenactment of RGB videos,” in
[29] S.-Y. Wang, O. Wang, R. Zhang, A. Owens, and A. A. Efros, “Cnn-
IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
generated images are surprisingly easy to spot... for now,” CoRR, vol.
2016.
abs/1912.11035, 2019.
[3] J. Thies, M. Zollhöfer, and M. Nießner, “Deferred neural rendering:
[30] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and
Image synthesis using neural textures,” ACM Transactions on Graphics
M. Nießner, “FaceForensics++: Learning to detect manipulated facial
(TOG), vol. 38, pp. 1–12, 2019.
images,” in International Conference on Computer Vision (ICCV), 2019.
[4] L. Li, J. Bao, H. Yang, D. Chen, and F. Wen, “Advancing High Fidelity
[31] Y. Li, P. Sun, H. Qi, and S. Lyu, “Celeb-DF: A Large-scale Challenging
Identity Swapping for Forgery Detection,” in IEEE/CVF Conference on
Dataset for DeepFake Forensics,” in IEEE Conference on Computer
Computer Vision and Pattern Recognition (CVPR), 2020.
Vision and Patten Recognition (CVPR), 2020.
[5] “Deepfakes github,” https://github.com/deepfakes/faceswap.
[32] V. Bazarevsky, Y. Kartynnik, A. Vakunov, K. Raveendran, and
[6] “Faceswap,” https://github.com/MarekKowalski/FaceSwap/.
M. Grundmann, “Blazeface: Sub-millisecond neural face detection on
[7] “Impressions,” https://impressions.app/.
mobile gpus,” CoRR, vol. abs/1907.05047, 2019.
[8] “Doublicat,” https://doublicat.com/.
[33] M. Tan and Q. V. Le, “EfficientNet: Rethinking model scaling for
[9] A. Rocha, W. Scheirer, T. Boult, and S. Goldenstein, “Vision of the
convolutional neural networks,” in International Conference on Machine
unseen: Current trends and challenges in digital image and video
Learning (ICML), 2019.
forensics,” ACM Computing Surveys, vol. 43, pp. 1–42, 2011.
[34] “FaceForensics++,” https://ai.googleblog.com/2019/09/
[10] S. Milani, M. Fontani, P. Bestagini, M. Barni, A. Piva, M. Tagliasacchi,
contributing-data-to-deepfake-detection.html.
and S. Tubaro, “An overview on video forensics,” APSIPA Transactions
[35] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
on Signal and Information Processing, vol. 1, p. e2, 2012.
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf,
[11] M. C. Stamm, Min Wu, and K. J. R. Liu, “Information forensics: An
E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner,
overview of the first decade,” IEEE Access, vol. 1, pp. 167–200, 2013.
L. Fang, J. Bai, and S. Chintala, “PyTorch: An imperative style, high-
[12] P. Bestagini, S. Milani, M. Tagliasacchi, and S. Tubaro, “Codec and
performance deep learning library,” in Advances in Neural Information
gop identification in double compressed videos,” IEEE Transactions on
Processing Systems (NIPS), 2019.
Image Processing (TIP), vol. 25, pp. 2298–2310, 2016.
[36] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen,
[13] D. Vázquez-Padı́n, M. Fontani, D. Shullani, F. Pérez-González, A. Piva,
and Y. Wu, “Learning fine-grained image similarity with deep ranking,”
and M. Barni, “Video integrity verification and GOP size estimation
in IEEE Conference on Computer Vision and Pattern Recognition
via generalized variation of prediction footprint,” IEEE Transactions
(CVPR), 2014.
on Information Forensics and Security (TIFS), vol. 15, pp. 1815–1830,
[37] A. V. Buslaev, A. Parinov, E. Khvedchenya, V. I. Iglovikov, and
2020.
A. A. Kalinin, “Albumentations: fast and flexible image augmentations,”
CoRR, vol. abs/1809.06839, 2018.
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 25,2021 at 22:58:08 UTC from IEEE Xplore. Restrictions apply.