Bennabhaktula2022_Article_SourceCameraDeviceIdentificati

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

University of Groningen

Source Camera Device Identification from Videos


Bennabhaktula, Guru Swaroop; Timmerman, Derrick; Alegre, Enrique; Azzopardi, George

Published in:
SN Computer Science

DOI:
10.1007/s42979-022-01202-0

IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from
it. Please check the document version below.

Document Version
Publisher's PDF, also known as Version of record

Publication date:
2022

Link to publication in University of Groningen/UMCG research database

Citation for published version (APA):


Bennabhaktula, G. S., Timmerman, D., Alegre, E., & Azzopardi, G. (2022). Source Camera Device
Identification from Videos. SN Computer Science, 3(4), Article 316. https://doi.org/10.1007/s42979-022-
01202-0

Copyright
Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the
author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).

The publication may also be distributed here under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license.
More information can be found on the University of Groningen website: https://www.rug.nl/library/open-access/self-archiving-pure/taverne-
amendment.

Take-down policy
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately
and investigate your claim.

Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the
number of authors shown on this cover page is limited to 10 maximum.

Download date: 27-10-2024


SN Computer Science (2022) 3:316
https://doi.org/10.1007/s42979-022-01202-0

ORIGINAL RESEARCH

Source Camera Device Identification from Videos


Guru Swaroop Bennabhaktula1,2,3 · Derrick Timmerman1 · Enrique Alegre2,3 · George Azzopardi1

Received: 14 June 2021 / Accepted: 12 May 2022


© The Author(s) 2022

Abstract
Source camera identification is an important and challenging problem in digital image forensics. The clues of the device
used to capture the digital media are very useful for Law Enforcement Agencies (LEAs), especially to help them collect
more intelligence in digital forensics. In our work, we focus on identifying the source camera device based on digital vid-
eos using deep learning methods. In particular, we evaluate deep learning models with increasing levels of complexity for
source camera identification and show that with such sophistication the scene-suppression techniques do not aid in model
performance. In addition, we mention several common machine learning strategies that are counter-productive in achieving a
high accuracy for camera identification. We conduct systematic experiments using 28 devices from the VISION data set and
evaluate the model performance on various video scenarios—flat (i.e., homogeneous), indoor, and outdoor and evaluate the
impact on classification accuracy when the videos are shared via social media platforms such as YouTube and WhatsApp.
Unlike traditional PRNU-noise (Photo Response Non-Uniform)-based methods which require flat frames to estimate camera
reference pattern noise, the proposed method has no such constraint and we achieve an accuracy of 72.75 ± 1.1% on the
benchmark VISION data set. Furthermore, we also achieve state-of-the-art accuracy of 71.75% on the QUFVD data set in
identifying 20 camera devices. These two results are the best ever reported on the VISION and QUFVD data sets. Finally,
we demonstrate the runtime efficiency of the proposed approach and its advantages to LEAs.

Keywords Source camera identification · Sensor pattern noise · Digital video forensics · Constrained ConvNet

Introduction

This article is part of the topical collection “Pattern Recognition With the widespread increase in the consumption of digi-
Applications and Methods” guest edited by Ana Fred, Maria De tal content, camera device identification has gained a lot of
Marsico and Gabriella Sanniti di Baja. importance in the digital forensics community. Law Enforce-
* Guru Swaroop Bennabhaktula ment Agencies (LEAs) have a special interest in the develop-
[email protected] ments in this field as the knowledge of the source camera,
Derrick Timmerman extracted from the digital media, can provide additional
[email protected] intelligence in the fight against child sexual abuse content.
Enrique Alegre Our work is part of the EU-funded 4NSEEK project1 which
[email protected] is aimed at the development of cyber-tools to assist LEAs
George Azzopardi in identifying the source of illicit content involving minors.
[email protected] Most of the research in camera identification has been
limited to the investigation of digital images generated by
1
Information Systems Group at the Bernoulli Institute a camera device. In contrast, Source Camera Identification
for Mathematics, Computer Science, and Artificial
Intelligence, University of Groningen, Groningen,
(SCI) based on videos has not seen much progress. An appli-
The Netherlands cation of video forensics can be seen by considering the
2
Group for Vision and Intelligent Systems at the Department
following real-world scenario. When LEAs have a warrant
of Electrical, Systems, and Automation, University of León, to conduct a search in the properties of alleged offenders, for
León, Spain any device that they find with an in-built video camera, they
3
Researcher at INCIBE (Spanish National Cybersecurity
1
Institute), León, Spain https://​www.​incibe.​es/​en/​europ​ean-​proje​cts/​4nseek

SN Computer Science
Vol.:(0123456789)
316 Page 2 of 15 SN Computer Science (2022) 3:316

can take multiple random videos, which can then be used as best of our knowledge, this is the largest subset of devices
reference/training samples to learn the proposed approach. from the VISION data set used to conduct experiments for
Therefore, the proposed approach can be reconfigured with SCI using only videos and trained using a single model.
every new device that LEAs find. Subsequently, any suspi- Furthermore, we also set a new benchmark result on the
cious video files can be processed by the proposed method to QUFVD data set [1] using all the 20 camera devices. Finally,
determine if they were captured by one of the known devices we demonstrate the runtime efficiency of our algorithm dur-
or not. An additional follow-up study would be to determine ing deployment which becomes necessary to conduct foren-
whether two or more videos were originated from the same sic investigations in time critical situations.
device, without having to know the specific device that was We also share the source code2 for further dissemination
used. This would help LEAs to link multiple cases to the of our approach and experiments.
same offender, for instance. The rest of the paper is organized as follows. The next
Another interesting application of video forensics is to section gives an account of the state-of-the-art on SCI using
identify copyright infringements of digital videos. This is videos. In the subsequent section, we describe the proposed
relevant in the current scenario where a lot of copyright con- technique, followed by the experimental details. A brief
tent is freely available on online platforms like YouTube. discussion of the results is elucidated next, which is finally
A major challenge in SCI is to mitigate the impact of the followed by our conclusions.
presence of scene content while extracting camera traces
from images or videos. The presence of scene content makes
the extraction of camera noise quite difficult as state-of-the- Related Works
art methods, such as convolutional neural networks (Con-
vNets), tend to learn the details from the scene rather than Source camera identification has become a topic of inter-
the camera noise. To address this problem, Bayer et al. [6] est after the widespread popularity of digital cameras [32,
proposed to use a constrained convolutional layer as the first 33]. Most of the literature in this area is concerned with
layer of a ConvNet. The constraints imposed on the convo- the investigation of digital images and not involving vid-
lutional filters are aimed to increase robustness at extract- eos. With digital videos becoming increasingly accessible
ing the camera noise by suppressing the scene-level details. and popular due to the advances in camera technology and
A limitation of their approach is based on the requirement the internet, it has now become crucial to investigate SCI
of the input image to be monochrome or a single-channel using videos. We begin by reviewing the relevant literature
image. As most of the digital media is generated as color from the investigation of digital images as these methods are
or multi-channel, by converting them to monochrome, the closely related to video-based SCI.
native information that is present in each individual color When the same scene is captured by two different digital
channel is lost. To overcome this problem, an extended ver- cameras at the same time and under the same conditions, the
sion of the constrained convolutional layer that can handle final images that are generated are never exactly the same.
multi-channel inputs was proposed in [52]. These methods There are always a few visually noticeable differences, such
demonstrate the usefulness of constrained convolutional as in color tones, radial distortions, and image noise, among
layer for shallow ConvNets. In our work, we demonstrate others. Such variations are perceptible in images generated
that such constraints are not needed when using deep Con- from different camera models. This is due to the fact that
vNets such as MobileNet, and ResNet. every camera model has a different recipe for image genera-
The key contributions of this paper are threefold. First, we tion, which is commonly referred to as the camera pipeline.
evaluate ConvNets with increasing levels of sophistication A typical camera pipeline consists of optical lenses, anti-
to understand the relation between network complexity and aliasing filter, color filter arrays, imaging sensor, demosaic-
its corresponding performance to source camera identifica- ing, and post-processing operations as depicted in Fig. 1.
tion. Second, we show that neither the constrained convo- When the light from the scene enters the lenses, it gets pro-
lutional layer nor the residual noise images are necessary cessed by a sequence of hardware and software processing
when sophisticated ConvNets are employed. Additionally, steps before the generation of the final digital image. As
we mention a few common algorithm choices that could be the implementation of the camera pipeline is distinct for
detrimental to camera identification from videos. Third, we each camera model the final image generated by them is also
set a benchmark on the VISION data set using 28 camera unique. Even though digital images captured by the camera
devices (see Table 1) that can be used to evaluate new meth- devices of the same model type, when examined closely they
ods for SCI on the VISION data set. The selection of these
devices ensures that multiple instances of the same camera
model are always included in the data set, which allows us
to test the performance at the device level. Moreover, to the 2
https://​github.​com/​bgswa​roop/​scd-​videos.

SN Computer Science
SN Computer Science (2022) 3:316 Page 3 of 15 316

Fig. 1  Video generation pipeline inside digital cameras. The light


from the scene is continuously sampled by the digital camera at a pre-
determined frequency to generate digital video frames. The camera
pipeline typically involves a series of lenses, an anti-aliasing filter,
color filter arrays, followed by the imaging sensor. These hardware
processing steps are succeeded by a set of software processing steps
consisting of demosaicing, video encoding, and other post-processing
operations to generate the output digital video

are not exactly the same. This subtle variation is due to the
unique sensor pattern noise generated by every camera sen- Fig. 2  Noise classification in imaging sensors. The noise generated
by an imaging sensor (sensor noise) can be classified into a random
sor, which makes it possible to identify individual camera component, namely shot noise, and into a deterministic component
devices of the same camera model. In this work, we consider commonly referred to as (sensor) pattern noise. The pattern noise is
the problem of camera device identification from videos. further classified into FPN and PRNU based on the incoming light
Kurosawa et al. [32, 33] in their initial study on sensor intensity. FPN is generated when the scene is dark and PRNU other-
wise
noise, have observed the noise pattern generated by the dark
currents and reported that such noise is unique to every cam-
era sensor. Their experiments were conducted on nine cam- compute the noise residuals. The authors show that using
era devices using videos generated by the involved cameras such a filter before frame averaging helps in suppressing
on a flat scene. Hundred video frames were extracted from scene content. This idea is similar to more recent works [6,
each video to determine the noise pattern of dark currents. 52] that allow the neural networks to learn constrained filters
Their study laid the foundation for further experiments con- for scene suppression.
cerning camera identification. That approach [33], however, Other approaches have been presented that target to
is limited as it relies on flat video frames and requires access enhance the sensor pattern noise [34, 36, 41]. Such methods
to physical camera device. To overcome the restriction using are based on the idea proposed in [38], where a handcrafted
only flat frames, Kharrazi et al. [30] proposed to extract 34 denoising filter was used to extract camera noise. All the
hand-crafted features from images and showed that those methods mentioned thus far target the sensor noise or the
features enabled them to identify the source camera device noise generated by the imaging sensor. As shown in Fig. 1,
from natural images. a typical camera pipeline also consists of other processing
As shown in Fig. 2, the sensor noise can be categorized steps. Therefore, methods to identify the source camera
into shot noise and pattern noise. Shot noise is a stochastic based on the artifacts resulting due to the CFA, demosaic-
random component that is present in every image and can be ing operations [8, 13, 14, 49], and image compression [2, 15]
suppressed by frame averaging. The resulting noise which were proposed. Those techniques, which target to extract the
survives frame averaging is defined as the pattern noise [23], noise from a single processing step, undesirably miss out on
which can be further classified into Fixed Pattern Noise the noise patterns generated by a combination of such steps.
(FPN) and Photo-response non-uniformity Noise (PRNU). In recent years, deep learning-based approaches were
FPN is generated by the dark currents when the sensor is proposed for SCI that target a specific step in the camera
not exposed to any light. PRNU is generated when the sen- pipeline. Examples include detection of forgeries involving
sor is exposed to light and is caused due to different sensi- image in-painting [54, 55], image resizing [12], median fil-
tivities of pixels to incoming light intensity. Unlike earlier tering [29, 51], and identifying JPEG compression artifacts
methods [32, 33] that rely on FPN, Lukas et al. [38] in their [3, 4], and so on. Such approaches were shown to be more
seminal work showed that it is possible to identify source robust than methods which rely on computing handcrafted
camera devices by extracting PRNU noise from images. features. Though the above methods are based on deep learn-
They determined the sensor pattern noise by averaging the ing, they may not cater to the camera noise generated from
noise obtained from multiple images using a denoising filter. the remaining processing steps. Furthermore, deep learning
In particular, they used a wavelet-based denoising filter to based methods were also proposed [5, 9, 10, 42] to extract

SN Computer Science
316 Page 4 of 15 SN Computer Science (2022) 3:316

Fig. 3  An overview of the proposed pipeline for source camera identification from videos. It consists of three major steps, namely frame extrac-
tor, ConvNet-based classifier, and finally an aggregation step to determine video-level predictions

camera features accounting for noise generated from each in a more reliable identification. In our work, we focus on
of the processing steps in a camera pipeline. This unified SCI based only on visual content, also because in practice
approach to SCI accounts for all sources of camera noise audio content can easily be replaced or manipulated. In
and is also used in our work. the visual content based ConvNet, Dal et al. [17] pick 50
Although it is not within the scope of this work, it is frames equally spaced in time, and extract 10 patches of size
worth mentioning that the techniques used for SCI can also 256 × 256 pixels followed by patch standardization as part of
be used to address related forensic tasks such as image for- their pre-processing. Furthermore, a pre-trained Effecient-
gery detection [11, 16, 35], for identifying image manipula- Net [50] was employed for classification.
tion to help fighting against fabricated evidence and fake
news. Image forgery detection can be performed as a first
step before performing camera identification. Such a com-
Methodology
bination can make SCI more robust in practice.
Convolutional neural networks (ConvNets) have the abil-
In Fig. 3, we illustrate a high-level overview of the proposed
ity to learn high-level scene details [31], however, in SCI
methodology. The input videos are processed in three stages.
we need the ConvNets to ignore the high-level scene fea-
First, the frames are extracted from a video and are then pre-
tures and learn to extract features from the camera noise.
processed. Second, a frame-level classifier is used to predict
Chen et al. [14] noticed this behavior when they trained Con-
the class for each frame. Finally, the frame predictions are
vNets for detecting median filtering forgeries. To suppress
aggregated to determine the video-level prediction for the
scene content, they used residual images of median filtering
given video. We describe each of these stages in detail in
to train their ConvNets. This new approach of suppressing
the following sections.
scene content resulted in improved accuracy, which led to
the development of two related methods for scene suppres-
sion. Firstly, methods that use predefined high-pass filtering Frame Extraction
to suppress high-level scene content [45]. Secondly, methods
that use constrained convolutions [5, 52], which consist of As the duration of each video is not fixed, we attempt to
trainable filter parameters. extract a fixed number of I-frames from each video that
Most of the methods for SCI address the problem using are equally spaced in time. This ensures that every video is
images. Very few methods have been proposed that use vid- equally represented in the frame-level data set irrespective
eos for identification and one such method that is closely of its duration. Our approach of frame selection is differ-
related to our work is that of Holster et al. [24]. In their work ent from that in [48], where the first N frames were used
they trained a ConvNet by discarding the constrained convo- to represent a video. As the consecutive frames in a video
lutional layer proposed by [6]. That layer was removed as it share temporal content, the scene content and the camera
was not compatible to handle color images or multi-channel noise will be highly correlated. Choosing the N consecu-
inputs. Derrick et al. [52] proposed an extended version of tive frames strategy is favorable when the scene content is
the constrained convolution layer and depict the scenarios relatively homogeneous and is disadvantageous otherwise.
in which inclusion of multi-channel constrained layer could Holster et al. [24] used a frame selection strategy based on
be beneficial. All these works show the effectiveness of the the frame types, and extracted an equal number of frames
constrained convolutional layer when used with shallow from both the categories. In our experiments, we applied and
ConvNets. In our work, we show empirical evidence that evaluated two different strategies for frame selection. With
such layers are unnecessary with deep ConvNets, such as the first one, we select up to N = 50 I-frames equally spaced
MobileNet and ResNet. in time. I-frames are intra-coded frames and have better
Dal et al. [17] proposed a multi-modal ConvNet based forensic traces as they do not have any temporal dependency
approach for camera model identification from video with adjacent frames. It is, however, not always possible
sequences. They combine the visual and the audio signals to extract 50 I-frames, therefore in our second strategy we
from a video and show that such an approach would result extract N = 50 frames equally spaced in time. Furthermore,

SN Computer Science
SN Computer Science (2022) 3:316 Page 5 of 15 316

we investigate the impact of using a different number of the gradients to freely propagate backwards thereby making
frames per video on camera identification during test time. the network less prone to the problem of vanishing gradients
We take a center-crop of all the extracted frames and nor- in a traditional deep ConvNet. These networks were further
malize the resulting images to the range [0, 1] by dividing by enhanced with a new residual unit in [22]. In our experi-
255. The dimensions of the center-crop are set to 480 × 800 ments, we use ResNet50 based on the architecture proposed
and were determined based on the dimensions of the video in [22]. We modify the dimensions of the input layer to be
with the smallest resolution in the VISION data set. 480 × 800 × 3 pixels and the output layer to 28 units rep-
resenting the total number of devices in VISION data set.
Convolutional Neural Networks
ConstrainedNet
We investigate the performance of the proposed approach
by considering two different ConvNet architectures, namely
In contrast to the sophisticated ConvNets described earlier,
MobileNet-v3-small [25] and ResNet50 [21]. Both archi-
simple ConvNets with few layers are not robust enough to
tectures are deep and are sophisticated in terms of network
extract the camera features from natural images. The pri-
design when compared to MISLNet [6]. We use the pre-
mary reason for such a lack of robustness is the presence of
trained versions of these ConvNets in our training where the
scene content, which obstructs the extraction of noise from
pre-training was done on ImageNet. In Sect. 5.3 we show
images. To overcome this issue, Bayer et al. [6] proposed
empirical evidence of the benefit in pre-training.
a constrained convolutional layer, which aims to suppress
the scene content. Unlike traditional approaches [36, 38]
MobileNet
that use a pre-determined denoising filter to suppress scene
content and extract camera noise, a constrained convolu-
MobileNets [25, 26, 47] are a family of ConvNet architec-
tional layer can be trained to suppress scene details. This
tures that were designed to be deployed to mobile platforms
layer was originally proposed for monochrome images and
and embedded systems. Though those networks have a low
later extended to process color images [6]. In this work,
memory footprint and high latency, they are sophisticated
we explore if an augmentation with constrained convolu-
and achieve comparable results to the state-of-the-art on the
tional layer is beneficial to sophisticated networks such as
benchmark ImageNet data set [31].
MobileNet and ResNet. The details of these filters are briefly
In MobileNet-v1 [26], depth-wise separable convolutions
described below.
were used instead of conventional convolutional filters. Such
Relationships exist between neighboring pixels which
a combination of depth-wise and point-wise convolutions
are independent of the scene content. Such an affinity
reduces the number of parameters while retaining the rep-
is caused due to the camera noise and can be learned by
resentative power of ConvNets. Reduction in parameters, in
jointly suppressing the scene details and learning the rela-
general, allows the network to generalize better to unseen
tionship between each pixel and its neighbors [6]. Thus,
examples, as it becomes less specific.
the constrained convolutional filters are restricted to learn
In MobileNet-v2 [47], the depth-wise separable convolu-
the extraction of image noise and are not allowed to evolve
tions were used in conjunction with skip connections [21]
freely. Essentially, these convolutional filters act as denois-
between the bottleneck layers along with linear expansion
ing filters, where for each pixel the corresponding output is
layers. The architecture design was further enhanced to
obtained by subtracting the weighted sum of its neighboring
reduce latency and improve accuracy in MobileNet-v3 [25].
pixels from itself.
The performance gain was achieved using squeeze and exci-
Formally, such errors can be determined by placing con-
tation [27] blocks along with modified swish non-linearities
straints on each of the K convolutional filters with weights
[46]. In our work, we use the MobileNet-v3-small, as it can
𝐰(k), as follows:
be easily deployed in systems with limited resources and
offers a high runtime efficiency. Furthermore, we change 𝐰(k) (0, 0) = −1 (1)
the input dimensions to 480 × 800 × 3 pixels and the output
layer to 28 units, which represent the total number of classes. ∑
𝐰(k) (m, n) = 1, (2)
For further architectural details, we refer the reader to [25].
m,n≠0

ResNet where 𝐰(k) (0, 0) corresponds to the center value of the filter.
These constraints are enforced manually after each weight
ResNets [21] are another popular deep learning architectures update step during the backpropagation.
that incorporates skip-connections between convolutional The above formulation of constrained convolutional layer
layers. Such connections enable identity mapping and allow was proposed by Bayer at al. [6] and was designed to process

SN Computer Science
316 Page 6 of 15 SN Computer Science (2022) 3:316

only grayscale images. This was extended by Derrick et al. office and home. Finally, the outdoor scenario contains vid-
[52] to process color inputs by imposing filter constraints on eos of gardens and streets. With such diversity in the scene
all the three color channels j, as shown below: content, the VISION data set acts as a suitable benchmark
to evaluate source camera identification.
𝐰(k)
j
(0, 0) = −1 (3)
Camera Device Selection Procedure
𝐰(k)

(m, n) = 1, (4)
m,n≠0
j Among the 35 camera devices of the VISION data set, 28
devices were selected for our experiments. This selection
where j ∈ {1, 2, 3}. was based on the following criteria:

Video‑Level Predictions – The camera devices must contain at least 18 videos in


their native resolution encompassing the three scenarios,
The source camera device of a video is predicted as follows. namely flat, indoor, and outdoor.
First, a set of N frames are extracted from a given video – Furthermore, all the native videos should have been
as elucidated in Sect. 3.1. Second, the trained ConvNet is shared via both YouTube and WhatsApp.
used to classify the frames which results in the source video – Finally, all devices that belong to a camera model with
device predictions for each frame. Finally, all the predictions multiple devices are included too. If this criterium is sat-
belonging to a single video are compiled together by means isfied then the previous two criteria do not need to be
of a majority vote, to determine the predicted source camera satisfied.
device. In Sect. 4, we show the efficacy of this step which
significantly improves our results. The first two criteria ensure that devices with few videos
are excluded. Additionally, this allows us to test the same
performance of the device identification when the videos are
Experiments subjected to compression. An exception is made in the final
criterium, where multiple devices from the same make and
model are always included. This enables us to test camera
identification at the device level. By following these criteria,
Data Set—VISION 29 devices were shortlisted. Furthermore, as suggested [48],
we exclude the Asus Zenphone 2 Laser camera resulting in
We use the publicly available VISION data set [48], which 28 camera devices (shown in Table 1). Having selected the
consists of images and videos captured from a diverse set camera devices, in the following section we describe the
of scenes and imaging conditions. The data set comprises a process of creating a balanced training-test set.
total of 35 camera devices representing 29 camera models
and 11 camera brands. Specifically, the data set consists of Data Set Balancing
6 camera models with multiple instances per model, which
enables us to investigate the performance of the proposed The constraint on the number of videos per device for its
approach at the device level. selection is motivated from the view of creating balanced
The data set consists of 648 native videos, in that they training and test sets. This is important to keep the data dis-
have not been modified post their generation by the camera. tribution similar for both training and test to avoid any bias
The native videos were shared via social media platforms towards majorly represented classes. First, we determined
including YouTube and WhatsApp and the corresponding the lowest number of native videos present per camera
social media version of the native videos are also available device, which turned out to be 13. These videos were split
in the data set. Of the 684 native videos, 644 videos were between a training and a test set such that 7 videos were pre-
shared via YouTube and 622 in WhatsApp. While both sent in the training and the remaining 6 in the test. This split
social media platforms compress the native videos, vid- further ensures that at least 2 videos from each of the three
eos shared via YouTube maintain their original resolutions scenarios (flat, indoor, and outdoor) are present in both train
whereas WhatsApp re-scales the video to 480 × 848 pixels. and test (with the exception of D02, where only 1 native-
Furthermore, the videos captured from each camera are indoor video was available to be included in the test set).
categorized into three different scenarios—flat, indoor, and Thereby, ensuring that all scenarios are equally represented
outdoor. The flat videos have their scene content relatively in the splits. Subsequently, the training and the test splits
homogeneous, such as skies and walls. The indoor scenario were augmented with the social media versions (WhatsApp
refers to videos captured inside indoor locations, such as and YouTube) of the corresponding native videos. Thus, the

SN Computer Science
SN Computer Science (2022) 3:316 Page 7 of 15 316

Table 1  List of 28 camera devices considered for our experiments from a total of 35 devices from the VISION data set
Sr.no. Device name Sr.no. Device name Sr.no. Device name Sr.no. Device name

1 D01_Samsung_GalaxyS- 8 D08_Samsung_Galaxy- 15 D16_Huawei_P9Lite 22 D28_Huawei_P8


3Mini Tab3
2 D02_Apple_iPhone4s 9 D09_Apple_iPhone4 16 D18_Apple_iPhone5c 23 D29_Apple_iPhone5
3 D03_Huawei_P9 10 D10_Apple_iPhone4s 17 D19_Apple_iPhone6Plus 24 D30_Huawei_Honor5c
4 D04_LG_D290 11 D11_Samsung_GalaxyS3 18 D24_Xiaomi_RedmiNote3 25 D31_Samsung_Galax-
yS4Mini
5 D05_Apple_iPhone5c 12 D12_Sony_XperiaZ- 19 D25_OnePlus_A3000 26 D32_OnePlus_A3003
1Compact
6 D06_Apple_iPhone6 13 D14_Apple_iPhone5c 20 D26_Samsung_GalaxyS- 27 D33_Huawei_Ascend
3Mini
7 D07_Lenovo_P70A 14 D15_Apple_iPhone6 21 D27_Samsung_GalaxyS5 28 D34_Apple_iPhone5

three versions of the same video content occur in either of ConvNet Training
the sets but not in both. This ensures that the evaluation is
not influenced by the in-advert classification of the scene We consider multiple ConvNet architectures as described
content. This scheme resulted in a total of 588 videos for in Sect. 3.2, and evaluate them for camera identification.
training and 502 for the test set. To perform a fair evaluation between the architectures, we
To facilitate model selection, we created a validation set set the same hyperparameters for the learning algorithms,
consisting of 350 videos which are systematically selected as much as possible. A couple of differences, however,
such that the videos represent all the scenarios and the com- were required and are specified below.
pression types as much as possible (subject to availability of The optimization problem is set up to minimize the cat-
videos). As the VISION data set is not sufficiently large, the egorical cross-entropy loss. We use the stochastic gradient
validation set could not be fully balanced. It contains a minor descent (SGD) optimizer with an initial learning rate 𝛼
data imbalance, which we believe is acceptable for model (more details below) and momentum of 0.95. A global
selection purposes. This resulted in a data set split of 65:35 l2-regularization was included in the SGD optimizer with
for (training + validation):test respectively. a decay factor of 0.0005 and a batch size of 64 and 32 for
MobileNet and ResNet, respectively. This choice of batch
Data Set—QUFVD size was based on the limitation of the GPU memory. The
ConvNet architectures also includes batch-normalization
We also conduct experiments on the newly available Qatar and dropout layers that aid in model generalization.
University Forensic Video Database (QUFVD) data set [1]. Two different sets of hyperparameters were used for
The data set consists of 20 camera devices such that there the learning of ConvNets. First, we consider the hyper-
are 5 brands, 2 camera models for each brand and 2 identi- parameters for the experiments not involving constrained
cal devices for each camera model. Although the data set convolutions. We set the initial learning rate 𝛼 = 0.1 and
does not have the corresponding WhatsApp and YouTube employ cosine learning rate decay scheme [37] with three
social media versions, this is an interesting data set to test warm-up epochs. Overall we train the system for a total
our approach at the device level. The scene content of all of 20 epochs. The learning rate updates were performed
the images are natural which helps to simulate real-world at the end of each batch to ensure a smooth warm-up and
scenarios. In comparison to VISION data set, QUFVD con- decay of the learning rate. The best model was selected
tains more recent smartphones. Furthermore, Akbari et al. based on the epoch which resulted in maximum video-
[1] explicitly divide the data set into train, validation, and level validation accuracy. In case of a tie between epochs,
test sets. This allows for fair evaluation of the SCI methods we select the one with the least validation loss.
on the QUFVD data set. A different setting for hyperparameters was used for
We conducted our experiments on all the extracted experiments involving constrained convolutions as pro-
I-frames (these were already provided in the data set) with- posed by Bayer et al. [6]. We begin with a small learning
out performing any frame selection. Since the videos in the rate 𝛼 = 0.001 and train for a total of 60 epochs. A step-
data set are only a few seconds long, most of the videos have wise learning rate decay scheme was employed to decay
about 11 to 16 I-frames. Overall, there are 192, 42, and 60 the learning rate by a factor of 2 after every 6 epochs.
videos for train, validation, and test sets, respectively for
each camera device and this corresponds to a 80:20 split.

SN Computer Science
316 Page 8 of 15 SN Computer Science (2022) 3:316

video has fewer than 50 I-frames then we considered all


the available I-frames for training/evaluation. The trained
ConvNets were used to determine the class of each of the
50 I-frames per test video v. These predictions were then
aggregated using a majority vote to predict the source cam-
era device for the video v. Having determined predictions
for each of the 504 test videos, the overall video-level clas-
sification accuracy was determined using:
# of correct predictions
Accuracy = (5)
total # of predictions

We further investigated the role of the scene suppression


techniques as a pre-processing step for the ConvNets. To
test this scenario, we trained the ConvNets with a multi-
Fig. 4  Epoch-wise accuracy and loss on the validation set for the channel constrained convolutional layer as proposed in [52].
VISION and the QUFVD data sets. The dot markers indicate the In such experiments, the ConvNets were augmented with a
epoch at which the overall best validation accuracy is achieved for the
ConvNets constrained convolutional layer. Another popular technique
for scene suppression is the extraction of PRNU noise which
is built on wavelet based denoising filter proposed in [19,
The experiments were conducted on NVIDIA V100 38]. This has achieved state-of-the-art on camera identifica-
GPUs with 32 GB of video memory. Fig. 4 depicts the con- tion based on images and is used in several research works.
vergence plots of the trained models. Therefore, we also experiment with residual PRNU noise to
verify its effectiveness for videos. PRNU noise was extracted
ConvNet Evaluation from each color channel of the input video frame and the
resulting 3-channel PRNU noise inputs were used to train
The ConvNets were evaluated on the test set consisting of and test the ConvNets. The results of these experiments are
504 videos from 28 camera devices. The evaluation was per- reported in Table 2.
formed in two phases. We used N = 50 I-frames per video On comparing the overall accuracy of all the experiments
to train and test the ConvNet models in the first phase. As with 50 I-frames per video, we notice that the unconstrained
described in Sect. 3, 50 frames were extracted from each MobileNet achieves the best accuracy of 72.47 ± 1.1. This
test video resulting in a total of 25, 200 test frames. In the result was obtained after running the same experiment for
second phase, we repeated the experiments by not making a 5 times and computing the overall average. The ResNet
selection based on frame type and selecting N = 50 frames achieves an average accuracy of 67.81 ± 0.5. On comparing
that are equally spaced in time. the accuracy of the unconstrained networks to their con-
strained counterparts [52], we notice that the unconstrained
Fifty I‑Frames per Video networks perform better by 19.28 and 13.63 percentage
points for MobileNet and ResNet, respectively. Moreover, we
The training was performed using up to 50 I-frames per observe that the traditional technique of scene suppression
video that are equally spaced in time. Note that, when a

Table 2  Classification accuracy Model N Constraint type Overall Flat Indoor Outdoor WA YT NA
of the proposed methods on the
VISION data set ResNet50 50 None 67.81 75.96 𝟔𝟐.𝟔𝟎 64.24 67.76 66.94 68.82
ResNet50 50 Conv [52] 54.18 61.80 50.70 40.70 51.20 50.60 60.80
MobileNet 50 None 72.47 𝟖𝟐.𝟑𝟎 60.80 𝟕𝟐.𝟒𝟔 𝟕𝟒.𝟕𝟔 69.96 72.66
MobileNet 50 Conv [52] 53.19 60.70 45.20 52.50 52.40 48.20 59.00
MobileNet 50 PRNU [19] 61.75 69.90 54.10 60.10 56.50 62.50 66.30
MobileNet all None 𝟕𝟐.𝟕𝟓 79.98 𝟔𝟐.𝟔𝟎 74.00 74.64 𝟕𝟎.𝟔𝟖 𝟕𝟐.𝟗𝟎

The third column indicates the type of pre-processing or constraints used on the input video frames. The
table presents the overall test accuracy along with the test accuracy for 3 scenarios (flat, indoor, and out-
door) and the 3 compression types (native (NA), WhatsApp (WA), and YouTube (YT)) for all the ConvNets.
Furthermore, these results correspond to experiments with N I-frames per video for both training and test-
ing. Best accuracy across experiments for each scenario and compression type are boldfaced

SN Computer Science
SN Computer Science (2022) 3:316 Page 9 of 15 316

Table 3  Test accuracy of MobileNet when experimenting with N I-frames per video for both the training and test sets on the VISION data set
N R1 R2 R3 R4 R5 Average

50 72.7 73.5 73.3 71.9 70.9 72.47 ± 1.07


all 71.7 71.9 74.5 72.9 72.7 72.75 ± 1.10

The columns R1–R5 indicate the overall accuracy for each of the five runs and the final column shows the respective means and standard devia-
tions

Fig. 5  The confusion matrices obtained by evaluating the MobileNet on the VISION test data set using all I-frames. The overall results along
with outcome specific to each of the three scenarios are depicted. The class labels correspond to the sequence of 28 devices listed in Table 1

based on PRNU [19] outperforms the constrained counter- contains 28 units in contrast to 1000 units for the ImageNet.
parts by a significant margin. Furthermore, on examining The weights of the output layer are initialized using Glorot
the results per scenario we notice that the unconstrained initialization [18]. Second, the dropout layers can also play
MobileNet and ResNet consistently outperform all other var- a role in contributing towards this randomness. To account
iants. It is interesting to see that the PRNU-based MobileNet for this sensitivity we repeat the experiments for 5 times. We
comes close in terms of accuracy for native and YouTube believe the deviations between the runs are pronounced due
scenarios while the performance degrades for WhatsApp to the small test set that we have rather than anything to do
videos. This shows that the traditional PRNU based denois- with the methodology. The best overall accuracy of 74.5 was
ing [19] is affected by WhatsApp compression. When con- achieved by the MobileNet when all I-frames were consid-
sidering 50 I-frames per video the unconstrained ResNet ered in the experiment. In Fig. 5 we illustrate the confusion
performs slightly better than MobileNet for the indoor sce- matrices obtained with the best performing MobileNet.
nario while the MobileNet performs significantly better in
other scenarios. Comparison with the State‑of‑the‑art
Instead of limiting to only 50 I-frames per video, we also
experimented with all I-frames per video. We considered Though the VISION data set consists of both images and
it beneficial as we have more data, however, we also were videos, most of the works [40] use only images for their
aware that it could also cause data imbalance. In the videos analysis. A few works have also been conducted involv-
considered for our experiments, we observed the number of ing videos [28, 39]. Mandelli et al. [39] estimate the refer-
I-frames per video vary between 8 to 230. The model per- ence PRNU noise for a video based on the 50 frames per
forms well even in this scenario and the MobileNet achieves video extracted from the videos of the flat-still scenario.
an overall accuracy of 72.75%. We further noticed that the This approach limits their applicability when flat videos
results tend to vary between runs, therefore we report the are unavailable for estimating the reference noise pattern
average accuracy across 5 runs for each experiment. 𝐊v . Iuliani et al. [28] proposed a hybrid solution using 100
The results, presented in Table 3, indicate that the model images (generated by the same device) to estimate the refer-
is sensitive to a few random components in the network. ence pattern noise 𝐊iv , for stabilized videos and 100 video
Firstly, since we are starting from a pre-trained network frames for non-stabilized videos. This approach is again
(model weights learnt on ImageNet), a source of random- limited by the availability of images from the same device.
ness is present in the initial weights of the output layer. In These works, however, require the knowledge if a video is
our experiments on the VISION data set, the output layer stabilized beforehand. In practice, the overall classification

SN Computer Science
316 Page 10 of 15 SN Computer Science (2022) 3:316

Table 4  Comparison with Akbari et al. [1] on the QUFVD data set
Model Constraint type Overall accuracy

[1] MISLNet - grayscale Conv [6] 59.60


[1] MISLNet - color Conv [52] 51.20
Ours: MobileNet None 71.75
Ours: MobileNet Conv [52] 70.50

All results are reported at the device level

Discussion
Fig. 6  The confusion matrix obtained by evaluating the MobileNet
on the QUFVD test data set using all I-frames. The overall accuracy
obtained is 71.75%
Video Compression

accuracy of these methods would therefore be limited by the It is very common for videos to be shared via social media
accuracy of determining the presence of video stabilization. platforms. This leads to video encoding and compression
The recent work of Cortivo et al. [17] included experi- as per the policy of these platforms. As the VISION data
ments at camera model level with 25 different cameras, set also includes videos from YouTube and WhatsApp, we
trained with only indoor and outdoor samples, and used a investigate the impact of these compressions on the camera
data set split of 80:20 for train+val:test. In our experiments, identification. The native videos in the data set were shared
we considered experiments at device level using 28 devices, on YouTube and WhatsApp and both versions were included
used all indoor, outdoor and flat scenarios and applied a data in the training set during model learning. We independently
set split of 65:35. Cortivo et al. [17] trained three different evaluate the performance of these three compression—
models, one for each compression type using visual con- native, WhatsApp, and YouTube video versions and report
tent. In our experiments we combine all compression types the results in Table 2.
and train a single model. This is more practical for forensic As shown in Table 2, the unconstrained ConvNets per-
investigators when they are unaware of the exact compres- form significantly better when compared to their constrained
sion type of a given video. These design differences make counterparts. Furthermore, the constrained ConvNets
it unsuitable to have a direct comparison between the two achieve higher accuracy on the native scenarios when com-
approaches. pared to WhatsApp and YouTube. Since the native videos
To the best of our knowledge, we are the first ones to per- contain the unaltered sensor pattern noise, we expect the
form camera identification using videos based on 28 devices native video versions to perform better than their com-
(listed in Table 1) from the VISION data set. Furthermore, pressed counterparts. The results indicate that the extracted
unlike [28, 39], we ensure that all devices from the same features from the sensor pattern noise when encoded into
camera model are always included in our data set to test the the YouTube and WhatsApp versions still retain most of
performance at device level. the camera signatures especially for the unconstrained Con-
We also compare the results obtained using our meth- vNets. It is interesting to note that actually for the uncon-
odology on the QUFVD data set. We conduct two experi- strained MobileNet, the WhatsApp videos perform slightly
ments using MobileNet with and without the constrained better than the native versions. These are promising results
convolution layer. The results of these experiments are and will enable forensic investigators to gain intelligence
reported in Table 4. The results indicate that using a pre- even from the compressed versions. Note that the sensor
trained MobileNet to classify I-frames without any scene- pattern noise is partly modified even in the native version as
suppression strategy yields better results. Furthermore, we the videos are always generated and stored in compressed
achieve the best result on the QUFVD data set for SCI formats to save storage.
using all the 20 camera devices. Notable is the fact that for
the QUFVD data set the performance of the constrained Number of Frames per Video
network is almost on par to that of the unconstrained net-
work. We present the confusion matrix obtained with the In most scenarios, the duration of videos is longer than 10
unconstrained MobileNet in Fig. 6. It can be seen that, seconds. Assuming a most common video capture rate of
most of the mispredictions are between the devices of the 30 frames per second, we can expect at least 300 frames
same model. per video. There could be a few scenarios where the test

SN Computer Science
SN Computer Science (2022) 3:316 Page 11 of 15 316

Table 5  Classification accuracy Network FT FS Overall Flat Indoor Outdoor WA YT NA


for various learning strategies
on the VISION data set ResNet50 No I-frame 68.13 73.40 59.60 69.90 70.20 66.70 67.50
ResNet50 Yes I-frame 68.53 76.30 61.00 67.20 68.50 67.90 69.30
MobileNet No I-frame 69.52 83.20 61.00 63.40 69.00 68.50 71.10
MobileNet Yes I-frame 73.51 81.50 61.60 75.40 77.40 69.00 74.10
MobileNet No Any 67.13 80.30 59.60 60.70 68.50 66.10 66.90
MobileNet Yes Any 72.51 80.90 59.60 74.90 72.60 70.20 74.70

These include fine-tuning (FT) and frame selection (FS). The results shown in this table correspond to the
model which obtained the best overall accuracy across all the runs

Table 6  Test accuracy of MobileNet with different number of frames Table 7  Test accuracy of MobileNet with different number of
per video (fpv) on the VISION data set I-frames per video (I-fpv) on the VISION data set
# fpv Overall Flat Indoor Outdoor # I-fpv Overall Flat Indoor Outdoor

1 70.32 72.3 59.6 77.0 1 69.12 71.1 57.5 76.5


5 72.31 80.3 56.8 77.0 5 72.31 79.8 59.6 75.4
10 72.91 80.9 59.6 76.0 30 74.10 82.1 62.3 76.0
20 73.71 82.1 60.3 76.5 50 73.51 81.5 61.6 75.4
50 72.51 80.9 59.6 74.9 100 73.71 82.1 61.6 75.4
100 73.31 81.5 58.9 74.3 all 73.71 82.1 61.6 75.4
200 71.91 80.9 58.9 73.8
Since videos may not have the required number of I-frames, we there-
400 72.11 80.0 58.2 74.9 fore test with up to the specified number of I-fpv

videos can be extremely short. To account for this scenario can notice that the performance increases when increasing
we train ConvNets with a different strategy for frame selec- number of fpv from 1 to 20, however, there is a slight drop
tion. Instead of relying on I-frames we now extract 50 frames in accuracy when considering a large number of frames. The
that are equally spaced in time. This strategy could pick increasing trend in accuracy up to 20 fpv can be attributed
any of the three types of frames (I-, P-, or B-frames). By to the role of the majority vote. When considering a large
keeping all learning parameters the same, we train two Con- number of frames, the majority vote includes frames that are
vNets, a MobileNet and a ResNet, and compare their perfor- temporally more dense. Thereby, frames that do not result in
mance with the I-frame counterparts. The results are shown correct classification could begin to dominate. Thus, using
in Table 5. We can see that the I-frames approach is more fewer frames that are equally spaced far apart in time are
beneficial than any-frame approach. Furthermore, the close beneficial when performing a simple majority vote.
gap between the results indicate that any-frame approach We conduct a similar experiment using the I-frames,
could also be used while encountering videos of very short whose results are presented in Table 7. The results indi-
duration during the training phase. cate that even with very little amount of test I-frames the
Additionally, at test time, few videos of extremely short model achieves a high accuracy. Moreover, as the videos
duration may be encountered. We simulate this scheme by in the VISION data set are of short duration, the number
testing the performance of our trained model for different of I-frames are also limited and, therefore, even though we
number of video frames. In particular, we test the perfor- attempt to extract a higher number of I-frames, the accuracy
mance on 1, 5, 10, 20, 50, 100, 200, and 400 frames per remains the same.
video that are equally spaced in time, the results of which
are presented in Table 6. Pre‑training ConvNets
As shown in Table 6, MobileNet achieves its best perfor-
mance when evaluated with 20 frames per video which are In several deep learning tasks based on images, using a pre-
equally spaced in time. The difference in overall accuracy trained network improves the classification accuracy of the
between 1 frame per video to 400 frames per video is very models. In our experiments, we chose MobileNet and ResNet
small at 1.79 percentage points. Thus, a model trained with that are pre-trained on ImageNet. Note that ImageNet is a
50 frames per video can be expected to perform reasonably large-scale object detection data set and the networks pre-
well on test videos with very few frames. Interestingly, we trained on ImageNet generalize to learn the high-level scene

SN Computer Science
316 Page 12 of 15 SN Computer Science (2022) 3:316

details for object recognition. In our setup, we require the Table 8  The time duration needed to classify a single video frame for
ConvNets to extract the low-level noise and therefore it is all the ConvNets used in our experiments
counter-intuitive to use a pre-trained network that works well ConvNet Time length for classifying each frame
to extract high-level scene details. Our experiments show
MobileNet 13.92 ms
that it is still beneficial to start training from such pre-trained
MobileNet - Constrained 15.62 ms
network rather than training from randomly initialized model
ResNet 18.95 ms
weights. Table 5 summarizes these experiments. It can be
ResNet - Constrained 18.72 ms
seen that by fine-tuning the overall accuracy for ResNet
improves marginally. For MobileNet, however, the overall
accuracy improves by 3.99 and 5.38 percentage points for The measurements were done in milliseconds (ms)
the I-frames and the any-frame approach, respectively.

brisque scores [43, 44, 53] but did not notice any patterns
Camera Model Identification that could be exploited for better camera identification.
We test the performance of the trained models for camera
Frame Selection Schemes
model identification. To perform camera model identifica-
tion, we replace the target device predictions with its cor-
Encouraged by the higher accuracy for flat scenario, we fur-
responding camera model for each of the 502 test videos.
ther investigated if such a strategy can be used for frame
The 28 device VISION data set, considered for our experi-
selection. That is, we determine the homogeneity score of
ments, consists of 22 camera models. With this evaluation,
each video frame and only use the top N video frames with
we observed an accuracy of 74.70%. This is a marginal
the highest homogeneity during prediction. This methodol-
improvement over the device level accuracy of 73.51%. A
ogy also did not lead to any improvement.
similar evaluation on the QUFVD data set resulted in an
improvement in results from 71.75% to 88.5%. In compari-
Scene Suppression Schemes
son, to the VISION data set this is a significant jump. A
network trained directly at model level would perform much
We already presented our experiments using two differ-
better, which can be studied in future work.
ent scene-suppression strategies, namely PRNU denoising
[19] and the other based on constrained convolutions [6,
Counter‑productive Learning Strategies 52]. When the ConvNets are trained with images of vary-
ing scene content but having the same noise signature, the
So far we have elucidated strategies that worked reasonably networks are forced to look beyond the high-level scene
well for source camera identification for videos. However, content and extract the camera noise. In a video, this can
it is equally important to discuss strategies that did not work be achieved by mixing up of the sequence of video frames
as expected for our problem, which we believe are very ben- and preparing special inputs for the ConvNets. Since green
eficial when considering future work. color channel contains twice as much as forensic traces when
compared to red and blue color channels [7], we randomly
Majority Voting Scheme sample 3 different video frames and prepare a new input
image with the corresponding green color channels. Such
As we are dealing with multiple frames per video it is rea- an input image would contain different scene content and
sonable to explore the role of various weighted majority same noise signature in the three color channels. With such
voting schemes. The results in Table 2 show that the flat a strategy we obtained an accuracy of 59.16% using the pre-
scenarios achieve higher performance when compared to trained MobileNet.
indoor and outdoor scenes. It can be reasoned that since
there is no high-level scene content in flat videos, such vid- Prediction Time Efficiency
eos retain higher degree of camera forensic traces. Based on
this idea, we quantitatively measured the degree of uniform- As demonstrated earlier, making the ConvNets sophisticated
ity by computing homogeneity, entropy and energy based on resulted in increased accuracy. It is important to ensure that
the gray-level co-occurrence matrix [20]. On appropriately this improvement does not come at a cost of increased com-
weighing frame predictions with these scores, we did not putation time. We tested the performance of the trained Con-
notice any improvement in accuracy. To give more impor- vNets using Intel Xeon CPU E5-2680 and NVIDIA V100
tance to homogeneous frames we further computed several GPU. The performance measurements are listed in Table 8.
no reference image quality metrics such as niqe, piqe, and In contrast, the methods proposed in [39] and [28] take 75

SN Computer Science
SN Computer Science (2022) 3:316 Page 13 of 15 316

ms and 10 minutes per frame respectively during prediction. We achieved the best state-of-the-art performance on the
Notable is the fact that the performance measurements of the QUFVD data set and also demonstrated the effectiveness
methods can be fairly compared only when evaluated on the of our approach on the VISION data set. For the latter data
same hardware. We report these measurements as stated in set, a direct comparison, however, was not possible due to
the respective works. several differences in the experimental design.
In addition to the time it takes to process each video The unconstrained networks outperform the constrained
frame, there is a fixed cost that is associated with extraction counterparts for sophisticated networks such as MobileNet
of frames. This overhead is shared by all video-based SCI and ResNet. It also turned out that using PRNU noise resid-
methods. In particular, it takes about 26.63 ms to extract a uals as a means to suppress scene content does not help
single I-frame from a 720p video, while it takes 42.06 ms neither. In fact, this preprocessing step made unconstrained
for a 1080p video using the ffmpeg library3. networks less effective. Finally, we analyzed several strat-
In practice, these numbers translate to a runtime of 202.75 egies to determine the aggregated decision obtained from
ms to process a 720p video with 5 frames per video. There- the considered frames, including weighted majority voting,
fore, 5 videos can be examined in about a second and around homogeneity based frame selection, and scene suppression
35.5k videos in about 2 hours. In situations where time is strategies for sophisticated ConvNets. The simple majority
a crucial factor for LEAs to collect crucial intelligence, our voting yielded the best results.
method can play an important role. The best results for the QUFVD and VISION data sets are
achieved with fine tuning a single pre-trained unconstrained
Future Work MobileNet that takes input from up to 50 I-frames, the out-
comes of which are combined by simple majority voting.
As videos share temporal information, one future direction
would be to investigate how adjacent frames can be used for Acknowledgements This work was supported by the framework agree-
ment between the University of León and INCIBE (Spanish National
scene suppression and enhancing the extraction of camera Cybersecurity Institute) under Addendum 01. This research has been
noise. partly funded with support from the European Commission under the
The ConvNets used in our experiments, MobileNet-v3- 4NSEEK project with Grant Agreement 821966. This publication
small and ResNet50, were both proposed to process images reflects the views only of the author, and the European Commission
cannot be held responsible for any use which may be made of the
of size 224 × 224 × 3 pixels. The input resolution that we information contained therein. We thank the Center for Information
consider is significantly bigger at 480 × 800 × 3 pixels. We Technology of the University of Groningen for their support and for
speculate that by scaling these networks further we can providing access to the Peregrine high-performance computing cluster.
extract fine-grained details of the camera noise, leading to
improved performance. These strategies can be considered Declarations
for a future work.
Another future direction is to leverage the structural Conflict of interest The authors declare that they have no conflict of
interest.
content of the video file’s meta-data. The structural content
refers to the individual building blocks and its arrangement Open Access This article is licensed under a Creative Commons Attri-
that makes up the file’s meta-data. Analyzing such non- bution 4.0 International License, which permits use, sharing, adapta-
editable meta-data would augment the information that we tion, distribution and reproduction in any medium or format, as long
currently extract from the visual content of video frames. as you give appropriate credit to the original author(s) and the source,
provide a link to the Creative Commons licence, and indicate if changes
More insights can therefore be obtained by analyzing a were made. The images or other third party material in this article are
fusion approach that rely on visual content and structural included in the article's Creative Commons licence, unless indicated
meta-data. otherwise in a credit line to the material. If material is not included in
the article's Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will
need to obtain permission directly from the copyright holder. To view a
Conclusion copy of this licence, visit http://​creat​iveco​mmons.​org/​licen​ses/​by/4.​0/.

Our approach for camera device identification was designed


with the LEAs’ practical requirement of high throughput in References
mind. In fact, our method requires only a relatively small set
of video frames to achieve a high accuracy. Such efficiency 1. Akbari Y, Al-maadeed S, Almaadeed N, Al-ali A, Khelifi F, Law-
allows the search space for cameras/videos to be scaled up. galy A, et al. A new forensic video database for source smartphone
identification: Description and analysis. IEEE Access (2022)
2. Alles EJ, Geradts ZJ, Veenman CJ. Source camera identification
for heavily JPEG compressed low resolution still images. J Foren-
3
https://​www.​ffmpeg.​org/ sic Sci. 2009;54(3):628–38.

SN Computer Science
316 Page 14 of 15 SN Computer Science (2022) 3:316

3. Barni M, Bondi L, Bonettini N, Bestagini P, Costanzo A, Maggini 19. Goljan, M., Fridrich, J., Filler, T. Large scale test of sensor fin-
M, Tondi B, Tubaro S. Aligned and non-aligned double JPEG gerprint camera identification. In: Media forensics and security,
detection using convolutional neural networks. J Vis Commun vol. 7254, pp. 170–181. SPIE (2009)
Image Represent. 2017;49:153–63. 20. Haralick RM, Shanmugam K, Dinstein IH. Textural fea-
4. Barni M, Chen Z, Tondi B.: Adversary-aware, data-driven detec- tures for image classification. IEEE Trans Syst Man Cybern.
tion of double JPEG compression: How to make counter-forensics 1973;3(6):610–21.
harder. In: 2016 IEEE international workshop on information 21. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for
forensics and security (WIFS), pp. 1–6. IEEE (2016) image recognition. In: Proceedings of the IEEE conference on
5. Bayar B, Stamm M.C. A deep learning approach to universal computer vision and pattern recognition, pp. 770–778 (2016)
image manipulation detection using a new convolutional layer. 22. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep
In: Proceedings of the 4th ACM Workshop on Information Hiding residual networks. In: European conference on computer vision,
and Multimedia Security, pp. 5–10 (2016) pp. 630–645. Springer (2016)
6. Bayar B, Stamm MC. Constrained convolutional neural net- 23. Holst, G.C.: CCD arrays, cameras, and displays. Citeseer (1998)
works: a new approach towards general purpose image 24. Hosler, B., Mayer, O., Bayar, B., Zhao, X., Chen, C., Shackl-
manipulation detection. IEEE Trans Inf Forensics Secur. eford, J.A., Stamm, M.C.: A video camera model identification
2018;13(11):2691–706. system using deep learning and fusion. In: ICASSP 2019-2019
7. Bayer B.E. Color imaging array. United States Patent 3,971,065 IEEE International Conference on Acoustics, Speech and Signal
(1976) Processing (ICASSP), pp. 8271–8275. IEEE (2019)
8. Bayram S, Sencar H, Memon N, Avcibas I. Source camera iden- 25. Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan,
tification based on CFA interpolation. In: IEEE International M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., et al.: Searching
Conference on Image Processing 2005, vol. 3, pp. III–69. IEEE for mobilenetv3. In: Proceedings of the IEEE/CVF International
(2005) Conference on Computer Vision, pp. 1314–1324 (2019)
9. Bennabhaktula S, Alegre E, Karastoyanova D, Azzopardi G.: 26. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W.,
Device-based image matching with similarity learning by con- Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient con-
volutional neural networks that exploit the underlying camera volutional neural networks for mobile vision applications. arXiv
sensor pattern noise. In: In Proceedings of the 9th International preprint arXiv:​1704.​04861 (2017)
Conference on Pattern Recognition Applications and Methods 27. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In:
- ICPRAM, pp. 578–584 (2020). https://​doi.​org/​10.​5220/​00091​ Proceedings of the IEEE conference on computer vision and pat-
55505​780584 tern recognition, pp. 7132–7141 (2018)
10. Bondi L, Baroffio L, Güera D, Bestagini P, Delp EJ, Tubaro S. 28. Iuliani M, Fontani M, Shullani D, Piva A. Hybrid reference-based
First steps toward camera model identification with convolutional video source identification. Sensors. 2019;19(3):649.
neural networks. IEEE Signal Process Lett. 2016;24(3):259–63. 29. Kang X, Stamm MC, Peng A, Liu KR. Robust median filtering
11. Bondi L, Lameri S, Güera D, Bestagini P, Delp E.J, Tubaro forensics using an autoregressive model. IEEE Trans Inf Forensics
S.: Tampering detection and localization through clustering of Secur. 2013;8(9):1456–68.
camera-based CNN features. In: 2017 IEEE Conference on Com- 30. Kharrazi, M., Sencar, H.T., Memon, N.: Blind source camera iden-
puter Vision and Pattern Recognition Workshops (CVPRW), pp. tification. In: 2004 International Conference on Image Processing,
1855–1864. IEEE (2017) 2004. ICIP’04., vol. 1, pp. 709–712. IEEE (2004)
12. Bunk J, Bappy J.H, Mohammed T.M, Nataraj L, Flenner A, Man- 31. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification
junath B, Chandrasekaran S, Roy-Chowdhury A.K, Peterson L.: with deep convolutional neural networks. Adv Neural Inf Process
Detection and localization of image forgeries using resampling Syst. 2012;25:1097–105.
features and deep learning. In: 2017 IEEE Conference on Com- 32. Kurosawa, K., Kuroki, K., Saitoh, N.: Basic study on identification
puter Vision and Pattern Recognition Workshops (CVPRW), pp. of video camera models by videotaped images. In: Proceedings
1881–1889. IEEE (2017) of 6th Indo Pacific Congress on Legal Medicine and Forensic
13. Cao H, Kot AC. Accurate detection of demosaicing regularity Sciences, pp. 26–30 (1998)
for digital image forensics. IEEE Trans Inf Forensics Secur. 33. Kurosawa, K., Kuroki, K., Saitoh, N.: CCD fingerprint method-
2009;4(4):899–910. identification of a video camera from videotaped images. In:
14. Chen C, Stamm M.C.: Camera model identification framework Proceedings 1999 International Conference on Image Processing
using an ensemble of demosaicing features. In: 2015 IEEE (Cat. 99CH36348), vol. 3, pp. 537–540. IEEE (1999)
International Workshop on Information Forensics and Security 34. Li CT. Source camera identification using enhanced sensor pattern
(WIFS), pp. 1–6. IEEE (2015) noise. IEEE Trans Inf Forensics Secur. 2010;5(2):280–7.
15. Chuang W.H, Su H, Wu M.: Exploring compression effects for 35. Li J, Li X, Yang B, Sun X. Segmentation-based image copy-move
improved source camera identification using strongly compressed forgery detection scheme. IEEE Transac Inform Forens Secur.
video. In: 2011 18th IEEE International Conference on Image 2014;10(3):507–18.
Processing, pp. 1953–1956. IEEE (2011) 36. Lin X, Li CT. Preprocessing reference sensor pattern noise
16. Cozzolino D, Verdoliva L. Noiseprint: a CNN-based camera via spectrum equalization. IEEE Trans Inf Forensics Secur.
model fingerprint. IEEE Transactions on Information Forensics 2015;11(1):126–40.
and Security. 2019;15:144–59. 37. Loshchilov, I., Hutter, F.: SGDR: Stochastic gradient descent with
17. Dal Cortivo D, Mandelli S, Bestagini P, Tubaro S. CNN-based warm restarts. arXiv preprint arXiv:​1608.​03983 (2016)
multi-modal camera model identification on video sequences. J 38. Lukas J, Fridrich J, Goljan M. Digital camera identification
Imaging. 2021;7(8):135. from sensor pattern noise. IEEE Trans Inf Forensics Secur.
18. Glorot, X., Bengio, Y.: Understanding the difficulty of training 2006;1(2):205–14.
deep feedforward neural networks. In: Proceedings of the thir- 39. Mandelli S, Bestagini P, Verdoliva L, Tubaro S. Facing device
teenth international conference on artificial intelligence and statis- attribution problem for stabilized video sequences. IEEE Transact
tics, pp. 249–256. JMLR Workshop and Conference Proceedings Inform Forens Secur. 2019;15:14–27.
(2010)

SN Computer Science
SN Computer Science (2022) 3:316 Page 15 of 15 316

40. Marra F, Gragnaniello D, Verdoliva L. On the vulnerability of 50. Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convo-
deep learning to adversarial attacks for camera model identifica- lutional neural networks. In: International Conference on Machine
tion. Signal Process. 2018;65:240–8. Learning, pp. 6105–6114. PMLR (2019)
41. Marra Francesco, Poggi Giovanni, Sansone Carlo, Verdoliva 51. Tang H, Ni R, Zhao Y, Li X. Median filtering detection of small-
Luisa. A study of co-occurrence based local features for camera size image based on CNN. J Vis Commun Image Represent.
model identification. Multimedia Tools Appl. 2017;76(4):4765– 2018;51:162–8.
81. https://​doi.​org/​10.​1007/​s11042-​016-​3663-0. 52. Timmerman., D., Bennabhaktula., G., Alegre., E., Azzopardi.,
42. Mayer, O., Bayar, B., Stamm, M.C.: Learning unified deep-fea- G.: Video camera identification from sensor pattern noise with
tures for multiple forensic tasks. In: Proceedings of the 6th ACM a constrained ConvNet. In: Proceedings of the 10th International
workshop on information hiding and multimedia security, pp. Conference on Pattern Recognition Applications and Meth-
79–84 (2018) ods - ICPRAM,, pp. 417–425. INSTICC, SciTePress (2021).
43. Mittal A., Moorthy A. K., Bovik A. C. No-reference image qual- doi:10.5220/0010246804170425
ity assessment in the spatial domain. IEEE Trans Image Process. 53. Venkatanath, N., Praneeth, D., Bh, M.C., Channappayya, S.S.,
2012;21(12):4695–708. https://​doi.​org/​10.​1109/​TIP.​2012.​22140​ Medasani, S.S.: Blind image quality evaluation using perception
50. based features. In: 2015 Twenty First National Conference on
44. Mittal A, Soundararajan R, Bovik AC. Making a “completely Communications (NCC), pp. 1–6. IEEE (2015)
blind” image quality analyzer. IEEE Signal Process lett. 54. Wang Xinyi, Wang He, Niu Shaozhang. An image forensic method
2012;20(3):209–12. for AI inpainting using faster R-CNN. In: Sun Xingming, Pan
45. Pibre L, Pasquet J, Ienco D, Chaumont M. Deep learning is a Zhaoqing, Bertino Elisa, editors. Artificial intelligence and secu-
good steganalysis tool when embedding key is reused for different rity: 5th International Conference, ICAIS 2019, New York, NY,
images, even if there is a cover source mismatch. Electron Imag- USA, July 26–28, 2019, Proceedings, Part III. Cham: Springer
ing. 2016;2016(8):1–11. International Publishing; 2019. p. 476–87. https://​doi.​org/​10.​
46. Ramachandran, P., Zoph, B., Le, Q.V.: Searching for activation 1007/​978-3-​030-​24271-8_​43.
functions. arXiv preprint arXiv:​1710.​05941 (2017) 55. Zhu X, Qian Y, Zhao X, Sun B, Sun Y. A deep learning approach
47. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: to patch-based image inpainting forensics. Signal Proces.
Mobilenetv2: Inverted residuals and linear bottlenecks. In: Pro- 2018;67:90–9.
ceedings of the IEEE conference on computer vision and pattern
recognition, pp. 4510–4520 (2018) Publisher's Note Springer Nature remains neutral with regard to
48. Shullani D, Fontani M, Iuliani M, Al Shaya O, Piva A. VISION: jurisdictional claims in published maps and institutional affiliations.
a video and image dataset for source identification. EURASIP J
Inform Secur. 2017;2017(1):1–16.
49. Swaminathan A, Wu M, Liu KR. Nonintrusive component foren-
sics of visual sensors using output images. IEEE Trans Inf Foren-
sics Secur. 2007;2(1):91–106.

SN Computer Science

You might also like