0% found this document useful (0 votes)
9 views9 pages

ICPR2022 Latest

Source Camera Identification

Uploaded by

Pranav Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views9 pages

ICPR2022 Latest

Source Camera Identification

Uploaded by

Pranav Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/365855190

PRNU-Net: a Deep Learning Approach for Source Camera Model Identification


based on Videos Taken with Smartphone

Conference Paper · August 2022


DOI: 10.1109/ICPR56361.2022.9956272

CITATIONS READS

13 50

5 authors, including:

Y. Akbari Noor Almaadeed


Semnan University Qatar University
47 PUBLICATIONS 849 CITATIONS 62 PUBLICATIONS 1,906 CITATIONS

SEE PROFILE SEE PROFILE

Somaya Al-ma'adeed
Qatar University
300 PUBLICATIONS 7,686 CITATIONS

SEE PROFILE

All content following this page was uploaded by Y. Akbari on 22 June 2023.

The user has requested enhancement of the downloaded file.


Northumbria Research Link
Citation: Akbari, Younes, Almaadeed, Noor, Al-Maadeed, Somaya, Khelifi, Fouad and
Bouridane, Ahmed (2022) PRNU-Net: a Deep Learning Approach for Source Camera Model
Identification based on Videos Taken with Smartphone. In: 2022 26th International
Conference on Pattern Recognition (ICPR). International Conference on Pattern
Recognition (ICPR) . IEEE, Piscataway, NJ, pp. 599-605. ISBN 9781665490634,
9781665490627

Published by: IEEE

URL: https://doi.org/10.1109/ICPR56361.2022.9956272
<https://doi.org/10.1109/ICPR56361.2022.9956272>

This version was downloaded from Northumbria Research Link:


https://nrl.northumbria.ac.uk/id/eprint/49361/

Northumbria University has developed Northumbria Research Link (NRL) to enable users
to access the University’s research output. Copyright © and moral rights for items on
NRL are retained by the individual author(s) and/or other copyright owners. Single copies
of full items can be reproduced, displayed or performed, and given to third parties in any
format or medium for personal research or study, educational, or not-for-profit purposes
without prior permission or charge, provided the authors, title and full bibliographic
details are given, as well as a hyperlink and/or URL to the original metadata page. The
content must not be changed in any way. Full items must not be sold commercially in any
format or medium without formal permission of the copyright holder. The full policy is
available online: http://nrl.northumbria.ac.uk/policies.html

This document may differ from the final, published version of the research and has been
made available online in accordance with publisher policies. To read and/or cite from the
published version of the research, please visit the publisher’s website (a subscription
may be required.)
PRNU-Net: a Deep Learning Approach for Source
Camera Model Identification based on Videos Taken
with Smartphone
Younes Akbari∗ , Noor Almaadeed∗ , Somaya Al-maadeed∗ ,
Fouad Khelifi† , and Ahmed Bouridane‡ ,
∗ Department of Computer Science and Engineering , Qatar University, Doha, Qatar
† Department of Computer and Information Sciences, Northumbria University, UK,
‡ Center for Data Analytics and Cybernetics, University of Sharjah, Sharjah, UAE.

Abstract—Recent advances in digital imaging have meant that well as the differences between frame types that can occur
every smartphone has a video camera that can record high- when producing a video. By analyzing the video produced by
quality video for free and without restrictions. In addition, rapidly
digital cameras, video identification algorithms can identify and
developing Internet technology has contributed significantly to the
widespread distribution of digital video via web-based multimedia distinguish camera types. During the last few years, forensic
systems and mobile applications such as YouTube, Facebook, specialists have been particularly interested in this topic. In
Twitter, WhatsApp, etc. However, as the recording and distribution general, images and videos can be identified in two ways: by
of digital video has become affordable nowadays, security issues extracting a unique fingerprint from the images or videos, or by
have become threatening and have spread worldwide. One of the examining the metadata associated with the images or videos
security issues is the identification of source cameras on videos.
Generally, two common categories of methods are used in this area, (the DNA of the video). Lopez et al. [13] demonstrated that
namely Photo Response Non-Uniformity (PRNU) and Machine the internal elements and metadata of video can be used for
Learning approaches. To exploit the power of both approaches, source video identification. Since metadata can be removed
this work adds a new PRNU-based layer to a convolutional neural from an image or video, identifying video or images based on
network (CNN) called PRNU-Net. To explore the new layer, the fingerprint is a reliable method. Moreover, two concepts are
main structure of the CNN is based on the MISLnet, which has
been used in several studies to identify the source camera. The considered for identifying camera: individual source camera
experimental results show that the PRNU-Net is more successful identification (ISCI) and source camera model identification
than the MISLnet and that the PRNU extracted by the layer from (SCMI). ISCI distinguishes cameras from both the same and
low features, namely edges or textures, is more useful than high different camera models, while SCMI is a subset of ISCI that
and mid-level features, namely parts and objects, in classifying distinguishes a particular camera model from others, but cannot
source camera models. On average, the network improves the
results in a new database by about 4%. distinguish a particular camera model from the same camera
models. In this paper, we focus on the SCMI scenario.
I. I NTRODUCTION Two common methods used in the field, namely Photo
In the last two decades, the cell phone technology has evolved Response Non Uniformity (PRNU) [14], [15], [16], [17], [18]
significantly due to its economic advantages, functionality and Machine Learning approaches. PRNU, which is understood
and accessibility [1]. The ability to create digital audiovisual to be the unique fingerprint of the camera, is often referred
content without constraints such as time, objects, or locations to as residual noise or sensor pattern noise (SPN). PRNU
are clear advantages of the technology [2]. For forensic is generated when the CCD (Charge Coupled Device) or
investigations and crime prosecutions, smartphone devices CMOS (Complementary Metal Oxide Semiconductor) sensors
provide some important information in crucial ways [1], [3]. process the input signal (light) and convert it into a digital
In areas such as medicine, law, and surveillance systems, signal. The output of the methods can be considered low-
where images and videos are examined for authenticity, these level features. In deep learning methods, which are a popular
types of investigations have potential significance. Lossy video category of machine learning, this training step should be
compression complicates the forensic analysis of videos much performed to extract the fingerprint of the video captured
more than the analysis of images, since the current traces can by the camera. The main challenges for these methods are
be erased or significantly damaged by high compression rates, the separation of content from noise. The challenge can be
making it impossible or difficult to recover the entire processing solved by introducing methods and algorithms to address
data. While numerous forensic methods have been developed the problem by, for example, adding new layers and loss
based on digital images [4], [5], [6], [7], [8], [9], the forensic functions. The architecture introduced by the Multimedia
analysis of videos has been less explored. It should be noted and Information Security Lab (MISL) [19] is one of the
that methods based on images cannot also be applied directly architectures. The MISLnet network is based on a so-called
to videos [10], [11], [12]. This is due to some challenges constrained convolutional layer. A Constrained Convolutional
such as compression, stabilization, scaling, and cropping, as layer is added at the beginning of a CNN that is to perform
forensic tasks as shown in Figure 1 (a). As a result of the layer, we focus on Deep Learning in this paper, we address these
low-level features are extracted to suppress the image content. methods in this section.
To design the layer, the convolutional layer filters are enforced In [23], a CNN based sensor pattern noise (SPN) method
by the following constraints: was presented, called SPN-CNN. The authors implemented
( (1)
ωkj (0, 0) = −1 the architecture based on the idea that CNN has the ability
P (1) (1) to extract signals characterised by noise from a set of images
m,n̸=0 ωkj (m, n) = 1 [24]. Therefore, the network was to obtain a noise pattern.
(1)
where j = {1, 2, 3}. Moreover, ωkj denotes the jth kernel The method was tested on the VISION database [25] and
of the kth filter in the first layer of the network. Despite experimental results have shown that the results outperform
promising results of the method [20], [21], because the degree those of the wavelet denoiser. Also, the authors showed that,
of sensitivity in the field, an improvement in the field is always when I-frames were considered to feed into CNN, the results
essential. were further improved.
To add the benefits of PRNU approaches to CNNs, this References [20] and [21] proposed a deep learning method
paper presents a new PRNU-based layer that can improve the (MISLnet architecture) for source camera identification using
results of Deep Learning architectures in this application. The video frames to train the network. They extended a version of a
PRNU-based layer, which can be inserted into CNNs, adds an constrained convolutional layer introduced in [19] as mentioned
advantageous attribute to CNNs thus taking into account the in Section I. Moreover, a majority vote was considered to make
fingerprint information extracted from frames in the network. the decision in video level using frames fed into the network.
The layer can pass the extracted fingerprint (low-level features) The constrained convolutional layer was added as the first layer
from each layer to the next layers. This means that the features that used three kernel with size 5. This layer is constructed
can be extracted by layers with high, mid or low features. in such a way that there are relationships between adjacent
An overview of the structures is given in Figure 1 (b). In pixels that are independent of the content of the scene. The
the structure, the new layer can be placed at any point in the methods was tested on VISION database [25]. The experiments
network and retrieved several times like a convolutional layer. showed that the layer can improve results compared with deep
The goal of evaluating the new layer at different locations in learning architectures without the layer. The key difference
the network is to make the effects of the PRNU extracted from between the two methods relates to the size of images and
the high, mid, and low-level features during learning clearer. type of color modes. [20] and [21] used RGB and gray scale
Forward propagation and backpropagation are based on the modes, respectively. Patches used in former is 480 while latter
PRNU method and the derivatives of the loss with respect to fed patched with 256.
the input data of the layer, respectively. Two scenarios were Mayer et al. [26] used CNN proposed in [19] like the two
performed in relation to the location and number of repetitions previous studies to extract features and a similarity network
of the layer. To evaluate the approaches, the frames need be to verify the source camera. The similarity network maps two
extracted. Generally, the frames consists of intra-coded picture input deep feature vectors to a 2D similarity vector. To achive
(I-frame), predictive coded picture (P-frame), and bi-predictive this, the authors follow a design of the similarity network
coded picture (B-frames) showing promising results with I- developed in [27]. To obtain a decision in video level, a fusion
frames [12], [22]. In our work, I-frames are extracted from approach based on mean of the inactivated output layer from
Qatar University Forensic Video Database (QUFVD) which the similarity network was presented. This method was tested
was created as part of this investigation. The database includes on SOCRatES dataset [28]. The experiments showed that the
6000 videos from 20 modern smartphone representing five method improve traditional methods such as [29].
brands, each brand has two models, and each model has two The structure of the CNN for the three studies is shown in
identical smartphone devices. The experiments show that the Figure 1 (a). As shown in the figure, a constrained convolutional
new layer can improve the results of the CNNs without it layer is added to a simple CNN.
(MISL [19]). It is worth noting that, like all deep learning
methods used to identify source cameras, this study deals with III. PRNU-N ET
videos at the frame level instead of considering the video in a
Figure 1 (b) shows the structure of the network used in the
feature space representation.
study. As can be seen in the figure, only the constrained layer
The paper is organized as follows. Section II gives a review of
proposed by [19] (MISLnet architecture) is removed from our
available deep learning methods for source camera verification
structure and the rest of our structure is the same as MISLnet.
from videos is presented. Our new approach is then presented in
A new PRNU-based layer has been designed to replace the
Section III. Section IV describes our database used to identify
constrained convolutional layer and can be placed elsewhere
source cameras from videos. Section V discusses the evaluation
in the network. The layer can extract PRNU from raw images
of the proposed while the last section concludes this work.
(input layer) and feature maps of each convolutional layer.
II. LITERATURE REVIEW To design our new layer, forward propagation and backward
Source camera identification from videos can be classified propagation are considered, which are explained in the follow-
into two categories: PRNU and Deep Learning methods. Since ing two subsections. Since PRNU is extracted from grayscale
Batch Normalization

Fully coneccted

Fully coneccted
Batch Normalization
Batch Normalization

Batch Normalization

Fully coneccted
Max Pooling
Max Pooling

SoftMax
TanH
Max Pooling

Max Pooling

TanH

TanH
TanH
TanH

TanH
(a)

Constrained Cov2 Cov3 Cov3


Cov3
Convolutional Layer 96*7*7 64*5*5 128*1*1
64*5*5
3*5*5

Batch Normalization
Batch Normalization

Fully coneccted

Fully coneccted
Batch Normalization

Batch Normalization

Fully coneccted
Max Pooling
Max Pooling

SoftMax
PRNU Layer

TanH
Max Pooling

Max Pooling

TanH

TanH
TanH
TanH

TanH
(b)

Cov2 Cov3 Cov3


Cov3 Place that PRNU layer can put in
96*7*7 64*5*5 128*1*1
64*5*5

Fig. 1. Overview of (a) The CNN (MISLnet) presented in [19] with a constrained layer in the first layer of the network and (b) PRNU network structure with
a layer based on PRNU (the dotted rectangle shows that different locations of the layer can be considered).

images, it should be noticed that the frames used as input are


in grayscale mode.
A. Forward propagation
n oN
(j)
Let B = X(l) , Y j be the training set with N samples.
j=1
l shows the position of the layer as shown in Figure 1 (b). For
(j)
each input of the layer, we consider X(l) = {x1 , x2 , ..., xd },
where d is the dimension of the input of the layer. For the
network shown in Figure 1 (b), for l = {1, 2, 3, 4, 5}, d can
be d = {1, 96, 64, 64, 128}, respectively. d guarantees that the
PRNU can be extracted from raw images (input layer) and
feature maps of the convolutional layers which are the input
(j)
of the layer. For example, for a member of X(2) , that is, x1
(a) (b) (c)
with d = 96 denotes the first dimension of the input with 96
dimensions at the second position (after the first convolutional
layer as shown in Figure 1 (b)) eligible for the layer. To obtain Fig. 2. Sample of PRNU layer results based on three positions. The first row
PRNU for the input, we have as [29]: shows (a) the original image, (b) the output of the first convolutional layer,
and (c) the output of the last convolutional layer. The second row shows the
PRNU extraction corresponding to each output (the displayed outputs of two
xi = O + OK + Θ (2) convolutional layers are based on scaled colors and also the sizes have been
adjusted to the original size).
Where O refers to the original input multimedia file, K
represents the PRNU factor and Θ is a random noise factor. To
estimate K, noise residual W of the input should be obtained median filters. Areas around the edges are usually misinter-
using denoising filter F : preted by the latter two. For details on how this denoising filter
works, see [4].
Wi = xi − F (xi ) (3)
B. BackPropagation
Estimation of K is obtained by the following maximum
Backpropagation of the PRNU layer requires input and
likelihood estimator:
output of the forward function of the layer. To include a
Wi xi
K̂i = (4) user-defined layer in a network, the forward function of
(xi )2 the layer must accept the output of the previous layer and
Where K̂i is output of the layer for input xi . forward propagate arrays of the size expected by the next layer.
The process of feature extraction by the PRNU layer is Similarly, if the backward function is specified, it must accept
illustrated in Figure 2 at three different layer positions for one inputs of the same size as the corresponding output of the
sample. The original image, the output of the first convolutional forward function and backward propagate derivatives of the
layer and the last convolutional layer are selected to obtain same size. The derivative of the loss with respect to the input
PRNU. A patch with a size of 350 × 350 is considered as data (X) is:
input for the original image. To create feature maps in the ∂L ∂L ∂f (X)
convolutional layers, one of the convolutional kernels is used. = (5)
∂X ∂f (X) ∂X
As mentioned above, a denoising filter F is used to extract
the pattern noise. Wavelet-based filters can be considered to where ∂f∂L
(X) is the gradient propagated from the next layer.
have better performance than approaches such as Wiener and Since in backpropagation scheme, we can use both input and
output of forward propagation to derive the derivative of the and 76,531 I-frames. Table I summarizes QUFVD with its
activation. Let us consider K̂ as: features and Figure 3 shows samples of the database. The
database is publicly available 1 .
K̂ = E ⊙ X (6)
where E shows a matrix that perform the operation to obtain
PRNU like (4), and ⊙ is the element-wise product of the two
matrix. Then, if we have f (X) = K̂, the derivation is:
∂f (X)
=E (7)
∂X
Since we have both K̂ and X from the backpropagation
operation, we can easily obtain E. Algorithms 1 enumerates the
learning schemes for forward propagation and backpropagation.
Fig. 3. Sample frames from captured videos of the database
Algorithm 1 Training PRNU-Net with forward propagation
and backpropagation
(Forward propagation)
V. E VALUATION
n oN
(j)
Input: Training data B = X(l) , Y j PRNU-Net is evaluated using the SCMI scenario (a 10-class
n j=1
o N
problem) with different settings. We divide the experiments into
(j)
Output: PRNU estimated K̂ = K̂i different scenarios showing the influence of some conditions
j=1
for j=1 to N on the results, which related to the position and repetition of
for i=1 to d the layer during training. Our proposed network is compared
(j)
Compute Noise residual estimated Wi using (3) with MISLnet architecture [19] that has been used in several
(j)
Compute PRNU K̂i using (4) recent studies [20], [26], [21]. To show the impact of separating
end for content from noise, MISLnet is considered with and without
end for the constrained layer. Our implementation of the architecture
Return: K̂
is based more on [20], which was considered for identifying
(Backpropagation)
Input: K̂, X, and ∂f∂L
the source camera. The Stochastic gradient descent (SGD) is
(X)
∂L considered to train the model. The batch size is set to 100
Output: ∂X
Compute matrix of the PRNU operation: E = K̂ ⊘ X
and the parameters for momentum and decay of the stochastic
(⊘ is the element-wise division) gradient descent are set to 0.95 and 0.0005, respectively. The
Compute ∂f∂X (X)
using (7) CNN is trained for 10 epochs in each experiment. The ratio
Return: ∂X = ∂f∂L
∂L
(X)
E of the database for the experiments uses 80% of these videos
for training while the remaining 20% are considered as testing.
Also, 20% of the training data is considered as validation data.
IV. DATABASES As mentioned earlier, I-frames of the videos are extracted to
evaluate the performances using the database. For each video
Most of the databases used for source video identification
and in each experimental setup, we selected all I-frames related
provide recordings captured with video cameras, and only two
to the videos in the training, testing, and validation tasks. A total
databases offer recordings with smartphones, namely Daxing
of 76,531 I-frames were extracted. To identify a video based
[1] and QUFVD [30]. Therefore, exploring the methods based
on its I-frames, all I-frames in the test set are considered. The
on smartphone databases, which are developed with a rapid
scores obtained by the CNN based on the highest probability
growth, can show whether the existing methods are useful
show which I-frames belong to which classes. At the video
on the new databases. Comparing the results of Daxing and
level, a majority vote then decides all the frames that belong to
QUFVD with older databases such as VISION will also show
a video. It should be noted that all patches (about 90,000) for
that the new smartphone-based databases are more challenging
each class are used for training and the best results for both
and thus improvement is inevitably needed. Although the
methods are based on patches with a size of 350×350. A 64-bit
Daxing database can be considered an important database in
operating system (Ubuntu 18) with a CPU E5-2650 v4 @ 2.20
this field, as explored in [30], QUFVD is better suited to be used
GHz, 128.0 GB RAM, and four NVIDIA GTX TITAN X. were
in Deep Learning methods. In fact, in Daxing database, source
used in order to run our experiments. Tables II lists the results
camera identification techniques based on machine learning
of the frame and video levels in terms of accuracy (%) for the
may face a problem of unbalanced data since the number of
SCMI scenario for each smartphone model based on PRNU-
training videos is small and differs across the devices. To make
Net and MISLnet with and without constrained layer. With
a fair comparison, the QUFVD database has been used. This
includes five popular smartphone brands with two models per 1 https://www.dropbox.com/sh
brand with two devices for each model, 6000 original videos, /nb543na9qq0wlaz/AAAc5N8ecjawk2KlVF8kfkrya?dl=0
TABLE I
T HE DEVICES OF OUR DATABASE WITH THEIR CHARACTERISTICS .

Brand Model Resolution Number of videos Number of I-frames Length in Secs Operating system
Samsung Galaxy A50 (device #1) 1080 × 1920 300 3654 11-15 Android 10.0
Samsung Galaxy A50 (device #2) 1080 × 1920 300 3782 11-15 Android 10.0
Samsung Note9 (device #1) 1080 × 1920 300 3956 11-15 Android 10.0
Samsung Note9 (device #2) 1080 × 1920 300 3962 12-15 Android 10.0
Huawei Y7 (device #1) 720 × 1280 300 3630 11-15 Android 9.0
Huawei Y7 (device #2) 720 × 1280 300 3642 11-15 Android 9.0
Huawei Y9 (device #1) 720 × 1280 300 4146 11-14 Android 9.0
Huawei Y9 (device #2) 720 × 1280 300 4011 11-15 Android 9.0
iPhone 8 Plus (device #1) 1080 × 1920 300 3991 11-15 iOS 13
iPhone 8 Plus (device #2) 1080 × 1920 300 4080 11-14 iOS 13
iPhone XS Max (device #1) 1080 × 1920 300 3893 11-15 iOS 13
iPhone XS Max (device #2) 1080 × 1920 300 4074 11-15 iOS 13
Nokia 5.4 (device #1) 1080 × 1920 300 3350 11-13 Android 10.0
Nokia 5.4 (device #2) 1080 × 1920 300 3531 11-14 Android 10.0
Nokia 7.1 (device #1) 1080 × 1920 300 3904 11-13 Android 10.0
Nokia 7.1 (device #2) 1080 × 1920 300 3819 11-14 Android 10.0
Xiaomi Redmi Note8 (device #1) 1080 × 1920 300 3776 11-14 Android 11.0
Xiaomi Redmi Note8 (device #2) 1080 × 1920 300 3598 11 Android 11.0
Xiaomi Redmi Note9 Pro (device #1) 1080 × 1920 300 3888 11-15 Android 11.0
Xiaomi Redmi Note9 Pro (device #2) 1080 × 1920 300 3838 11-13 Android 11.0

TABLE II
T HE RESULTS OF THE FRAME AND VIDEO LEVELS IN TERMS OF ACCURACY (%) FOR THE SCMI SCENARIO FOR EACH SMARTPHONE MODEL .

Model I-frame Video


[19] (without [19] (with [19] (without [19] (with
constrained layer) constrained layer) Ours constrained layer) constrained layer) Ours
Galaxy A50 69.1 72.8 75.0 71.0 73.3 77.2
Note9 74.2 78.7 78.8 88.4 95.8 95.8
Y7 65.7 68.0 71.5 80.0 84.2 86.0
Y9 70.8 76.9 77.3 82.4 86.7 91.6
8 Plus 65.5 67.8 73.9 83.4 84.2 85.5
XS Max 71.5 76.8 79.2 64.8 68.3 74.9
5.4 79.8 81.8 83.1 88.7 90.8 92.7
7.1 71.8 75.5 80.6 86.5 90.0 92.2
Redmi Note8 70.5 75.8 81.9 78.2 80.8 84.2
Redmi Note9 Pro 66.1 66.4 75.0 64.4 65.8 77.3
Overall accuracy 70.5 74.0 77.6 78.8 82.0 85.7

TABLE III
I MPACT OF PLACE AND REPETITION OF THE PRNU LAYER IN THE NETWORK (l)

Place l=1 l=2 l=3 l=4 l=5 l = {1, 2} l = {1, 2, 3} l = {1, 2, 3, 4} l = {1, 2, 3, 4, 5}
77.4 77.6 74.1 73.8 73.0 77.4 73.9 72.5 72.3

this premise, Figure 4 provides a more comprehensive picture Impact of place and repetition of the layer in the network (l) is
of camera identification performance to check the quality of explored as shown in Table III. This shows that the position is
PRNU-Net compared to MISLnet by presenting the Receiver more suitable for layers with high, mid, or low-level features.
Operating Characteristic (ROC) curves for a selected group of
A. Result discussion
ten classes (smartphone model) from our database. Two values
are calculated for each threshold: True Positive Ratio (TPR) Recently, Deep Learning methods have been introduced to
and False Positive Ratio (FPR). The TPR of a given class, solve source camera identification. The methods can help to
e.g. Huawei Y7, is the number of outputs whose actual and improve the results obtained with traditional methods such as
predicted class is Huawei Y7 divided by the number of outputs the PRNU methods. Overall, the results obtained at the frame
whose predicted class is Huawei Y7. The FPR is calculated and video levels suggest that PRNU-Net is more successful
by dividing the number of outputs whose actual class is not than MISLnet for the SCMI problem in all device models.
Huawei Y7, but whose predicted class was Huawei Y7 by the For both methods, when the results are reported at the video
number of outputs whose predicted class is not Huawei Y7. level, improvement can be cleaely observed. In addition, the
results of PRNU-Net and MISLnet with the constrained layer
when compared against MISLnet without the constrained layer
clearly show that the separation of content and noise is useful
for source camera identification. The results are discussed in
more detail below. As can be seen in Table II at the frame level,
a few devices are hard to identify, such as the Y7, 8 Plus, and
Redmi Note9 Pro, and this requires further analysis to find out
the reason for the differences, which could be the resolution of
the videos or the imaging technology used by the devices, etc.
However, from other results, resolution does seem to be the
reason, since Y7 and Y9 have the lowest resolution, but their
identification results are not worse. The biggest improvement
is for the Redmi Note9 Pro compared to MISLnet about 9%.
At the video level, an overall improvement is achieved for
all devices. The best result with 95% is also obtained for the
Note 9 by both the methods. The best improvement on video
level is also for Redmi Note9 pro compared to MISLnet about
12%. Figures 4 shows the TPR compared to the FPR for the
two methods at different frame-level thresholds. As can be (a)
seen from the figure, all models achieve a larger Area Under
Curve (AUC) value than MISLnet. A different analysis for
the devices in terms of TPR and FPR as shown in the figure,
shows that the best performance is obtained by Nokia 5.4 with
Area Under Curve (AUC=0.991) for PRNU-Net (Figure 4 (a))
compared to AUC=0.989 for MISLnet (Figure 4 (b)). Table III
shows the best result when the layer position is l = 2 and the
layer repetition is 2, namely, l = {1, 2} in frame level. When
the repetition is set for all layers, a drop in performance is
indicated showing that the fingerprint extracted can be effected
by convolutional layers. This also shows that the layer gives
better results when placed after layers with low-level features.
This may be because PRNU extracts low level features and
the features may be more accurate if the input is low level.
VI. C ONCLUSION
This paper has presented a new layer based PRNU extracted
from videos taken with a smartphone to identify the camera (b)
source. In general, PRNU methods extract low-level features
from frames, and we have studied the feature extraction by the Fig. 4. True and false positive rates (ROC) obtained in SCMI scenario (a) 10
classes with PRNU-Net (b) 10 classes with MISLnet
methods using a deep-learning approach. For the new layer,
forward propagation and backpropagation are defined based on
the extracted PRNU and the derivative of the loss with respect be a worthwhile endeavor for the future to change architecture
to the input data, respectively. The method is evaluated with so that videos are seen as a sequence of frames rather than
a database containing five popular smartphone brands with focusing on single frames. Finally, the PRNU network should
two models per brand and two devices for each model, 6000 be tested using other scenarios such as ISCI to obtain a more
original videos, and 76,531 I-frames. The results show that the accurate analysis.
approach achieves promising results compared to MISLnet, one
of the most popular deep learning methods in the field. The ACKNOWLEDGMENT
best results are obtained when the layer located after low level This publication was made possible by NPRP grant #
inputs. However, it is obvious that it is essential to improve NPRP12S-0312-190332 from Qatar National Research Fund
the results in future works. (a member of Qatar Foundation). The statement made herein
To improve the results, especially when the layer is repeated, are solely the responsibility of the authors.
defining new learnable parameters can help to reduce the impact
of the convolutional layers. The parallel use of other PRNU R EFERENCES
methods and filters can be considered as a bank of operators that
[1] H. Tian, Y. Xiao, G. Cao, Y. Zhang, Z. Xu, and Y. Zhao, “Daxing
can be used instead of convolutional layers. Also, it is possible smartphone identification dataset,” IEEE Access, vol. 7, pp. 101 046–
to add the layer to other Deep Learning architectures. It would 101 053, 2019.
[2] S. Milani, M. Fontani, P. Bestagini, M. Barni, A. Piva, M. Tagliasacchi, Workshop on Information Forensics and Security (WIFS). IEEE, 2019,
and S. Tubaro, “An overview on video forensics,” APSIPA Transactions pp. 1–6.
on Signal and Information Processing, vol. 1, 2012. [24] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian
[3] Y. Akbari, S. Al-maadeed, O. Elharrouss, F. Khelifi, A. Lawgaly, and denoiser: Residual learning of deep cnn for image denoising,” IEEE
A. Bouridane, “Digital forensic analysis for source video identification: transactions on image processing, vol. 26, no. 7, pp. 3142–3155, 2017.
A survey,” Forensic Science International: Digital Investigation, vol. 41, [25] D. Shullani, M. Fontani, M. Iuliani, O. Al Shaya, and A. Piva, “Vision:
p. 301390, 2022. a video and image dataset for source identification,” EURASIP Journal
[4] J. Lukas, J. Fridrich, and M. Goljan, “Digital camera identification from on Information Security, vol. 2017, no. 1, pp. 1–16, 2017.
sensor pattern noise,” IEEE Transactions on Information Forensics and [26] O. Mayer, B. Hosler, and M. C. Stamm, “Open set video camera model
Security, vol. 1, no. 2, pp. 205–214, 2006. verification,” in ICASSP 2020-2020 IEEE International Conference on
[5] M. Chen, J. Fridrich, M. Goljan, and J. Lukás, “Determining image origin Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp.
and integrity using sensor noise,” IEEE Transactions on information 2962–2966.
forensics and security, vol. 3, no. 1, pp. 74–90, 2008. [27] O. Mayer and M. C. Stamm, “Forensic similarity for digital images,”
[6] A. Lawgaly and F. Khelifi, “Sensor pattern noise estimation based on IEEE Transactions on Information Forensics and Security, vol. 15, pp.
improved locally adaptive dct filtering and weighted averaging for source 1331–1346, 2019.
camera identification and verification,” IEEE Transactions on Information [28] C. Galdi, F. Hartung, and J.-L. Dugelay, “Socrates: A database of realistic
Forensics and Security, vol. 12, no. 2, pp. 392–404, 2016. data for source camera recognition on smartphones.” in ICPRAM, 2019,
[7] A. Lawgaly, F. Khelifi, and A. Bouridane, “Weighted averaging-based pp. 648–655.
sensor pattern noise estimation for source camera identification,” in 2014 [29] M. Goljan, J. Fridrich, and T. Filler, “Large scale test of sensor fingerprint
IEEE International Conference on Image Processing (ICIP). IEEE, camera identification,” in Media forensics and security, vol. 7254.
2014, pp. 5357–5361. International Society for Optics and Photonics, 2009, p. 72540I.
[8] X. Kang, Y. Li, Z. Qu, and J. Huang, “Enhancing source camera [30] Y. Akbari, S. Al-Maadeed, N. Al-Maadeed, A. Al-Ali, F. Khelifi,
identification performance with a camera reference phase sensor pattern A. Lawgaly et al., “A new forensic video database for source smartphone
noise,” IEEE Transactions on Information Forensics and Security, vol. 7, identification: Description and analysis,” IEEE Access, vol. 10, pp. 20 080–
no. 2, pp. 393–402, 2011. 20 091, 2022.
[9] F. Ahmed, F. Khelifi, A. Lawgaly, and A. Bouridane, “Comparative
analysis of a deep convolutional neural network for source camera
identification,” in 2019 IEEE 12th International Conference on Global
Security, Safety and Sustainability (ICGS3). IEEE, 2019, pp. 1–6.
[10] M. Iuliani, M. Fontani, D. Shullani, and A. Piva, “Hybrid reference-based
video source identification,” Sensors, vol. 19, no. 3, p. 649, 2019.
[11] S. Mandelli, P. Bestagini, L. Verdoliva, and S. Tubaro, “Facing device
attribution problem for stabilized video sequences,” IEEE Transactions
on Information Forensics and Security, vol. 15, pp. 14–27, 2019.
[12] E. Altinisik and H. T. Sencar, “Source camera verification for strongly
stabilized videos,” IEEE Transactions on Information Forensics and
Security, vol. 16, pp. 643–657, 2020.
[13] R. R. López, E. A. Luengo, A. L. S. Orozco, and L. J. G. Villalba, “Digital
video source identification based on container’s structure analysis,” IEEE
Access, vol. 8, pp. 36 363–36 375, 2020.
[14] S. McCloskey, “Confidence weighting for sensor fingerprinting,” in 2008
IEEE Computer Society Conference on Computer Vision and Pattern
Recognition Workshops. IEEE, 2008, pp. 1–6.
[15] W.-H. Chuang, H. Su, and M. Wu, “Exploring compression effects for
improved source camera identification using strongly compressed video,”
in 2011 18th IEEE International Conference on Image Processing. IEEE,
2011, pp. 1953–1956.
[16] M. Goljan, M. Chen, P. Comesaña, and J. Fridrich, “Effect of compression
on sensor-fingerprint based camera identification,” Electronic Imaging,
vol. 2016, no. 8, pp. 1–10, 2016.
[17] A. Mahalanobis, B. V. Kumar, and D. Casasent, “Minimum average
correlation energy filters,” Applied Optics, vol. 26, no. 17, pp. 3633–
3640, 1987.
[18] L. J. G. Villalba, A. L. S. Orozco, R. R. López, and J. H. Castro,
“Identification of smartphone brand and model via forensic video analysis,”
Expert Systems with Applications, vol. 55, pp. 59–69, 2016.
[19] B. Bayar and M. C. Stamm, “Constrained convolutional neural networks:
A new approach towards general purpose image manipulation detection,”
IEEE Transactions on Information Forensics and Security, vol. 13, no. 11,
pp. 2691–2706, 2018.
[20] D. Timmerman, S. Bennabhaktula, E. Alegre, and G. Azzopardi, “Video
camera identification from sensor pattern noise with a constrained
convnet,” arXiv preprint arXiv:2012.06277, 2020.
[21] B. Hosler, O. Mayer, B. Bayar, X. Zhao, C. Chen, J. A. Shackleford,
and M. C. Stamm, “A video camera model identification system using
deep learning and fusion,” in ICASSP 2019-2019 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
2019, pp. 8271–8275.
[22] B. C. Hosler, X. Zhao, O. Mayer, C. Chen, J. A. Shackleford, and M. C.
Stamm, “The video authentication and camera identification database: A
new database for video forensics,” IEEE Access, vol. 7, pp. 76 937–76 948,
2019.
[23] M. Kirchner and C. Johnson, “Spn-cnn: Boosting sensor-based source
camera attribution with deep learning,” in 2019 IEEE International

View publication stats

You might also like