Efficient Video Privacy Protection Against Malicious Face Recognition Models
Efficient Video Privacy Protection Against Malicious Face Recognition Models
ABSTRACT The proliferation of powerful facial recognition systems poses a serious threat to user privacy.
Attackers could train highly accurate facial recognition models using public data on social platforms.
Therefore, recent works have proposed image pre-processing techniques to protect user privacy. Without
affecting people’s normal viewing, these techniques add special noises into images, so that it would be
difficult for attackers to train models with high accuracy. However, existing protection techniques are
mainly designed for image data protection, and they cannot be directly applied for video data because of
high computational overhead. In this paper, we propose an efficient protection method for video privacy
that exploits unique features of video protection to eliminate computation redundancy for computational
acceleration. The evaluation results under various benchmarks demonstrate that our method significantly
outperforms the traditional methods by reducing computation overhead by 35.5%.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 3, 2022 271
GUO ET AL.: EFFICIENT VIDEO PRIVACY PROTECTION AGAINST MALICIOUS FACE RECOGNITION MODELS
for every video frames. The main contributions of this paper transformation, the authors offer a unique security enhance-
are listed as follows: ment cycle-consistent generative adversarial network (GAN)
r We find that existing privacy protection methods against [17], [19]. In PECAM, for instance, each vehicle’s license
malicious face recognition have high computational plate number is regarded as private information when used
overhead for videos. The main overhead comes from the for video surveillance. The monitoring video’s finer details
process of generating cloaks for faces in video frames. are masked by GAN before being uploaded to the server. The
r We propose a novel method to reuse existing cloaks, structural information is also kept, and it has been empirically
instead of generating new ones to reduce computational demonstrated to be secure and reversible. Consequently, it is
overhead, so that video protection can be accelerated. usually possible to determine the application of road condi-
r We implement our method on TensorFlow and use well- tion and accident analysis [20], [21]. Moreover, PECAM can
known video data sets for performance evaluation. The retrieve the data for in-depth analysis. For privacy protection,
results of experiments show that 35.5 % computation can this method works effectively. To get the same visual impres-
be saved. sion in the context of this paper however, is challenging.
The rest of our paper is organized as follows. We first intro- Some programs alter images and videos so they look good
duce the background and discuss the motivation in Section II. from both a human and a machine perspective. The backdoor
Section III presents the algorithm design. Then, we show attack is a typical example, in which the attacker controls
performance evaluation in Section IV. Finally, we draw the the training of the DL model by supplying specific data and
conclusion in Section V. labels. This method is also analyzed following as protecting
privacy via poisoning attacks. After the DL model is impacted
by backdoor data, the data containing the relevant inputs are
II. BACKGROUND AND RELATED WORK
classified into predetermined categories. Since the change
The current protection techniques are outlined in this section.
of the images or videos is the user’s personal conduct in
Additionally, we preview three aspects, including metrics,
this article’s context, other people’s data cannot be impacted.
sorts of attackers, and protection methods. The current study
Furthermore, personal information is shared on social media
extends numerous types of applications and focuses on the
networks where tags cannot be changed. Therefore, back-
human vision and machine recognition aspects. Attacker types
door’s approach is not completely applicable to our scenario.
are separated into authorized model and unauthorized model
In summary, we compare two parallel metrics in existing
based on the relationship between attackers and protectors.
applications and list some applications. Moreover, the short-
Additionally, we list the protections against the unauthorized
comings of the above methods for the scenario in our article
model.
are compared.
available in a single query. Each access can only return results it is difficult to eliminate the influence of backdoor data in the
that have been processed, such as the quantity and duration of model, it is not difficult to detect attacks [34]. On the other
objects. Using this method, Privid divides continuous video hand, the label is the owner of image under face recognition
into discrete query pieces and then increases the DP of the task. Since our images or videos are published on social plat-
query results [26], [30]. This method mostly applies to track- forms, they can only be protected by clean label methods.
ing statistics, not to private social platform videos. We have Clean label attacks does not change the label of the data, but
no control over how the video is edited or how the inquiry by modifying the original data to protect the effect.
process works. Methods based on DP are therefore unrelated
to the subject matter of this paper. Not the amount of data a 3) PROTECTING PRIVACY VIA CLOAK
query can access at once needs to be regulated, but rather the
Protecting privacy via cloak is more suitable for the scenario
effect that an unauthorized attacker could have after training
of individual privacy [10], [14], [32]. Firstly, it can be modi-
the data.
fied for a user’s personal data, Secondly, it has good generality
and can deal with a wide range of models. Third, the conceal-
C. PROTECTING PRIVACY METHODS
ment of abnormal detection is better. Different from backdoor
We provide a preview of cloaks which protect privacy through
attacks, this approach does not trigger the wrong classification
the pixel-level changes. As shown in Fig. 1, there are three
by having the model record a particular input-output pair.
ways to protect privacy on images. Protecting privacy via
This particular input-output pair is independent of the original
evasion attacks uses the appropriate decoration to cover part of
recognition task. Protection method via cloak is equivalent to
the face. Poison in protection methods generate the images to
modifying the input-output pair in the recognition task.
interfere the attackers. Cloaking protection methods modifies
the image at the pixel-level. In order to protect user privacy
from unauthorized attackers, many techniques use attacks III. ALGORITHM DESIGN
against DL models. This scenario can be seen as a reversal of This section describes threat model and assumptions as shown
the traditional roles of attacker and protector, where the user in Fig. 2. The method for locating key positions of the face is
are the attackers and a third party tracker with unauthorized then demonstrated. By combining the consideration of visual
tracing is the protector. effects and privacy protections, we formulate the loss func-
tion. We then go over how to re-generate the cloaks using
1) PROTECTING PRIVACY VIA EVASION ATTACKS adjacent frames.
This type of technology requires the user to wear the appropri-
ate decoration, which is not suitable for normal use [22], [31]. A. THREAT MODEL AND ASSUMPTIONS
To evade tracking, this kind of methods need adequate white In this section, we present the threat model and assumptions
box access to the attacker model. The recognition effect of the for both users and trackers as shown in Fig. 1. Then we
tracker is calculated as the optimization objective. Therefore, analyse the intermediate results of face recognition models,
the scope of application is small and easy to be defended [32], which are called intermediate features. We follow existing
[33]. The other kind of evasion method changes the original work’s assumptions about the computing power, where users
image obviously, which will affect the normal use. can obtain intermediate features [13], [14].
Users: Users’ goal is to share their images or videos on so-
2) PROTECTING PRIVACY VIA POISONING ATTACKS cial platforms while preventing facial recognition by unautho-
Another way to avoid DL model attack is to interfere with rized facial recognition trackers. In addition, changes of visual
their training [31], [33]. A typical one is the backdoor attack, effects should be constrained, to maintain regular usage [14],
in which the attacker guides the training process of DL model [34], [35]. Therefore, users should apply data pre-processing
by generating specific data and labels. Model corruption at- for protection before uploading. The pre-processed data can
tacks actively attack the tracker, which will easily lead the lead trackers to train faulty models that fails to recognize user
tracker to use more advanced attack methods. In fact, although faces. Unfortunately, such pre-processing is computationally
FIGURE 2. Users would like to publish their videos or images on social media platforms but they don’t want unauthorized face recognition trackers to be
able to identify who they are. Trackers have sufficient processing power. They can train the recognition model using a vast data set available on social
media.
FIGURE 3. The operation principle of the protection method in feature space is visually demonstrated. In the figure, we show an example of a dataset
with four people. Classes are distinguished by different shapes, where the circle is the protected class. (a) The data is spread over locations. (b) The
protection method changes data features. (c) The feature distribution of video is more concentrated than individual images.
expensive, especially for videos, which brings high compu- different locations in the feature space. Note that feature space
tation overhead for users. Therefore, we have the following has multiple dimensions; for visual purposes, we present a
design goals for the protection method. two-dimensional one here. The same user’s data is distributed
r The protection method should not affect the visual ef- across nearby regions as a result of their similar features. The
fects of images or videos. orange circle user pre-processes his data using the protection
r The malicious models trained on the pre-processed data method. To obfuscate the malicious model, the protection
cannot recognize user’s faces. method moves the user’s data to the locations of another user,
r The computational overhead of the protection method called target class. In our example, the user data is moved
should be acceptable for users. closer to the red triangle class. Due to the large variation of
Tracker: We assume that trackers have sufficient computing features, even if the tracker selects a different model, it won’t
power. They can access large data sets or use pre-trained typically assign the orange circle data to its real class.
feature extractors through transfer learning. The intermediate Moreover, the relationship between the data belonging to
feature of faces in the data set are located in the same high- the same user must be considered. In fact, the identical per-
dimensional space, which is called feature space. As a result son’s data are not entirely mapped in the same location in
of the large amount of data in the feature space, we have the feature space. Weather, angle, and camera equipment can
chance to confuse our own data with other data. The attacker also have an impact on the intermediate features of the same
that only identifies a single user is out of this paper’s scope. person. In a given video, these outside variables tend not to
At the same time, we mainly consider the case, where social drastically alter. As a result, compared to individual images,
platform is the primary source of personal data. Although, in the intermediate elements in a video frame are more closely
reality, user’s data might also be leaked from other sources. If distributed as shown in Fig. 3(c).
the user provide enough pre-processed data, the effect of real
data provided by other sources is negligible. B. PROTECTION PROCESS
Feature Space: We explain the intuition of the protection 1) LOCATING FACE POSITIONS
method using an example with four classes in the feature When pre-processing video data, the video is first divided into
space as shown in Fig. 3. Four users’ data are mapped to frames. Inspired by MTCNN [15], we use a positioning model
cascaded of three deep convolutional neural networks (CNN) Note that the SSIM [14] value is proportional to the similarity
to locate faces and their landmarks as shown in Fig. 4(b). First, between two frames.
using shallow CNN, the positioning model quickly generates While using SSIM to measure visual effects, we use the
candidate windows, each of which has a chance of locating changes in intermediate feature to reflect protection effects.
the face. Second, a more sophisticated CNN refines the can- We train a CNN facial recognition model to extract the in-
didate windows and filters out many of them. Finally, CNN termediate features of the frame. The frame’s dimensionality
outputs the face location and refines the results. Note that the can be swiftly decreased using the convolutional layer. We ba-
positioning model specifically outputs the location of the face sically need to intercept the convolutional layer’s output as an
together with the of the eyes and mouth, called key positions. intermediary feature in our proposed protection method. It is
Later in the cloak generation section, we will detail how to straightforward for us to acquire the Minkowski Distance [37]
identify the relationship between adjacent cloaks through key from the intermediate features in the lower dimensions, called
positions. feature distance. The output of the convolutional layer [38]
on the position (k, x, y) is computed with the region of inputs
according to the following equation:
2) JOINT OPTIMIZATION OF VISUAL EFFECTS AND PRIVACY
PROTECTION C Fw Fh
The basic aspect of the our protection method is to super- F (x) = W(c,m,n,k) ∗ In(c,x+m,y+n) +b j ,
impose the faces by a pixel matrix, called cloak. The cloak c m n
influences the frame in both visual effect and the position of 0 ≤ k ≤ K, 0 ≤ x ≤ Iw − Fw , 0 ≤ y ≤ Ih − Fh
intermediate features. We measure the visual effect changes ⎛ ⎞1
using Structure Similarity Metric (SSIM) [14], [31], [36]. C Fw Fh p
SSIM assesses how similar two frames are, utilizing bright- Dt = ⎝ |F (xa )(m,n,k) − F (xb )(m,n,k) | p⎠
.
ness, contrast, and structure as three different dimensions as c m n
follows: (4)
2μx μy + C1 where the convolution layer is defined by the weight W(c,m,n,k)
l(x, y) = ,
μ2x + μ2y + C1 of height Fh , width Fw and channel C in network. The con-
2σx σy + C2 volution layer scans the space of the inputs with height Ih
c(x, y) = , and width Iw . p is the order of Minkowski Distance. The
σx2 + σy2 + C2
convolutional layer’s parameters are frozen and no longer
1 x − μx 1 y − μy changed again after the model is converged. We can map the
s(x, y) = √ . √ . (2) frames to their positions in the feature space using the frozen
N − 1 σx N − 1 σy
recognition model. In addition, simply using a portion of the
where μx and μy represents the mean of x and y, respectively. network for computation leads to lower costs, which may be
C1 and C2 are the divided zero protections, which are constant managed within the user’s tolerance range [11], [13], [14].
time the value range of the images. σx and σy stand for vari- To sum up, the optimization strategy of the protection
ance of x and y, respectively. method should combine the visual effects and feature dis-
Then the SSIM can be obtained by multiplying the above tance. The optimization objective is:
brightness, contrast, and structure similarity as follows:
L f = − Dt (F (x), F (xm )) + λ1 Dt F (xm ) , F x p ,
SSIM(x, y) = l (x, y) · c(x, y) · s(x, y), Ls = λ max (|Ds (x, xm ) | − ρ, 0)
2μx μy + C1 2σxy + C2 − λ2 max |Ds xm , x p | − ρ, 0 ,
= . (3)
μ2x + μ2y + C1 σx2 + σy2 + C2 min L f − λ3 Ls . (5)
FIGURE 6. We choose a number of continuous frames from a video, then 4) RELATIONSHIP BETWEEN CONTINUOUS FRAMES
we use SSIM to assess how similar they are. We choose a few other
people’s individual photos at random for comparison. Continuous videos We re-generate the cloak based on the prior one using an affine
typically have SSIM values that are higher than single images. transformation method [38]. The key positions, such as the
corners of the mouth and the eyes, are first located from the
continuous frames. Additional computational expenses can be
where F is the feature extractor to translate the frames to
avoided by incorporating the task of finding key positions
the feature space. λ max(|Ds(x, xm )| − ρ, 0) calculates SSIM,
within the positioning model. As a result, we create the affine
which is the measure of the view effects. In order to improve
transformation matrix as follows:
the consistency of the video, we consider the feature distance −
and SSIM between adjacent frame x p and recent frame xm . −−−−−−→ →
f (P) f (Q) = ϕ PQ ,
The image is mistakenly assigned to different categories by
shifting its location within the feature space. Therefore, using y A | b x
these pre-processed data as guidelines, malicious face recog- = (6)
1 0 ... 0 1 1
nition likewise produces inaccurate results. The loss function
is directly differentiated against the parameters in the cloak in −−−−−−→
where P, Q are the points before affine in f (P) f (Q). y is
Fig. 5. A random pixel matrix, which is the same size as the selected from key positions. A and b are affine matrices that
face, serves as the initiation of a cloak. By using an iterative determine the changes of each point.
gradient descent procedure, the cloaks are gradually updated The corners of the mouth and the centers of the eyes are
to the optimal value. the key positions that the affine transformation method uses
to identify the plane of the face in a recent frame. Once
3) EFFICIENT CLOAK GENERATION affine matrices have been established, we can utilize the affine
There are many similarities between continuous frames in a transformation method as given in. The pixels of the frame
video, including similar positions in feature space and visual are affined in this manner [38]. The areas where the pixels in
effects. As shown in Fig. 6, we select continuous frames from the new frame do not match those in the old frame are filled
a video and measure their similarity using SSIM. For com- using area interpolation. The cloaks are created to match the
parison, we randomly select images of several people from updated frames following the affine transformation method.
the well-known VGG2 data [14], [39]. The SSIM value is Therefore, the optimization process is accelerated.
TABLE 1. Protection Success Rate in Face Recognition Platforms Hyper-parameters: Our approach includes several hyper-
parameters. We undertake a thorough analysis to determine
the ideal hyper-parameter settings in order to improve the
performance of the computation reuse approach. The selection
of hyper-parameters mainly refers to previous experimen-
tal results [14], [39] and the results obtained in actual use.
The first of the hyper-parameters is the number of iterations.
well-performance protection system [14] and list their respec- Each cloak is initialized, either randomly in the conventional
tive calculations. Each configuration is arranged according to method or through the computation reuse process by mapping
previous work [12], [36]. the cloak from the previous frame. The cloaks are iteratively
Metrics: Fawkes simultaneously optimized two different optimized after initialization, with a maximum value of 40
modifications. Users’ visual effects are unaffected. The at- iterations set for each image. Typically, the iteration results
tacker’s training model may be misled by the pre-processed can be finished inside this upper limit. On the other hand, we
videos, misclassifying the real user data. As a result, there established a threshold and used it to gauge calculation speed
are two separate indicators to measure these objectives. The throughout the comparison process. When the rate of opti-
visual differences between videos taken before and after pro- mization within the threshold accelerates, the aforementioned
tection are represented by SSIM. A lower value means that number of iterations can be reduced to boost the optimization
the protection method has less of an influence on the image rate.
or video’s routine use. The protective effect measurement is
the other component. We continue to use the feature extractor B. EVALUATION
in previous work [14]. Through the pre-trained model, the We demonstrate the effects of the traditional method and
feature extractor compares the feature distance between two the computation reuse method on the data’s visual effects in
frames. The method’s improved protective impact is gener- Fig. 9(a), Compared to existing methods, our method gets
ally shown by the noticeable feature change. The total loss the similarity between frames closer than traditional method
function, which is inversely proportional to space loss and to the original video. As shown in Fig. 9(a), both meth-
square of input loss, is summarized at the end. In addition, ods have an impact on the feature continuity of the original
our method can achieve the same image protection efficiency video. Our method is considerably more similar to the original
as Fawkes in Table 1. video.
[25] Z. Yang et al., “Neural network inversion in adversarial setting via back- PENG LI (Senior Member, IEEE) received the
ground knowledge alignment,” in Proc. ACM SIGSAC Conf. Comput. B.S. degree from the Huazhong University of Sci-
Commun. Secur., 2019, pp. 225–240. ence and Technology, Wuhan, China, in 2007,
[26] N. Johnson, J. P. Near, and D. Song, “Towards practical differential and the M.S. and Ph.D. degrees from the Univer-
privacy for SQL queries,” Proc. VLDB Endowment, vol. 11, no. 5, sity of Aizu, Aizuwakamatsu, Japan, in 2009 and
pp. 526–539, 2018. 2012, respectively. He is currently a Senior As-
[27] F. D. McSherry, “Privacy integrated queries: An extensible platform for sociate Professor with the University of Aizu. He
privacy-preserving data analysis,” in Proc. ACM SIGMOD Int. Conf. has authored or coauthored more than 100 papers
Manage. Data, 2009, pp. 19–30. in major conferences and journals. His research
[28] F. Bastani et al., “MIRIS: Fast object track queries in video,” in Proc. interests mainly include cloud/edge computing,
ACM SIGMOD Int. Conf. Manage. Data, 2020, pp. 1907–1921. Internet-of-Things, distributed AI systems, and AI
[29] Z. Cai, M. Saberian, and N. Vasconcelos, “Learning complexity-aware security and privacy. He was the recipient of the Young Author Award of
cascades for deep pedestrian detection,” in Proc. IEEE Int. Conf. Com- IEEE Computer Society Japan Chapter in 2014, Best Paper Award of IEEE
put. Vis., 2015, pp. 3361–3369. TrustCom 2016, and Best Paper Award of IEEE Communication Society Big
[30] F. Cangialosi, N. Agarwal, V. Arun, S. Narayana, A. Sarwate, and Data Technical Committee in 2019. He supervised students to win the First
R. Netravali, “Privid: Practical, privacy-preserving video analytics Prize of IEEE ComSoc Student Competition in 2016. Dr. Li was also the
queries,” in Proc. 19th USENIX Symp. Netw. Syst. Des. Implementation, recipient of the 2020 Best Paper Award of IEEE Transactions on Computers.
2022, pp. 209–228. Dr. Li is the Editor of IEEE OPEN JOURNAL OF THE COMPUTER SOCIETY, and
[31] J. Steinhardt, P. W. Koh, and P. Liang, “Certified defenses for data IEICE Transactions on Communications.
poisoning attacks,” in Proc. 31st Int. Conf. Neural Inf. Process. Syst.,
2017, pp. 3520–3532. SHUI YU (Senior Member, IEEE) received the
[32] O. Suciu, R. Marginean, Y. Kaya, H. Daume III, and T. Dumitras, Ph.D. degree from Deakin University, Burwood,
“When does machine learning generalized transferability for evasion VIC, Australia, in 2004. He currently is a Professor
and poisoning attacks,” in Proc. USENIX Conf. Secur. Symp., 2018, with the School of Computer Science, University
pp. 1299–1316. of Technology Sydney, Ultimo, NSW, Australia.
[33] Y. Wu et al., “DeltaGrad: Rapid retraining of machine learning models,” He initiated the research field of networking for
in Proc. 37th Int. Conf. Mach. Learn., 2020, pp. 10355–10366. big data in 2013, and his research outputs have
[34] B. Wang et al., “Neural cleanse: Identifying and mitigating backdoor been widely adopted by industrial systems, such
attacks in neural networks,” in Proc. IEEE Symp. Secur. Privacy, 2019, as Amazon cloud security. He has authored or
pp. 707–723. coauthored four monographs and edited two books,
[35] B. Wang, Y. Yao, B. Viswanath, H. Zheng, and B. Y. Zhao, “With more than 500 technical papers, including top jour-
great training comes great vulnerability: Practical attacks against nals and top conferences, such as IEEE TRANSACTIONS ON PARALLEL AND
transfer learning,” in Proc. 27th USENIX Conf. Secur. Symp., 2018, DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS ON COMPUTERS, IEEE TRANS-
pp. 1281–1297. ACTIONS ON INFORMATION FORENSICS AND SECURITY, IEEE TRANSACTIONS
[36] M. Sharif, S. Bhagavatula, L. Bauer, and M. K. Reiter, “Accessorize to ON MOBILE COMPUTING, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA
a crime: Real and stealthy attacks on state-of-the-art face recognition,” ENGINEERING, IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING,
in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., 2016, pp. 1528– IEEE/ACM TRANSACTIONS ON NETWORKING, and INFOCOM. His research
1540. interests include Big Data, security and privacy, networking, and mathe-
[37] O. Suciu et al., “When does machine learning fall? generalized transfer- matical modeling. His h-index is 66. He is currently serving a number of
ability for evasion and poisoning attacks,” in Proc. 27th USENIX Conf. prestigious editorial boards, including IEEE COMMUNICATIONS SURVEYS
Secur. Symp., 2018, pp. 1299–1316. AND TUTORIALS as an Area Editor, IEEE Communications Magazine, IEEE
[38] G. Singh, R. Ganvir, M. Püschel, and M. Vechev, “Beyond the single INTERNET OF THINGS JOURNAL, and so on. He was a Distinguished Lecturer
neuron convex barrier for neural network certification,” in Proc. 33rd of IEEE Communications Society (2018–2021). He is a Distinguished Visitor
Int. Conf. Neural Inf. Process. Syst., 2019, Art. no. 1352. of IEEE Computer Society, a Voting Member of IEEE ComSoc Educational
[39] M. Abadi et al., “TensorFlow: A system for large-scale machine learn- Services Board, and an Elected Member of Board of Governor of IEEE
ing,” in Proc. 12th USENIX Conf. Operating Syst. Des. Implementation, Vehicular Technology Society.
2016, pp. 265–283.
HAO WANG (Senior Member, IEEE) is cur-
rently an Associate Professor and the Head of
the Big Data Laboratory, Department of ICT and
Natural Sciences, Norwegian University of Sci-
ence and Technology, Trondheim, Norway. He
was a Researcher with IBM Canada, McMaster,
and St. Francis Xavier University, Antigonish, NS,
ENTING GUO received the master’s degree with Canada, before he moved to Norway. His research
the Nanjing University of Posts and Telecommu- interests include Big Data analytics and industrial
nications, Nanjing, China, in 2020. He is currently Internet of Things, high-performance computing,
working toward the Ph.D. degree from the School safety-critical systems, and communication secu-
of the Division of Computer Science, University rity. He has authored more than 60 papers in the IEEE TVT, GlobalCom
of Aizu, Aizuwakamatsu, Japan. His research in- 2016, Sensors, the IEEE Design & Test, and Computer Communications. He
terests include AI systems, and AI security and is a Member of the IEEE IES Technical Committee on Industrial Informatics.
privacy. He was a TPC Co-Chair of the IEEE DataCom 2015, IEEE CIT 2017, and
ES 2017, and a Reviewer of journals, such as the IEEE TKDE, TBD, TETC,
T-IFS, and ACM TOMM.