0% found this document useful (0 votes)

31 views14 pages

2024 - Generalizing VT For Face Anti-Spoofing

The document presents the S-Adapter, a novel approach for Face Anti-Spoofing (FAS) that enhances the generalization capabilities of Vision Transformer (ViT) models by utilizing Statistical Tokens and Token Style Regularization (TSR). This method addresses the domain shift problem that affects the performance of existing FAS techniques, particularly in zero-shot and few-shot scenarios. Experimental results demonstrate that the S-Adapter significantly outperforms state-of-the-art methods on various benchmark tests, indicating its effectiveness in cross-domain generalized FAS.

Uploaded by

10119Bharti Thakur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views14 pages

2024 - Generalizing VT For Face Anti-Spoofing

Uploaded by

10119Bharti Thakur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

1

S-Adapter: Generalizing Vision Transformer for

Face Anti-Spoofing with Statistical Tokens
Rizhao Cai, Zitong Yu, Senior Member, IEEE, Chenqi Kong, Haoliang Li Member, IEEE,
Changsheng Chen, Senior Member, IEEE, Yongjian Hu, and Alex C. Kot, Life Fellow, IEEE

Abstract—Face Anti-Spoofing (FAS) aims to detect malicious FAS Data

attempts to invade a face recognition system by presenting
Attack
spoofed faces. State-of-the-art FAS techniques predominantly rely
ViT
arXiv:2309.04038v2 [cs.CV] 19 Jun 2024

on deep learning models but their cross-domain generalization

capabilities are often hindered by the domain shift problem, Bona fide

which arises due to different distributions between training ViT

and testing data. In this study, we develop a generalized FAS Pre-trained ViT Backbone (a)
method under the Efficient Parameter Transfer Learning (EPTL)
FAS Data
paradigm, where we adapt the pre-trained Vision Transformer Vanilla Adapter Attack
Token Linear
models for the FAS task. During training, the adapter modules ViT Projection
are inserted into the pre-trained ViT model, and the adapters Bona fide
are updated while other pre-trained parameters remain fixed.
We find the limitations of previous vanilla adapters in that they (b) Spoofing‐aware Inductive Bias
are based on linear layers, which lack a spoofing-aware inductive
bias and thus restrict the cross-domain generalization. To address
FAS Data S-Adapter
this limitation and achieve cross-domain generalized FAS, we Attack
Token Token
propose a novel Statistical Adapter (S-Adapter) that gathers local ViT Histogram
discriminative and statistical information from localized token
Bona fide
histograms. To further improve the generalization of the statisti-
cal tokens, we propose a novel Token Style Regularization (TSR), (c)
which aims to reduce domain style variance by regularizing
Gram matrices extracted from tokens across different domains. Fig. 1. (a) In the traditional transfer learning paradigm of training a ViT
Our experimental results demonstrate that our proposed S- model for the Face Anti-Spoofing task, a pre-trained Vision Transformer (ViT)
Adapter and TSR provide significant benefits in both zero-shot model is used for initialization, which can utilize the knowledge from the
and few-shot cross-domain testing, outperforming state-of-the-art pre-training dataset. Usually, the entire or a large proportion of the model
methods on several benchmark tests. We will release the source parameters are fine-tuned. (b) In the cutting-edge Efficient Parameter Transfer
Learning paradigm, adapter modules are integrated into a pre-trained ViT
code upon acceptance.
model. Throughout the training process, only the adapter module parameters
are updated and the pre-trained parameters are fixed. Previous vanilla adapters,
which are based on linear layers, lack the task-aware inductive bias [1],
I. I NTRODUCTION thereby limiting the utilization of pre-trained models. (c) Our proposed S-
Adapter addresses this limitation by extracting localized token histograms to
Face recognition is a convenient biometric technique and extract statistical information, enabling more efficient fine-tuning of the pre-
trained ViT model for cross-domain generalized face anti-spoofing.
people can use their faces for identity authentication. However,

Manuscript received September 2023, revised April 2024. This work was
done at Rapid-Rich Object Search (ROSE) Lab, School of Electrical and face recognition systems are threatened by face presentation
Electronic Engineering (EEE), Nanyang Technological University (NTU).
This research is supported in part by the NTU-PKU Joint Research Institute attacks, a.k.a, and face spoofing attacks. These attacks involve
(a collaboration between the NTU and Peking University that is sponsored by presenting spoofing examples of human faces to cameras, such
a donation from the Ng Teng Fong Charitable Foundation), the Science and as printed photos, digital displays, and 3D masks. Face Anti-
Technology Foundation of Guangzhou Huangpu Development District under
Grant 2022GH15. This work is also partially supported by National Natural Spoofing (FAS) [2], Face Presentation Attack Detection (Face
Science Foundation of China under Grant 62371301&62306061, and partially PAD) or Face Liveness Detection, is a crucial technology that
supported by Guangdong Basic and Applied Basic Research Foundation aims to enhance the security of face recognition (FR) systems
(Grant No. 2023A1515140037) and Chow Sang Sang Group Research Fund
under grant DON-RMG 9229161. (Corresponding author: Zitong Yu.) by detecting malicious spoofing attacks.
Rizhao Cai, Chenqi Kong, and Alex Kot are with the To safeguard FR systems against such malicious spoofing
ROSE Lab, School of EEE, Nanyang Technological University
({rzcai,chenqi.kong,eackot}@ntu.edu.sg). attacks, various techniques have been extensively researched
Zitong Yu is with the School of Computing and Information Technology, and developed [2], [3], [4], [5]. Traditional methods are mainly
Great Bay University, China ([email protected]). based on handcraft features and Support Vector Machines [6],
Haoliang Li is with the Department of Electrical Engineering, City Univer-
sity of Hong Kong ([email protected]). [7]. Given the limited representation capability of the hand-
Changsheng Chen is with the Shenzhen University, China crafted features, traditional methods cannot meet the security
([email protected]) requirement of a FAS system. In recent years, deep neural
Yongjian Hu is with the School of Electronic and Information Engineering,
South China University of Technology, Guangzhou, China, and with China- networks have been increasingly incorporated into data-driven
Singapore International Joint Research Institute (email: [email protected]). face anti-spoofing (FAS) methods to extract learnable features,
2

surpassing traditional methods [2]. However, the deployment spoofing problem, to efficiently fine-tune a pre-trained ViT
of these models is facing the domain shift problem, which for cross-domain generalized FAS. As illustrated in Fig. 2,
arises from differences in data distribution between the source our S-Adapter is motivated by traditional texture analysis
training data and the target testing data, caused by various methods, which collect histogram features from handcrafted
data collection conditions, such as illumination, cameras, or feature maps, such as local binary pattern maps [6], [26],
attack mediums [8]. The resulting overfitting of the training to alleviate the negative impact from varying environments,
distribution can lead to poor performance on the target testing such as lightings. Our S-Adapter first extracts learnable dis-
data, hindering the effective detection of spoofing attacks. To criminative token maps. Then token histograms are extracted,
tackle the challenge in cross-domain testing, previous research which provide statistical information and improve the robust-
has explored various techniques, including but not limited ness against variations in the environment. Furthermore, the
to reinforcement learning [9], adversarial learning [10], [11], statistical information can benefit the model but its effective-
meta-learning [12], [13], [14], [15], disentanglement learning ness is still hindered by the style variance between different
[16], [17], and casual intervention [18], [19]. Despite the domains. To reduce the style variance, we propose Token
progress achieved, cross-domain generalization performance Style Regularization (TSR). The proposed TSR extracts style
remains unsatisfactory due to the critical challenge posed by components based on the gram matrix, and regularizes the
the domain shift problem, and further research effort is still style variance of real faces from different domains to be
needed. minimized. As such, the statistical information with less style
Recently, the cutting-edge Vision Transformer (ViT) models variance would be more generalized for the cross-domain FAS.
have achieved striking performance with the self-attention We conduct extensive experiments to show that our proposed
mechanism for computer vision tasks [20], [21]. Inspired by method surpasses the vanilla adapter by a clear margin and
the success of ViT, the FAS researchers have been exploring achieves state-of-the-art performance on existing cross-domain
the use of ViT to address the face anti-spoofing problem face anti-spoofing benchmarks. The contributions of our work
[22], [23]. While training a ViT model to the FAS task from can be summarized as follows:
scratch requires a large amount of data to achieve generalized • We propose a novel S-Adapter to efficiently adapt pre-
performance, the model weights of ImageNet pre-trained ViT trained ViT models to achieve generalized face anti-
are easily available from open-source model zoos and can spoofing by extracting statistical information via token
be used for model initialization for training a ViT model histograms;
on FAS data [22], as shown in Fig. 1. However, previous • We propose a new Token Style Regularization (TSR),
works utilize the pre-trained model by fine-tuning either the which reduces the style variances across different do-
partial or the whole model weights of the ViT backbone. Such mains to improve the generalization of statistical token
utilization is straightforward but inefficient. Recent research histograms.;
on Efficient Parameter Transfer Learning (EPTL) has shown a • The ViT model integrated with our proposed S-Adapter
more efficient way of utilizing pre-trained ViT models for the and TSR can achieve state-of-the-art generalization per-
FAS problem. Huang et al. [23] utilized multi-stream linear formance on existing face anti-spoofing benchmarks, in-
adapters to adapt ViT efficiently and achieve promising gener- cluding zero/few-shot cross-domain generalization and
alization performance in the few-shot cross-domain scenario. unseen attack detection.
However, in the zero-shot scenario, the unseen target domain
testing scenario [24], the ViT’s generalization performance II. R ELATED WORKS
is still inferior to the previous state-of-the-art on the four-
dataset benchmark [24]. We identify the limitation of using A. Face Anti-Spoofing.
vanilla adapters based on linear layers. Linear layers lack 1) Traditional FAS Methods: Traditional face anti-spoofing
image-aware inductive bias, such as locality, and are thus (FAS) methods rely on handcrafted image descriptors to ex-
ineffective in extracting local information [25]. Since the FAS tract features for classification, such as Local Binary Patterns
data are visual images and the local information is crucial (LBP) [27], [6], [26], Histogram of Gradient [7], Difference of
for the classification [9], the linear-layer-base adapter fails to Gaussian (DoG) [28], and image quality features [29], [30],
capture discriminative local information to efficiently adapt [31]. These pioneering methods’ performance is limited by
ViT for FAS. Moreover, the feature/token embeddings used the representation capability of handcraft features, and even
for FAS classification are sensitive to the imaging process, the intra-domain performance is not satisfactory.
such as variations in camera modules and illuminations. Such 2) Deep Learning FAS Methods.: Recently, numerous FAS
variations between the source training and target testing data methods based on deep neural networks have been proposed to
cause the domain shift and lead to models’ poor domain exploit their powerful representation learning capabilities [2],
generalization performance [8]. However, it is non-trivial to [3]. For example, reinforcement learning has been proposed to
learn domain-invariant information by simply using linear mine local and global features for FAS [9]. Pixel-wise super-
layers. Nevertheless, ViT with EPTL leads to a promising vision has been studied to show more advanced performance
direction for future research in the field of FAS and deserves than binary supervision [32], [33], [34], [35]. However, models
further development. trained solely on RGB images often suffer from overfitting
Motivated by the above discussion, we propose to design and poor generalization performance when there are domain
a more advanced adapter, named S-Adapter for the face anti- shifts between training and testing data [8]. Besides, hybrid
3

methods, which combine handcraft features and deep learning LBP LBP Feature
,
have also been proposed [36], [37], [38], [39], [40]. While the Histogram
LBP map Classifier
above methods have achieved saturated performance in the

…
…
Concatenate (e.g. SVM)
intra-domain evaluation, more evaluation scenarios are raised Face
LBP ,
LBP Histogram
and studied, such as domain generalization scenario [24], [13], Image (a)
[14], [41], [42], unsupervised domain adaptation scenario [16], Output tokens
…
Transformer …
Transformer
[43], unseen attack detection scenario [44], and so on. Input Layer 𝑖
…
Layer 𝑖+1
3) FAS under Different Scenarios.: The domain generaliza- Tokens …
Conv
Local
tion (DG) scenario in FAS aims to learn a model with source Information Hist Our
Layer S-Adapter
data from one or more domains and can achieve generalized Token Token
map histogram (b)
performance on unseen target data domains without using
target domain data [24], [14]. Usually, the target domain
data is unseen from the training, meaning the different data Fig. 2. (a) The process of traditional texture analysis method for face anti-
spoofing: handcraft features (LBP) are first extracted, which are often sensitive
distributions between training and testing. This scenario is to illumination changes. Then, the histogram features are extracted as final
also referred to as unseen domain generalization or zero- representations for the classifier, which is more robust to lighting changes. (b)
shot cross-domain generalization. In this scenario, methods are Our adapter extracts local information from spatial tokens and extracts token
histogram, which is inspired by (a), for improving cross-domain performance.
expected to learn domain-invariant feature representations, and
thus various techniques have been proposed to tackle domain One EPTL example is the adapter approach [58], which
generalization challenges in FAS, such as casual intervention involves training small additional modules (adapters) on top
[18], disentangled representation learning [17], [45], [46], [47], of a pre-trained base model. The adapter contains a few task-
and meta-learning [15], [13], [14], [12], adversarial learning specific layers and has been successfully applied to various
[24], [10], [11], and contrastive learning [48], [10], [49]. The computer vision tasks, such as object detection and semantic
domain generalization scenario is a crucial challenge since segmentation [1]. Given the input token X, the adapter A
domain shift would deter an FAS model from being deployed usually transforms the tokens as X ← X + A(X).
to practical environments. Meanwhile, the unsupervised do- Another EPTL example is the Low-Rank Approximation
main adaptation (UDA) for face anti-spoofing is to utilize (LoRA) method [59], which approximates the weight in-
target domain data to adapt a model pre-trained on the source crements of W Q and W K by ∆W Q and ∆W K . During
domain data but without the labels of attack and bona fide fine-tuning, ∆W Q and ∆W K are approximated by extra
examples [50], [43], [16]. Since the accessibility of labels parameters, and updated via backward propagation, while
is not often a problem, the few-shot cross-domain face anti- W Q and W K are initialized from pre-trained models and
spoofing is studied, to utilize a few labeled target domain data fixed. Consequently, the query Q = XW Q + X∆W Q and
(e.g. 5-shot, 10 examples) during the training to achieve great K = XW K + X∆W K .
generalization performance in the target domain [51], [23]. Prompt tuning [60] is another EPTL example, in which input
Likewise, one-class adaptation is also studied, where only real tokens of one or more layers are concatenated with learnable
face examples are available [52], [53], [49]. Moreover, other prompt tokens P . This combination of pre-trained models and
than common replay, print, and 3D mask attacks [54], more few-shot learning enables rapid adaptation to new tasks. In the
attack types appear such as makeup attacks, partial attacks, and layer with prompts, X ← [X, P ]. The tokens X are learned
obfuscation attacks. To evaluate a model’s performance against from the fixed pre-trained model, while P is trainable during
unseen attack types, the unseen attack detection scenario has fine-tuning.
also been proposed and studied [55], [56], [44], [57]. In this In this work, we focus on developing a more advanced
work, we extensively evaluate our proposed method in zero- adapter by introducing spoofing-aware inductive bias to A
shot cross-domain (DG) and few-shot domain generalization when conducting token transformation. How to develop
scenarios, as well as the unseen attack detection scenario. Prompt and LoRA for generalized FAS can be studied in
the future. A prior work that is related to our work is [23],
B. Efficient Parameter Transfer Learning for ViT which also adopts adapters for face anti-spoofing. However,
The Vision Transformer (ViT) has a key component called [23] only adopts simple multi-stream linear adapters for face
Multi-Head Self-Attention (MHSA). In MHSA, the input anti-spoofing. The design of the adapters lacks insight from
token X is first transformed into the query (Q), key (K), the data properties of the FAS task, such as locality, fine-grain
and value (V): Q = XW Q , K = XW K , and V = XW V , information, style variances, etc. Our work incorporates the
where W Q , W K , and W V are the linear layers that transform above insights in the design of our proposed S-Adapter.
the Q, K, and V respectively. The output with self-attention
⊺
is then calculated as Softmax( QK
√ )V, where d denotes the III. M ETHODOLOGY
d
embedding dimensions of Q and K. Efficient Parameter In this section, we first provide preliminary knowledge
Transfer Learning (EPTL) aims to accelerate the training about how to use adapters to fine-tune the vision transformer.
process on downstream datasets by transferring knowledge Subsequently, we describe how our proposed S-Adapter is
from pre-trained ViT models. Typically, only a small number developed. Finally, we describe the final optimization method,
of parameters are updated, while the rest are initialized from which involves the proposed Token Style Regularization in the
the pre-trained model and kept fixed during fine-tuning. total loss function.
4

A. ViT with S-Adapter Output 𝑌 (196 768)

Attack / Bonafide Dimension Up
Before delving into our S-Adapter, we first provide the nec-
essary background on Vision Transformer (ViT) and adapters.
MLP Reshape & Permute (196 8)
In a ViT model [20], there can be a number of N B feed- Head
S- Token 𝑍
forward transformer blocks, where the i-th block can be l=0 l=1 l=n
MLP Adapter histogram
(8 16 16)
0 … 0
represented as WiB . As depicted
0
in Fig. 3, each block usually Transformer
Norm Histogram
exp
Conv 1X1 (𝛾)
fv
Blocks

Fusion
contains a Multi-Head Self
Video sequence 1 Attention
Vision Encoder
1 … 1 (MHSA) Truth
Lie
layer WiM SA Layer Conv 1X1 (𝜇)

M LP
and a Multi-Layer Perceptron
2
(MLP) layer Wi
2 … 2
, and each Classifier
Output
𝑍∗ 8 16 16
layer is accompanied by a Layer Normalization layer and a
Audio segment fa n layers MHSA S-
Audio Encoder
Patch Adapter
non-linear
0
activation
1
layer. By simplifying
Rectified Identity Cell 2
the skip connec-
Intra-modal reasoning Cell Cross-modal alignment Cell Embedding 𝑍 𝑍
tions, normalization layers, and activation layers, the inference Norm CD
Vanilla
procedure of each transformer block can be expressed as: Convolution Convolution
Token Spatializing (8 16 16)
Y = WiB (X) = WiMLP (WiMSA (X)), (1) Transformer Block 𝑁
Dimension Down 196 8
where X and Y are the input and output tokens of the block Updated during training
respectively. Frozen during training Input token 𝑋 196 768

In the traditional paradigm, ViT is pre-trained on a large-

scale dataset, either by Self-Supervised Learning (SSL) or Fig. 3. The structure of ViT backbone and our S-Adapter. The S-Adapter is
Supervised Learning (SL). Then, the pre-trained ViT is fine- inserted into the ViT and updated during the training.
tuned on a specific downstream task. Since fine-tuning the
entire ViT model is challenging, recent studies on EPTL
provide an efficient way of fine-tuning ViT by inserting into it The design of our S-Adapter is depicted in Fig. 3, which
small and task-specific adapter modules [1], [25]. With adapter links to the traditional LBP histogram method. The process
modules, the inference process in the i-th block is turned from of S-Adapter can be broken down into two main steps.
Eq. 1 into The first step is Token Map Extraction and this step is
Y = AMLP (WiMLP (AMSA (WiMSA (X)), (2) analogous to the extraction of raw LBP maps. The second
i i
step is Token Histogram Extraction, in which we collect
where AMLPi and AMSAi are the adapter modules after the the histograms on token maps. This step is analogous to the
MHSA and MLP layers respectively. During the fine-tuning, LBP histogram feature extraction, improving the level of token
WiM LP and WiM SA are initialized by a pre-trained model, and embedding to be more robust against environmental changes.
they are fixed and not updated during the fine-tuning. On the By incorporating the S-Adapter into the ViT architecture, we
other hand, the inserted adapter modules AMSA and AMLP are enhance its capability to handle cross-domain performance
randomly initialized and updated by backward propagation. while maintaining efficiency by extracting statistical informa-
Eq. 2 indicates the adapter’s role in transforming token tion.
embeddings from the original space into a new space related
to face anti-spoofing. However, vanilla adapters utilizing linear 1) Token Map Extraction.: In the context of face anti-
layers exhibit limitations in the embedding transformation for spoofing, local information is crucial for detection due to arti-
a face anti-spoofing (FAS) dataset. Firstly, linear-layer adapters facts generated during the recapturing process, often appearing
lack image-specific inductive biases, such as spatial locality in local regions [9]. Since input tokens X of a transformer
and 2D neighborhood structure [25], which are crucial for block are flattened without the spatial structure. To extract
addressing FAS challenges due to the importance of visual local information, we reconstruct the 2D structure of X.
P

local features [9]. Beside, FAS classification needs more fine- Given the input token X ∈ RN ×C , where N P denotes
grain information [61] and expects features to be robust against the number of tokens and C represents token embedding, we
variations in imaging environments about illuminations and first reshape X as X R ∈ RH×W ×C , with H × W = N P
camera modules, which is non-trivial for linear layers to learn. (class token ignored). Then, we permute dimensions to obtain
X M ∈ RC×H×W . Consequently, tokens are represented in a
We propose our S-Adapter to address the above limitations 2D-image style, enabling the use of widely-used PyTorch-style
to adapt pre-trained ViT for generalized face anti-spoofing 2D Convolution techniques for learning purposes.
efficiently. As illustrated in Fig. 2, our S-Adapter is inspired With the spatial tokens X M , we apply the 2D convolution
by traditional texture analysis methods in face anti-spoofing W Conv on it to extract the token map Z in a learnable way
[6], [26]. In the method of using local LBP features, the LBP that
descriptor is first used to extract the raw LBP maps, which are
at a low level and sensitive to imaging conditions [6]. Then the Z = W Conv (X M ). (3)
LBP histogram is collected, improving the feature level with
histogram statistical information. The feature representation Moreover, considering that the features of spoofing artifacts
with histogram can be more robust against lighting variations, are often of fine-grain details, which can be represented by
which inspires us to introduce statistical information via the the gradients based on the Center Difference (CD), we extract
proposed S-Adapter. token gradients based on the central difference [61] of tokens.
5

The gradient of a token Zn can be represented as Domain 1 Domain 2

X
Zng = ω(p) · (Zp − Zn ), (4)
p∈P n

where Zn is the n-th element of Z (n < N P ), P n is the index

sets of the spatial neighbors of tokens of Zn , ω is the kernel ViT Weight Sharing
ViT
weight of W Conv . With the token gradients Z g , the final token Adapter Adapter
maps Z ∗ can be calculated as ∗
𝑍𝑍12
Attack / Bona fide Attack / Bona fide
Z ∗ = (1 − θ)Z + θZ g , (5)
𝐿𝐿𝐵𝐵𝐵𝐵𝐵𝐵 𝐿𝐿 𝑇𝑇𝑇𝑇𝑇𝑇 𝐿𝐿𝐵𝐵𝐵𝐵𝐵𝐵
where θ is a constant ratio that balances the localized trans- Flow of bona fide examples Flow of attack examples
formed tokens and the gradient of tokens.
Fig. 4. The overall optimization. The bona fide and attack examples are used
2) Token Histogram Extraction.: The token map Z ∗ con- to calculate the binary cross-entropy loss (LBCE ), but only the bona fide
tains fine-grain texture features but remains sensitive to do- examples are used to calculate the Token Style Regularization (LT SR ).
main shifts. To address this issue, we propose to gather token
statistics through the computation of token histograms Z Hist
to mitigate the domain shift problem, as statistical information the learnable kernel weight can serve as γ. In this way,
can improve the representation capability [6] for the FAS task. W Conv2 (X) leads to γX. As a result,
The histograms are computed for each token of the token ∗
U = W Conv2 (W Conv1 (Z:,h+j,w+k )),
map Z ∗ . In each transformer layer, there are 14 × 14 patch ∗
(8)
Uc = γc (Zc,h+j,w+k − µc ).
tokens. Therefore, 14 × 14 histograms are calculated, and one
histogram for one patch token respectively. The output token histogram Z Hist ∈ RC×H×W is finally
The token histogram calculation process involves: 1) seg- reshaped and permuted as the output token (N P × 8). Then,
P
menting the feature range into discrete bins, and 2) enumer- it is projected to Y ∈ RN ×768 via a linear layer, which can
ating the occurrences of each value within their respective achieve dimensional and semantic alignment for the fusion
bins. However, these operations are non-differentiable, which with the original tokens.
obstructs gradient calculation and prevents the model from
being updated during backward propagation. To fit into the B. Token Style Regularization
modern fashion of deep learning, we utilize the soft binning The S-Adapter extracts statistical tokens, but the generaliza-
[62] and extract differentiable token histograms. tion of the token histogram is still affected by the variances
By defining the bin center and the bin width as µ and γ, of color and lighting, which can be summarized as style
respectively, given the token map Z ∗ ∈ RC×H×W , the soft variances. To alleviate the adverse effect of learning domain-
binned histogram Z Hist is defined as variant style components when extracting the tokens and en-
J K hance cross-domain generalization performance. Existing deep
Hist 1 X X −γc2 (Zc,h+j,w+k
∗
−µc )2
Zchw = e , (6) learning techniques for face anti-spoofing involve learning
JK j=1 features from data, which can be categorized into content
k=1
and style features [11]. Content features pertain to artifacts
where c denotes the channel dimension, h and w are spatial
or cues crucial to the classification and are expected to be
dimensions of the token histogram, J and K are the spatial
domain-invariant. By contrast, the style features encompass
sizes of the sliding window. To keep the size of tokens
low-level image attributes, which are sensitive to variations in
unchanged, J = K = 3, the stride of the window is 1, and
environmental changes. Diverse capturing conditions yield dis-
the padding size is 1. Eq. 6 is differentiable and the bin center
tinct styles, which contribute to domain shifts. Consequently,
(µ) and bin width (γ) are also trainable and can be updated
minimizing the style features learned from data is desired
during the backward propagation. To learn µ and γ, Eq. 6 can
for domain-invariant features. Prior studies have explored
be dismantled as
disentanglement learning to separate content and style features
J K through adversarial learning or other intricate mechanisms
Hist 1 X X −Uc2
Zchw = e , [17]. However, adversarial learning is often unstable, prompt-
JK j=1 (7)
k=1 ing us to propose TSR, a straightforward approach that does
∗
Uc = γc (Zc,h+j,w+k − µc ), not require complex adversarial training.
Since that image style can be represented by the Gram
where Uc denotes the c-th element of a vector U . U can be matrix [63] of images, we formulate TSR based on the Gram
learned from two consecutive pixel-wise convolutional layers. matrix to minimize domain-variant styles. Given a token map
In detail, we define W Conv1 , a C-channel pixel-wise 1 × 1 Z ∈ RC×H×W , the entry (k, k ′ ) of its gram matrix can be
convolution of which the kernel weight is fixed as 1 and defined as
the bias is learnable, which can serve as µ. In this way, H W
W Conv1 (Z) leads to Z −µ. Likewise, we can define W Conv2 , 1 XX
G(Z)k,k′ = Zk,h,w Zk′ ,h,w , (9)
a C-channel pixel-wise convolution of which the bias 0, and CHW w=1 h=1
6

where Zk,h,w denotes the element of Z at the coordiates of TABLE I

(h, w) at k-th channel. Considering that all bona fide examples I NTRA - DOMAIN EXPERIMENTS ON THE FOUR PROTOCOLS (P1,P2,P3,P4)
OF THE OUPU-NPU DATASETS .
can be regarded from one domain [10], and thus their style
features should be regularized to be the same. Thus, we Protocol Method APCER(%) BPCER(%) ACER(%)
propose our single-side Token Style Regularization, which Auxiliary[32] 1.6 1.6 1.6
minimizes the style variance between bona fide examples from DCL-FAS [9] 5.4 4.0 4.7
different domains, as depicted in Fig. 4. Given tokens of bona Spoof-Trace [17] 0.8 1.3 1.1
fide examples from two domains (e.g., Z D1 , Z D2 ), the TSR P1
CDCN [61] 0.4 1.7 1
LT SR is represented as
CDCN++ [61] 0.4 0 0.2
LT SR = ||G(Z D1 ) − G(Z D2 )||2F . (10) EPCR [67] 1.7 0 0.8
S-Adapter-TSR (Ours) 0.05 0.15 0.1
C. Final Optimization Auxiliary[32] 2.7 2.7 2.7
In the final optimization, we utilize the binary cross-entropy DCL-FAS [9] 3.4 0.1 1.9
loss LBCE , as the face anti-spoofing can be regarded as a Disentangle[17] 2.3 1.6 1.9
binary classification problem, where the label “0” denotes P2
Spoof-Trace [17] 2.3 1.6 1.9
the bonafide example and label “1” denotes spoofing attack CDCN [61] 1.5 1.4 1.5
example. When the number of domains is more than 2, we CDCN++ [61] 1.8 0.8 1.3
enumerate every two domains to calculate LT SR and then EPCR [67] 0.8 0.4 0.6
calculate the average of the LT SR to obtain LaT SR . As such,
S-Adapter-TSR (Ours) 1.2 0.4 0.8
the final loss can be represented as
Auxiliary[32] 2.7±1.3 3.1±1.7 2.9±1.5
Ltotal = LBCE + λLaT SR , (11) DCL-FAS [9] 4.6±3.6 1.3±1.8 3.0±1.5

where λ is a constant scaling factor. When calculating LaT SR , Spoof-Trace [17] 1.6±1.6 4.0±5.4 2.8±3.3
P3
we use the Z from the last transformer block to calculate the CDCN [61] 2.4±1.3 2.2±2.0 2.3±1.4
gram matrix. CDCN++[61] 1.7±1.5 2.0±1.2 1.8±0.7
EPCR [67] 0.4±0.5 2.5±3.8 1.5±2.0
IV. E XPERIMENT S-Adapter-TSR (Ours) 0.4.±0.6 1.8±1.2 1.1±1.0
A. Datasets, Protocols, and Implementations Auxiliary[32] 9.3±5.6 10.4±6.0 9.5±6.0

Our experiments involve the use of several benchmark DCL-FAS [9] 8.1±2.7 6.9±5.8 7.2±3.9
datasets, including CASIA-FASD [64], IDIAP REPLAY AT- Spoof-Trace [17] 2.3+3.6 5.2±5.4 3.8±4.2
P4
TACK [27], MSU MFSD [29], OULU-NPU [65], and SiW- CDCN [61] 4.6±4.6 9.2±8.0 6.9±2.9
M [56]. To evaluate our models, we employ Half-Total Error CDCN++ [61] 4.2±3.4 5.8±4.9 5.0±2.9
Rate (HTER), Attack Classification Error Rate (ACER), Equal EPCR [67] 0.8±2.0 7.5±11.7 3.3±4.9
Error Rate (EER), Area Under Receiver Operating Character- S-Adapter-TSR (Ours) 1.5±3.1 3.9±4.6 2.7±3.5
istic Curve (AUC), and the True Positive Rate (TPR) when
False Positive Rate (FPR) equals 1% (T P R@F P R = 1%).
These rigorous evaluation procedures ensure the reliability and it generally achieves the best ACER compared with other
validity of our findings. methods. Thus, our method is effective in the intra-dataset
To conduct our experiments, we utilized the PyTorch 1.9 experiment.
framework and performed training and testing on a single
NVIDIA GTX 2080 Ti GPU. We follow [22], [23] to use C. Cross-Domain Evaluation
ViT-Base as the ViT backbone. In data processing, we utilized
1) Leave-one-out cross-domain benchmark: To begin with,
MTCNN [66] to detect faces and resized the cropped face
we compare our proposed method with state-of-the-art meth-
images to 224 × 224 as the input to the ViT-Base. During
ods on the leave-one-out cross-domain benchmark [24], which
training, we employed the Adam optimizer with an initial
consists of the CASIA-FASD (C) [64], IDIAP REPLAY
learning rate of 0.0001.
ATTACK (I) [27], MSU MFSD (M) [29], and OULU-NPU
(O) [65] datasets. This benchmark can also be referred to as
B. Intra-Domain Evaluation the MICO benchmark [14] and has been widely used for cross-
We first report the intra-domain experiment by using the domain performance evaluation [10], [11], [23], [15], [13],
OULU-NPU dataset’s four protocols [65], and the experimen- [12]. We follow the MICO benchmark’s protocols described
tal results are in Table I, and the used metrics are Attack in [24] and present our HTER and AUC results in Table II.
Presentation Classification Error Rate (APCER), Bona Fide We conducted a fair comparison by extracting the results of
Presentation Classification Error Rate (BPCER), and Average ViT† from [23] without using any supplementary data from
Classification Error Rate (ACER). The ACER is the average the CelebA-Spoof dataset [77]. ViT† utilizes the same ViT-
of APCER and BPCER. Compared with the state-of-the- Base backbone as our proposed approach. As presented in
art methods, our method shows prominent performance, as Table II, our method outperforms ViT† significantly in all
7

TABLE II
E XPERIMENTAL RESULTS ON THE LEAVE - ONE - OUT BENCHMARK MICO. R ESULTS ARE IN TERMS OF HTER (%) AND AUC (%).

C&I&O to M O&M&I to C O&C&M to I I&C&M to O

Method Venue HTER(%) AUC(%) HTER(%) AUC(%) HTER(%) AUC(%) HTER(%) AUC(%)
MMD-AAE [68] CVPR 2018 27.08 83.19 44.59 58.29 31.58 75.18 40.98 63.08
MADDG [24] CVPR 2019 17.69 88.06 24.50 84.51 22.19 84.99 27.98 80.02
RFMetaFAS [12] AAAI 2020 13.89 93.98 20.27 88.16 17.30 90.48 16.45 91.16
NAS-Baesline [15] T-PAMI 2021 11.62 95.85 16.96 89.73 16.82 91.68 18.64 88.45
NAS w/ D-Meta [15] T-PAMI 2021 16.85 90.42 15.21 92.64 11.63 96.98 13.16 94.18
NAS-FAS [15] T-PAMI 2021 19.53 88.63 16.54 90.18 14.51 93.84 13.80 93.43
SSDG-M [10] CVPR 2020 16.67 90.47 23.11 85.45 18.21 94.61 25.17 81.83
SSDG-R [10] CVPR 2020 7.38 97.17 10.44 95.94 11.71 96.59 15.61 91.54
FAS-DR-BC(MT) [13] T-PAMI 2021 11.67 93.09 18.44 89.67 11.93 94.95 16.23 91.18
SSAN-R [11] CVPR 2022 6.57 98.78 10.00 96.67 8.88 96.79 13.72 93.62
PatchNet [69] CVPR 2022 7.10 98.46 11.33 94.58 13.4 95.67 11.82 95.07
AMEL[70] ACM MM 2022 10.23 96.62 11.88 94.39 18.60 88.79 11.31 93.36
MetaPattern [14] T-IFS 2022 5.24 97.28 9.11 96.09 15.35 90.67 12.40 94.26
ViT† [23] ECCV 2022 4.75 98.79 15.70 92.76 17.68 86.66 16.46 90.37
ViT-S-Adapter-TSR Ours 3.43 99.50 6.32 97.82 7.16 97.61 7.21 98.00

TABLE III 5-shot cross-domain experiments and displayed the results in

E XPERIMENTAL RESULTS WITH LIMITED SOURCE DOMAINS . R ESULTS Table IV. To ensure a fair comparison, the results of “ViTAF†”
ARE IN TERMS OF HTER (%) AND AUC (%).
[23] excluded the CelebA-Spoof dataset [77] as supplementary
M&I to C M&I to O training data [23], which is consistent with our ViT-S-Adapter.
Method Venue HTER(%) AUC(%) HTER(%) AUC(%)
MS-LBP [71] IJCB 2011 51.16 52.09 43.63 58.07 Our S-Adapter demonstrated increased efficiency in 5-shot
IDA [29] T-IFS 2015 45.16 58.80 54.52 42.17
ColorTexture [6] T-IFS 2016 55.17 46.89 53.31 45.16
cross-domain testing, yielding superior results compared to
LBP-TOP [72] EJIVP 2014 45.27 54.88 47.26 50.21 the linear-layer-based adapter employed in [23]. Consequently,
MADDG [24] CVPR 2019 41.02 64.33 39.35 65.10
SSDG-M [10] CVPR 2020 31.89 71.29 36.01 66.88 our S-Adapter remains promising in the few-shot cross-domain
AMEL[70] ACM MM 2022 23.33 85.17 19.68 87.01
SSAN-M [11] CVPR 2022 30.00 76.20 29.44 76.62 setting.
MetaPattern[14] T-IFS 2022 30.89 72.48 20.94 86.71
ViT-S-Adapter-TSR Ours 17.93 89.56 19.76 86.87
D. Unseen Attack Detection Evaluation
While the above cross-domain evaluation only involves 2D
four leave-one-out comparisons, indicating the effectiveness attack examples (i.e. print, replay), we are aware that there
of our approach in adapting pre-trained models for domain are more attack types appearing, such as mask attacks [56].
generalization performance in unseen domains. Additionally, Therefore, it is necessary to evaluate the proposed method’s
our approach outperforms recent state-of-the-art methods, such performance in detecting various types of attacks, especially
as SSAN-R (CVPR2022) [11] and MettaPattern (TIFS2022) attack types unseen in the training. As such, we utilize the
[14], achieving the best results across all four leave-one-out unseen attack evaluation protocol based on the SiW-M dataset
experiments and demonstrating the cross-domain generaliza- [56]. As listed in Table V, the SiW-M dataset includes replay
tion performance of our method. attacks, print attacks, mask attacks, makeup attacks, and partial
2) Limited Source Domains: Furthermore, we evaluate the attacks. The leave-one-out unseen attack protocol in the SiW-
performance of our proposed method when only a limited M dataset [56] leaves one attack type as the unseen target
number of source domains are available. Modified from the attack type in the testing. Overall, we can see our method
MICO protocol, we use data from only two source domains achieves the best ACER results in detecting unseen attack
(“M&I”) during the training process, and present the results types and surpasses the state-of-the-art method by a significant
in Table III. In the experiments of “M&I to O”, our method margin (> 5%) in terms of ACER and EER. Therefore, our S-
performs comparably with the state-of-the-art AMEL method Adapter is still effective in detecting out-of-distribution attack
[70] and outperforms AMEL by a clear margin in “M&I to types.
C”. Overall, our method can also achieve promising cross-
domain performance when the number of source domain data E. Ablation Study
is limited. We conduct the ablation study to investigate the effective-
3) Few-shot Cross-Domain Evaluation: The domain gen- ness of our proposed S-Adapter and the TSR under different
eralization evaluation previously mentioned was conducted in λ.
a zero-shot cross-domain setting, wherein no target domain 1) Effectiveness of S-Adapter: In this study, we investigate
data was available during the training phase. In practical the effectiveness of our S-Adapter in a series of cross-domain
applications, however, it is plausible that a small amount of experiments without the TSR. To assess the impact of the
target domain data may be collected to adapt a model to a new histogram layers and token gradient, we conduct experiments
target domain. Therefore, we also assessed our method in a under two additional configurations: 1) “S-Adapter w/o hist”,
few-shot cross-domain setting. Following [23], we performed where histogram layers are removed, and 2) “S-Adapter w/o
8

TABLE IV
T HE RESULTS OF THE 5- SHOT CROSS - DOMAIN EXPERIMENT. 5 BONA FIDE EXAMPLES AND 5 ATTACK EXAMPLES FROM THE TARGET DOMAIN ARE USED
TO FINE - TUNE THE PRE - TRAINED MODEL . T HE RESULTS ARE IN TERMS OF HTER (%), AUC(%) TPR(%)@FPR=1%

C&I&O to M O&M&I to C C&M&O to I C&I&M to O

Method
HTER AUC TPR@FPR=1% HTER AUC TPR@FPR=1% HTER AUC TPR@FPR=1% HTER AUC TPR@FPR=1%
ViTAF† (ECCV-2022) [23] 4.75 98.59 80.00 4.19 98.59 57.86 3.28 99.27 76.92 10.74 95.70 51.13
ViT-S-Adapter (Ours) 2.38 99.72 94.67 3.82 99.52 78.12 1.76 99.81 94.75 4.24 99.21 85.12

TABLE V
R ESULTS OF LOO PROTOCOLS ON S I W-M DATASET [56]. T HE VALUES ACER(%) REPORTED ON TESTING SETS ARE OBTAINED WITH THE THRESHOLD
OF 0.5. T HE BEST RESULTS ARE BOLDED .

Method Metric(%) Replay Print Mask Attacks Makeup Attacks Partial Attacks Average
Half Silicone Trans Paper Manne Obfusc Imperson Cosmetic Funny Eye Paper Glasses Partial Paper
ACER 16.8 6.9 19.3 14.9 52.1 8.0 12.8 55.8 13.7 11.7 49.0 40.5 5.3 23.6±18.5
Auxiliary[32]
EER 14.0 4.3 11.6 12.4 24.6 7.8 10.0 72.3 10.1 9.4 21.4 18.6 4.0 17.0±17.7
ACER 12.8 5.7 10.7 10.3 14.9 1.9 2.4 32.3 0.8 12.9 22.9 16.5 1.7 11.2±9.2
BCN[73]
EER 13.4 5.2 8.3 9.7 13.6 5.8 2.5 33.8 0.0 14.0 23.3 16.6 1.2 11.3±9.5
ACER 10.8 7.3 9.1 10.3 18.8 3.5 5.6 42.1 0.8 14.0 24.0 17.6 1.9 12.7±11.2
CDCN++[61]
EER 9.2 5.6 4.2 11.1 19.3 5.9 5.0 43.5 0.0 14.0 23.3 14.3 0.0 11.9±11.8
ACER 12.1 9.7 14.1 7.2 14.8 4.5 1.6 40.1 0.4 11.4 20.1 16.1 2.9 11.9±10.3
DC-CDN[74]
EER 10.3 8.7 11.1 7.4 12.5 5.9 0.0 39.1 0.0 12.0 18.9 13.5 1.2 10.8±10.1
ACER 7.8 7.3 7.1 12.9 13.9 4.3 6.7 53.2 4.6 19.5 20.7 21.0 5.6 14.2±13.2
SpoofTrace[17]
EER 7.6 3.8 8.4 13.8 14.5 5.3 4.4 35.4 0.0 19.3 21.0 20.8 1.6 12.0 ± 10.0
ACER 9.8 6.0 15.0 18.7 36.0 4.5 7.7 48.1 11.4 14.2 19.3 19.8 8.5 16.8±11.1
DTN[75]
EER 10.0 2.1 14.4 18.6 26.5 5.7 9.6 50.2 10.1 13.2 19.8 20.5 8.8 16.1±12.2
ACER 9.5 7.6 13.1 16.7 20.6 2.9 5.6 34.2 3.8 12.4 19.0 20.8 3.9 13.1±8.7
DTN(MT)[13]
EER 9.1 7.8 14.5 14.1 18.7 3.6 6.9 35.2 3.2 11.3 18.1 17.9 3.5 12.6±8.5
ACER 7.8 5.9 13.4 11.7 17.4 5.4 7.4 39.0 2.3 12.6 19.6 18.4 2.4 12.6±9.5
FAS-DR(Depth)[13]
EER 8.0 4.9 10.8 10.2 14.3 3.9 8.6 45.8 1.0 13.3 16.1 15.6 1.2 11.8±11.0
ACER 6.3 4.9 9.3 7.3 12.0 3.3 3.3 39.5 0.2 10.4 21.0 18.4 1.1 10.5±10.3
FAS-DR(MT)[13]
EER 7.8 4.4 11.2 5.8 11.2 2.8 2.7 38.9 0.2 10.1 20.5 18.9 1.3 10.4±10.2
ACER 11.35 5.58 3.44 9.63 16.73 1.47 2.89 26.60 1.90 9.04 23.14 11.23 2.44 9.65±8.19
ViT[76]
EER 11.18 7.32 3.89 9.63 14.32 0.00 3.50 23.48 1.64 9.20 20.38 11.32 1.86 9.06±7.21
ACER 8.93 4.08 1.81 2.02 1.61 0.39 0.62 4.00 1.09 6.60 13.09 0.54 0.43 3.48±3.90
ViT-S-Adapter (Ours)
EER 5.38 3.48 1.67 2.96 1.36 0.00 0.00 4.35 0.00 7.20 10.25 0.48 0.23 2.87±3.20

TABLE VI
R ESULTS OF OUR S-A DAPTER AND TSR FOR DIFFERENT V I T BACKBONES : V I T-L ARGE , V I T-S MALL , AND V I T-T INY.

C&I&O to M O&M&I to C C&M&O to I C&I&M to O

Backbone Adapter
HTER (%) AUC (%) HTER (%) AUC (%) HTER (%) AUC (%) HTER (%) AUC (%)
S-Adapter w/o hist (θ = 0) 13.04 94.17 20.68 87.90 27.08 71.43 21.28 87.13
ViT-Tiny S-Adapter 14.95 91.89 17.75 90.68 23.46 75.94 21.38 86.91
S-Adapter-TSR 9.92 96.00 16.00 92.22 19.41 83.67 16.24 90.91
S-Adapter w/o hist (θ = 0) 11.40 95.64 17.32 91.20 19.05 89.17 15.88 91.04
ViT-Small S-Adapter 10.52 95.66 12.11 94.06 20.02 89.85 12.74 93.10
S-Adapter-TSR 9.23 96.56 11.55 94.90 12.79 93.15 12.32 94.84
S-Adapter w/o hist (θ = 0) 11.04 95.56 8.03 97.27 13.74 86.62 11.13 94.93
ViT-Large S-Adapter 4.04 99.09 7.57 96.86 10.33 96.06 11.53 95.09
S-Adapter-TSR 2.90 99.48 7.34 97.63 8.54 97.17 8.20 97.69

Hist(θ = 0)”, where both histogram layers and token gradient a more comprehensive representation of texture information
(θ = 0) are removed. The experimental results are provided in across resolutions. This is evident in the lower HTER achieved
Fig. 5. It can be seen that our S-Adapter generally outperforms by our S-Adapter compared to the other two configurations
the other two configurations, illustrating the advantages of in the “C&I&M to O” experiment. In summary, our pro-
extracting the token histogram. We observe that the token posed S-Adapter demonstrates performance improvements by
gradient also contributes to lower HTER values in most cases. leveraging statistical information to enhance cross-domain
However, in the “C&I&M to O” experiment, the inclusion performance, highlighting the benefits of incorporating a token
of token gradient information results in an increased HTER. histogram from the token map with the gradient information.
We conjecture that this unexpected result may be attributed
to the disparity in texture between the low-resolution source Moreover, we validate the proposed components of the
domains (I, C, and M) and the high-resolution target domain CDC layer and the histogram (Hist) layer with a standard
(O). Although the fine-grained texture information is extracted vanilla adapter. To achieve this, after a standard vanilla linear
in the gradient, the domain gap might cause the texture adapter, we add the CDC layer (θ = 0.7) and the histogram
to differ significantly between the low-resolution and high- layer, and the results are denoted and reported in Fig. 6 as
resolution domains. In contrast, our histogram layers provide ‘Adapter+CDC’ and ‘Adapter+CDC+Hist’. As illustrated in
Fig. 6, our proposed histogram layer also benefits the vanilla
9

S-$GDSWHU w/RKLVW （θ=0） HTER (%) and AUC (%) results of different fusion strategies
S-$GDSWHUZRKLVW 100.00%
90.00%
S-$GDSWHU
80.00%

+DOI7RWDO(UURU5DWH

70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
HTER AUC HTER AUC HTER AUC HTER AUC

C&I&O to M O&M&I to C C&M&O to I C&I&M to O

Concatenation Summation (ours)

Fig. 7. Comparison between the fusion strategy of summation and concate-
nation. Results are in terms of HTER (%) ↓ and AUC (%) ↑.
& 2 ,WR0 2 0 ,WR& & 2 0WR, & , 0WR2
([SHULPHQWV
& , 2WR0
Fig. 5. Ablation study about adapters. Red bars convey the results of our
S-Adapters. Green bars convey the results of removing the token histogram
2 0 ,WR&
from our S-Adapters. Blue bars convey the results of further removing the
& 0 2WR,

+DOI7RWDO(UURU5DWH
gradient information (θ = 0). & 0 ,WR2

+\SHUSDUDPHWHU
Fig. 8. The curve of HTER (%) over different λ on the leave-one-out cross-
domain experiments.

Fig. 6. Illustration of the effectiveness of applying our histogram and different combinations of source domains possess varying style
TSR on a standard vanilla adapter. ‘Adapter’ denotes the standard vanilla variations. However, we do not recommend using a large λ,
adapter. ‘Adapter+CDC’ denotes that a CDC layer is applied after the
‘Adapter’. ‘Adapter+CDC+Hist’ denotes the Histogram applied after the as the model would overfit to aligning the token style and
‘Adapter+CDC+Hist’. become less effective in classification. Generally, a small λ
(e.g. 0.1) is advised.
4) Study on the fusion strategy: As depicted in Fig. 3, Y
adapter to further reduce the HTER. is fused with the original token input X via a summation
2) Versatility to other backbones.: In the above experi- operation. It could be argued that Y is about the statistical
ments, we used ViT-Base to align with previous work. In information, which is different from X and thus it does not
this study, we examine how the proposed method can be make sense to do the summation between X and Y . We point
transferred to other backbones, including ViT-tiny, ViT-Small, out that it is Z Hist that is about the statistical information,
and ViT-Large, and the results are presented in Table VI. As and Y has been transformed and aligned to X in terms of
can be seen in Table VI. The proposed S-Adapter and TSR the semantics and dimensions via the “dimension up” (linear
generally benefit different backbones over the different cross- layer), as shown in Fig. 3. Thus, it makes sense to fuse Y with
domain testing experiments. Therefore, our proposed methods X via the summation. Nevertheless, we conduct experiments
are transferable. by concatenating Y to the summation result between X and
3) Effectiveness of TSR and λ: We investigate the effective- the results after the MHSA. After the concatenation, the
ness of our proposed TSR by altering the balancing ratio λ, and tokens’ dimension is doubled (768×2). To fix the dimension
the experimental results are provided in Fig. 8. When λ = 0, mismatch, we forward the concatenated tokens to a linear layer
TSR is not employed in the optimization process, serving to reduce the dimension back to 768 to match the next layer’s
as a baseline. Our proposed TSR demonstrates its ability input dimension. The experimental results are shown in Fig. 7.
to reduce style variance and facilitate a more generalized We can see that the concatenation fusion is providing poorer
model, as evidenced by the further reduction of HTER results results. We conjecture that the extra linear layers increase the
when a λ > 0 is applied across the four experiments. It is model complexity and make the model prone to overfitting.
important to note that the optimal λ varies under different 5) The impact of θ: To further study the effectiveness
source domain settings, which is reasonable considering that of θ on the statistic histogram, we conduct experiments to
10

C&I&M to O Vanilla Adapter S-Adapter S-Adapter-TSR C&I&M to O Vanilla Adapter S-Adapter S-Adapter-TSR

Query Query
15 57

Query Query
155 78

Fig. 9. Visualization of the maps of self-attention of a query patch. The left part shows an image example’s self-attention maps when the 15th (top) and
155th (bottom) are the query patch respectively. The examples are from the OULU-NPU dataset. The vanilla Adapter is based on ViT-S-Adapter but without
the statistical tokens (histogram). Red means high attention and blue means low attention. Given that the summation of the attention map within a single
image is “1”, the more attention on the background areas, the less attention on the face areas. Best viewed in color.

image, rather than between multiple images. For example,

we observe that our ViT-S-Adapter-TSR is consistent in the
attention maps given different queries. The output attention
maps of Query 15 and Query 155 are similar, meaning that
the model is “keeping focus” on a similar area. On the other
hand, the ViT with vanilla adapter is paying attention to
various areas, since the attention maps of the two queries (15th
and 155th patch) vary hugely. When the query is the 15th
patch, the ViT-Adapter even focuses on the background region,
instead of mining spoofing cues from the face regions. Our
observations are aligned with the [78] that a vision transformer
paying less attention can achieve more generalization. The
observations are also consistent in the examples in the right
Fig. 10. The HTER results reported on the ‘C&I&O to M’ experiment with part of Fig. 9.
θ adjusted from 0.1 to 0.9 with our S-Adapter.

G. Visualization of t-SNE
observe how adjusting θ from 0.1 to 0.9. As can be seen We also use the t-SNE technique to visualize the distribution
in Fig. 10, when θ is around 0.5, the model exhibits satis- of class tokens (features) from the last transformer block to
factory performance (≈ 4% HTER), as the component from show the effectiveness of the proposed method. In Fig. 11,
vanilla convolutions and its gradients are well balanced. When the models are trained by the combined dataset ‘C&O&M’
θ = 0.7, the lowest HTER is achieved, which aligns with the and the target test data is the ‘I’. Fig. 11(a) represents the t-
recommendation by [61]. SNE visualization of the ViT-S-Adapter without the statistical
(histogram) and fine-grained information, which serves as the
F. Attention Map Visualization baseline. Fig. 11(b) represents the result of ViT-S-Adapter
optimized with TSR. As can be seen in Fig. 11, the ‘real’ and
Since the core of ViT is the self-attention mechanism, we
“attack” samples from the target (tgt) domain are overlapped
visualize the results of self-attention from the last transformer
and cannot be well separated (green color). By contrast,
block to analyze the model behavior. In a self-attention layer,
our proposed ViT-S-Adapter-TSR has better generalization
there are N tokens of patches as input. Each output token is
capability that the ‘real’ and ‘attack’ samples from the target
a weighted summation of the input tokens, where the weight
domains are well separated.
matrix for each query token is calculated by self-attention.
For example, in Fig. 9, the top left image shows that the
15th patch (in red rectangle) is the query, and the right sides H. Overhead Analysis
are the attentions of the other patches, a.k.a, the weighted We analyzed the overhead introduced by the S-Adapter,
matrix for the summation. A patch with red attention means which contains additional parameters, by utilizing an open-
it contributes significantly to the output tokens of the query source library1 to collect data on Multiply-Accumulate opera-
patch. The first row shows the attention when the 15th patch tions (MACs) and the number of parameters for both the ViT
is the query patch, and the second row shows the attention (Base) model and the ViT with our S-Adapter. As illustrated
when the 155th patch is the query patch. We can see that our in Table VII, the inclusion of the S-Adapter results in a
proposed S-Adapter and TSR can help the model to focus more modest increase of only 0.45% in MACs and a mere 0.38%
consistently and intensively on the face areas (more redness
than the background area), which is compared within a single 1 https://github.com/Lyken17/pytorch-OpCounter
11

[3] C. Kong, S. Wang, and H. Li, “Digital and physical face attacks:
Reviewing and one step further,” arXiv preprint arXiv:2209.14692, 2022.
[4] C. Kong, K. Zheng, S. Wang, A. Rocha, and H. Li, “Beyond the pixel
world: A novel acoustic-based face anti-spoofing system for smart-
phones,” IEEE Transactions on Information Forensics and Security,
vol. 17, pp. 3238–3253, 2022.
[5] C. Kong, S. Wang, H. Li, et al., “Digital and physical face attacks:
Reviewing and one step further,” APSIPA Transactions on Signal and
Information Processing, vol. 12, no. 1, 2022.
[6] Z. Boulkenafet, J. Komulainen, and A. Hadid, “Face Spoofing Detection
(a) ViT-S-Adapter w/o Hist (𝜃=0) (b) ViT-S-Adapter-TSR Using Colour Texture Analysis,” IEEE Transactions on Information
Forensics and Security, vol. 11, pp. 1818–1830, Aug 2016.
[7] J. Komulainen, A. Hadid, and M. Pietikäinen, “Context based face anti-
Fig. 11. The visualization of t-SNE conducted on the class token of two
spoofing,” in 2013 IEEE Sixth International Conference on Biometrics:
different adapters in the experiments of ‘C&O&M to I’. (a) is the ViT-S-
Theory, Applications and Systems (BTAS), pp. 1–8, 2013.
Adapter without the statistical and fine-grained information (θ = 0), which
[8] H. Li, P. He, S. Wang, A. Rocha, X. Jiang, and A. C. Kot, “Learning
serves as the baseline. (b) represents our ViT-S-Adapter with the TSR.
Generalized Deep Feature Representation for Face Anti-Spoofing,” IEEE
Transactions on Information Forensics and Security, vol. 13, pp. 2639–
TABLE VII 2652, Oct 2018.
I NFERENCE TIME OVERHEAD ANALYSIS . T HE MAC S (G) INDICATE THE [9] R. Cai, H. Li, S. Wang, C. Chen, and A. C. Kot, “DRL-FAS: A Novel
TOTAL CALCULATION OPERATIONS AND THE PARAMS (M) INDICATE THE Framework Based on Deep Reinforcement Learning for Face Anti-
AMOUNT OF PARAMETER AND STORAGE NEEDED . Spoofing,” IEEE Transactions on Information Forensics and Security,
vol. 16, pp. 937–951, 2020.
Model MACs (G) Params (M) [10] Y. Jia, J. Zhang, S. Shan, and X. Chen, “Single-Side Domain Gen-
eralization for Face Anti-Spoofing,” in 2020 IEEE/CVF Conference
ViT 33.73 85.65 on Computer Vision and Pattern Recognition (CVPR), pp. 8481–8490,
ViT-S-Adapter 33.88 85.98 2020.
[11] Z. Wang, Z. Wang, Z. Yu, W. Deng, J. Li, T. Gao, and Z. Wang, “Domain
Increment (∆) 0.15 (0.45%) 0.33(0.38%) Generalization via Shuffled Style Assembly for Face Anti-Spoofing,”
in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 4123–4133, 2022.
[12] R. Shao, X. Lan, and P. C. Yuen, “Regularized Fine-Grained Meta
in parameters overhead. Given the substantial improvement Face Anti-Spoofing,” Proceedings of the AAAI Conference on Artificial
in zero/few-shot cross-domain performance and unseen attack Intelligence, vol. 34, pp. 11974–11981, Apr. 2020.
detection capabilities provided by our S-Adapter, this minimal [13] Y. Qin, Z. Yu, L. Yan, Z. Wang, C. Zhao, and Z. Lei, “Meta-teacher
for Face Anti-Spoofing,” IEEE Transactions on Pattern Analysis and
overhead is well justified. Machine Intelligence, vol. Early Access, pp. 1–1, 2021.
[14] R. Cai, Z. Li, R. Wan, H. Li, Y. Hu, and A. C. Kot, “Learning Meta
V. C ONCLUSION AND F UTURE W ORK Pattern for Face Anti-Spoofing,” IEEE Transactions on Information
Forensics and Security, vol. 17, pp. 1201–1213, 2022.
In conclusion, Face Anti-Spoofing is important for the [15] Z. Yu, J. Wan, Y. Qin, X. Li, S. Z. Li, and G. Zhao, “NAS-FAS: Static-
security and integrity of face recognition systems by iden- Dynamic Central Difference Network Search for Face Anti-Spoofing,”
IEEE Transactions on Pattern Analysis and Machine Intelligence,
tifying and thwarting malicious attacks. Although there has vol. 43, no. 9, pp. 3005–3023, 2021.
been significant progress in recent years, the domain shift [16] G. Wang, H. Han, S. Shan, and X. Chen, “Cross-Domain Face Presen-
problem continues to pose a challenge to a model’s cross- tation Attack Detection via Multi-Domain Disentangled Representation
Learning,” in Proceedings of the IEEE/CVF Conference on Computer
domain generalization performance. To tackle this issue, we Vision and Pattern Recognition, pp. 6678–6687, 2020.
introduced a novel method, S-Adapter, which effectively uti- [17] Y. Liu, J. Stehouwer, and X. Liu, “On disentangling spoof trace for
lizes knowledge from pre-trained Vision Transformers (ViT) generic face anti-spoofing,” in Computer Vision–ECCV 2020: 16th
European Conference, Glasgow, UK, August 23–28, 2020, Proceedings,
for cross-domain generalization in face anti-spoofing. Our Part XVIII 16, pp. 406–422, Springer, 2020.
method of S-Adapter employs the histogram information of [18] Y. Liu, Y. Chen, W. Dai, C. Li, J. Zou, and H. Xiong, “Causal
transformer tokens and incorporates our proposed Token Style intervention for generalizable face anti-spoofing,” in ICME, 2022.
[19] G. Zheng, Y. Liu, W. Dai, C. Li, J. Zou, and H. Xiong, “Learning causal
Regularization (TSR) to learn more domain-invariant feature representations for generalizable face anti spoofing,” in ICASSP 2023-
representations. Our comprehensive experiments reveal that 2023 IEEE International Conference on Acoustics, Speech and Signal
the proposed S-Adapter and TSR method surpasses existing Processing (ICASSP), pp. 1–5, IEEE, 2023.
[20] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
techniques, achieving state-of-the-art performance on several T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al.,
face anti-spoofing benchmarks of zero/few-shot cross-domain “An image is worth 16x16 words: Transformers for image recognition at
evaluations, and unseen attack detection. scale,” in International Conference on Learning Representations (ICLR),
In the future, the proposed S-Adapter may be extended 2020.
[21] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and
to other similar tasks such as inpainting detection, deepfake B. Guo, “Swin transformer: Hierarchical vision transformer using shifted
detection, and recaptured document detection problems as windows,” in Proceedings of the IEEE/CVF international conference on
well. Besides, how the token histogram can be used with computer vision, pp. 10012–10022, 2021.
[22] A. George and S. Marcel, “On the effectiveness of vision transformers
LoRA and Prompt can be explored in the future. for zero-shot face anti-spoofing,” in 2021 IEEE International Joint
Conference on Biometrics (IJCB), pp. 1–8, IEEE, 2021.
R EFERENCES [23] H.-P. Huang, D. Sun, Y. Liu, W.-S. Chu, T. Xiao, J. Yuan, H. Adam, and
M.-H. Yang, “Adaptive transformers for robust few-shot cross-domain
[1] Z. Chen, Y. Duan, W. Wang, J. He, T. Lu, J. Dai, and Y. Qiao, “Vision face anti-spoofing,” in Computer Vision–ECCV 2022: 17th European
transformer adapter for dense predictions,” in ICLR, 2023. Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part
[2] Z. Yu, Y. Qin, X. Li, C. Zhao, Z. Lei, and G. Zhao, “Deep learning XIII, pp. 37–54, Springer, 2022.
for face anti-spoofing: A survey,” IEEE transactions on pattern analysis [24] R. Shao, X. Lan, J. Li, and P. C. Yuen, “Multi-Adversarial Discriminative
and machine intelligence, vol. 45, no. 5, pp. 5609–5631, 2022. Deep Domain Generalization for Face Presentation Attack Detection,”
12

in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recog- UK, August 23–28, 2020, Proceedings, Part XIX 16, pp. 641–657,
nition (CVPR), pp. 10015–10023, 2019. Springer, 2020.
[25] S. Jie and Z.-H. Deng, “Convolutional bypasses are better vision [46] H. Wu, D. Zeng, Y. Hu, H. Shi, and T. Mei, “Dual spoof disentanglement
transformer adapters,” arXiv preprint arXiv:2207.07039, 2022. generation for face anti-spoofing with depth uncertainty learning,” IEEE
[26] R. Cai and C. Chen, “Learning deep forest with multi-scale lo- Transactions on Circuits and Systems for Video Technology, 2021.
cal binary pattern features for face anti-spoofing,” arXiv preprint [47] W. Yan, Y. Zeng, and H. Hu, “Domain adversarial disentanglement
arXiv:1910.03850, 2019. network with cross-domain synthesis for generalized face anti-spoofing,”
[27] I. Chingovska, A. Anjos, and S. Marcel, “On the effectiveness of local IEEE Transactions on Circuits and Systems for Video Technology,
binary patterns in face anti-spoofing,” in 2012 BIOSIG - Proceedings vol. 32, no. 10, pp. 7033–7046, 2022.
of the International Conference of Biometrics Special Interest Group [48] A. Liu, C. Zhao, Z. Yu, J. Wan, A. Su, X. Liu, Z. Tan, S. Escalera,
(BIOSIG), pp. 1–7, Sep. 2012. J. Xing, Y. Liang, et al., “Contrastive context-aware learning for 3d high-
[28] X. Tan, Y. Li, J. Liu, and L. Jiang, “Face Liveness Detection from a fidelity mask face presentation attack detection,” IEEE Transactions on
Single Image with Sparse Low Rank Bilinear Discriminative Model,” in Information Forensics and Security, vol. 17, pp. 2497–2507, 2022.
European Conference on Computer Vision, pp. 504–517, 2010. [49] A. George and S. Marcel, “Learning one class representations for face
[29] D. Wen, H. Han, and A. K. Jain, “Face Spoof Detection With Image presentation attack detection using multi-channel convolutional neural
Distortion Analysis,” IEEE Transactions on Information Forensics and networks,” IEEE Transactions on Information Forensics and Security,
Security, vol. 10, pp. 746–761, April 2015. vol. 16, pp. 361–375, 2020.
[30] J. Galbally and S. Marcel, “Face Anti-spoofing Based on General Image [50] H. Li, W. Li, H. Cao, S. Wang, F. Huang, and A. C. Kot, “Unsupervised
Quality Assessment,” in 2014 22nd International Conference on Pattern Domain Adaptation for Face Anti-Spoofing,” IEEE Transactions on
Recognition, pp. 1173–1178, Aug 2014. Information Forensics and Security, vol. 13, pp. 1794–1809, July 2018.
[31] H. Li, S. Wang, and A. C. Kot, “Face spoofing detection with image [51] Y. Qin, C. Zhao, X. Zhu, Z. Wang, Z. Yu, T. Fu, F. Zhou, J. Shi, and
quality regression,” in 2016 Sixth International Conference on Image Z. Lei, “Learning meta model for zero-and few-shot face anti-spoofing,”
Processing Theory, Tools and Applications (IPTA), pp. 1–6, 2016. in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34,
[32] Y. Liu, A. Jourabloo, and X. Liu, “Learning Deep Models for Face pp. 11916–11923, 2020.
Anti-Spoofing: Binary or Auxiliary Supervision,” in Proceedings of the [52] Z. Li, R. Cai, H. Li, K.-Y. Lam, Y. Hu, and A. C. Kot, “One-class
IEEE Conference on Computer Vision and Pattern Recognition, (Salt knowledge distillation for face presentation attack detection,” IEEE
Lake City, UT), pp. 389–398, 2018. Transactions on Information Forensics and Security, vol. 17, pp. 2137–
[33] W. Sun, Y. Song, C. Chen, J. Huang, and A. C. Kot, “Face Spoofing 2150, 2022.
Detection Based on Local Ternary Label Supervision in Fully Convo- [53] Y. Qin, W. Zhang, J. Shi, Z. Wang, and L. Yan, “One-class adapta-
lutional Networks,” IEEE Transactions on Information Forensics and tion face anti-spoofing with loss function search,” Neuro Computing,
Security, vol. 15, pp. 3181–3196, 2020. vol. 417, pp. 384–395, 2020.
[54] S. Jia, X. Li, C. Hu, G. Guo, and Z. Xu, “3d face anti-spoofing with
[34] A. George and S. Marcel, “Deep Pixel-wise Binary Supervision for Face
factorized bilinear coding,” IEEE Transactions on Circuits and Systems
Presentation Attack Detection,” in 2019 International Conference on
for Video Technology, vol. 31, no. 10, pp. 4031–4045, 2020.
Biometrics (ICB), pp. 1–8, 2019.
[55] Z. Li, H. Li, K.-Y. Lam, and A. C. Kot, “Unseen Face Presentation
[35] Z. Yu, X. Li, J. Shi, Z. Xia, and G. Zhao, “Revisiting Pixel-Wise
Attack Detection with Hypersphere Loss,” in ICASSP 2020 - 2020 IEEE
Supervision for Face Anti-Spoofing,” IEEE Transactions on Biometrics,
International Conference on Acoustics, Speech and Signal Processing
Behavior, and Identity Science, vol. 3, no. 3, pp. 285–295, 2021.
(ICASSP), pp. 2852–2856, 2020.
[36] L. Li, Z. Xia, A. Hadid, X. Jiang, H. Zhang, and X. Feng, “Replayed [56] B. Liu, Z. Wu, H. Hu, and S. Lin, “Deep metric transfer for label
Video Attack Detection Based on Motion Blur Analysis,” IEEE Trans- propagation with limited annotated data,” in ICCV Workshop, pp. 0–
actions on Information Forensics and Security, vol. 14, no. 9, pp. 2246– 0, 2019.
2261, 2019. [57] A. George, Z. Mostaani, D. Geissenbuhler, O. Nikisins, A. Anjos, and
[37] M. Asim, Z. Ming, and M. Y. Javed, “CNN based spatio-temporal feature S. Marcel, “Biometric face presentation attack detection with multi-
extraction for face anti-spoofing,” in 2017 2nd International Conference channel convolutional neural network,” IEEE Transactions on Informa-
on Image, Vision and Computing (ICIVC), pp. 234–238, 2017. tion Forensics and Security, vol. 15, pp. 42–55, 2020.
[38] Y. A. U. Rehman, L.-M. Po, M. Liu, Z. Zou, and W. Ou, “Perturbing [58] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe,
Convolutional Feature Maps with Histogram of Oriented Gradients A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer
for Face Liveness Detection,” in International Joint Conference: 12th learning for nlp,” in International Conference on Machine Learning,
International Conference on Computational Intelligence in Security for pp. 2790–2799, PMLR, 2019.
Information Systems (CISIS 2019) and 10th International Conference on [59] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang,
EUropean Transnational Education (ICEUTE 2019), pp. 3–13, Springer, and W. Chen, “Lora: Low-rank adaptation of large language models,”
2019. arXiv preprint arXiv:2106.09685, 2021.
[39] A. Pinto, S. Goldenstein, A. Ferreira, T. Carvalho, H. Pedrini, and [60] M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan,
A. Rocha, “Leveraging Shape, Reflectance and Albedo From Shading for and S.-N. Lim, “Visual prompt tuning,” in Computer Vision–ECCV
Face Presentation Attack Detection,” IEEE Transactions on Information 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022,
Forensics and Security, vol. 15, pp. 3347–3358, 2020. Proceedings, Part XXXIII, pp. 709–727, Springer, 2022.
[40] H. Chen, G. Hu, Z. Lei, Y. Chen, N. M. Robertson, and S. Z. [61] Z. Yu, C. Zhao, Z. Wang, Y. Qin, Z. Su, X. Li, F. Zhou, and G. Zhao,
Li, “Attention-Based Two-Stream Convolutional Networks for Face “Searching Central Difference Convolutional Networks for Face Anti-
Spoofing Detection,” IEEE Transactions on Information Forensics and Spoofing,” in 2020 IEEE/CVF Conference on Computer Vision and
Security, vol. 15, pp. 578–593, 2020. Pattern Recognition (CVPR), pp. 5294–5304, 2020.
[41] R. Cai, Y. Cui, Z. Li, Z. Yu, H. Li, Y. Hu, and A. Kot, “Rehearsal-free [62] J. Peeples, W. Xu, and A. Zare, “Histogram layers for texture analysis,”
domain continual face anti-spoofing: Generalize more and forget less,” IEEE Transactions on Artificial Intelligence, vol. 3, no. 4, pp. 541–552,
arXiv preprint arXiv:2303.09914, 2023. 2022.
[42] X. Lin, S. Wang, R. Cai, Y. Liu, Y. Fu, W. Tang, Z. Yu, and A. Kot, [63] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time
“Suppress and rebalance: Towards generalized multi-modal face anti- style transfer and super-resolution,” in Computer Vision–ECCV 2016:
spoofing,” in Proceedings of the IEEE/CVF Conference on Computer 14th European Conference, Amsterdam, The Netherlands, October 11-
Vision and Pattern Recognition, pp. 211–221, 2024. 14, 2016, Proceedings, Part II 14, pp. 694–711, Springer, 2016.
[43] Y. Liu, Y. Chen, W. Dai, M. Gou, C.-T. Huang, and H. Xiong, “Source- [64] Z. Zhang, J. Yan, S. Liu, Z. Lei, D. Yi, and S. Z. Li, “A face anti-spoofing
free domain adaptation with contrastive domain alignment and self- database with diverse attacks,” in IAPR International Conference on
supervised exploration for face anti-spoofing,” in ECCV, 2022. Biometrics, pp. 26–31, 2012.
[44] X. Dong, H. Liu, W. Cai, P. Lv, and Z. Yu, “Open set face anti-spoofing [65] Z. Boulkenafet, J. Komulainen, L. Li, X. Feng, and A. Hadid, “OULU-
in unseen attacks,” in Proceedings of the 29th ACM International Con- NPU: A mobile face presentation attack database with real-world
ference on Multimedia, MM ’21, (New York, NY, USA), p. 4082–4090, variations,” in IEEE International Conference on Automatic Face and
Association for Computing Machinery, 2021. Gesture Recognition, May 2017.
[45] K.-Y. Zhang, T. Yao, J. Zhang, Y. Tai, S. Ding, J. Li, F. Huang, H. Song, [66] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint Face Detection and
and L. Ma, “Face anti-spoofing via disentangled representation learning,” Alignment Using Multitask Cascaded Convolutional Networks,” IEEE
in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016.
13

[67] Z. Wang, Z. Yu, X. Wang, Y. Qin, J. Li, C. Zhao, X. Liu, and Zitong Yu (Senior Member, IEEE) received the
Z. Lei, “Consistency regularization for deep face anti-spoofing,” IEEE Ph.D. degree in Computer Science and Engineering
Transactions on Information Forensics and Security, vol. 18, pp. 1127– from the University of Oulu, Finland, in 2022.
1140, 2023. Currently, he is an Assistant Professor at Great Bay
[68] H. Li, S. J. Pan, S. Wang, and A. C. Kot, “Domain Generalization with University, China. He was a Postdoctoral researcher
Adversarial Feature Learning,” in Proceedings of the IEEE Conference at ROSE Lab, Nanyang Technological University.
on Computer Vision and Pattern Recognition, pp. 5400–5409, 2018. He was a visiting scholar at TVG, University of
[69] C.-Y. Wang, Y.-D. Lu, S.-T. Yang, and S.-H. Lai, “PatchNet: A Simple Oxford, from July to November 2021. His research
Face Anti-Spoofing Framework via Fine-Grained Patch Recognition,” interests include human-centric computer vision and
in Proceedings of the IEEE/CVF Conference on Computer Vision and biometric security. He was a recipient of IAPR Best
Pattern Recognition, pp. 20281–20290, 2022. Student Paper Award, IEEE Finland Section Best
[70] Q. Zhou, K.-Y. Zhang, T. Yao, R. Yi, S. Ding, and L. Ma, “Adaptive Student Conference Paper Award 2020, second prize of the IEEE Finland
mixture of experts learning for generalizable face anti-spoofing,” in Jt. Chapter SP/CAS Best Paper Award (2022), and World’s Top 2
Proceedings of the 30th ACM International Conference on Multimedia,
pp. 6009–6018, 2022.
[71] J. Määttä, A. Hadid, and M. Pietikäinen, “Face spoofing detection from
single images using micro-texture analysis,” in 2011 International Joint
Conference on Biometrics (IJCB), pp. 1–7, 2011.
[72] T. de Freitas Pereira, J. Komulainen, A. Anjos, J. M. De Martino,
A. Hadid, M. Pietikäinen, and S. Marcel, “Face liveness detection using Haoliang Li received the B.S. degree in commu-
dynamic texture,” EURASIP Journal on Image and Video Processing, nication engineering from University of Electronic
vol. 2014, no. 1, p. 2, 2014. Science and Technology of China (UESTC) in 2013,
[73] Z. Yu, X. Li, X. Niu, J. Shi, and G. Zhao, “Face anti-spoofing with and his Ph.D. degree from Nanyang Technological
human material perception,” in Computer Vision–ECCV 2020: 16th University (NTU), Singapore in 2018. He is cur-
European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, rently an assistant professor in Department of Elec-
Part VII 16, pp. 557–575, Springer, 2020. trical Engineering, City University of Hong Kong.
[74] Z. Yu, Y. Qin, H. Zhao, X. Li, and G. Zhao, “Dual-Cross Central His research mainly focuses on AI security, multi-
Difference Network for Face Anti-Spoofing,” in 2021 International Joint media forensics and transfer learning. His research
Conference on Artificial Intelligence (IJCAI), 2021. works appear in international journals/conferences
[75] Y. Liu, J. Stehouwer, A. Jourabloo, and X. Liu, “Deep Tree Learning such as TPAMI, IJCV, TIFS, NeurIPS, CVPR and
for Zero-Shot Face Anti-Spoofing,” in 2019 IEEE/CVF Conference on AAAI. He received the Wallenberg-NTU presidential postdoc fellowship in
Computer Vision and Pattern Recognition (CVPR), pp. 4675–4684, 2019, doctoral innovation award in 2019, and VCIP best paper award in 2020.
2019.
[76] C.-H. Liao, W.-C. Chen, H.-T. Liu, Y.-R. Yeh, M.-C. Hu, and C.-S. Chen,
“Domain invariant vision transformer learning for face anti-spoofing,”
in Proceedings of the IEEE/CVF Winter Conference on Applications of
Computer Vision, pp. 6098–6107, 2023.
[77] Y. Zhang, Z. Yin, Y. Li, G. Yin, J. Yan, J. Shao, and Z. Liu, “Celeba-
Changsheng Chen (S’09-M’15-SM’20) received
spoof: Large-scale face anti-spoofing dataset with rich annotations,” in
the B.Eng. degree in software engineering from
European Conference on Computer Vision (ECCV), 2020.
Sun Yat-sen University, Guangzhou, China in 2008
[78] Z. Pan, B. Zhuang, H. He, J. Liu, and J. Cai, “Less is more: Pay less
and the Ph.D. degree in Electrical and Electronic
attention in vision transformers,” in Proceedings of the AAAI Conference
Engineering from Nanyang Technology University,
on Artificial Intelligence, vol. 36, pp. 2035–2043, 2022.
Singapore in 2013. From 2013 to 2015, he worked
as a PostDoc research associate at the HKUST
Barcode Group, Department of Electronic and Com-
puter Engineering, Hong Kong University of Science
and Technology. Since 2016, he has been with the
Rizhao Cai received his B.Eng degree in electronic Guangdong Key Laboratory of Intelligent Informa-
information engineering from Shenzhen University, tion Processing and Key Laboratory of Media Security, College of Electronics
China in 2018. After that, he worked as a project and Information Engineering, Shenzhen University, Shenzhen, China, where
officer in the Rapid-Rich Object Search (ROSE) he is currently an Associate Professor. His current research interests include
Lab and NTU-PKU Joint Research Institute, leading multimedia forensics and security, pattern recognition and machine learning.
computer vision projects collaborated with industrial
partners. Since 2020, he has been a Ph.D candidate
at Nanyang Technological University, Singapore.
His research interests include computer vision and
biometric/AI security.
Yongjian Hu received the Ph.D. degree in commu-
nication and information systems from South China
University of Technology in 2002. Between 2000
and 2004, he visited City University of Hong Kong
four times as a researcher. From 2005 to 2006, he
Chenqi Kong received the B.S. and M.S. degrees worked as Research Professor in SungKyunKwan
from the Harbin Institute of Technology, Harbin, University, South Korea. From 2006 to 2008, he
China, in 2017 and 2019, respectively, and the Ph.D. worked as Research Professor in Korea Advanced
degree from the Department of Computer Science, Institute of Science and Technology, South Korea.
City University of Hong Kong, Hong Kong, SAR, From 2011 to 2013, he worked as Marie Curie
China, in 2023. Currently, he is a Research Fellow Fellow in the University of Warwick, UK. Now He
with the School of Electrical and Electronic Engi- is full Professor with the School of Electronic and Information Engineering,
neering, Nanyang Technological University, Singa- South China University of Technology, China. He is also a research scientist
pore. His current research interests include AI se- with China-Singapore International Joint Research Institute. Dr. Hu is Senior
curity and multimedia forensics. He was a recipient Member of IEEE and has published more than 130 peer reviewed papers.
of the National Scholarship, the Gold Medalist at His research interests include image forensics, information security, and deep
the International Exhibition of Inventions, Geneva, and the Research Tuition learning.
Scholarship.
14

Prof. Alex Kot has been with the Nanyang Tech-

nological University, Singapore since 1991. He was
Head of the Division of Information Engineering
and Vice Dean Research at the School of Electrical
and Electronic Engineering. Subsequently, he served
as Associate Dean for College of Engineering for
eight years. He is currently Professor and Director
of Rapid-Rich Object SEarch (ROSE) Lab and NTU-
PKU Joint Research Institute. He has published
extensively in the areas of signal processing, bio-
metrics, image forensics and security, and computer
vision and machine learning. Dr. Kot served as Associate Editor for more
than ten journals, mostly for IEEE transactions. He served the IEEE SP
Society in various capacities such as the General Co-Chair for the 2004 IEEE
International Conference on Image Processing and the Vice-President for the
IEEE Signal Processing Society. He received the Best Teacher of the Year
Award and is a co-author for several Best Paper Awards including ICPR,
IEEE WIFS and IWDW, CVPR Precognition Workshop and VCIP. He was
elected as the IEEE Distinguished Lecturer for the Signal Processing Society
and the Circuits and Systems Society. He is a Fellow of IEEE, and a Fellow
of Academy of Engineering, Singapore.