2024 - Generalizing VT For Face Anti-Spoofing
2024 - Generalizing VT For Face Anti-Spoofing
Manuscript received September 2023, revised April 2024. This work was
done at Rapid-Rich Object Search (ROSE) Lab, School of Electrical and face recognition systems are threatened by face presentation
Electronic Engineering (EEE), Nanyang Technological University (NTU).
This research is supported in part by the NTU-PKU Joint Research Institute attacks, a.k.a, and face spoofing attacks. These attacks involve
(a collaboration between the NTU and Peking University that is sponsored by presenting spoofing examples of human faces to cameras, such
a donation from the Ng Teng Fong Charitable Foundation), the Science and as printed photos, digital displays, and 3D masks. Face Anti-
Technology Foundation of Guangzhou Huangpu Development District under
Grant 2022GH15. This work is also partially supported by National Natural Spoofing (FAS) [2], Face Presentation Attack Detection (Face
Science Foundation of China under Grant 62371301&62306061, and partially PAD) or Face Liveness Detection, is a crucial technology that
supported by Guangdong Basic and Applied Basic Research Foundation aims to enhance the security of face recognition (FR) systems
(Grant No. 2023A1515140037) and Chow Sang Sang Group Research Fund
under grant DON-RMG 9229161. (Corresponding author: Zitong Yu.) by detecting malicious spoofing attacks.
Rizhao Cai, Chenqi Kong, and Alex Kot are with the To safeguard FR systems against such malicious spoofing
ROSE Lab, School of EEE, Nanyang Technological University
({rzcai,chenqi.kong,eackot}@ntu.edu.sg). attacks, various techniques have been extensively researched
Zitong Yu is with the School of Computing and Information Technology, and developed [2], [3], [4], [5]. Traditional methods are mainly
Great Bay University, China ([email protected]). based on handcraft features and Support Vector Machines [6],
Haoliang Li is with the Department of Electrical Engineering, City Univer-
sity of Hong Kong ([email protected]). [7]. Given the limited representation capability of the hand-
Changsheng Chen is with the Shenzhen University, China crafted features, traditional methods cannot meet the security
([email protected]) requirement of a FAS system. In recent years, deep neural
Yongjian Hu is with the School of Electronic and Information Engineering,
South China University of Technology, Guangzhou, China, and with China- networks have been increasingly incorporated into data-driven
Singapore International Joint Research Institute (email: [email protected]). face anti-spoofing (FAS) methods to extract learnable features,
2
surpassing traditional methods [2]. However, the deployment spoofing problem, to efficiently fine-tune a pre-trained ViT
of these models is facing the domain shift problem, which for cross-domain generalized FAS. As illustrated in Fig. 2,
arises from differences in data distribution between the source our S-Adapter is motivated by traditional texture analysis
training data and the target testing data, caused by various methods, which collect histogram features from handcrafted
data collection conditions, such as illumination, cameras, or feature maps, such as local binary pattern maps [6], [26],
attack mediums [8]. The resulting overfitting of the training to alleviate the negative impact from varying environments,
distribution can lead to poor performance on the target testing such as lightings. Our S-Adapter first extracts learnable dis-
data, hindering the effective detection of spoofing attacks. To criminative token maps. Then token histograms are extracted,
tackle the challenge in cross-domain testing, previous research which provide statistical information and improve the robust-
has explored various techniques, including but not limited ness against variations in the environment. Furthermore, the
to reinforcement learning [9], adversarial learning [10], [11], statistical information can benefit the model but its effective-
meta-learning [12], [13], [14], [15], disentanglement learning ness is still hindered by the style variance between different
[16], [17], and casual intervention [18], [19]. Despite the domains. To reduce the style variance, we propose Token
progress achieved, cross-domain generalization performance Style Regularization (TSR). The proposed TSR extracts style
remains unsatisfactory due to the critical challenge posed by components based on the gram matrix, and regularizes the
the domain shift problem, and further research effort is still style variance of real faces from different domains to be
needed. minimized. As such, the statistical information with less style
Recently, the cutting-edge Vision Transformer (ViT) models variance would be more generalized for the cross-domain FAS.
have achieved striking performance with the self-attention We conduct extensive experiments to show that our proposed
mechanism for computer vision tasks [20], [21]. Inspired by method surpasses the vanilla adapter by a clear margin and
the success of ViT, the FAS researchers have been exploring achieves state-of-the-art performance on existing cross-domain
the use of ViT to address the face anti-spoofing problem face anti-spoofing benchmarks. The contributions of our work
[22], [23]. While training a ViT model to the FAS task from can be summarized as follows:
scratch requires a large amount of data to achieve generalized • We propose a novel S-Adapter to efficiently adapt pre-
performance, the model weights of ImageNet pre-trained ViT trained ViT models to achieve generalized face anti-
are easily available from open-source model zoos and can spoofing by extracting statistical information via token
be used for model initialization for training a ViT model histograms;
on FAS data [22], as shown in Fig. 1. However, previous • We propose a new Token Style Regularization (TSR),
works utilize the pre-trained model by fine-tuning either the which reduces the style variances across different do-
partial or the whole model weights of the ViT backbone. Such mains to improve the generalization of statistical token
utilization is straightforward but inefficient. Recent research histograms.;
on Efficient Parameter Transfer Learning (EPTL) has shown a • The ViT model integrated with our proposed S-Adapter
more efficient way of utilizing pre-trained ViT models for the and TSR can achieve state-of-the-art generalization per-
FAS problem. Huang et al. [23] utilized multi-stream linear formance on existing face anti-spoofing benchmarks, in-
adapters to adapt ViT efficiently and achieve promising gener- cluding zero/few-shot cross-domain generalization and
alization performance in the few-shot cross-domain scenario. unseen attack detection.
However, in the zero-shot scenario, the unseen target domain
testing scenario [24], the ViT’s generalization performance II. R ELATED WORKS
is still inferior to the previous state-of-the-art on the four-
dataset benchmark [24]. We identify the limitation of using A. Face Anti-Spoofing.
vanilla adapters based on linear layers. Linear layers lack 1) Traditional FAS Methods: Traditional face anti-spoofing
image-aware inductive bias, such as locality, and are thus (FAS) methods rely on handcrafted image descriptors to ex-
ineffective in extracting local information [25]. Since the FAS tract features for classification, such as Local Binary Patterns
data are visual images and the local information is crucial (LBP) [27], [6], [26], Histogram of Gradient [7], Difference of
for the classification [9], the linear-layer-base adapter fails to Gaussian (DoG) [28], and image quality features [29], [30],
capture discriminative local information to efficiently adapt [31]. These pioneering methods’ performance is limited by
ViT for FAS. Moreover, the feature/token embeddings used the representation capability of handcraft features, and even
for FAS classification are sensitive to the imaging process, the intra-domain performance is not satisfactory.
such as variations in camera modules and illuminations. Such 2) Deep Learning FAS Methods.: Recently, numerous FAS
variations between the source training and target testing data methods based on deep neural networks have been proposed to
cause the domain shift and lead to models’ poor domain exploit their powerful representation learning capabilities [2],
generalization performance [8]. However, it is non-trivial to [3]. For example, reinforcement learning has been proposed to
learn domain-invariant information by simply using linear mine local and global features for FAS [9]. Pixel-wise super-
layers. Nevertheless, ViT with EPTL leads to a promising vision has been studied to show more advanced performance
direction for future research in the field of FAS and deserves than binary supervision [32], [33], [34], [35]. However, models
further development. trained solely on RGB images often suffer from overfitting
Motivated by the above discussion, we propose to design and poor generalization performance when there are domain
a more advanced adapter, named S-Adapter for the face anti- shifts between training and testing data [8]. Besides, hybrid
3
methods, which combine handcraft features and deep learning LBP LBP Feature
,
have also been proposed [36], [37], [38], [39], [40]. While the Histogram
LBP map Classifier
above methods have achieved saturated performance in the
…
…
Concatenate (e.g. SVM)
intra-domain evaluation, more evaluation scenarios are raised Face
LBP ,
LBP Histogram
and studied, such as domain generalization scenario [24], [13], Image (a)
[14], [41], [42], unsupervised domain adaptation scenario [16], Output tokens
…
Transformer …
Transformer
[43], unseen attack detection scenario [44], and so on. Input Layer 𝑖
…
Layer 𝑖+1
3) FAS under Different Scenarios.: The domain generaliza- Tokens …
Conv
Local
tion (DG) scenario in FAS aims to learn a model with source Information Hist Our
Layer S-Adapter
data from one or more domains and can achieve generalized Token Token
map histogram (b)
performance on unseen target data domains without using
target domain data [24], [14]. Usually, the target domain
data is unseen from the training, meaning the different data Fig. 2. (a) The process of traditional texture analysis method for face anti-
spoofing: handcraft features (LBP) are first extracted, which are often sensitive
distributions between training and testing. This scenario is to illumination changes. Then, the histogram features are extracted as final
also referred to as unseen domain generalization or zero- representations for the classifier, which is more robust to lighting changes. (b)
shot cross-domain generalization. In this scenario, methods are Our adapter extracts local information from spatial tokens and extracts token
histogram, which is inspired by (a), for improving cross-domain performance.
expected to learn domain-invariant feature representations, and
thus various techniques have been proposed to tackle domain One EPTL example is the adapter approach [58], which
generalization challenges in FAS, such as casual intervention involves training small additional modules (adapters) on top
[18], disentangled representation learning [17], [45], [46], [47], of a pre-trained base model. The adapter contains a few task-
and meta-learning [15], [13], [14], [12], adversarial learning specific layers and has been successfully applied to various
[24], [10], [11], and contrastive learning [48], [10], [49]. The computer vision tasks, such as object detection and semantic
domain generalization scenario is a crucial challenge since segmentation [1]. Given the input token X, the adapter A
domain shift would deter an FAS model from being deployed usually transforms the tokens as X ← X + A(X).
to practical environments. Meanwhile, the unsupervised do- Another EPTL example is the Low-Rank Approximation
main adaptation (UDA) for face anti-spoofing is to utilize (LoRA) method [59], which approximates the weight in-
target domain data to adapt a model pre-trained on the source crements of W Q and W K by ∆W Q and ∆W K . During
domain data but without the labels of attack and bona fide fine-tuning, ∆W Q and ∆W K are approximated by extra
examples [50], [43], [16]. Since the accessibility of labels parameters, and updated via backward propagation, while
is not often a problem, the few-shot cross-domain face anti- W Q and W K are initialized from pre-trained models and
spoofing is studied, to utilize a few labeled target domain data fixed. Consequently, the query Q = XW Q + X∆W Q and
(e.g. 5-shot, 10 examples) during the training to achieve great K = XW K + X∆W K .
generalization performance in the target domain [51], [23]. Prompt tuning [60] is another EPTL example, in which input
Likewise, one-class adaptation is also studied, where only real tokens of one or more layers are concatenated with learnable
face examples are available [52], [53], [49]. Moreover, other prompt tokens P . This combination of pre-trained models and
than common replay, print, and 3D mask attacks [54], more few-shot learning enables rapid adaptation to new tasks. In the
attack types appear such as makeup attacks, partial attacks, and layer with prompts, X ← [X, P ]. The tokens X are learned
obfuscation attacks. To evaluate a model’s performance against from the fixed pre-trained model, while P is trainable during
unseen attack types, the unseen attack detection scenario has fine-tuning.
also been proposed and studied [55], [56], [44], [57]. In this In this work, we focus on developing a more advanced
work, we extensively evaluate our proposed method in zero- adapter by introducing spoofing-aware inductive bias to A
shot cross-domain (DG) and few-shot domain generalization when conducting token transformation. How to develop
scenarios, as well as the unseen attack detection scenario. Prompt and LoRA for generalized FAS can be studied in
the future. A prior work that is related to our work is [23],
B. Efficient Parameter Transfer Learning for ViT which also adopts adapters for face anti-spoofing. However,
The Vision Transformer (ViT) has a key component called [23] only adopts simple multi-stream linear adapters for face
Multi-Head Self-Attention (MHSA). In MHSA, the input anti-spoofing. The design of the adapters lacks insight from
token X is first transformed into the query (Q), key (K), the data properties of the FAS task, such as locality, fine-grain
and value (V): Q = XW Q , K = XW K , and V = XW V , information, style variances, etc. Our work incorporates the
where W Q , W K , and W V are the linear layers that transform above insights in the design of our proposed S-Adapter.
the Q, K, and V respectively. The output with self-attention
⊺
is then calculated as Softmax( QK
√ )V, where d denotes the III. M ETHODOLOGY
d
embedding dimensions of Q and K. Efficient Parameter In this section, we first provide preliminary knowledge
Transfer Learning (EPTL) aims to accelerate the training about how to use adapters to fine-tune the vision transformer.
process on downstream datasets by transferring knowledge Subsequently, we describe how our proposed S-Adapter is
from pre-trained ViT models. Typically, only a small number developed. Finally, we describe the final optimization method,
of parameters are updated, while the rest are initialized from which involves the proposed Token Style Regularization in the
the pre-trained model and kept fixed during fine-tuning. total loss function.
4
Fusion
contains a Multi-Head Self
Video sequence 1 Attention
Vision Encoder
1 … 1 (MHSA) Truth
Lie
layer WiM SA Layer Conv 1X1 (𝜇)
M LP
and a Multi-Layer Perceptron
2
(MLP) layer Wi
2 … 2
, and each Classifier
Output
𝑍∗ 8 16 16
layer is accompanied by a Layer Normalization layer and a
Audio segment fa n layers MHSA S-
Audio Encoder
Patch Adapter
non-linear
0
activation
1
layer. By simplifying
Rectified Identity Cell 2
the skip connec-
Intra-modal reasoning Cell Cross-modal alignment Cell Embedding 𝑍 𝑍
tions, normalization layers, and activation layers, the inference Norm CD
Vanilla
procedure of each transformer block can be expressed as: Convolution Convolution
Token Spatializing (8 16 16)
Y = WiB (X) = WiMLP (WiMSA (X)), (1) Transformer Block 𝑁
Dimension Down 196 8
where X and Y are the input and output tokens of the block Updated during training
respectively. Frozen during training Input token 𝑋 196 768
local features [9]. Beside, FAS classification needs more fine- Given the input token X ∈ RN ×C , where N P denotes
grain information [61] and expects features to be robust against the number of tokens and C represents token embedding, we
variations in imaging environments about illuminations and first reshape X as X R ∈ RH×W ×C , with H × W = N P
camera modules, which is non-trivial for linear layers to learn. (class token ignored). Then, we permute dimensions to obtain
X M ∈ RC×H×W . Consequently, tokens are represented in a
We propose our S-Adapter to address the above limitations 2D-image style, enabling the use of widely-used PyTorch-style
to adapt pre-trained ViT for generalized face anti-spoofing 2D Convolution techniques for learning purposes.
efficiently. As illustrated in Fig. 2, our S-Adapter is inspired With the spatial tokens X M , we apply the 2D convolution
by traditional texture analysis methods in face anti-spoofing W Conv on it to extract the token map Z in a learnable way
[6], [26]. In the method of using local LBP features, the LBP that
descriptor is first used to extract the raw LBP maps, which are
at a low level and sensitive to imaging conditions [6]. Then the Z = W Conv (X M ). (3)
LBP histogram is collected, improving the feature level with
histogram statistical information. The feature representation Moreover, considering that the features of spoofing artifacts
with histogram can be more robust against lighting variations, are often of fine-grain details, which can be represented by
which inspires us to introduce statistical information via the the gradients based on the Center Difference (CD), we extract
proposed S-Adapter. token gradients based on the central difference [61] of tokens.
5
where λ is a constant scaling factor. When calculating LaT SR , Spoof-Trace [17] 1.6±1.6 4.0±5.4 2.8±3.3
P3
we use the Z from the last transformer block to calculate the CDCN [61] 2.4±1.3 2.2±2.0 2.3±1.4
gram matrix. CDCN++[61] 1.7±1.5 2.0±1.2 1.8±0.7
EPCR [67] 0.4±0.5 2.5±3.8 1.5±2.0
IV. E XPERIMENT S-Adapter-TSR (Ours) 0.4.±0.6 1.8±1.2 1.1±1.0
A. Datasets, Protocols, and Implementations Auxiliary[32] 9.3±5.6 10.4±6.0 9.5±6.0
Our experiments involve the use of several benchmark DCL-FAS [9] 8.1±2.7 6.9±5.8 7.2±3.9
datasets, including CASIA-FASD [64], IDIAP REPLAY AT- Spoof-Trace [17] 2.3+3.6 5.2±5.4 3.8±4.2
P4
TACK [27], MSU MFSD [29], OULU-NPU [65], and SiW- CDCN [61] 4.6±4.6 9.2±8.0 6.9±2.9
M [56]. To evaluate our models, we employ Half-Total Error CDCN++ [61] 4.2±3.4 5.8±4.9 5.0±2.9
Rate (HTER), Attack Classification Error Rate (ACER), Equal EPCR [67] 0.8±2.0 7.5±11.7 3.3±4.9
Error Rate (EER), Area Under Receiver Operating Character- S-Adapter-TSR (Ours) 1.5±3.1 3.9±4.6 2.7±3.5
istic Curve (AUC), and the True Positive Rate (TPR) when
False Positive Rate (FPR) equals 1% (T P R@F P R = 1%).
These rigorous evaluation procedures ensure the reliability and it generally achieves the best ACER compared with other
validity of our findings. methods. Thus, our method is effective in the intra-dataset
To conduct our experiments, we utilized the PyTorch 1.9 experiment.
framework and performed training and testing on a single
NVIDIA GTX 2080 Ti GPU. We follow [22], [23] to use C. Cross-Domain Evaluation
ViT-Base as the ViT backbone. In data processing, we utilized
1) Leave-one-out cross-domain benchmark: To begin with,
MTCNN [66] to detect faces and resized the cropped face
we compare our proposed method with state-of-the-art meth-
images to 224 × 224 as the input to the ViT-Base. During
ods on the leave-one-out cross-domain benchmark [24], which
training, we employed the Adam optimizer with an initial
consists of the CASIA-FASD (C) [64], IDIAP REPLAY
learning rate of 0.0001.
ATTACK (I) [27], MSU MFSD (M) [29], and OULU-NPU
(O) [65] datasets. This benchmark can also be referred to as
B. Intra-Domain Evaluation the MICO benchmark [14] and has been widely used for cross-
We first report the intra-domain experiment by using the domain performance evaluation [10], [11], [23], [15], [13],
OULU-NPU dataset’s four protocols [65], and the experimen- [12]. We follow the MICO benchmark’s protocols described
tal results are in Table I, and the used metrics are Attack in [24] and present our HTER and AUC results in Table II.
Presentation Classification Error Rate (APCER), Bona Fide We conducted a fair comparison by extracting the results of
Presentation Classification Error Rate (BPCER), and Average ViT† from [23] without using any supplementary data from
Classification Error Rate (ACER). The ACER is the average the CelebA-Spoof dataset [77]. ViT† utilizes the same ViT-
of APCER and BPCER. Compared with the state-of-the- Base backbone as our proposed approach. As presented in
art methods, our method shows prominent performance, as Table II, our method outperforms ViT† significantly in all
7
TABLE II
E XPERIMENTAL RESULTS ON THE LEAVE - ONE - OUT BENCHMARK MICO. R ESULTS ARE IN TERMS OF HTER (%) AND AUC (%).
TABLE IV
T HE RESULTS OF THE 5- SHOT CROSS - DOMAIN EXPERIMENT. 5 BONA FIDE EXAMPLES AND 5 ATTACK EXAMPLES FROM THE TARGET DOMAIN ARE USED
TO FINE - TUNE THE PRE - TRAINED MODEL . T HE RESULTS ARE IN TERMS OF HTER (%), AUC(%) TPR(%)@FPR=1%
TABLE V
R ESULTS OF LOO PROTOCOLS ON S I W-M DATASET [56]. T HE VALUES ACER(%) REPORTED ON TESTING SETS ARE OBTAINED WITH THE THRESHOLD
OF 0.5. T HE BEST RESULTS ARE BOLDED .
Method Metric(%) Replay Print Mask Attacks Makeup Attacks Partial Attacks Average
Half Silicone Trans Paper Manne Obfusc Imperson Cosmetic Funny Eye Paper Glasses Partial Paper
ACER 16.8 6.9 19.3 14.9 52.1 8.0 12.8 55.8 13.7 11.7 49.0 40.5 5.3 23.6±18.5
Auxiliary[32]
EER 14.0 4.3 11.6 12.4 24.6 7.8 10.0 72.3 10.1 9.4 21.4 18.6 4.0 17.0±17.7
ACER 12.8 5.7 10.7 10.3 14.9 1.9 2.4 32.3 0.8 12.9 22.9 16.5 1.7 11.2±9.2
BCN[73]
EER 13.4 5.2 8.3 9.7 13.6 5.8 2.5 33.8 0.0 14.0 23.3 16.6 1.2 11.3±9.5
ACER 10.8 7.3 9.1 10.3 18.8 3.5 5.6 42.1 0.8 14.0 24.0 17.6 1.9 12.7±11.2
CDCN++[61]
EER 9.2 5.6 4.2 11.1 19.3 5.9 5.0 43.5 0.0 14.0 23.3 14.3 0.0 11.9±11.8
ACER 12.1 9.7 14.1 7.2 14.8 4.5 1.6 40.1 0.4 11.4 20.1 16.1 2.9 11.9±10.3
DC-CDN[74]
EER 10.3 8.7 11.1 7.4 12.5 5.9 0.0 39.1 0.0 12.0 18.9 13.5 1.2 10.8±10.1
ACER 7.8 7.3 7.1 12.9 13.9 4.3 6.7 53.2 4.6 19.5 20.7 21.0 5.6 14.2±13.2
SpoofTrace[17]
EER 7.6 3.8 8.4 13.8 14.5 5.3 4.4 35.4 0.0 19.3 21.0 20.8 1.6 12.0 ± 10.0
ACER 9.8 6.0 15.0 18.7 36.0 4.5 7.7 48.1 11.4 14.2 19.3 19.8 8.5 16.8±11.1
DTN[75]
EER 10.0 2.1 14.4 18.6 26.5 5.7 9.6 50.2 10.1 13.2 19.8 20.5 8.8 16.1±12.2
ACER 9.5 7.6 13.1 16.7 20.6 2.9 5.6 34.2 3.8 12.4 19.0 20.8 3.9 13.1±8.7
DTN(MT)[13]
EER 9.1 7.8 14.5 14.1 18.7 3.6 6.9 35.2 3.2 11.3 18.1 17.9 3.5 12.6±8.5
ACER 7.8 5.9 13.4 11.7 17.4 5.4 7.4 39.0 2.3 12.6 19.6 18.4 2.4 12.6±9.5
FAS-DR(Depth)[13]
EER 8.0 4.9 10.8 10.2 14.3 3.9 8.6 45.8 1.0 13.3 16.1 15.6 1.2 11.8±11.0
ACER 6.3 4.9 9.3 7.3 12.0 3.3 3.3 39.5 0.2 10.4 21.0 18.4 1.1 10.5±10.3
FAS-DR(MT)[13]
EER 7.8 4.4 11.2 5.8 11.2 2.8 2.7 38.9 0.2 10.1 20.5 18.9 1.3 10.4±10.2
ACER 11.35 5.58 3.44 9.63 16.73 1.47 2.89 26.60 1.90 9.04 23.14 11.23 2.44 9.65±8.19
ViT[76]
EER 11.18 7.32 3.89 9.63 14.32 0.00 3.50 23.48 1.64 9.20 20.38 11.32 1.86 9.06±7.21
ACER 8.93 4.08 1.81 2.02 1.61 0.39 0.62 4.00 1.09 6.60 13.09 0.54 0.43 3.48±3.90
ViT-S-Adapter (Ours)
EER 5.38 3.48 1.67 2.96 1.36 0.00 0.00 4.35 0.00 7.20 10.25 0.48 0.23 2.87±3.20
TABLE VI
R ESULTS OF OUR S-A DAPTER AND TSR FOR DIFFERENT V I T BACKBONES : V I T-L ARGE , V I T-S MALL , AND V I T-T INY.
Hist(θ = 0)”, where both histogram layers and token gradient a more comprehensive representation of texture information
(θ = 0) are removed. The experimental results are provided in across resolutions. This is evident in the lower HTER achieved
Fig. 5. It can be seen that our S-Adapter generally outperforms by our S-Adapter compared to the other two configurations
the other two configurations, illustrating the advantages of in the “C&I&M to O” experiment. In summary, our pro-
extracting the token histogram. We observe that the token posed S-Adapter demonstrates performance improvements by
gradient also contributes to lower HTER values in most cases. leveraging statistical information to enhance cross-domain
However, in the “C&I&M to O” experiment, the inclusion performance, highlighting the benefits of incorporating a token
of token gradient information results in an increased HTER. histogram from the token map with the gradient information.
We conjecture that this unexpected result may be attributed
to the disparity in texture between the low-resolution source Moreover, we validate the proposed components of the
domains (I, C, and M) and the high-resolution target domain CDC layer and the histogram (Hist) layer with a standard
(O). Although the fine-grained texture information is extracted vanilla adapter. To achieve this, after a standard vanilla linear
in the gradient, the domain gap might cause the texture adapter, we add the CDC layer (θ = 0.7) and the histogram
to differ significantly between the low-resolution and high- layer, and the results are denoted and reported in Fig. 6 as
resolution domains. In contrast, our histogram layers provide ‘Adapter+CDC’ and ‘Adapter+CDC+Hist’. As illustrated in
Fig. 6, our proposed histogram layer also benefits the vanilla
9
S- $ G D S W H U w/ R K L V W (θ=0) HTER (%) and AUC (%) results of different fusion strategies
S- $ G D S W H U Z R K L V W 100.00%
90.00%
S- $ G D S W H U
80.00%
+ D O I 7 R W D O ( U U R U 5 D W H
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
HTER AUC HTER AUC HTER AUC HTER AUC