100% found this document useful (1 vote)
670 views17 pages

SimSwap Towards Faster and High-Quality Identity Swapping

Uploaded by

Naina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
670 views17 pages

SimSwap Towards Faster and High-Quality Identity Swapping

Uploaded by

Naina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

576 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO.

1, JANUARY 2024

SimSwap++: Towards Faster and High-Quality


Identity Swapping
Xuanhong Chen , Bingbing Ni , Yutian Liu , Naiyuan Liu, Zhilin Zeng, and Hang Wang

Abstract—Face identity editing (FIE) shows great value in AI with low-resolution (e.g., 224 × 224 or 256 × 256) face images,
content creation. Low-resolution FIE approaches have achieved which have difficulty in dealing with ever-larger images (e.g.,
tremendous progress, but high-quality FIE struggles. Two major 512 × 512), as both computation and memory consumption
challenges hinder higher-resolution and higher-performance devel-
opment of FIE: lack of high-resolution dataset and unacceptable grow exponentially with the increase of face size. For exam-
complexity forbidding for mobile platforms. To address both issues, ple, existing methods show unacceptable complexity even for
we establish a novel large-scale, high-quality dataset tailored for low-resolution (LR) face images, e.g., FSGAN [2] (256 × 256)
FIE. Based on our SimSwap (Chen et al. 2020), we propose an has 226 M parameters and 2450 GFLOPs; FaceShifter [3]
upgraded version named SimSwap++ with significantly boosted (256 × 256) possesses 412 M parameters and over 97 GFLOPs;
model efficiency. SimSwap++ features two major innovations for
high-performance model compression. First, a novel computational HifiFace [4] requires more than 146 M parameters and 71
primitive named Conditional Dynamic Convolution (CD-Conv) is GFLOPs @ 256 × 256. Such a computational obstacle severely
proposed to address the inefficiency of conditional schemes (e.g., hinders FIE techniques from beings deployed in real-time ap-
AdaIN) in tiny models. CD-Conv achieves anisotropic process- plications (e.g., interactive live video streaming) as well as
ing and injection with significantly lower complexity compared resource-constrained platforms (e.g., mobile phones, AR/VR
to standard conditional operators, e.g., modulated convolution.
Second, a Morphable Knowledge Distillation (MKD) is presented to glasses).
further trim the overall model. Unlike conventional homogeneous The current impediment is mainly caused by two issues, i.e.,
teacher-student structures, MKD is designed to be heterogeneous lack of high-quality face image dataset and heavy deep network
and mutually compensable, endowing the student with the multi- designs for high-quality pursuing. As a typical unpaired image
path morphable property; thus, our student maximally inherits translation task [5], FIE highly demands robust supervision
the teacher’s knowledge after distillation while further reducing
its complexity through structure re-parameterization. Extensive signals. Given the large number and diversity of face images,
experiments demonstrate that our SimSwap++ achieves state-of- VGGFace2 [6] has become the de facto standard training set
the-art performance (97.55% ID accuracy on FaceForensics++) for the FIE task. However, low-quality (e.g., noising faces
with extremely low complexity (2.5 GFLOPs). and identities) and low-resolution (typically 112 × 112 pixels)
Index Terms—Generative adversarial networks, image trans- make VGGFace2 completely unhelpful in training high-quality
lation, face swapping. editing models. Celeba-HQ [7] and FFHQ [8] are often used
as high-quality surrogate training sets, but they contain only
one image per identity, which produce insufficiently trained
I. INTRODUCTION models. A straightforward solution [4] is to enhance the quality
IVEN two images, face identity editing is the task of swap- of existing face datasets (e.g., VGGFace2) via face restoration
G ping the identity from a source face Is onto a target face
It while maintaining the remaining attributes (e.g., pose, hair,
approaches [9], i.e., image denoising or super-resolution. Due to
unrobustness algorithm, these synthetic datasets often contain a
expression) of It . It shows incredibly high value both in research large number (e.g., more than 10%) of failure cases or artifacts
and practical application, especially in computer graphics, in- and exhibit a distribution gap with real images, which seriously
teractive digital media, forgery detection. Contemporary FIE harms the generalization ability of FIE models.
approaches [1], [2], [3] achieve noticeable progress in efficient In addition, sophisticated model design hinders deployment of
swapping. However, most of these approaches can only deal FIE models. StyleGAN based approaches [10], [11] leverages a
GAN prior (i.e., StyleGANs [8], [12], [13]) to alleviate the qual-
Manuscript received 19 June 2022; revised 1 June 2023; accepted 3 August ity issue of the generated data; however, difficulty in decoupling
2023. Date of publication 22 August 2023; date of current version 5 December latent variables leads to its poor performance on swapping iden-
2023. This work was supported in part by the National Science Foundation of tities. Furthermore, model compression for StyleGAN-based
China under Grants U20B2072 and 61976137, and in part by SJTU Medical
Engineering Cross Research under Grant YG2021ZD18. Recommended for pipelines [14], [15], [16], [17] is extremely difficult as it requires
acceptance by G. Hua. (Corresponding author: Bingbing Ni.) a fully trained StyleGAN. To achieve high-quality FIE, existing
Xuanhong Chen, Bingbing Ni, Yutian Liu, Zhilin Zeng, and Hang Wang FIE models [2], [3], [18] generally design complicated archi-
are with the Shanghai Jiao Tong University, Shanghai 200240, China (e-mail:
[email protected]; [email protected]; [email protected]; tectures with large depth and width to prevent mode collapse,
[email protected]; wang–[email protected]). e.g., Faceshifter designs up to four sub-networks with separate
Naiyuan Liu is with the University of Technology Sydney, Sydney, NSW functions to realize the identity editing, which makes it slow
2007, Australia (e-mail: [email protected]).
Digital Object Identifier 10.1109/TPAMI.2023.3307156 in inference speed with high resource requirements; FSGAN

0162-8828 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: National Institute of Technology- Meghalaya. Downloaded on March 23,2024 at 02:15:58 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: SIMSWAP++: TOWARDS FASTER AND HIGH-QUALITY IDENTITY SWAPPING 577

decomposes the FIE task into four processes (i.e., reenactment, structurally isomorphism) lead to heavy student structure (e.g.,
segmentation, inpainting, and blending) and designs four gen- large inception blocks [20]) and also have difficulty in fully
erators to complete it, resulting in unacceptable computational compressing the encoding from teacher to student given limited
complexity (i.e., 2450 GFLOPs). It is worth mentioning that degree of freedom for knowledge distillation due to isomor-
the performance of mainstream conditional injection mecha- phism. In contrast, network architectures of teacher and student
nisms (e.g., AdaIN [19], modulated convolution [12]) degrades in our MKD are designed to be heterogeneous in the mean-
severely as the FIE model gets smaller/lighter, which prohibits time endowing the student’s model with multi-path morphable
the compression of high-quality models. Furthermore, GAN property based on linear additivity blocks (e.g., DBB [23]).
compression approaches often require sophisticated hypernet As a result, our student model maximally inherits teacher’s
(e.g., inception structures [20]) to provide sufficient search knowledge after distillation while its computational complex-
space [21], which not only greatly increases the complexity but ity is further reduced through structural re-parameterization
also makes the student model hardware-unfriendly. (i.e., via multi-path merging). In addition, Adaptive Kernel
This work is delicately proposed to address both issues, Alignment is proposed to further avoid negative regulariza-
aiming at developing a high-quality FIE framework capa- tion caused by the undesirable characteristic of teachers. With
ble of handling high-quality facial images with even lower the cooperation of the above strategies, SimSwap++ achieves
model complexity. First, we construct a novel large-scale, high- highly competitive performance with well-controlled resource
quality, and high-resolution (i.e., 512 × 512) face dataset named overhead.
VGGFace2-HQ to enable high-performance FIE model training. We experiment with the proposed SimSwap++ framework in
This manual collection work is of non-trivial research value, as terms of qualitative and quantitative evaluations on the 512 ×
we survey over 100,000 online videos (i.e., Youtube) and find 512 FIE task. Extensive experimental results demonstrate that
that 98% of faces are below 512 × 512, which demonstrates that our SimSwap++ framework achieves highly satisfying visual
the 512 × 512 FIE model is sufficient for most cases. Through effects and quantitative performance (i.e., 97.55% ID retrieval
careful video searching, face cropping, and semi-automatic and 2.55 pose error on FaceForensics++ [24]) while keeping
quality screening, VGGFace2-HQ contains more than 1.36 M an extremely low resource overhead (i.e., 2.5 GFLOPs). We
512 × 512 aligned face images and up to 9,630 distinct identi- show some visual results and model performance in Fig. 1.
ties, with an average of 141 images per identity. In addition, this Compared with SimSwap, SimSwap++ achieves significant per-
dataset consists of two parts: 1) a natural image sub-collection, formance and efficiency gain (e.g., ID accuracy +1% with
which collects up to 200,000 images covering 1,000 different 1/63 GFLOPs).
identities; and 2) a synthetic image sub-collection, containing
8,630 cleaned and re-annotated identities (i.e., clean up the II. RELATED WORK
images with mismatching identities and low-quality faces in the
cropped VGGFace2). A. Face Identity Editing
Second, based on proposed dataset, we develop a high- Face swapping methods can be broadly divided into two
performance and lightweight identity swapping framework categories: 3DMM-based and GAN-based methods [25], [26].
called SimSwap++, which is designed to comprehensively sur- Under each category, source-oriented methods first transfer the
pass the preliminary version SimSwap1 [1]. It is worth noting attributes such as expressions of the source face to the target face
that SimSwap, based on a VGG-style architecture, is capable of before performing blending operations on the resulting image,
handling face images with 224 × 224 pixels and is one of the while target-oriented methods directly blend the extracted face
most popular FIE approach. Aiming at generating high-quality features.
and visually pleasing facial images, the structure of SimSwap++ 1) 3DMM-Based Methods: 3DMM [27] projects the geom-
is re-worked on top of the SimSwap architecture, with a novel etry, texture, and expression of a face to a high-dimensional
and comprehensive compression scheme that integrates both subspace. Face2Face [28] achieves real-time expression transfer
operator-level and structure-level tailoring techniques, which by fitting 3D models of the target and source faces. Nirkin
greatly alleviates exponentially growing computational burden. et al. [29] use 3DMM and face segmentation to blend the
At the operator level, we present a lightweight computing expression and pose of the source face into the target face.
primitive called Conditional Dynamic Convolution (CD-Conv) Xu et al. [30] embed 3DMM coefficients, style, and identity
to achieve unified content-dependent and condition-dependent into a learning framework, enabling control of face attributes.
depth-wise separable convolutions [22]. CD-Conv generates Unfortunately, the face segmentation stage of these methods
filters customized for content and conditions by means of bi- makes it impossible to retain the face shape of the source image,
lateral kernel fusion, overcoming the condition injection in- and they also suffer from inaccurate facial reconstruction. In
efficiency of existing conditional operators (e.g., AdaIN [19], comparison, HifiFace [4] utilizes the geometric information in
modulated convolution [12]) at unnoticed overhead. At the 3DMM representation and achieves more realistic results. How-
structure level, a Morphable Knowledge Distillation (MKD) ever, it is still too computationally expensive to fulfill real-time
scheme is presented to further trim the overall model. Note that requirements.
previous distillation schemes [21] (i.e., teacher and student are 2) GAN-Based Methods: Unlike the 3DMM-based ap-
proaches, the GAN-based approaches blend faces at the feature
1 https://github.com/neuralchen/SimSwap level. Source-oriented methods, including RSGAN [31] and

Authorized licensed use limited to: National Institute of Technology- Meghalaya. Downloaded on March 23,2024 at 02:15:58 UTC from IEEE Xplore. Restrictions apply.
578 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 1, JANUARY 2024

Fig. 1. Our proposed SimSwap++ shows highly pleasing visual performance while needing a significantly low resource overhead (i.e., 2.5 GFLOPs). HifiFace [4]
does not report the detailed structure of the 512 × 512 model, so only the information of the 256 × 256 HifiFace model is shown here.

FSGAN [2], use a segmentation network to separate the face only provide a single image with only a few poses (i.e., all
and then blend in this region. Nevertheless, these methods do images keep one pose) for each identity, making the trained
not perform well on identity similarity. Faceshifter [3] intro- FIE model incapable of dealing with diverse poses. To alleviate
duces the encoder-decoder framework to generate high-quality the dilemma, in this work, we first establish a high-quality (i.e.,
results, whereas it fails to preserve expression and lighting well. face size 512 × 512) dataset tailored for FIE purpose, which
Similarly, DeepFakes [32] employs this structure, which has contains a total of 9,630 face identities with more than 1.36 M
poor generalization across identities. Our SimSwap [1], the 512 × 512 well-aligned face images. Cropped from videos and
most popular open-source one-shot face swapping project, has a photo albums, these images cover diverse facial poses (e.g.,
simple structure and performs well on low-resolution tasks but singing, talking, kissing.) and expressions and are presented in
is unable to work well on higher resolution tasks. Our proposed high resolution and quality. As a result, each identity possesses
SimSwap++, extended from SimSwap, achieves a significant an average of 141 face images with varying poses and expres-
improvement in terms of the number of model parameters and sions.
inference time.
A. Dataset Collection and Preparation
B. Model Compression Due to high diversity in pose and ID information, VGGFace2
dataset [6] is widely employed to train low-resolution (typically
As discussed above, the tremendous computational cost of
224 × 224) FIE frameworks [1], [2], [3]. VGGFace2 is specially
most deep models limits their deployment on lightweight de-
designed for face detection/recognition, and its face region is
vices, and as a result, model compression [33], [34] has re-
much smaller than 512 × 512, which makes VGGFace2 unable
cently gained attention. Common methods are typically based
to be directly applied to high-quality face editing tasks. More
on knowledge distillation and pruning. Knowledge distillation
importantly, we find that there are a large number (i.e., more
(KD) [35] is a technique that transfers the dark knowledge
than 10%) of images with identity mismatched and incomplete
learned in a larger teacher model to a smaller student model.
facial features in VGGFace2. Considering these facts, the design
Most distillation-based GAN compression methods [36], [37],
rule of the proposed VGGFace2-HQ dataset is to 1) base on
[38], [39], [40], [41] require searching and optimizing a supernet.
the original VGGFace2; 2) prune mismatched or low-quality
By eliminating these operations, the overhead of our approach
images; 3) perform super-resolution with quality screening to
is significantly reduced. Recently, Jin et al. [21] utilize kernel
obtain high-quality data; and 4) collect additional high-quality
alignment to minimize the difference between the teacher and
images from internet searched videos to further improve diver-
student model at the feature level, which is further improved in
sity. Namely, our constructed VGGFace2-HQ consists of two
our approach.
sub-collections: (1) natural image sub-collection cropped from
additional video corpus, which collects up to 200,000 images
III. VGGFACE2-HQ: A HIGH-QUALITY DATASET FOR FACE covering 1,000 different identities; and (2) synthetic image
IDENTITY EDITING
sub-collection from VGGFace2 that contains 8,630 cleaned and
Popular face datasets for the FIE task, such as [6], [42] contain re-annotated identities.
only low-resolution images (i.e., facial region 224 × 224), some Directly collecting a dataset of millions of images is a labor-
of which are even noisy [6]. Although many high-resolution intensive task. Instead of manual data acquisition, previous
datasets (e.g., FFHQ [8], Celeba-HQ [7]) are presented, they works [4] obtain higher-resolution data by augmenting image

Authorized licensed use limited to: National Institute of Technology- Meghalaya. Downloaded on March 23,2024 at 02:15:58 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: SIMSWAP++: TOWARDS FASTER AND HIGH-QUALITY IDENTITY SWAPPING 579

Fig. 2. Non-compliant examples in the dataset. Please zoom in for better view. More examples can be found on our project website https://github.com/neuralchen/
SimSwapPlus.

Fig. 3. We show some randomly selected samples from our VGGFace2-HQ. The first row shows samples selected from native VGGFace2, and the second row
shows the corresponding images in VGGFace2-HQ. The third row shows two identities with 6 different poses in the natural sub-collection of VGGFace2-HQ. The
last row displays images randomly sampled from natural sub-collection.

resolution of existing datasets via off-the-shelf super-resolution automatic collection and archiving of the faces in the videos. Af-
techniques [45]. However, those restoration approaches show ter automatic processing, we obtain more than 300,000 raw faces
unfaithful facial details and less similar face identities. Unlike in total. Then, through manual work, we remove the blurred,
those methods, we carefully select 8,630 identities from the incomplete facial features, ID mismatched, and face occluded
VGGFace2, and then boost these face images to 512 × 512 (e.g., blood in the face) images, and finally obtain 200,000
via face augmentation approach [9], which achieves SOTA natural face images. We show randomly selected samples of our
performance in realness and fidelity. Due to a large number (i.e., dataset in Fig. 3. More details about the dataset can be found
more than 100,000 images) of failure cases (i.e., SR images with at https://github.com/neuralchen/SimSwapPlus.
artifacts), incomplete facial features, and mismatched identity
labels, we manually clean and re-annotate the results and finally
form the first sub-collection. Some failure cases are illustrated in IV. BASELINE FRAMEWORK REVISITED: SIMSWAP
Fig. 2, which significantly harm FIE model training. With 1.1 M We start with a revisit to our baseline model SimSwap [1].
face images covering 8,630 identities, these synthetic images As essentially an unpaired image translation task [5], training
form the bulk of the proposed VGGFace2-HQ. Although the the swapping model is often unstable with a high possibility
resolution has been improved, there is a large distribution gap of mode collapse. Existing methods [2], [3] maintain stable
between synthetic and natural images, which makes the trained training through complex multi-stage processing or function-
model suffer from performance degradation when dealing with ally separated model components, thus suffering from a heavy
in-the-wild data. To balance the distribution of the images, we computational burden. Also, most methods [32] require person-
construct a subset of natural images. Great efforts are devoted specific model training procedures and thus cannot be applied
to collecting, cleaning, and annotating the natural faces from for unseen person ID swapping, forbidden for practical FIE
open-source videos and albums, which form a huge natural application. Given the above inherent limitations, our SimSwap
face sub-collection (i.e., up to 200,000 images) covering 1,000 framework was proposed for efficient FIE with only a single
different identities. It is not easy to obtain images of different VGG-style generator (i.e., fast inference speed). To balance
scenes and poses of a person simultaneously, so most of the attributes preserving and identity editing, as well as to achieve
images are collected from videos. Unfortunately, the size of flexible face swapping for unseen inputs, SimSwap features two
faces in videos changes drastically and becomes blurred with vital designs: an identity injection module and a weak feature
severe motion. We develop an annotation tool to realize the matching loss.

Authorized licensed use limited to: National Institute of Technology- Meghalaya. Downloaded on March 23,2024 at 02:15:58 UTC from IEEE Xplore. Restrictions apply.
580 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 1, JANUARY 2024

Fig. 4. Network Structure Evolution: from SimSwap to SimSwap++. The red dotted lines and red letters indicate the parts that need to be changed in the evolution
step. We achieve model compression by two means: 1) we propose the lightweight operator CD-Conv as the basic operator to replace the classical convolution; 2)
The MKD is proposed to achieve compensable teacher-student structures and model re-parameterization. BN denotes batch normalization [43]. LReLU represents
the Leaky ReLU [44].

A. Model Architecture To this end, the ID injection module is composed of two parts:
The network architecture of SimSwap has illustrated in Fig. 4 1) an identity extraction part and 2) an ID embedding part. On
one hand, we employ a face recognition network [48] as an
in detail. The overall framework is trained in a GAN-style
framework. The generator consists of an Encoder module, an ID embedding module to extract the identity information vs of the
Injection Module (IIM), and a Decoder module. The Encoder source face. Note that through training iteration, our model is
encouraged to progressively disentangle the identity information
module, which is based on the classic convolution followed
by a batchnorm [43] and a ReLU [46], extracts features Xt from entangled attributes information implicitly from target face
from the target image It . The ID Injection Module, which is features. On the other hand, we employ a total of 9 ID-Blocks
to inject identity information. ID-Block is a modified version of
designed based on a residual structure, transfers the identity
information from Is into It . This module realizes the injection the residual block [49] with Adaptive Instance Normalization
(AdaIN) [19]. The injection process to align the target identity
of identity information by first normalizing the distribution of
to the source identity can be written as
Xt , and then aligning it with the distribution of Is ’s identity vs
via affine transformation. The aligned features are further fed
Xt − μ(Xt )
into the Decoder module (i.e., same infrastructure as Encoder) AdaIN (Xt , vs ) = F1 (vs ) + F2 (vs ), (1)
to reconstruct the synthetic image. The discriminator contains a σ(Xt )
GAN loss with a novel weak feature matching loss to make
sure attributes preservation of Xt without harming the FIE where μ(Xt ) and σ(Xt ) is the channel mean and variance of Xt .
performance too much. F(·) denotes the linear layer used to align latent codes dimension.

C. Preserving the Attributes of Target Faces


B. Identity Injection Module
For the FIE task, the modification should be only conducted in
The objective of ID injection is to replace the identity in- the identity sub-space, and other attributes of It should remain
formation in Xt with the identity information of the source unchanged. However, as all these attributes embedded in the
face while keeping the remaining attributes in Xt unchanged. latent feature representation Xt are seriously entangled together,
Unfortunately, both pieces of information are highly coupled a change of the identity embedding would affect other attributes’
and hard to decompose, as indicated in [32]. Therefore, our ID feature representations. To enforce attributes matching between
injection module is designed in such a way that by manipulating results and target faces (i.e., namely, to eliminate the influence
the feature map Xt , we replace the target face identity informa- of identity change), one straightforward idea is to explicitly
tion with the extracted source face identity information while decompose the latent feature map Xt into different independent
preserving other facial attributes Xt as much as possible. As attribute values and constrain each of them, which is impractical
explicit decomposition/disentangle is not possible and acknowl- as attribute space decomposition is impossible and the number of
edging that the feature map Xt essentially lies on various facial informative attributes is huge. To tackle this challenge, we pro-
attribute sub-spaces such as ID, pose, shape. Our ID injection pose to utilize a weak feature matching loss to achieve implicit
module works together with a carefully designed training loss attribute matching. The idea of feature matching originated in
to encourage the network to implicitly project Xt onto the target pix2pixHD [50], which uses discriminators to extract multiple
face identity manifold while keeping its projections onto other layers of features from the ground truth image and the generated
attributes sub-spaces unchanged. output and penalizes their semantic representation gap. The

Authorized licensed use limited to: National Institute of Technology- Meghalaya. Downloaded on March 23,2024 at 02:15:58 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: SIMSWAP++: TOWARDS FASTER AND HIGH-QUALITY IDENTITY SWAPPING 581

original feature matching loss is written as If the source face and the target face are sampled from different
identities, using reconstruction loss makes it difficult for the

M
1
Lof m (Ir , IGT ) = D(i) (Ir ) − D(i) (IGT )1 , (2) identity loss to converge, so we simply set it to 0.
i=1
Ni Adversarial Loss and Gradient Penalty. We use multi-scale
discriminators [51] for better performance under large postures.
where D(i) denotes the ith layer feature extractor of Discrimi- We also utilize the gradient penalty [52], [53] Lgp to prevent the
nator D and Ni denotes the number of elements in the ith layer. discriminators from gradient explosion. Note that the adversarial
M is the total number of layers. Ir is the generated output, and loss helps the generator present good facial image quality with
IGT is its corresponding ground truth image. effective modification of identity.
In our problem setting, since there is no ground truth for Weak Feature Matching Loss. Since we are using mutli-scale
identity editing output, we employ the input target image It discriminator, the weak feature matching loss should be calcu-
instead. In particular, we remove the first few layers and only lated using all discriminators, which can be written as
use the last few layers to calculate our weak feature matching
loss, which can be written as 3

Lwf w = Liwf m (Di (Ir , Is )). (6)
M
1
Lwf m (Ir , It ) = D(i) (Ir ) − D(i) (It )1 , (3) i=1

i=m
N i
The overall loss can be therefore written as
where m is the layer where we start to calculate the weak feature
matching loss. The reason for using deep layer features only is L = wid Lid + wrec Lrec + wadv Ladv + wgp Lgp
as follows. Shallow layer features mainly contain texture infor- + wwf m Lwf m , (7)
mation that is good at constraining pixel-level results. However,
in the FIE task, introducing excessive low-level details from where wid = 10, wrec = 10, wgp = 10−5 and wwf m = 10 to
the target image will make the synthetic result similar to the balance the magnitude.
target face and cause difficulty in identity modification. Thus, we
drop the first few layers in the proposed weak feature matching
V. TOWARDS FASTER AND HIGHER-QUALITY IDENTITY
loss and only encourage high-level features alignment, which
EDITING: SIMSWAP++
can avoid explicit damage to identity editing capability while
maintaining expressive power for other attributes. It is precise Although the proposed SimSwap framework has achieved de-
because the shallow features are discarded that the redesigned cent face editing quality with moderate computational loading,
loss focuses on matching more abstract attribute sub-spaces its inherent structural and memory inefficiency (i.e., based on
rather than at the pixel level. It works together with ID Block classic convolutions) prevents it from further deployment in a
and ID loss to delicately realize the directional modification of real-time or resource-constrained platform. More specifically,
ID information. By using such a loss function, even if we are not the major computational bottleneck in SimSwap is the heavy
explicitly constraining the network on any specific attributes, it computational primitives, which are based on classic convolu-
implicitly learns how to preserve the attributes of the input target tions (i.e., Conv in short in our article) and a forward compu-
face. tational throughput of SimSwap on a TitanXp GPU consumes
1) Overall Training Losses: SimSwap is trained in a GAN- 48.3 GFLOPs with 1.1 GB memory occupancy at 224 × 224
style framework, and we use five types of losses, including images, i.e., far exceeding the capability of the most mobile
identity loss, reconstruction loss, adversarial loss, weak feature platform. One of the most straightforward methods is to replace
matching loss, and gradient penalty, which are explained as the classical convolution with the depth-wise convolution (DW-
follows. Conv) [22]. However, the poor performance of this operator
Identity Loss. To change the identity of Is with It , we intro- causes serious degradation of the reconstruction performance
duce an identity loss to constrain the identity of the generated (refer to the 5th row and 2nd column in Fig. 5). More importantly,
result Ir . We use the above introduced face recognition model standard condition injection schemes such as AdaIN [19] and
to extract the identity representation vector vr of Ir . The model modulated convolution [12] (i.e., Mod-Conv in short) suffer
applies cosine distance to measure the identity similarity be- from significant performance degradation when model size is
tween the resulting synthetic face and source face, which can be drastically reduced (refer to Fig. 6).
written as To alleviate the dilemma between both efficiency and visual
vr · vs quality (i.e., resolution), this work proposes an upgraded version
Lid = 1 − . (4)
vr 2 vs 2 of SimSwap, called SimSwap++, which targets a faster and
high-quality face identity editing framework. In a nutshell, two
Reconstruction Loss. If the source face and the target face are
innovative techniques are equipped in SimSwap++ framework to
sampled from the same identity, it should in principle produce
power up computational efficiency with editing fidelity preser-
the same result as the source face. Therefore, we design the re-
vation: 1) optimized primitive operator design, (i.e., content-
construction loss to regularize the results with the same identity,
adaptive low-cost convolution) and 2) whole-infrastructure com-
which is written as
pression via a novel high-performance knowledge distillation
Lrec = ||Ir − Is ||1 . (5) scheme, which is introduced in detail as follows.

Authorized licensed use limited to: National Institute of Technology- Meghalaya. Downloaded on March 23,2024 at 02:15:58 UTC from IEEE Xplore. Restrictions apply.
582 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 1, JANUARY 2024

Fig. 5. Comparison of complexity and performance of different operators. Pose error is tested on FF++ based on a reduced SimSwap (i.e., 1/4 channels). Where
K is the kernel size. r is squeeze ratio [47], usually set to 4. G is the number of processing groups for depthwise D-Conv, usually set to C/8. The input feature
∈ RC×H×W . * FLOPs calculation conditions: K = 3, C = 32, G = 4, r = 4, H = 16, W = 16. It is worth mentioning that our proposed CD-Conv only adds
limited FLOPs (i.e., 2K 2 GHW + 1024K 2 G) over D-Conv.

Unfortunately, D-Conv does not have the ability to inject condi-


tional information. It needs to cooperate with AdaIN to achieve
face manipulation. However, such a scheme suffers from se-
vere editing performance degradation (as shown in Fig. 6, the
AdaIN+D-Conv shows an 11.5% drop in ID accuracy) as the
model size gets smaller, which is obviously not in line with
our original intention. At the same time, it can be seen in
Fig. 6. Comparison of editing performance of different operators. ID accuracy Fig. 6 that the recently popular conditional injection operator
is tested on FF++ based on the SimSwap with different number of channels. C Mod-Conv also shows a similar degree of degradation to AdaIN.
is the basic number of channels.
Therefore, the design rule/purpose of the proposed Conditional
Dynamic Convolution (CD-Conv) is to seek a novel atomic
structure to simultaneously solve the injection of conditions and
A. Conditional Dynamic Convolution the encoding of information with a limited cost increase.
We found the most serious issue in FIE model compression, As shown in Fig. 5, CD-Conv is designed as a bilateral filter
that is, the performance of mainstream conditional injection scheme, consisting of a content branch and a conditional branch,
methods (e.g., AdaIN, ModConv) degrades severely as the FIE which enables CD-Conv to uniformly integrate the capabilities
model gets smaller/lighter, which prohibits the compression of of convolution filtering and conditional injection. Such a bilat-
high-quality models. The essential reason for severe degradation eral information architecture makes our CD-Conv significantly
is that the support set (i.e., spanned by possible combinations different from existing operators (i.e., content-aware only or
of conditions) of these operators are relatively small and cannot condition-aware only), which can generate kernels tailored to
provide high-variance perturbations, which becomes even worse both content and conditions, endowing it with robust perfor-
when the model gets smaller, leading to the inability to clearly mance even at small model sizes (refer to Fig. 6). The content
separate features of different IDs after injection. Witness this, branch provides the basic functionality for generating content-
we introduce a novel computing primitive called Conditional aware convolution kernels. Following [54], the core structure
Dynamic Convolution (CD-Conv), i.e., a high-performance of the content branch is a squeeze-and-excitation block [47],
low-cost convolution module, to address both efficiency and which has been proven to be very effective in generating kernels.
ID injection. A widely-used operator for the light-weighted The most important innovation of CD-Conv is a conditional
module is depth-wise convolutions. Unfortunately, DW-Conv branch to achieve ID information injection, based on a simple yet
is a spatial-invariant isotropic operator and CANNOT perform effective idea of bilateral kernels fusion. The conditional branch
content-dependent convolution for various facial regions, which consists of a naive linear layer that receives conditional input and
is very important for manipulating face features in finer de- generates a set of conditional kernels, which are then fused with
tail. On the contrary, Dynamic convolutions (D-Conv) [54], the kernels generated by the content branch via bilateral kernel
[55] have recently emerged as a high-performance alternative fusion. Moreover, our CD-Conv only adds a limited amount
to DW-Conv, which generates anisotropic filters to process of 2K 2 GHW + 1024K 2 G FLOPs compared to D-Conv. Note
each sliding window, making it superior to static convolution. that the conditional branch only needs to be run once when

Authorized licensed use limited to: National Institute of Technology- Meghalaya. Downloaded on March 23,2024 at 02:15:58 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: SIMSWAP++: TOWARDS FASTER AND HIGH-QUALITY IDENTITY SWAPPING 583

processing the same id (i.e., video), so the added complexity Namely, the teacher and student networks are designed to have
of CD-Conv is further reduced. We show all the above operator different architecture but with a mutually-compensable struc-
structure, complexity, and performance in Fig. 5 for comparison. ture that promotes complementary by creating diverse feature
The output Y ∈ RC×H×W of CD-Conv can be expressed as distributions (e.g., processing features via diverse operators). In
below this way, the learned feature representations of the teacher and
 student network are encoded with different data distributions,
Yc,i,j = [Φ(vs ) · Ψc (Xu,v )] ∗ Xc,u,v , (8)
and thus knowledge flows during distillation training are more
(u,v)∈Ωi,j
easily injected into the student model. In other words, the
where Φ and Ψ represent conditional branch and content branch mismatch between the representations of both models brings
generator, respectively. vs is the ID embedding generated by up more degree of learning freedom for the student model to
a pretrained ArcFace. Ωi,j denotes the set of positions in the absorb knowledge. On the other hand, since homogeneity design
K × K kernel window, written under Cartesian product as constraint no longer exists, we can freely design the student
architecture to endow it with the capability of being further
Ωi,j = [0, 1, . . . , 2K/2] × [0, 1, . . . , 2K/2]. (9) compressed after distillation. Thanks to the recent development
of network morphine techniques [56], [57], one plausible choice
For simplicity, we use u, v to represent the sliding window of
is to design the student network with multi-branch mergeable
K × K. c is the channel index, and X ∈ RC×H×W represents
architecture, and thus the student model can be easier to train
the input feature. Compared with classical convolution, the
(as a well-known advantage offered by multi-path structure [20])
computational overhead of CD-Conv is greatly reduced while
and model trimming is possible.
maintaining high performance.
1) Morphable Dual-Structures: The most important infras-
tructure to realize the above ideas is Morphable Dual-Structures
B. Model Compression via Morphable Knowledge Distillation (MDS), which is designed as inception-like blocks to obtain the
Mainstream GAN compression approaches often require so- mutually-compensable characteristics. Referring to Fig. 7, MDS
phisticated hypernet (e.g., inception structures) to provide suf- contains two different topologies, one for the teacher network
ficient search space [21], [39], which not only greatly increases and one for the student network. The teacher’s structure is an
the complexity but also makes the student model hardware- inception-like multi-branch structure, which contains convolu-
unfriendly (e.g., inception block). We develop a tailored distilla- tions of different scales, and each operator is followed by a
tion scheme to further compress our models. Conventional dis- BN and nonlinear layer. The student structure adopts a hetero-
tillation frameworks mostly favor homogeneous teacher-student geneous topology with the teacher, eliminating the nonlinear
architecture. Namely, teacher and student models share a similar layer inside the multi-branch structure and adding a unified
network structure with the same basic computation components, nonlinear layer outside to maintain the representation ability.
with the only difference on parameter scale, i.e., shallower Compared to commonly used distillation structures (e.g., res-
networks, fewer channels, and others. The underlying design block, inception-like block), diversified processing branches in
principle is that similar structures share the same knowledge MDS provide the teacher and the student more freedom and
encoding format/style/characteristics, and therefore the knowl- richer feature matching, resulting in better distillation results.
edge encoded by the teacher is more easily to be fed into the In order to replace the ID-Block, we use MDS and CD-Conv to
student, in other words, learned representations (i.e., from both build a residual module called MDS-Resblock, and we show its
intermedia layers or final responses) are more compatible with structure in Fig. 7. It is worth mentioning that the elimination of
each other [39]. However, this homogeneous distillation scheme the nonlinear layer follows the idea of network morphine [57],
often leads to sub-optimal training. First, since both teacher and which demonstrates that the linear branches can be morphable
student share a similar network structure and similar distribu- via structure re-parameterization [56], [57]. After distillation
tions of the learned intermedia features, with training saturation and re-parameterization, our module can finally be reduced to
of the teacher model, more knowledge is difficult to be further a convolution followed by a nonlinear layer, which is more
injected into the student model that has much a smaller parameter hardware friendly than inception-like structures.
size. In other words, homogeneous feature representation is 2) Network Structure Evolution: From SimSwap to Sim-
often hard to compress in distillation. Second, since layer-wise Swap++: In order to accommodate high-quality, high-
feature representations of both models are encoded in a similar resolution face editing, we introduce in detail the structural
subspace, the student model might suffer from early saturation evolution process from SimSwap towards SimSwap++ (shown
issues, and layer-wise information reconstruction error would be in Fig. 4), which can directly deal with 512 × 512 face images.
hard to reduce. Furthermore, homogeneous structures make the 1) Receptive Field Increase. To reconstruct finer detailed tex-
student network inherit the cumbersome design in the teacher ture/geometric patterns for higher resolution face images, it is
network [39], which significantly weakens the compression necessary to enlarge the receptive field size of our generation
benefits. model. For such a purpose, we add two more down-sampling
In view of these issues, in this work we propose a novel layers and two more up-sampling layers in the encoder and
Morphable Knowledge Distillation (MKD) scheme. On the one decoder structure of SimSwap++; 2) Trainability Enhancement.
hand, in contrast to the previous distillation scheme [21], MKD Although the multi-scale discriminator adopted in the original
features a heterogeneous teacher-student architecture design. SimSwap model presents excellent performance, its excessive

Authorized licensed use limited to: National Institute of Technology- Meghalaya. Downloaded on March 23,2024 at 02:15:58 UTC from IEEE Xplore. Restrictions apply.
584 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 1, JANUARY 2024

Fig. 7. Structure shown on the left is the vital building block (MDS) for knowledge distillation. MDS-ResBlock consists of cascaded CD-Conv and MDS.
MDS-ResBlock injects ID information via our proposed CD-Conv.

hardware overhead (typically three discriminators together) also the above problems, a linear function R(did ) of the similarity
limits the training batch size, yielding suboptimal results. To did between the teacher’s output and the GT. The closer the
alleviate this issue, we instead use a projected discriminator [61], teacher’s output is, the more reliable it is, and the larger the
which shows a faster convergence speed with fewer hardware R(did ) is. We introduce the R(did ) coefficient to the kernel
requirements. With the above two improvements, we get the alignment to obtain the Adaptive Kernel Alignment (i.e., adaKA
target model called SimSwap-512, which can effectively handle in short) as follows:
high-quality 512 × 512 face images but with huge parameters
Y T (R(did ) · X) F
(i.e., 216 M) and high complexity (i.e., 157.7 GFLOPs). To Kada (X, Y, R(did )) = , (10)
adapt our proposed high-performance network compression X T XF Y T Y F
techniques based on the CD-Conv and the teacher-student train- where X, Y denotes teacher’s features and student’s features.
ing scheme, we develop our teacher network and student network In order to further improve the reconstruction ability and con-
based on the SimSwap-512 model. First, we replace all down- vergence speed of the student model, we also introduce output
/up-sampling layers of SimSwap-512 with CD-Conv down-/up- reconstruction between the student and teacher models
sampling layers and replace all ID-Blocks with MDS-Resblock
(teacher), yielding the teacher model called SimSwap++(T) with Ldis = R(did ) · L1 (Gt (x), Gs (x))
high editing performance. Then, we develop a middle-scale   
− Kada Gt (x)l , Gs (x)l , R(did ) , (11)
model SimSwap++(M) by replacing all residual modules with
l∈Sbot
MDS-Resblock (student) in SimSwap++(T). More importantly,
based on the SimSwap++(M), we obtain our final tiny model where Gt and Gs denotes the teacher generator and the student
called SimSwap++(S) by significantly reducing the channels generator. Sbot denotes the set of layers in the bottleneck for
(i.e., 1/4) of Encoder and Decoder to meet the computational performing distillation.
and memory budget for resource-constrained platform usage.
In addition, to improve the model’s ability to reconstruct the C. Learning
background, we added two mask heads (it consists of two layers
We employ four loss functions derived from SimSwap to train
of depth-wise convolutions) to the last (i.e., 512 × 512) and
our teacher network, including identity loss Lid , reconstruction
third layers (i.e., 128 × 128) of the decoder of the SimSwap++
loss Lrec , weak feature matching loss Lwf m , and adversarial
(T/M/S) models. The detailed model evolution process is illus-
loss Ladv . In particular, we use the multi-scale adversarial loss
trated in Fig. 4.
from [61] instead. In addition, we use a fine-tuned FaRL [66]
3) Adaptive Similarity-Based Knowledge Distillation: Com-
model to generate face masks for the VGGFace2-HQ to super-
pared with discriminative models, generative models need to
vise our two mask heads. The total loss of teacher network can
predict dense labels (i.e., images), making it difficult to distill the
be expressed as below:
dark knowledge only through response-based distillation [35].
To alleviate this issue, we use the feature-based scheme [63], LT = λid Lid +λrec Lrec +λwf m Lwf m +λadv Ladv +λm Lm ,
which aims to increase the similarity of intermediate features (12)
between the teacher and student networks. The kernel align- where λid = 30, λrec = 10, λwf m = 10, λadv = 1, λm = 80.
ment [64], [65] is introduced to avoid feature dimension mis- Lm is the face mask loss in the same format as HifiFace [4]. For
match between the teacher and student model. Experiments find the training of student network, we use the same loss functions as
that the naive direct use of distillation loss will lead to colossal for training teacher network and introduce the mentioned loss of
training noise. We design a dynamic weight mechanism to solve knowledge distillation Ldis . For the student network, the overall

Authorized licensed use limited to: National Institute of Technology- Meghalaya. Downloaded on March 23,2024 at 02:15:58 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: SIMSWAP++: TOWARDS FASTER AND HIGH-QUALITY IDENTITY SWAPPING 585

loss function is: them together for visual comparison. HifiFace, DeepFaceLab,
MegaFS and FSLSD-HiRes are specially designed for high-
LS = λid Lid + λrec Lrec + λwf m Lwf m + λadv Ladv quality faces (i.e., 512 × 512), and we compare them together. It
+ λdis Ldis + λm Lm . (13) needs to be specifically stated that FaceController, FaceShifter,
and HifiFace have not released official open-source codes.
To balance the losses, we set λid = 30, λrec = 10, λwf m = 10, MegaFS released a Github repository without the pre-trained
λadv = 1 and λdis = 10, λm = 80. model and training code. For a fair comparison, we directly crop
the results images from their published article.
VI. EXPERIMENTS
Comprehensive experimental evaluations have been con- C. Quantitative Analysis
ducted to show the ID swapping effectiveness, the visual quality
(of the output synthetic image) as well as the algorithmic effi- Our proposed models together with various baseline models,
ciency of the proposed SimSwap++ framework and its variants trained on the VGGFace2-HQ dataset, are tested on FaceForen-
(i.e., the preliminary version SimSwap and all intermediate mod- sics++ dataset [24] for face ID editing task, to quantitatively
els, including SimSwap, SimSwap-512, SimSwap++(T), Sim- compare the FIE accuracy in terms of ID retrieval accuracy (%)
Swap++(M) and SimSwap++(S)). In particular, ablation studies and pose recognition error (i.e., euclidean distance) for all mod-
have been the effectiveness of the proposed model compression els. At the same time, we test the model parameter size, GFLOPs,
techniques, including the CD-Conv operator and Morphable and inference speed of these models to illustrate/compare their
Knowledge Distillation scheme. The value of the proposed computational complexities. The above quantitative compari-
VGGFace2-HQ dataset in terms of training a high-quality FIE son results are illustrated in Table I. First of all, it is directly
model is also demonstrated. observed from Table I that our SimSwap++ models (includ-
ing SimSwap++(T/M/S) variants) achieve excellent ID retrieval
accuracy and pose error, as well as highly competitive model
A. Implementation Details
efficiency in terms of GFLOPs/FPS/parameter size, among all
We down-sample our proposed VGGFace2-HQ to 224 × 224 comparing approaches. More specific, SimSwap++(T) model
as the training set for our SimSwap. We choose VGGFace2-HQ achieves the best ID retrieval accuracy (98.52%) and the lowest
as the training dataset for SimSwap-512 and SimSwap++ model pose recognition error (2.38); and its compressed model Sim-
series. To minimize the background area, our VGGFace2-HQ Swap++(S) achieves the best model efficiency (2.5 GFLOPs
maintains the cropping style of SimSwap. For the ID embed- with 81.4 FPS) with only negligible ID retrieval accuracy
ding, we use the pre-trained ArcFace model obtained from drop (i.e., from 98.52% to 97.55%, which is still higher than
InsightFace-v2.2 All images are centered with mean tensor most of the comparing methods with much larger model size).
[0.485, 0.456, 0.406] and variance tensor [0.229, 0.224, 0.22]. These results demonstrate that the proposed SimSwap++ mod-
We train our model using the ADAM optimizer [67] with els possess great identity encoding and editing capability with
parameters set to β1 = 0; β2 = 0.999. All models are trained minimal interference to other facial attributes, i.e., good facial
with 4 V100 GPUs. For the SimSwap, we set the learning rate feature/quality preservation power. Also, the very competitive
to 0.004 and the batch size to 64. For the SimSwap-512 and performance obtained by SimSwap++(S) proves that our pro-
SimSwap++ teacher models, we set the learning rate to 0.008 posed CD-Conv operator and MKD scheme offer good facial
and the batch size to 32. For the SimSwap++ student model, feature representation power with a limited computational bur-
we set the learning rate to 0.004 and the batch size to 64. In den. Second, it is noted that the model sizes/complexities of
particular, we alternately sample image pairs of the same identity GAN Compression, CAT, and Response-based [35] (i.e., using
and image pairs of different identities to train the model, which them to directly compress the target SimSwap-512 model) are
can play a role similar to cycle-loss [5]. comparable to our compressed model SimSwap++(S), but their
ID retrieval accuracies and pose errors are significantly inferior
B. Baseline Models to that of SimSwap++(S). The reason is that these models have
For quantitative analysis, we select the most popular face no means to solve the problem of AdaIN performance degra-
swapping models as baseline models, i.e., DeepFake [32], dation, and the reconstruction ability of static operators is also
FaceSwap,3 FaceShifter [3], FaceController [30], Mobile- worse than our CD-Conv. Furthermore, these schemes utilize
FaceSwap [58], FSGAN [2], SimSwap [1], HifiFace [4], conventional distillation schemes (i.e., rely on only intermediate
MegaFS [10], and FSLSD-HiRes [11]. In particular, we also layer or output layer features for the model matching loss, or use
compared with the most popular FIE software: DeepFace- teacher-student homogeneous computational block design), still
Lab [62]. Since FaceShifter, FaceSwap, DeepFake, FaceCon- retain cumbersome inception-based structures for the student
troller, MobileFaceSwap, FSGAN, and our SimSwap are specif- model, thus leading to difficulty in knowledge distillation and
ically designed for the LR (i.e., 224 × 224) FIE task, we put sub-optimal FIE performance. In contrast, our proposed novel
distillation method can well mitigate these issues, yielding high-
quality face images even under a very large model compression
2 https://github.com/foamliu/InsightFace-v2 rate (i.e., from 157.7 GFLOPs to 2.5 GFLOPs). Last but not least,
3 https://github.com/MarekKowalski/FaceSwap note that our SimSwap++(S) model reaches a very high frame

Authorized licensed use limited to: National Institute of Technology- Meghalaya. Downloaded on March 23,2024 at 02:15:58 UTC from IEEE Xplore. Restrictions apply.
586 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 1, JANUARY 2024

TABLE I
QUALITATIVE EXPERIMENTS ON FACEFORENSICS++

Fig. 8. Comparison with DeepFakes, FaceSwap, FSGAN, FaceShifter, FaceController, and MobileFaceSwap on the FaceForensics++ dataset. Result images are
generated at resolution of 224 × 224 and 256 × 256.

rate (i.e., 81.4 FPS) for model inference, which indicates that validates the good ID editing capability as well as facial feature
it is ready to be deployed on edge devices due to its extremely preserving power of our framework. Second, Fig. 9 com-
high speed and small model scale. pares various synthetic face results from high-resolution model
groups, including our SimSwap++ model family, HifiFace,
MegaFS, and FSLSD-HiRes, tested on FaceForensics++ and
D. Qualitative Analysis in-the-wild images (512 × 512). From Fig. 9, it is noticed that
The generated synthetic facial images at both low-resolution the images generated by MegaFS and FSLSD-HiRes are visu-
and high-resolution (based on corresponding models) are visual- ally unnatural. This is because MegaFS/FSLSD-HiRes achieves
ized and compared as follows. First, Fig. 8 shows generated low- HR face identity editing by manipulating latent encodes of a
resolution (i.e., in terms of 224 × 224 pixels) face editing results pre-trained StyleGAN model based on FFHQ, which makes
from low-resolution model groups, including our preliminary it generalize poorly on datasets other than FFHQ. In other
SimSwap, DeepFakes, FSGAN, FaceShifter, MobileFaceSwap, words, since models like MegaFS/FSLSD-HiRes cannot well
and FaceController, tested on the FaceForensics++ dataset. It is project facial features onto maximally de-coupled attribute sub-
observed that among all low-resolution models, our SimSwap spaces/manifolds (i.e., ID, shape, pose, expression, illumina-
model offers highly visually pleasing results. Even compared tion, and others), the latent representations of various attributes
with the more complex FaceShifter and FaceController (whose are inter-correlated with each other and editing certain types
model size is much larger than SimSwap), our SimSwap model of features might cause large confusion/ambiguity for oth-
still presents better or at least no worse visual quality, which ers, resulting in synthetic faces with serious artifacts. Most

Authorized licensed use limited to: National Institute of Technology- Meghalaya. Downloaded on March 23,2024 at 02:15:58 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: SIMSWAP++: TOWARDS FASTER AND HIGH-QUALITY IDENTITY SWAPPING 587

Fig. 9. Comparison with MegaFS [10], HifiFace [4], FSLSD-HiRes [11] and DeepFaceLab [62]. Our method can generate high-quality results while maintaining
the attributes of target images. Our results are generated at the image size of 512 × 512. For fairness, we directly use the images provided in their article/documents.
Please zoom in for better view.

importantly, among all comparison results, it is clearly ob- we retrain the SimSwap++(S) structure from scratch on our
served from Fig. 9 that our SimSwap++ models family shows VGGFace2-HQ dataset. We denote the model that uses classical
highly competitive visual effects that significantly outperform convolution as the basic operator by “Conv”. “DW-Conv” is the
most prior art. Although SimSwap++(S) is with the smallest model employing depth-wise convolution as the basic operator.
model size, the visual fidelity it provides is almost the same “D-Conv” denotes the model employing involution [54] (i.e.,
compared to its teacher models SimSwap++(T/M), much bet- SOTA dynamic convolution) as the basic operator. In order
ter than other competing methods which have larger model to maintain fairness, we replace all the BNs in the Encoder
sizes such as MegaFS, FSLSD-HiRes. This demonstrates the and Decoder parts of the above three networks with the same
high fidelity face editing capability offered by our SimSwap++ number of conditional operators as AdaIN. “Mod-Conv” is the
framework as well as the effectiveness of the proposed model model employing modulated convolution as the basic operator.
compression techniques. Also note that HifiFace shows good “CD-Conv” represents the standard SimSwap++(S) model with
visual effects, but it has high complexity as it explicitly invokes the proposed CD-Conv primitive operator. We analyzed the per-
3DMM model [27] for facial attribute editing (especially for formance of different operator combinations, and the results are
shape-preserving). In contrast, although our model does not shown in Table II. Our proposed operator is significantly better
introduce additional complexity by using 3DMM model, it than the existing operator combinations (e.g., AdaIN+D-Conv,
is still able to effectively change ID information with good Mod-Conv). DIN [68] is the dynamic instance normalization
face shape-preserving, thanks to the proposed CD-Conv (i.e., operator. GhostNet [69] and ShuffleNet [70] respectively rep-
the injected ID feature also well encodes discriminative shape resent two variants of DW-Conv with AdaIN. To test various
characteristics) as well as our proposed knowledge distillation models’ editing performances with different model sizes, we
scheme. align all models’ computational complexity at different levels
(i.e., 20 GFLOPs, 10 GFLOPs, 4 GFLOPs, 1 GFLOPs) by
adjusting the number of channels of each model. At a similar
E. Ablation Study complexity, we therefore fairly compare their FIE performances
1) Ablation Analysis of Primitive Operators: We test the in terms of pose error and ID retrieval accuracy, tested on the
effect of different computational primitives based on the Sim- FaceForensics++ dataset. As shown in Fig. 11(b), all archi-
Swap++(S) structure prototype. For each alternative setting, tectures suffer significant performance degradation as model

Authorized licensed use limited to: National Institute of Technology- Meghalaya. Downloaded on March 23,2024 at 02:15:58 UTC from IEEE Xplore. Restrictions apply.
588 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 1, JANUARY 2024

TABLE II
ABLATION STUDIES ON DIFFERENT COMPONENTS IN THE DESIGN OF SIMSWAP++

Fig. 10. SimSwap++ models compared to SimSwap in-the-wild results. Please zoom in for better view. More challenging cases and videos can be found on our
project website https://github.com/neuralchen/SimSwapPlus.

Fig. 11. Dataset analysis and performance degradation testing. The proposed dataset VGGFace2-HQ exhibits a significant advantage in training performance.
Datasets without manual cleaning, relabeling and real data degrade the performance benefits brought by the proposed dataset. Figure (b) shows the mild performance
degradation of our proposed model compression framework, which makes our model still have excellent editing performance with a small computational cost.

Authorized licensed use limited to: National Institute of Technology- Meghalaya. Downloaded on March 23,2024 at 02:15:58 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: SIMSWAP++: TOWARDS FASTER AND HIGH-QUALITY IDENTITY SWAPPING 589

Fig. 12. Analysis of features modulated by CD-Conv and AdaIN respectively. Figure (a) shows the average feature similarity (same content with different IDs)
under the processing of CD-Conv and AdaIN. Figure (b) shows the distribution of features (same content with different IDs) under the processing of CD-Conv and
AdaIN. For the sake of simplicity, we selected 6 IDs from 100 IDs.

sizes continue to decrease. DW-Conv shows the most severe our CD-Conv can provide a richer combination. In addition,
model degradation, which also shows that its representation we used t-SNE to cluster the features corresponding to six
capacity is limited. Our proposed CD-Conv exhibits superior different IDs. It can be seen from Fig. 12(b) that the features
performance robustness, and its degradation rate is significantly after CD-Conv perturbation show more obvious separation than
slower than that of Conv, DW-Conv, Mod-Conv, GhostNet, those processed by AdaIN. Furthermore, CD-Conv injection
ShuffleNet and D-Conv, since CD-Conv successfully encodes conditions only require less computation than AdaIN (K 2 G
condition-aware and content-aware information uniformly. Al- versus CHW ), which allows us to inject conditions more times
though the ID injection performance of DIN is similar to that to obtain superior editing performance.
of our CD-Conv, its reconstruction performance degrades seri- 3) Ablation Analysis on Morphable Knowledge Distillation:
ously. The reason for this result is that DIN’s kernel prediction Besides adopting CD-Conv, we also test our proposed MKD
is only related to conditions and has nothing to do with con- scheme. As illustrated in Fig. 11(b), for each set GLOPs model
tent, which makes it not adaptable to content. Most important, scale, we apply our MKD scheme to standard SimSwap++(S)
among all these primitive operators, the proposed CD-Conv model. Note that after distillation, the rate of model degradation
presents the best trade-off between efficiency and visual editing is satisfactorily alleviated. Compared with the response-based
quality. scheme and CAT, our distillation architecture exhibits more
2) Analysis of Condition Injection of CD-Conv: As discussed robust degradation properties. As can be seen from Fig. 11(b),
above, both AdaIN and Mod-Conv will suffer severe perfor- MKD shows significant performance advantages in both Pose
mance degradation (i.e., conditional injection) as the model Error and ID retrieve metrics at all model scales (e.g., 97.5%
size becomes smaller. In this section we delve into the rea- versus 96.7%@4 GFLOPs). We further conduct an in-depth
sons behind it. Compared AdaIN + D-Conv/Conv/DW-Conv, analysis of the heterogeneous design of MKD. We compare
the advantage of our CD-Conv comes from the fact that we MKD with homogeneous designed CAT, naive feature-based
directly conditionally encode the convolution kernel through distillation without MDS, by comparing the intermediate feature
the K 2 G matrix, which can inject higher-dimensional/higher- similarity between student and teacher models. Kernel align-
variance information (i.e., AdaIN/ModConv can only provide ment [64], [65] is used as the metric of feature similarity, which is
CHW combinations. It should be pointed out that Mod-Conv similar to cosine similarity. The closer it is to 1, the more similar
is a variant of AdaIN+Conv, which integrates AdaIN into the it is and vice versa. In Fig. 13, ‘Branchx’ represents the feature
convolution operation. In contrast, the support set (i.e., spanned similarity of a single branch inside the inception-like structure
by possible combinations of conditions) of CD-Conv is not (i.e., MDS, inception block), and ‘MDS’ and ‘Homogeneous’
only related to the injected conditions, but also related to the both represent the output feature similarity of the entire block.
content, rendering K 4 G2 HW possibility far more than what As can be seen in the second row, the similarity distribution
AdaIN/Mod-Conv can provide). It should be pointed out that inside the inception-like module in the homogeneous distillation
the essence of condition injection is to reduce the correlation scheme is highly close to the distribution of the entire module,
of feature distribution under different conditions. In light of indicating that the homogeneous design forces each branch to
this, we design the following experiments to confirm the above approximate the corresponding branch in the teacher model with
understanding. We extract 100 source images and 100 target a low learning degree of freedom. Different from the above, the
images from FF++, and we employ the cosine similarity to branch similarity distribution of MDS (shown in the first row in
measure the similarity of target feature with different source IDs Fig. 13) is obviously different from that of the entire block, which
after multiple injections. It can be learned from the Fig. 12(a) illustrates that the higher diversity within the MDS branches than
that CD-Conv can make features significantly different under the homogeneous design indicates a higher degree of learning
different injection times and model scales, which confirms that freedom. More importantly, in terms of the similarity of the

Authorized licensed use limited to: National Institute of Technology- Meghalaya. Downloaded on March 23,2024 at 02:15:58 UTC from IEEE Xplore. Restrictions apply.
590 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 1, JANUARY 2024

Fig. 13. Feature similarity distribution in the intermediate layer of teacher network and student network. We employ the kernel alignment [64], [65] as the
similarity metric. We select 10000 images from VGGFace2-HQ as test samples. Feature KD represents a naive method of using intermediate layer features for
distillation.

TABLE III
STATISTICAL ANALYSIS BETWEEN VGGFACE2-HQ, CELEBA-HQ AND FFHQ DATASET

entire module (shown in the 2nd row and 4th column in Fig. 13), FIE performance. Note that VGGFace2-HQ (ori) means with-
MDS is better than that of the homogeneous scheme. The above out data cleaning/screening, ID re-annotation and the natural
results verify that our proposed MDS has a higher degree of sub-collection (i.e., to show the criticality of dataset purity). We
learning freedom, and at the same time, the structure pairs randomly select 500 faces of different identities from FF++ and
play a mutually-compensable role, making the final matching exchange them with each other using the trained SimSwap-512
ability better. As shown in Fig. 10, even the smallest model model based on different training datasets. We measure the
SimSwap++(S) still shows a significant visual improvement identity cosine similarity of all source-target image pairs based
compared to the previous version of SimSwap. On the contrary, on ArcFace output, and the scores are shown in Fig. 11(a). From
as observed from Fig. 10, small models without distillation suffer Fig. 11(a), we see that the model trained with the VGGFace2-
from unnatural generation effects, e.g., the 7th column in Fig. 10 HQ dataset has the best average ID similarity score as well
shows poor results consistency (e.g., face color, facial shape). as stable performance, which demonstrates the great value of
These results demonstrate that the distillation scheme plays a our VGGFace2-HQ in high-resolution FIE model training. Not
vital role in visual effect improvement since MKD with multiple surprisingly, uncleaned VGGFace2-HQ degrades performance
heterogeneous teacher-student blocks allows better matching. and robustness, which indicates noisy data/label is harmful to
high fidelity FIE model learning. At the same time, the lack
of real data also hurts the generalization performance of the
F. Proof of Value for VGGFace2-HQ Dataset
model. Moreover, models trained with a single pose dataset
Our proposed VGGFace2-HQ consists of natural face sub- (i.e., Celeba-HQ and FFHQ) perform worse as sufficient image
collection and synthetic images sub-collection. From Table III, variation/diversity (pose, expression, lighting condition, shape,
it can be seen that VGGFace2-HQ not only has far more images and others.) is very important for learning high-quality mod-
than Celeba-HQ and FFHQ but also has extremely diverse els. In contrast, our VGGFace2-HQ dataset possesses a large
scenes. For example, scenes such as expressions, glasses, and diversity of poses and IDs, facilitating high-performance model
hats are an order of magnitude richer than Celeba-HQ and training.
FFHQ. Our data exhibits very high image quality, ready to
help with face swapping tasks and beyond (e.g., 3D recon-
struction of faces). To demonstrate the value of the proposed VII. CONCLUSION
VGGFace2-HQ on high-resolution FIE model learning, we train In this article, we propose two cooperating means to achieve
our SimSwap-512 model with VGGFace2-HQ, VGGFace2-HQ the compression of the face identity editing model. The pro-
(ori), Celeba-HQ, and FFHQ respectively, and compare their posed CD-Conv can simultaneously utilize content-dependent

Authorized licensed use limited to: National Institute of Technology- Meghalaya. Downloaded on March 23,2024 at 02:15:58 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: SIMSWAP++: TOWARDS FASTER AND HIGH-QUALITY IDENTITY SWAPPING 591

and condition-dependent information in a unified manner, which [21] Q. Jin et al., “Teachers do more than teach: Compressing image-to-image
overcomes the problems of the inefficiency of conditional injec- models,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021,
pp. 13595–13606.
tion and spatial information processing. More importantly, we [22] A. G. Howard et al., “MobileNets: Efficient convolutional neural networks
propose morphable knowledge distillation, which can achieve for mobile vision applications,” 2017, arXiv:1704.04861.
further structural simplification after distillation. Noticing the [23] X. Ding, X. Zhang, J. Han, and G. Ding, “Diverse branch block: Building
a convolution as an inception-like unit,” in Proc. IEEE Conf. Comput. Vis.
lack of specially designed high-quality face swapping datasets, Pattern Recognit., 2021, pp. 10881–10890.
we present the VGGFace2-HQ dataset ready to help with high- [24] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner,
quality face swapping tasks and beyond. Extensive experiments “FaceForensics: Learning to detect manipulated facial images,” in Proc.
IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 1–11.
have proved that our proposed SimSwap++ framework can [25] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua, “Towards open-set identity
achieve face swapping with a very high speed (i.e., up to 81 preserving face synthesis,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
FPS), which provides options for low-resource-cost and real- Recognit., 2018, pp. 6713–6722.
[26] S. Liu, J. Ye, S. Ren, and X. Wang, “DynaST: Dynamic sparse transformer
time applications. for exemplar-guided image generation,” in Proc. Eur. Conf. Comput. Vis.,
S. Avidan, G. J. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, Eds.,
2022, pp. 72–90.
REFERENCES [27] V. Blanz and T. Vetter, “A morphable model for the synthesis of 3D
faces,” in Proc. 26th Annu. Conf. Comput. Graph. Interactive Techn., 1999,
[1] R. Chen, X. Chen, B. Ni, and Y. Ge, “SimSwap: An efficient framework pp. 187–194.
for high fidelity face swapping,” in Proc. 28th ACM Int. Conf. Multimedia, [28] J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and M. Nießner,
Seattle, WA, USA, 2020, pp. 2003–2011. “Face2Face: Real-time face capture and reenactment of RGB videos,”
[2] Y. Nirkin, Y. Keller, and T. Hassner, “FSGAN: Subject agnostic face 2020, arXiv:2007.14808.
swapping and reenactment,” 2019, arXiv:1908.05932. [29] Y. Nirkin, I. Masi, A. T. Tran, T. Hassner, and G. G. Medioni, “On face seg-
[3] L. Li, J. Bao, H. Yang, D. Chen, and F. Wen, “FaceShifter: Towards high mentation, face swapping, and face perception,” 2017, arXiv:1704.06729.
fidelity and occlusion aware face swapping,” 2019, arXiv:1912.13457. [30] Z. Xu et al., “FaceController: Controllable attribute editing for face in the
[4] Y. Wang et al., “HifiFace: 3D shape and semantic prior guided high wild,” in Proc. AAAI Conf. Artif. Intell., 2021, pp. 3083–3091.
fidelity face swapping,” in Proc. Int. Joint Conf. Artif. Intell., 2021, [31] R. Natsume, T. Yatagawa, and S. Morishima, “RSGAN: Face swapping
pp. 1136–1142. and editing using face and hair representation in latent spaces,” in Proc.
[5] J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image ACM SIGGRAPH Posters, 2018, Art. no. 69.
translation using cycle-consistent adversarial networks,” in Proc. IEEE [32] “Deepfakes,” 2018. [Online]. Available: https://github.com/deepfakes/faceswap
Int. Conf. Comput. Vis., 2017, pp. 2242–2251. [33] X. Yang, D. Zhou, S. Liu, J. Ye, and X. Wang, “Deep model reassembly,”
[6] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “VGGFace2: in Proc. Adv. Neural Inf. Process. Syst., 2022, pp. 25739–25753.
A dataset for recognising faces across pose and age,” in Proc. IEEE 13th [34] X. Yang, D. Zhou, J. Feng, and X. Wang, “Diffusion probabilistic model
Int. Conf. Autom. Face Gesture Recognit., 2018, pp. 67–74. made slim,” 2022, arXiv:2211.17106.
[7] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of GANs [35] G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural
for improved quality, stability, and variation,” 2017, arXiv:1710.10196. network,” 2015, arXiv:1503.02531.
[8] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture [36] A. Aguinaldo, P. Chiang, A. Gain, A. Patil, K. Pearson, and S. Feizi, “Com-
for generative adversarial networks,” in Proc. IEEE Conf. Comput. Vis. pressing GANs using knowledge distillation,” 2019, arXiv:1902.00159.
Pattern Recognit., 2019, pp. 4396–4405. [37] H. Chen et al., “Distilling portable generative adversarial networks for
[9] X. Wang, Y. Li, H. Zhang, and Y. Shan, “Towards real-world blind face image translation,” 2020, arXiv:2003.03519.
restoration with generative facial prior,” in Proc. IEEE Conf. Comput. Vis. [38] Y. Fu, W. Chen, H. Wang, H. Li, Y. Lin, and Z. Wang, “AutoGAN-
Pattern Recognit., 2021, pp. 9164–9174. distiller: Searching to compress generative adversarial networks,”
[10] Y. Zhu, Q. Li, J. Wang, C. Xu, and Z. Sun, “One shot face swapping on 2020, arXiv:2006.08198.
megapixels,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, [39] M. Li, J. Lin, Y. Ding, Z. Liu, J. Zhu, and S. Han, “GAN compression: Effi-
pp. 4834–4844. cient architectures for interactive conditional GANs,” in Proc. IEEE/CVF
[11] Y. Xu, B. Deng, J. Wang, Y. Jing, J. Pan, and S. He, “High-resolution face Conf. Comput. Vis. Pattern Recognit., 2020, pp. 5283–5293.
swapping via latent semantics disentanglement,” in Proc. IEEE/CVF Conf. [40] Y. Liu, Z. Shu, Y. Li, Z. Lin, F. Perazzi, and S. Kung, “Content-aware GAN
Comput. Vis. Pattern Recognit., 2022, pp. 7632–7641. compression,” 2021, arXiv:2104.02244.
[12] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, [41] H. Wang, S. Gui, H. Yang, J. Liu, and Z. Wang, “GAN slimming:
“Analyzing and improving the image quality of StyleGAN,” in Proc. All-in-one GAN compression by A unified optimization framework,”
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 8107–8116. 2020, arXiv:2008.11062.
[13] T. Karras et al., “Alias-free generative adversarial networks,” in Proc. Adv. [42] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in
Neural Inf. Process. Syst., 2021, pp. 852–863. the wild,” in Proc. Int. Conf. Comput. Vis., 2015, pp. 3730–3738.
[14] Y. Shen, C. Yang, X. Tang, and B. Zhou, “InterFaceGAN: Interpreting the [43] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
disentangled face representation learned by GANs,” IEEE Trans. Pattern network training by reducing internal covariate shift,” in Proc. 32nd
Anal. Mach. Intell., vol. 44, no. 4, pp. 2004–2018, Apr. 2022. Int. Conf. Mach. Learn., F. R. Bach and D. M. Blei, Eds., 2015,
[15] Y. Xu, Y. Shen, J. Zhu, C. Yang, and B. Zhou, “Generative hierarchical pp. 448–456.
features from synthesizing images,” in Proc. IEEE Conf. Comput. Vis. [44] A. L. Maas, “Rectifier nonlinearities improve neural network acoustic
Pattern Recognit., 2021, pp. 4430–4430. models,” Proc. Int. Conf. Mach. Learn., vol. 30, no. 1, p. 3, 2013.
[16] O. Tov, Y. Alaluf, Y. Nitzan, O. Patashnik, and D. Cohen-Or, “Designing an [45] X. Li, C. Chen, S. Zhou, X. Lin, W. Zuo, and L. Zhang, “Blind face
encoder for StyleGAN image manipulation,” ACM Trans. Graph., vol. 40, restoration via deep multi-scale component dictionaries,” in Proc. 16th
2021, Art. no. 133. Eur. Conf. Comput. Vis., Glasgow, 2020, pp. 399–415.
[17] E. Richardson et al., “Encoding in style: A StyleGAN encoder for image- [46] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural net-
to-image translation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., works,” in Proc. 14th Int. Conf. Artif. Intell. Statist., G. J. Gordon, D. B.
2021, pp. 2287–2296. Dunson, and M. Dudík, Eds., 2011, pp. 315–323.
[18] Z. Xu et al., “FaceController: Controllable attribute editing for face in the [47] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc.
wild,” in Proc. 35th AAAI Conf. Artif. Intell., 2021, pp. 3083–3091. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7132–7141.
[19] X. Huang and S. J. Belongie, “Arbitrary style transfer in real-time with [48] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “ArcFace: Additive angular
adaptive instance normalization,” in Proc. IEEE Int. Conf. Comput. Vis., margin loss for deep face recognition,” in Proc. IEEE Conf. Comput. Vis.
2017, pp. 1510–1519. Pattern Recognit., 2019, pp. 4685–4694.
[20] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking [49] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
the inception architecture for computer vision,” in Proc. IEEE Conf. recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
Comput. Vis. Pattern Recognit., 2016, pp. 2818–2826. pp. 770–778.

Authorized licensed use limited to: National Institute of Technology- Meghalaya. Downloaded on March 23,2024 at 02:15:58 UTC from IEEE Xplore. Restrictions apply.
592 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 1, JANUARY 2024

[50] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro, Bingbing Ni received the BEng degree in electri-
“High-resolution image synthesis and semantic manipulation with condi- cal engineering from Shanghai Jiao Tong University,
tional GANs,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, Shanghai, China, in 2005, and the PhD degree from
pp. 8798–8807. the National University of Singapore, Singapore, in
[51] T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro, “High- 2011. He is currently a professor with the Department
resolution image synthesis and semantic manipulation with conditional of Electrical Engineering, Shanghai Jiao Tong Uni-
GANs,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, versity. Before that, he was a research scientist with
pp. 8798–8807. the Advanced Digital Sciences Center, Singapore. He
[52] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein GAN,” was with Microsoft Research Asia, Beijing, China, as
2017, arXiv:1701.07875. a research intern, in 2009. He was also a software
[53] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, engineer Intern with Google Inc., Mountain View,
“Improved training of Wasserstein GANs,” in Proc. Adv. Neural Inf. California, in 2010. He was a recipient of the Best Paper Award from PCM’11
Process. Syst., 2017, pp. 5769–5779. and the Best Student Paper Award from PREMIA’08. He was also the recipient
[54] D. Li et al., “Involution: Inverting the inherence of convolution for visual of the first prize in the International Contest on Human Activity Recognition
recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, and Localization in conjunction with the International Conference on Pattern
pp. 12316–12325. Recognition, in 2012.
[55] J. Dai et al., “Deformable convolutional networks,” in Proc. IEEE Int.
Conf. Comput. Vis., 2017, pp. 764–773.
[56] T. Wei, C. Wang, Y. Rui, and C. W. Chen, “Network morphism,” in Proc.
33nd Int. Conf. Mach. Learn., 2016, pp. 564–572.
[57] X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, and J. Sun, “RepVGG: Making Yutian Liu is currently working toward the under-
VGG-style convnets great again,” in Proc. IEEE Conf. Comput. Vis. Pattern graduate degree with the Department of Computer
Recognit., 2021, pp. 13728–13737. Science and Engineering, Shanghai Jiao Tong Uni-
[58] Z. Xu et al., “MobileFaceSwap: A lightweight framework for video face versity, and she also participates in Zhiyuan Honors
swapping,” in Proc. 36th AAAI Conf. Artif. Intell., AAAI Press, 2022, Program. Her research interests mainly lie in network
pp. 2973–2981. compression and image/video generation.
[59] T. R. Shaham, M. Gharbi, R. Zhang, E. Shechtman, and T. Michaeli,
“Spatially-adaptive pixelwise networks for fast image translation,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 14877–14886.
[60] P. Wang et al., “QGAN: Quantized generative adversarial networks,”
2019, arXiv:1901.08263.
[61] A. Sauer, K. Chitta, J. Müller, and A. Geiger, “Projected GANs converge
faster,” in Proc. Adv. Neural Inf. Process. Syst., 2021, pp. 17480–17492.
[62] “Deepfacelab,” 2019. [Online]. Available: https://github.com/iperov/Deep Naiyuan Liu received the BEng degree from the
FaceLab Department of Automation, Guangdong University
[63] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and of Technology, and he is currently working toward
Y. Bengio, “FitNets: Hints for thin deep nets,” 2014, arXiv:1412.6550. the master’s degree with Australian Artificial Intelli-
[64] C. Cortes, M. Mohri, and A. Rostamizadeh, “Algorithms for learning gence Institute, University of Technology Sydney. His
kernels based on centered alignment,” J. Mach. Learn. Res., vol. 13, research interests mainly lie in multimodal content
pp. 795–828, 2012. analysis, facial attribute editing.
[65] N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. S. Kandola, “On kernel-
target alignment,” in Proc. Adv. Neural Inf. Process. Syst., MIT Press, 2001,
pp. 367–373.
[66] Y. Zheng et al., “General facial representation learning in a visual-linguistic
manner,” 2021, arXiv:2112.03109.
[67] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
2014, arXiv:1412.6980.
[68] Y. Jing et al., “Dynamic instance normalization for arbitrary style transfer,” Zhilin Zeng is currently working toward the un-
in Proc. 34th AAAI Conf. Artif. Intell., 2020, pp. 4369–4376. dergraduate degree majoring in artificial intelligence
[69] K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu, and C. Xu, “GhostNet: More School with Shanghai Jiao Tong University. His re-
features from cheap operations,” in Proc. IEEE/CVF Conf. Comput. Vis. search interests mainly lie in computer vision, image
Pattern Recognit., 2020, pp. 1577–1586. translation.
[70] X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShuffleNet: An extremely efficient
convolutional neural network for mobile devices,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit., 2018, pp. 6848–6856.

Xuanhong Chen received the MAEng degree in con- Hang Wang received the BEng and MAEng degrees
trol science and engineering from Zhejiang Univer- in electronic engineering from Shanghai Jiao Tong
sity, Zhejiang, China, in 2018, and he is currently University under the supervision of Prof. Bingbing
working toward the PhD degree with the Department Ni. He is currently an algorithm researcher with
of Electronic Engineering, Shanghai Jiao Tong Uni- Huawei Hisilicon, Shanghai, China. His research in-
versity under the supervision of Prof. Bingbing Ni. terests mainly lie in visual recognition and transfer
His research interests mainly lie in computer vision, learning.
computer graphics, generative model compression.
He was a recipient of the Top Paper Award (i.e., Best
Paper Nomination) from ACM Multimedia 2022.

Authorized licensed use limited to: National Institute of Technology- Meghalaya. Downloaded on March 23,2024 at 02:15:58 UTC from IEEE Xplore. Restrictions apply.

You might also like