Few-Shot Image Generation Via Style Adaptation and Content Preservation
Few-Shot Image Generation Via Style Adaptation and Content Preservation
I. I NTRODUCTION
to train GANs in similar domains and significantly reduce the
G ENERATIVE adversarial networks (GANs) learn to
map a simple predefined distribution to a complex
real image distribution. Despite its great success in many
training time (from weeks to a few hours).
To achieve this, fine-tuning-based methods have been
areas of computer vision including image manipulation [1], proposed, where people only fine-tune a part of the model
image-to-image-translation [2], [3], [4], [5], [6] and image parameters or train a few additional parameters [8], [9], [10],
compression [7], GAN requires a large amount of training [11]. Most of these methods, however, still require hundreds
data and time to achieve high-quality images. Therefore, few- of training images. When the target samples are limited to 10,
shot generative model adaptation has been proposed, which they are prone to overfitting and fail to inherit the diversity
aims to transfer a pretrained source generative model to a from the source domain.
target domain with extremely limited examples (e.g., ten To address these issues, recent methods tried to constrain
images), as shown in Fig. 1. The practical importance of the transfer process based on some assumed correspondence
this task is twofold: 1) in some domains such as painting, between two images. Ojha et al. [12] proposed to preserve the
it is very difficult to obtain enough data to meet the training differences of relative similarities between instances via cross-
requirements of GANs; and 2) a well-trained GAN holds a domain correspondence (CDC) loss and a patch discriminator.
wealth of knowledge about images, which can be leveraged Xiao et al. [13] proposed a relaxed spatial structural alignment
(RSSA) method and tried to project the original latent
Received 27 November 2023; revised 13 June 2024 and 2 September space to a narrow subspace close to the target domain to
2024; accepted 1 October 2024. This work was supported in part by
the Industry Alignment Fund—Industry Collaboration Projects (IAF-ICPs) accelerate training. These methods can generate diverse and
Funding Initiative [cash and in-kind contributions from the industry partner(s)] realistic images with limited data. However, these predefined
and in part by the Ministry of Education (MoE) Academic Research Fund correspondence losses are either relatively weak in diversity
(AcRF) Tier 1 under Grant RG14/22. (Corresponding author: Guosheng Lin.)
Xiaosheng He and Fan Yang are with S-Lab, Nanyang Technological preservation (CDC) or overemphasize diversity (RSSA) and
University (NTU), Singapore 639798 (e-mail: [email protected]; sacrifice style adaptation, limiting the generation performance
[email protected]). on the target domain (see Sections III and IV).
Fayao Liu is with the Agency for Science, Technology and Research
(A*STAR), Singapore 138632 (e-mail: [email protected]). Diffusion-based methods [14], [15] have recently achieved
Guosheng Lin is with the School of Computer Science and Engineering, remarkable success in various vision tasks. Custom diffusion
NTU, Singapore 639798 (e-mail: [email protected]). methods [16], [17] fine-tune a pretrained diffusion model to
This article has supplementary downloadable material available at
https://doi.org/10.1109/TNNLS.2024.3477467, provided by the authors. generate personalized outputs based on user-provided prompts.
Digital Object Identifier 10.1109/TNNLS.2024.3477467 However, Stable diffusion and other diffusion-based visual
2162-237X © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 04:08:07 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
A. Contributions
Our main contribution is a novel paired image recon-
struction method to balance style adaptation and content
preservation, transferring diversity from the source domain to
the target domain. Qualitative and quantitative results show
that our method produces the best results in a variety of
settings.
Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 04:08:07 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
HE et al.: FEW-SHOT IMAGE GENERATION VIA STYLE ADAPTATION AND CONTENT PRESERVATION 3
Fig. 3. Overview of the proposed framework. Our approach uses a translation module F for content preservation and a discriminator D for style adaptation.
F will take a pair of images (shown in the red box) generated with the same latent code as input, and try to reconstruct them. We then use the reconstruction
loss L rec to encourage these paired images to have the same content.
Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 04:08:07 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 04:08:07 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
HE et al.: FEW-SHOT IMAGE GENERATION VIA STYLE ADAPTATION AND CONTENT PRESERVATION 5
C. Learning
We train the whole framework by solving the given
optimization objective
min max L adv (D, G T ) + λ1 L rec (G T ) + λ2 L ′ rec (F) (2)
G T ,F D
where L adv refers to the GAN loss, and L rec and L ′ rec refer
to the reconstruction loss used for target generator G T and
translator F, respectively. The GAN loss is an adversarial loss
given by
L G = −Ez∼ p(z) log( D(G T (z))
Fig. 6. Learning process in the view of the domain. Di is the corresponding
L D = Ex∼Dt [ log( 1 − D(x) ] + Ez∼ p(z) [ log( D(G T (z)) ]. (3) domain to G T in each iteration.
Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 04:08:07 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 7. Comparison results with different baselines on FFHQ → Sketches and FFHQ → FFHQ-babies. For AdaIN, we randomly choose a real image as the
style image. For other GAN-based methods, we keep the latent code the same (across columns). Our method generates results of higher quality and diversity
which better correspond to the source domain images.
where neither CDC nor RSSA can preserve diverse mouth LSUN Cats → LSUN Spaniels and LSUN Churches → Van
poses. They can even have some strange distortions in such Gogh Houses adaptations, as shown in Fig. 8. We can see that
settings. This indicates that the assumed correspondence RSSA fails to address the adaptation between two relatively
losses cannot properly balance style adaptation and content distant domains such as LSUN Cats → LSUN Spaniels.
preservation. Even between a more related pair of domains such as LSUN
Our method achieves the best results in these two Churches → Van Gogh Houses, there still exists undesired
adaptations. As depicted in Fig. 7, in FFHQ → Sketches distortion in RSSA’s outputs. On the other hand, CDC can
adaptation, our output images can better capture the diverse generate acceptable results in these two settings. Our method
facial expressions without bringing color and can better outperforms both of them, indicating that our method can
address the light and shadow issues; in FFHQ → FFHQ-babies properly balance the style adaptation and content preservation
adaptation, our methods can preserve refined details such as on various source → domain adaptations.
the mouth poses and does not output unnatural distortions. In Fig. 9, we show more results with different source
We further compare our method with CDC and RSSA on → target adaptation settings. Supervised by paired image
Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 04:08:07 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
HE et al.: FEW-SHOT IMAGE GENERATION VIA STYLE ADAPTATION AND CONTENT PRESERVATION 7
Fig. 8. Comparison results with CDC and RSSA on LSUN Cats → LSUN Spaniels and LSUN Churches → Van Gogh Houses.
The lower the FID score, the better the quality of the
generated images, as it indicates a smaller distance between
the distributions of real and generated images in the feature
TABLE II space.
I NTRACLUSTER PAIRWISE LPIPS D ISTANCE (↑). CDC S HOWS Table I shows the FID score of different methods. Our
R ELATIVELY L OWER D ISTANCE (L ESS DIVERSITY ), S UGGESTING method achieves the best FID score on the three datasets,
I TS W EAKNESS IN C ONTENT P RESERVATION indicating that our method generates images that best model
the true distribution of the target domain. However, FID does
not directly address the diversity of the generated images or
the potential issue of overfitting to the training data. Therefore,
we use it as a measure of generation quality, i.e., a measure
of style adaptation.
TABLE III
We use the intracluster pairwise LPIPS distance [12] to
BALANCE I NDEX (↑) FOR Q UALITY AND D IVERSITY
measure the diversity level. To capture this, we assign the
generated 5000 images to one of the k training images,
by using the lowest LPIPS distance. We then compute the
average pairwise LPIPS distance within members of the same
cluster and then average over the k clusters. A method that
reproduces the original images exactly will have a score of
reconstruction loss, the generator can produce images that zero by this metric. Higher LPIPS distances between generated
successfully preserve diversity from the source domain while images suggest more diversity and distinctiveness among the
the style fits the target domain. generated samples. Therefore, we use it as a measure of
2) Quantitative Comparison: We use three datasets with generation diversity, i.e., a measure of content preservation.
abundant data that meet the requirement of evaluation: the As shown in Table II, our method consistently achieves
original Sketches, FFHQ-babies, and LSUN-spaniels datasets, higher average LPIPS distance than CDC. On the other hand,
which contain 289, 2492, and 188 images, respectively. although RSSA achieves higher LPIPS distance in FFHQ →
Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 04:08:07 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 9. Results of different adaptation settings: (i) FFHQ → Raphael Paintings; (ii) LSUN Churches → Haunted Houses; and (iii) LSUN Cars → Wrecked
Cars.
Fig. 10. Comparison results with CDC and RSSA on FFHQ → Sketches in one-shot and five-shot settings.
Sketches and Cats → Spaniels adaptations, it gets very A higher score means better performance in balancing quality
high FID scores in these adaptations (meaning the learned and diversity. The comparison results are shown in Table III,
distribution is very different from the target domain). This is we can see that our method outperforms CDC and RSSA.
consistent with our previous analysis: CDC is relatively weak
in content preservation, which fails to generate more diversity B. Ablation Study
(has the lowest LPIPS distance) while RSSA overemphasizes 1) Effect of Target Dataset Size: We further explore the
the diversity and fails to generate a similar distribution to the effectiveness of our method compared to CDC and RSSA in
target domain (has very high FID score). one-shot and five-shot settings.
To assess the combined performance of the model in terms Fig. 10 shows the results on FFHQ → Sketches adaptation.
of diversity and quality more clearly, we propose a balance For the one-shot setting, we can observe that RSSA is so
metric by incorporating FID score (FID) and LPIPS distance strong at content preservation that with only one example,
(LD) as follows: the outputs can still resemble the corresponding images from
the source domain. However, it falls short in style adaptation
100 · LD as it introduces color that is not consistent with the sketch
balance = . (6)
FID style. On the other hand, CDC and our method just output
Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 04:08:07 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
HE et al.: FEW-SHOT IMAGE GENERATION VIA STYLE ADAPTATION AND CONTENT PRESERVATION 9
Fig. 11. Comparison results with CDC and RSSA on LSUN Churches → Van Gogh Houses in one-shot and five-shot settings.
Fig. 12. Results of the translation module after training. The module takes the first row as content image, and the second row as style image. The outputs
show that after training the translation module is capable of separating style and content and translating images from the source domain to the target domain.
similar faces with different poses. Our method can generate TABLE IV
more diverse poses than CDC, such as the poses of mouth. FID S CORE ON D IFFERENT S HOTS ON FFHQ → S KETCHES A DAPTATION .
T HE F IRST ROW R EFERS TO F INE -T UNING S TYLE GAN U SING O NLY
The evaluation of one-shot generation is very vague, for it is D ISCRIMINATIVE L OSS
very hard to tell what is style and content. The results of the
five-shot setting are similar to those of the ten-shot setting.
We can see that the results of CDC are not very identical to
the corresponding images from the source domain, indicating
its shortness in content preservation. On the other hand,
RSSA can generate outputs resembling the original images but
bring some undesired color, suggesting its weakness in style
adaptation. Our method can generate diverse outputs without discriminative loss, which tells us that content preservation
bringing any color. loss is only applicable in few-shot generation scenarios of a
Fig. 11 shows the results on LSUN → Van Gogh Houses certain size.
adaptation. For the one-shot setting, we can observe that our 2) Translation Module: We examine the output of the
method and RSSA can generate churches with different shapes translation module to evaluate its function. As shown in
while CDC can only output similar churches. For the five-shot Fig. 12, after training the translation module is capable of
setting, we can see that both CDC and RSSA have unnatural separating style and content and translating images from the
distortions while our method can generate churches similar to source domain to the target domain.
the original images and successfully adapts the style to Van 3) Architecture Decisions of the Translation Module: It is
Gogh paintings. possible to use a generic image-to-image translation module
In addition, we explored the range of dataset sizes suitable for reconstruction, i.e., let the translation module learn the
for content preservation loss on FFHQ → Sketches adaptation. target style and only feed it with the content image. However,
As depicted in Table IV, when the number of samples in the this will increase its training difficulty since the translation
target domain is fifty or more, CDC, RSSA, and our method module needs to since the translation module needs to quickly
perform worse than directly fine-tuning StyleGAN using only establish a mapping between the current domain and the source
Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 04:08:07 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE V
T RAINING M EMORY AND T IME ON FFHQ → S KETCHES A DAPTATION
the style codes; and iv) put the reconstructed images into the
corresponding discriminator and use the adversarial loss as
reconstruction loss. The result is shown in Fig. 14. We can
see that putting l1 loss on two images results in overfitting,
probably because it overemphasizes the pixel similarity and
makes it great harder for the translator to do the reconstruction;
on the other hand, LPIPS loss can relax the translator, let it
focus on reconstruct meaningful information and thus has the
best results. Putting l1 loss directly on the style and content
code can generate similar results as ii), but will lose some
diversity. Using discriminator loss for reconstruction fails to
produce valid results.
5) Choice of Reconstruction Method: As mentioned in
Section III-B, we actually have three choices for reconstruc-
tion: i) reconstruct the source images only; ii) reconstruct the
Fig. 14. Comparison results with different losses on FFHQ → Sketches. target images only; and iii) reconstruct both the source images
(i) Compute the l1 loss of the original images and reconstructed images; (ii) and the target images. The results are shown in Fig. 15. We can
compute the LPIPS loss of the original images and reconstructed images;
(iii) compute the l1 loss on the content codes of the content images and observe that if we only reconstruct target images, some edges
reconstructed images, and the l1 loss on the style codes of the style images of the face will appear blank; while only reconstructing source
and reconstructed images; and (iv) put the reconstructed image into the images does not have this issue. Both of them have overfitting
corresponding discriminator and use the adversarial loss as reconstruction
loss. issues and cannot fully capture the facial expression details.
On the other hand, by reconstructing both source and target
images, the face will become more refined and more identical
domain in each iteration. As shown in Fig. 13, the module to the source images.
failed to properly separate style and content, resulting in bad 6) Computational Efficiency: As shown in Fig. 16, our
generation performance. method converges slightly slower than CDC. As shown in
4) Choice of Reconstruction Loss: We explore four kinds of Fig. 16, our method converges slightly slower than CDC.
reconstruction loss in our experiments: i) compute the l1 loss Specifically, we conducted experiments on NVIDIA A6000
of the original images and reconstructed images; ii) compute GPUs and recorded the training memory and time for the
the Lpips loss of the original images and reconstructed FFHQ → Sketches adaptation as shown in Table V. These
images; iii) compute the l1 loss on the content codes and results demonstrate that our method does not significantly
Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 04:08:07 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
HE et al.: FEW-SHOT IMAGE GENERATION VIA STYLE ADAPTATION AND CONTENT PRESERVATION 11
Fig. 16. FID score over iteration times. Our method converges slightly slower than CDC.
increase training time and memory usage compared to other should be preserved (content) and what should be adapted
methods. (style). By using these models, it may be possible to reduce
overfitting and improve the generalization capabilities of the
V. C ONCLUSION AND L IMITATION model across a wider range of domain pairs, even when the
domains are significantly different.
In this article, we propose a novel content preservation
method to address the image generation problem in extremely
few shot settings. We find that effective separation of R EFERENCES
content and style enables high content fidelity and robust [1] W. Diao, F. Zhang, J. Sun, Y. Xing, K. Zhang, and L. Bruzzone,
style adaptation, resulting in high-quality and diverse image “ZeRGAN: Zero-reference GAN for fusion of multispectral and
generation. Our method introduces a translation network to panchromatic images,” IEEE Trans. Neural Netw. Learn. Syst., vol. 34,
no. 11, pp. 8195–8209, Nov. 2023.
map between the source and target domain. With the help [2] M.-Y. Liu et al., “Few-shot unsupervised image-to-image translation,”
of the paired image reconstruction, the generative model can in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019,
learn to preserve rich content context inherited from the pp. 10551–10560.
[3] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image
source domain. Due to this flexible tradeoff strategy between translation using cycle-consistent adversarial networks,” in Proc. IEEE
style adaptation and content preservation, our approach can Int. Conf. Comput. Vis., Oct. 2017, pp. 2223–2232.
successfully generate diverse and realistic images under [4] H. Tang, H. Liu, D. Xu, P. H. S. Torr, and N. Sebe, “AttentionGAN:
Unpaired image-to-image translation using attention-guided generative
various source → domain adaptation settings. In addition, adversarial networks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 34,
we design a new balanced metric combining the FID score no. 4, pp. 1972–1987, Apr. 2023.
and the intracluster pairwise LPIPS distance to evaluate the [5] F. Kong et al., “Unpaired artistic portrait style transfer via asymmetric
performance of balancing the quality and diversity of gener- double-stream GAN,” IEEE Trans. Neural Netw. Learn. Syst., vol. 34,
no. 9, pp. 5427–5439, Sep. 2023.
ated images, which can serve as an alternative supplement to [6] C. Luo, Y. Zhu, L. Jin, Z. Li, and D. Peng, “SLOGAN: Handwriting style
current metrics in few-shot generation scenarios. synthesis for arbitrary-length and out-of-vocabulary text,” IEEE Trans.
Neural Netw. Learn. Syst., vol. 34, no. 11, pp. 8503–8515, Nov. 2023.
[7] C. Huang et al., “Self-supervised attentive generative adversarial
A. Limitations networks for video anomaly detection,” IEEE Trans. Neural Netw. Learn.
Syst., vol. 34, no. 11, pp. 9389–9403, Nov. 2023.
Despite the compelling results our method achieves, there [8] A. Noguchi and T. Harada, “Image generation from small datasets via
are still some limitations. First, there is still some overfitting batch statistics adaptation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis.
of certain details of the output. For example, in FFHQ → (ICCV), Oct. 2019, pp. 2750–2758.
[9] E. Robb, W.-S. Chu, A. Kumar, and J.-B. Huang, “Few-shot adaptation
Sketches adaptation, the generated sketch faces will tend to of generative adversarial networks,” 2020, arXiv:2010.11943.
show a few teeth even if the mouth is initially closed on [10] M. Zhao, Y. Cong, and L. Carin, “On leveraging pretrained GANs for
the source image. In FFHQ → FFHQ-babies adaptation, the generation with limited data,” in Proc. Int. Conf. Mach. Learn., 2020,
pp. 11340–11351.
hair color is a little shallower than the corresponding source
[11] Y. Wang et al., “MineGAN: Effective knowledge transfer from GANs
image. Besides, the adaptation should be conducted between to target domains with few images,” in Proc. IEEE/CVF Conf. Comput.
similar domains for good results. If the two domains are too Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 9332–9341.
different, there will be little content in the source domain that [12] U. Ojha et al., “Few-shot image generation via cross-domain
correspondence,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
is meaningful to the target domain. Recognit., Jun. 2021, pp. 10743–10752.
Future research directions include exploring more robust [13] J. Xiao, L. Li, C. Wang, Z.-J. Zha, and Q. Huang, “Few shot generative
techniques for content preservation and style transfer that do model adaption via relaxed spatial structural alignment,” in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022,
not rely heavily on domain similarity. For instance, integrating pp. 11194–11203.
advanced multimodal models like LLaVA-OneVision [29] [14] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,”
could offer a more sophisticated approach to managing style in Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 6840–6851.
and content across diverse domains. These models have the [15] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer,
“High-resolution image synthesis with latent diffusion models,” in
potential to leverage both visual and linguistic information, Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2022,
enabling a finer-grained understanding of what elements pp. 10684–10695.
Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 04:08:07 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[16] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, Fan Yang received the bachelor’s degree from
“DreamBooth: Fine tuning text-to-image diffusion models for subject- Nanjing University, Nanjing, China, in 2021. He is
driven generation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern currently pursuing the Ph.D. degree with the School
Recognit. (CVPR), Jun. 2023, pp. 22500–22510. of Computer Science and Engineering, Nanyang
[17] N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y. Zhu, Technological University, Singapore.
“Multi-concept customization of text-to-image diffusion,” in Proc. His research interests lie in the fields of computer
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, vision, 3-D deep learning, and 3-D generation.
pp. 1931–1941.
[18] S. Mo, M. Cho, and J. Shin, “Freeze the discriminator: A simple baseline
for fine-tuning GANs,” 2020, arXiv:2002.10964.
[19] X. Huang and S. Belongie, “Arbitrary style transfer in real-time with
adaptive instance normalization,” in Proc. IEEE Int. Conf. Comput. Vis.
(ICCV), Oct. 2017, pp. 1501–1510.
[20] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using
convolutional neural networks,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jun. 2016, pp. 2414–2423.
[21] M.-Y. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-image
translation networks,” in Proc. Adv. Neural Inf. Process. Syst., vol. 30,
2017, pp. 1–9.
[22] K. Saito, K. Saenko, and M.-Y. Liu, “COCO-FUNIT: Few-shot
unsupervised image translation with a content conditioned style
encoder,” in Proc. 16th Eur. Conf. Comput. Vis. Glasgow, U.K.: Springer,
Aug. 2020, pp. 382–398.
[23] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, Fayao Liu received the B.Eng. and M.Eng. degrees
“The unreasonable effectiveness of deep features as a perceptual metric,” from the National University of Defense Technology,
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, Changsha, China, in 2008 and 2010, respectively,
pp. 586–595. and the Ph.D. degree in computer science from
[24] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, The University of Adelaide, Adelaide, SA, Australia,
“Analyzing and improving the image quality of StyleGAN,” in Proc. in December 2015.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, She is currently a Research Scientist with the
pp. 8110–8119. Institute for Infocomm Research (I2R), A*STAR,
[25] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture Singapore. She mainly works on machine learning
for generative adversarial networks,” in Proc. IEEE/CVF Conf. Comput. and computer vision problems, with particular
Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 4401–4410. interests in self-supervised learning, 3-D vision, and
[26] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao, “LSUN: generative models.
Construction of a large-scale image dataset using deep learning with
humans in the loop,” 2015, arXiv:1506.03365.
[27] X. Wang and X. Tang, “Face photo-sketch synthesis and recognition,”
IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 11, pp. 1955–1967,
Nov. 2008.
[28] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter,
“GANs trained by a two time-scale update rule converge to a local nash
equilibrium,” in Proc. Adv. Neural Inf. Process. Syst., vol. 30, 2017,
pp. 1–12.
[29] B. Li et al., “LLaVA-OneVision: Easy visual task transfer,” 2024,
arXiv:2408.03326.
Xiaosheng He received the bachelor’s degree Guosheng Lin received the Ph.D. degree from The
from Shanghai Jiao Tong University, Shanghai, University of Adelaide, Adelaide, SA, Australia,
China, in 2021. He is currently pursuing the in 2014.
Ph.D. degree with the School of Computer Science He is currently an Associate Professor with
and Engineering, Nanyang Technological University the School of Computer Science and Engineering,
(NTU), Singapore. Nanyang Technological University, Singapore. His
His research interests lie in the fields of machine research interests are generally in computer vision
learning and computer vision, with a focus on and machine learning including scene understanding,
generative models and 3-D vision. 3-D vision, and generative learning.
Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 04:08:07 UTC from IEEE Xplore. Restrictions apply.