0% found this document useful (0 votes)
5 views12 pages

Few-Shot Image Generation Via Style Adaptation and Content Preservation

This article presents a novel approach for few-shot image generation that balances style adaptation and content preservation using a paired image reconstruction method. The proposed method introduces an image translation module that helps separate style and content, allowing for better diversity in generated images while minimizing overfitting. Experimental results demonstrate that this approach consistently outperforms existing state-of-the-art methods in few-shot settings.

Uploaded by

姜祖涛
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views12 pages

Few-Shot Image Generation Via Style Adaptation and Content Preservation

This article presents a novel approach for few-shot image generation that balances style adaptation and content preservation using a paired image reconstruction method. The proposed method introduces an image translation module that helps separate style and content, allowing for better diversity in generated images while minimizing overfitting. Experimental results demonstrate that this approach consistently outperforms existing state-of-the-art methods in few-shot settings.

Uploaded by

姜祖涛
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

Few-Shot Image Generation via Style Adaptation


and Content Preservation
Xiaosheng He , Fan Yang, Fayao Liu , and Guosheng Lin

Abstract— Training a generative model with limited data (e.g.,


10) is a very challenging task. Many works propose to fine-
tune a pretrained GAN model. However, this can easily result
in overfitting. In other words, they manage to adapt the style
but fail to preserve the content, where style denotes the specific
properties that define a domain while content denotes the
domain-irrelevant information that represents diversity. Recent
works try to maintain a predefined correspondence to preserve
the content, however, the diversity is still not enough and
it may affect style adaptation. In this work, we propose a
paired image reconstruction approach for content preservation.
We propose to introduce an image translation module to GAN
transferring, where the module teaches the generator to separate
style and content, and the generator provides training data to
the translation module in return. Qualitative and quantitative
experiments show that our method consistently surpasses the
state-of-the-art methods in a few-shot setting.
Index Terms— Few-shot learning, generative model, model Fig. 1. Given a source model G S trained on a large-scale dataset, we want
adaptation. to adapt it to a target domain with very few examples. The target model is
expected to generate diverse images with the target style.

I. I NTRODUCTION
to train GANs in similar domains and significantly reduce the
G ENERATIVE adversarial networks (GANs) learn to
map a simple predefined distribution to a complex
real image distribution. Despite its great success in many
training time (from weeks to a few hours).
To achieve this, fine-tuning-based methods have been
areas of computer vision including image manipulation [1], proposed, where people only fine-tune a part of the model
image-to-image-translation [2], [3], [4], [5], [6] and image parameters or train a few additional parameters [8], [9], [10],
compression [7], GAN requires a large amount of training [11]. Most of these methods, however, still require hundreds
data and time to achieve high-quality images. Therefore, few- of training images. When the target samples are limited to 10,
shot generative model adaptation has been proposed, which they are prone to overfitting and fail to inherit the diversity
aims to transfer a pretrained source generative model to a from the source domain.
target domain with extremely limited examples (e.g., ten To address these issues, recent methods tried to constrain
images), as shown in Fig. 1. The practical importance of the transfer process based on some assumed correspondence
this task is twofold: 1) in some domains such as painting, between two images. Ojha et al. [12] proposed to preserve the
it is very difficult to obtain enough data to meet the training differences of relative similarities between instances via cross-
requirements of GANs; and 2) a well-trained GAN holds a domain correspondence (CDC) loss and a patch discriminator.
wealth of knowledge about images, which can be leveraged Xiao et al. [13] proposed a relaxed spatial structural alignment
(RSSA) method and tried to project the original latent
Received 27 November 2023; revised 13 June 2024 and 2 September space to a narrow subspace close to the target domain to
2024; accepted 1 October 2024. This work was supported in part by
the Industry Alignment Fund—Industry Collaboration Projects (IAF-ICPs) accelerate training. These methods can generate diverse and
Funding Initiative [cash and in-kind contributions from the industry partner(s)] realistic images with limited data. However, these predefined
and in part by the Ministry of Education (MoE) Academic Research Fund correspondence losses are either relatively weak in diversity
(AcRF) Tier 1 under Grant RG14/22. (Corresponding author: Guosheng Lin.)
Xiaosheng He and Fan Yang are with S-Lab, Nanyang Technological preservation (CDC) or overemphasize diversity (RSSA) and
University (NTU), Singapore 639798 (e-mail: [email protected]; sacrifice style adaptation, limiting the generation performance
[email protected]). on the target domain (see Sections III and IV).
Fayao Liu is with the Agency for Science, Technology and Research
(A*STAR), Singapore 138632 (e-mail: [email protected]). Diffusion-based methods [14], [15] have recently achieved
Guosheng Lin is with the School of Computer Science and Engineering, remarkable success in various vision tasks. Custom diffusion
NTU, Singapore 639798 (e-mail: [email protected]). methods [16], [17] fine-tune a pretrained diffusion model to
This article has supplementary downloadable material available at
https://doi.org/10.1109/TNNLS.2024.3477467, provided by the authors. generate personalized outputs based on user-provided prompts.
Digital Object Identifier 10.1109/TNNLS.2024.3477467 However, Stable diffusion and other diffusion-based visual
2162-237X © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 04:08:07 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

A. Contributions
Our main contribution is a novel paired image recon-
struction method to balance style adaptation and content
preservation, transferring diversity from the source domain to
the target domain. Qualitative and quantitative results show
that our method produces the best results in a variety of
settings.

II. R ELATED W ORK


A. Few-Shot Image Generation
Few-shot image generation aims to generate diverse and
realistic images with limited training data. A popular way to
do this is to adapt a source model pretrained on sufficient
data of the source domain to the target domain with few
training data. Due to the great fitting ability of the GAN
Fig. 2. What to preserve in different methods. (a) CDC: preserve the distances model, the training can easily overfit to the training samples.
between instances. (b) RSSA: preserve the distances between pixels. (c) Our
method: preserve content information learned from the translation module F. To address this, many fine-tune-based methods [8], [9], [10],
The comparison results are shown in Figs. 7 and 8. [11] have been proposed. However, most of them still need
a relatively large amount of data (more than 100) and fail to
produce high-quality images. Recently, Ojha et al. [12] and
Xiao et al. [13] proposed to maintain a prior correspondence
models are trained on large-scale data, making fine-tuning
such as CDC and correlation consistency loss in RSSA
difficult and prone to disrupting the original structure. When
to supervise training. However, CDC is relatively weak in
the reference samples are limited to 10, the success rate of
diversity preservation while RSSA sacrifice style adaptation.
custom diffusion generation methods, such as Dreambooth
Despite their shortcomings, CDC and RSSA manage to avoid
[16] and LoRA, is very low, with only about two to three
severe overfitting and achieve relatively good results in several
images meeting the requirements out of ten attempts.
source → target adaptations. This suggests that we can
We consider the few-shot image generation task as
achieve better results by estimating content information in an
two parts: style adaptation and content preservation, where
appropriate way.
style denotes the specific properties that define a domain
while content denotes the domain-irrelevant information that
B. Image-to-Image Translation
represents diversity. Since the target domain has very few
samples, which can only provide style and very limited An alternative perspective of our training process can be
content, we need to effectively preserve the content of seen as an image-to-image translation, where we aim to
the source domain. The fine-tune-based methods such as convert an image generated by the source model to the target
FreezeD [18] focus little on this and thus end up overfitting. domain. Then, it is natural to only change the style of the
In this work, we propose PIR, a paired image reconstruction image while preserving its content. An intuitive way to do
method to address the few-shot generative model adaptation. this is to directly apply arbitrary style transfer methods such
We first make an assumption that given the same latent code, as AdaIN [19]. However, the “style” in these methods is
the source model and the target model should generate a pair not defined by the source and target domains, they refer
of images with the same content and different style. As shown more to the low-level semantic information extracted by some
in Fig. 2, the correspondences preserved by CDC and RSSA pretrained classification network [20]. Therefore, as shown in
can be seen as estimations of the real content. Motivated Section IV, these methods are likely to fail to properly transfer
by previous work of style transfer [19] and image-to-image the domain-relevant style.
translation [2], we introduce an image translation module that Image-to-image translation focuses on converting images
learns to separate style and content of an image during training to another domain. However, these methods [3], [21] are not
(note that it is still impossible to train a suitable translator designed for few-shot settings and require a large amount of
using only data from the source domain and ten examples data from both the source and target domain. Liu et al. [2]
from the target domain, as depicted in Section IV). In the and Saito et al. [22] has proposed a framework to extract style
training process, the source and the target model generate a and content information to address limited data from the target
pair of images with the same latent code, we let the translation domain. However, the content and style extractor still need to
module to reconstruct them with their own style and each be trained on sufficient labeled data (data with different class
other’s content. This reconstruction will thus encourage the labels), which is not available in our case as the source domain
adapted generator to inherit the diverse content from the is unlabeled.
source domain. Instead of applying predefined correspondence
losses, the model will dynamically balance style adaptation C. Diffusion-Based Customize Generation
and content preservation to generate realistic and diverse Text-to-image propagation models [15] can generate
outputs. diverse, high-fidelity images based on user-provided textual

Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 04:08:07 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

HE et al.: FEW-SHOT IMAGE GENERATION VIA STYLE ADAPTATION AND CONTENT PRESERVATION 3

Fig. 3. Overview of the proposed framework. Our approach uses a translation module F for content preservation and a discriminator D for style adaptation.
F will take a pair of images (shown in the red box) generated with the same latent code as input, and try to reconstruct them. We then use the reconstruction
loss L rec to encourage these paired images to have the same content.

prompts. Recent research has extended these models to


generate customized images: given only a few images of a
subject, a pretrained text-to-image model is fine-tuned to learn
to bind a unique identifier to that specific subject. Once the
subject is embedded in the model’s output domain, the unique
identifier can be used to synthesize novel realistic images
contextualized in different scenarios.
However, these methods are not suitable for our setting for
the following reasons.
1) When using prompt-driven image editing methods, it is
challenging to accurately describe the characteristics of
the target domain.
2) Diffusion-based vision models such as stable diffusion Fig. 4. Architecture of the proposed translation module F. F consists of
a style encoder E S , a content encoder E C , and a decoder Dec. To generate
are trained on large-scale data, which is difficult to a translation output x ′ , F combines the style code z y extracted from the
fine-tune and easily destroys the original structure. input style image y with the content code z x extracted from the input
When reference images are limited to 10, custom content image x.
diffusion generation methods such as Dreambooth [16]
and LoRA have relatively low success rates, with only than doing extra modification. However, due to the limited
about two to three images in ten attempts meeting the training data, the target generator is hard to distinguish
requirements. between content and style, which therefore leads to overfitting
(see Section IV).
III. A PPROACH As a result, we introduce a translation module to help the
We are given a source generator G S , which is trained on model learn the knowledge about style and content. To achieve
a large unlabeled dataset of the source domain. We want to this, we apply a paired image reconstruction procedure (see
use G S and a few training samples (ten samples) to train a Fig. 3) to encourage the content preservation, every epoch of
target generator G T . The goal of our training is to transfer training contains three steps, where D, G T , and F are trained
the rich content context of G S to G T while adapting the style separately.
context generated by G T to the target domain. To begin with, The architecture of our translation module is interpreted
we initialize the weight of G T to G S . in Section III-A; the tradeoff between style-adaptation and
We assume that, in an ideal training process, with the content preservation is discussed in Section III-B, where
same latent code z, images generated by G T and G S should we also explain our content preservation approach. Finally,
share the same content. This is because it should be easier we explain our learning objective and applied losses in
for the generator G T to only change the style of an image Section III-C.

Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 04:08:07 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

RSSA overemphasizes content preservation and is prone to


preserving lots of related information that may not be desired
for certain target domains (e.g., when we transfer from
FFHQ to sketches, some of the output will have color).
Moreover, the tradeoff between content preservation and style
adaptation is very dynamic, as demonstrated in Section IV,
when we transfer FFHQ to babies, neither CDC nor RSSA
can preserve diverse mouth poses. We will further discuss this
with qualitative and quantitative results in Section IV-A.
Despite their drawbacks, it is worth noting that in several
sources → target adaptations, CDC and RSSA have achieved
relatively good results. This suggests that there are different
proper tradeoff points for different adaptations, which can
be achieved by estimating the content information in a
suitable way. When the estimated content information is
appropriate for the given source → target adaptation, CDC or
RSSA can achieve good results by balancing between content
preservation and style adaptation.
Accordingly, we need a dynamic strategy to find suitable
Fig. 5. Schematic of the tradeoff. The horizontal axis represents the style
adaptation level (represents image quality), and the vertical axis represents tradeoffs for different target domains. Instead of using
content preservation (represents image diversity). The upper left corner is the predefined correspondence losses, we introduce a translation
output of the source model, and the lower right corner is the training example module to learn to separate content and style. Specifically,
of the target domain.
we propose to apply paired image reconstruction to preserve
the content information during the adaptation process,
A. Image Translation Module as shown in Fig. 3. Given a latent code z, G T , and G S will
generate a pair of images x and y from different domains.
As shown in Fig. 4, our image translation module consists We use the translation module F to reconstruct y from the
of a content encoder E C , a style encoder E S and a decoder content of x and the style of y. When F is frozen, the
Dec. The content encoder E C maps the input content image x reconstruction loss encourages x to have the same content
to a content code z x , and the style encoder E S maps the input as y for if there is any modification to the content of
style image y to a style code z y . The decoder Dec takes these x, the translation module would fail to find the original
two codes as input, and generates an image that combines content of y. Similarly, we can let F take the content of
the corresponding content and style. Dec consists of a couple y and the style of x to reconstruct x according to the
of adaptive instance normalization layers, which will use z y ’s symmetry.
mean and variance to normalize z x , the convolutional layers During the experiment, we find that the l1 loss equally takes
will then upscale it to the final output image. every pixel into account, which makes reconstruction harder.
On the other hand, LPIPS loss [23] measures the deep features
B. Tradeoff Between Style Adaptation and Content of two images, which makes the training much faster and more
Preservation stable. L rec is then given by
From another point of view, the few-shot generative
model adaption can be seen as a tradeoff between style L rec = Ez∼ p(z) [LPIPS(F(G T (z), G S (z)), G S (z))
adaptation and content preservation. According to our
+ LPIPS(F(G S (z), G T (z)), G T (z))]. (1)
definition, style represents a specific property of the target
domain. In other words, images with the style of the target
domain will be recognized as real images by the discriminator. The tradeoff is fundamentally guided by the interaction
Therefore, style adaptation is naturally accomplished by the between the adversarial loss and the reconstruction loss. The
discriminative loss, we therefore need to find a proper way adversarial loss ensures that the generated images conform to
to preserve content (diversity). The fine-tune-based methods the stylistic properties of the target domain, as recognized by
such as FreezeD [18] focus little on this and thus end up the discriminator. Simultaneously, the reconstruction loss is
overfitting. employed to preserve the structural and semantic content of
On the other hand, CDC [12] and RSSA [13] try to preserve the source image during the translation process.
content by maintaining a correspondence between two images, The balance between these two losses is dynamically
as shown in Fig. 2. The authors used these correspondences adjusted during training because the reconstruction task varies
to estimate the true content. The drawback of this is that the depending on the specific adaptation being performed. This
quality of this estimation varies with different kinds of source dynamic interplay between the losses allows the model
→ target adaptations. As depicted in Fig. 5, we find that to adapt to various domain pairs effectively, finding an
CDC is weak in content preservation, as it captures relatively appropriate balance between style adaptation and content
little content context from the source domain. In contrast, preservation for each specific task.

Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 04:08:07 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

HE et al.: FEW-SHOT IMAGE GENERATION VIA STYLE ADAPTATION AND CONTENT PRESERVATION 5

C. Learning
We train the whole framework by solving the given
optimization objective
min max L adv (D, G T ) + λ1 L rec (G T ) + λ2 L ′ rec (F) (2)
G T ,F D

where L adv refers to the GAN loss, and L rec and L ′ rec refer
to the reconstruction loss used for target generator G T and
translator F, respectively. The GAN loss is an adversarial loss
given by
L G = −Ez∼ p(z) log( D(G T (z))
 
Fig. 6. Learning process in the view of the domain. Di is the corresponding
L D = Ex∼Dt [ log( 1 − D(x) ] + Ez∼ p(z) [ log( D(G T (z)) ]. (3) domain to G T in each iteration.

For L D , we follow the idea of [12] to use a combination of


image-wise and patch-wise discriminator loss. qualitatively and quantitatively with the following few-shot
The reconstruction loss L ′ rec is designed to train the image generation baselines: FreezeD [18], MineGAN [11],
translator F. Similar to the previous step explained in CDC [12], and RSSA [13]. We use the StyleGANv2 [24]
Section III-B, G T and G S can generate a pair of images with models pretrained on sour different datasets: 1) Flickr-
the same content and different styles. We freeze G T to let F Faces-HQ (FFHQ) [25]; 2) LSUN Churches [26]; 3) LSUN
learn to reconstruct one image with the other’s content and its Cars [26]; and 4) LSUN Cats [26].
own style. We adapt the source GAN models to various target domains
In order to ensure that the style and content of an image including: 1) face sketches [27]; 2) FFHQ-babies [25];
could cover all its information, we also let F perform self- 3) haunted houses [12]; 4) LSUN spaniels [26]; 5) village
reconstruction, where the input content image and style image painting by Van Gogh [12]; 6) wrecked/abandoned cars [12];
would be the same. L ′ rec is then given by and 7) Raphael paintings [12].
L ′ rec = Ez∼ p(z) [LPIPS(F(G T (z), G S (z)), G S (z))
+LPIPS(F(G S (z), G T (z)), G T (z)) A. Performance Evaluation
+LPIPS(F(G S (z), G S (z)), G S (z)) 1) Qualitative Comparison: Fig. 7 shows the results of
+LPIPS(F(G T (z), G T (z)), G T (z))]. (4) different methods on two transfer settings. We can observe that
direct style transfer (AdaIN [19]) fails the transfer task for it
Fig. 6 intuitively shows the learning process from the only replaces the low-level semantic information of the output
domain perspective. Let Di be the domain corresponding to images and fails to capture the real domain-relevant features.
G T in the ith iteration. Our training goal can be regarded as Direct image-to-image translation (FUNIT [2]) is also prone
transforming Di from the source domain to the target domain. to collapse in these settings. Without a labeled source domain
During the training process, the adversarial loss pushes Di to that contains enough source classes, the translator would find
the target domain. If we do not add any other loss, Di will it very hard to separate style and content information between
tend to collapse into a domain space consisting of 10 target the source domain and a very small target domain. As for fine-
samples. The translation module T can be regarded as the tune-based few-shot GAN adaptation methods (FreezeD [18]
corresponding mapping between Di and the source domain. and MineGAN [11]), the output images strongly overfit to the
Maintaining this mapping can help to maintain the shape of reference images of the target domain. Though these methods
Di . Therefore, every iteration of the learning process of the perform well with hundreds of training samples, they are
model can be regarded as consisting of two steps. ineffective in handling extremely few-shot settings.
1) Maintain the mapping of Di to the source domain (freeze On the other hand, CDC [12] and RSSA [13] manage
G T , train T ). to generate diverse and realistic images compared to other
2) Push Di to the target domain through GAN loss, and baselines. However, there are still some issues with their
maintain the shape of Di through reconstruction loss correspondence losses. For CDC, we can see that the generated
(freeze T , train G T ). sketch images are not quite identical to the source images.
We find that when the translation module is sufficiently This indicates that the CDC may lose some diversity of facial
trained in every iteration, the training process will be fairly expressions when transferred to the target domain. On the
stable. For most adaptations, we train the generator and the contrary, RSSA can preserve visual attributes well in sketches,
discriminator once, and the translation module four times in but sometimes overly restricts the output. This over-restriction
each iteration of training. Additional training details can be can even result in some output images having colors, which
found in the supplementary. should not be the case for sketches. Besides, RSSA does not
handle light and shadow very well, resulting in unnatural spots
IV. E XPERIMENT on the face. Moreover, despite the strong effect RSSA has
In this section, we discuss the effectiveness of our achieved in FFHQ → Sketches adaptation, it cannot provide
approach in several few-shot settings. We compare our method enough constraints in FFHQ → FFHQ-babies adaptation,

Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 04:08:07 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 7. Comparison results with different baselines on FFHQ → Sketches and FFHQ → FFHQ-babies. For AdaIN, we randomly choose a real image as the
style image. For other GAN-based methods, we keep the latent code the same (across columns). Our method generates results of higher quality and diversity
which better correspond to the source domain images.

where neither CDC nor RSSA can preserve diverse mouth LSUN Cats → LSUN Spaniels and LSUN Churches → Van
poses. They can even have some strange distortions in such Gogh Houses adaptations, as shown in Fig. 8. We can see that
settings. This indicates that the assumed correspondence RSSA fails to address the adaptation between two relatively
losses cannot properly balance style adaptation and content distant domains such as LSUN Cats → LSUN Spaniels.
preservation. Even between a more related pair of domains such as LSUN
Our method achieves the best results in these two Churches → Van Gogh Houses, there still exists undesired
adaptations. As depicted in Fig. 7, in FFHQ → Sketches distortion in RSSA’s outputs. On the other hand, CDC can
adaptation, our output images can better capture the diverse generate acceptable results in these two settings. Our method
facial expressions without bringing color and can better outperforms both of them, indicating that our method can
address the light and shadow issues; in FFHQ → FFHQ-babies properly balance the style adaptation and content preservation
adaptation, our methods can preserve refined details such as on various source → domain adaptations.
the mouth poses and does not output unnatural distortions. In Fig. 9, we show more results with different source
We further compare our method with CDC and RSSA on → target adaptation settings. Supervised by paired image

Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 04:08:07 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

HE et al.: FEW-SHOT IMAGE GENERATION VIA STYLE ADAPTATION AND CONTENT PRESERVATION 7

Fig. 8. Comparison results with CDC and RSSA on LSUN Cats → LSUN Spaniels and LSUN Churches → Van Gogh Houses.

TABLE I We let the models trained with different methods to generate


FID S CORES (↓) FOR D OMAINS W ITH A BUNDANT DATA . O UR M ETHOD 5000 samples, which are used to calculate the FID score [28]
ACHIEVES THE L OWEST FID S CORE , D EMONSTRATING I TS S UPE -
RIORITY IN A DAPTING THE S TYLE TO THE TARGET D OMAIN .
and intracluster pairwise LPIPS distance [12] for each method.
I N C ONTRAST, RSSA S HOWS A V ERY H IGH FID S CORE , The FID score is computed as the Fréchet distance between
I NDICATING T HAT I T I S W EAK IN T ERMS OF S TYLE the mean and covariance of feature representations of real and
A DAPTATION
generated images in the inception space
 1/2 
FID = ∥µr − µg ∥2 + Tr 6r + 6g − 2 6r 6g . (5)

The lower the FID score, the better the quality of the
generated images, as it indicates a smaller distance between
the distributions of real and generated images in the feature
TABLE II space.
I NTRACLUSTER PAIRWISE LPIPS D ISTANCE (↑). CDC S HOWS Table I shows the FID score of different methods. Our
R ELATIVELY L OWER D ISTANCE (L ESS DIVERSITY ), S UGGESTING method achieves the best FID score on the three datasets,
I TS W EAKNESS IN C ONTENT P RESERVATION indicating that our method generates images that best model
the true distribution of the target domain. However, FID does
not directly address the diversity of the generated images or
the potential issue of overfitting to the training data. Therefore,
we use it as a measure of generation quality, i.e., a measure
of style adaptation.
TABLE III
We use the intracluster pairwise LPIPS distance [12] to
BALANCE I NDEX (↑) FOR Q UALITY AND D IVERSITY
measure the diversity level. To capture this, we assign the
generated 5000 images to one of the k training images,
by using the lowest LPIPS distance. We then compute the
average pairwise LPIPS distance within members of the same
cluster and then average over the k clusters. A method that
reproduces the original images exactly will have a score of
reconstruction loss, the generator can produce images that zero by this metric. Higher LPIPS distances between generated
successfully preserve diversity from the source domain while images suggest more diversity and distinctiveness among the
the style fits the target domain. generated samples. Therefore, we use it as a measure of
2) Quantitative Comparison: We use three datasets with generation diversity, i.e., a measure of content preservation.
abundant data that meet the requirement of evaluation: the As shown in Table II, our method consistently achieves
original Sketches, FFHQ-babies, and LSUN-spaniels datasets, higher average LPIPS distance than CDC. On the other hand,
which contain 289, 2492, and 188 images, respectively. although RSSA achieves higher LPIPS distance in FFHQ →

Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 04:08:07 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 9. Results of different adaptation settings: (i) FFHQ → Raphael Paintings; (ii) LSUN Churches → Haunted Houses; and (iii) LSUN Cars → Wrecked
Cars.

Fig. 10. Comparison results with CDC and RSSA on FFHQ → Sketches in one-shot and five-shot settings.

Sketches and Cats → Spaniels adaptations, it gets very A higher score means better performance in balancing quality
high FID scores in these adaptations (meaning the learned and diversity. The comparison results are shown in Table III,
distribution is very different from the target domain). This is we can see that our method outperforms CDC and RSSA.
consistent with our previous analysis: CDC is relatively weak
in content preservation, which fails to generate more diversity B. Ablation Study
(has the lowest LPIPS distance) while RSSA overemphasizes 1) Effect of Target Dataset Size: We further explore the
the diversity and fails to generate a similar distribution to the effectiveness of our method compared to CDC and RSSA in
target domain (has very high FID score). one-shot and five-shot settings.
To assess the combined performance of the model in terms Fig. 10 shows the results on FFHQ → Sketches adaptation.
of diversity and quality more clearly, we propose a balance For the one-shot setting, we can observe that RSSA is so
metric by incorporating FID score (FID) and LPIPS distance strong at content preservation that with only one example,
(LD) as follows: the outputs can still resemble the corresponding images from
the source domain. However, it falls short in style adaptation
100 · LD as it introduces color that is not consistent with the sketch
balance = . (6)
FID style. On the other hand, CDC and our method just output

Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 04:08:07 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

HE et al.: FEW-SHOT IMAGE GENERATION VIA STYLE ADAPTATION AND CONTENT PRESERVATION 9

Fig. 11. Comparison results with CDC and RSSA on LSUN Churches → Van Gogh Houses in one-shot and five-shot settings.

Fig. 12. Results of the translation module after training. The module takes the first row as content image, and the second row as style image. The outputs
show that after training the translation module is capable of separating style and content and translating images from the source domain to the target domain.

similar faces with different poses. Our method can generate TABLE IV
more diverse poses than CDC, such as the poses of mouth. FID S CORE ON D IFFERENT S HOTS ON FFHQ → S KETCHES A DAPTATION .
T HE F IRST ROW R EFERS TO F INE -T UNING S TYLE GAN U SING O NLY
The evaluation of one-shot generation is very vague, for it is D ISCRIMINATIVE L OSS
very hard to tell what is style and content. The results of the
five-shot setting are similar to those of the ten-shot setting.
We can see that the results of CDC are not very identical to
the corresponding images from the source domain, indicating
its shortness in content preservation. On the other hand,
RSSA can generate outputs resembling the original images but
bring some undesired color, suggesting its weakness in style
adaptation. Our method can generate diverse outputs without discriminative loss, which tells us that content preservation
bringing any color. loss is only applicable in few-shot generation scenarios of a
Fig. 11 shows the results on LSUN → Van Gogh Houses certain size.
adaptation. For the one-shot setting, we can observe that our 2) Translation Module: We examine the output of the
method and RSSA can generate churches with different shapes translation module to evaluate its function. As shown in
while CDC can only output similar churches. For the five-shot Fig. 12, after training the translation module is capable of
setting, we can see that both CDC and RSSA have unnatural separating style and content and translating images from the
distortions while our method can generate churches similar to source domain to the target domain.
the original images and successfully adapts the style to Van 3) Architecture Decisions of the Translation Module: It is
Gogh paintings. possible to use a generic image-to-image translation module
In addition, we explored the range of dataset sizes suitable for reconstruction, i.e., let the translation module learn the
for content preservation loss on FFHQ → Sketches adaptation. target style and only feed it with the content image. However,
As depicted in Table IV, when the number of samples in the this will increase its training difficulty since the translation
target domain is fifty or more, CDC, RSSA, and our method module needs to since the translation module needs to quickly
perform worse than directly fine-tuning StyleGAN using only establish a mapping between the current domain and the source

Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 04:08:07 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 15. Comparison results with different reconstruction choices on FFHQ


→ Sketches. (i) Reconstruct the source image only; (ii) reconstruct the target
image only; and (iii) reconstruct both the source image and the target image.

TABLE V
T RAINING M EMORY AND T IME ON FFHQ → S KETCHES A DAPTATION

Fig. 13. Generic image-to-image translation module leads to bad results.

the style codes; and iv) put the reconstructed images into the
corresponding discriminator and use the adversarial loss as
reconstruction loss. The result is shown in Fig. 14. We can
see that putting l1 loss on two images results in overfitting,
probably because it overemphasizes the pixel similarity and
makes it great harder for the translator to do the reconstruction;
on the other hand, LPIPS loss can relax the translator, let it
focus on reconstruct meaningful information and thus has the
best results. Putting l1 loss directly on the style and content
code can generate similar results as ii), but will lose some
diversity. Using discriminator loss for reconstruction fails to
produce valid results.
5) Choice of Reconstruction Method: As mentioned in
Section III-B, we actually have three choices for reconstruc-
tion: i) reconstruct the source images only; ii) reconstruct the
Fig. 14. Comparison results with different losses on FFHQ → Sketches. target images only; and iii) reconstruct both the source images
(i) Compute the l1 loss of the original images and reconstructed images; (ii) and the target images. The results are shown in Fig. 15. We can
compute the LPIPS loss of the original images and reconstructed images;
(iii) compute the l1 loss on the content codes of the content images and observe that if we only reconstruct target images, some edges
reconstructed images, and the l1 loss on the style codes of the style images of the face will appear blank; while only reconstructing source
and reconstructed images; and (iv) put the reconstructed image into the images does not have this issue. Both of them have overfitting
corresponding discriminator and use the adversarial loss as reconstruction
loss. issues and cannot fully capture the facial expression details.
On the other hand, by reconstructing both source and target
images, the face will become more refined and more identical
domain in each iteration. As shown in Fig. 13, the module to the source images.
failed to properly separate style and content, resulting in bad 6) Computational Efficiency: As shown in Fig. 16, our
generation performance. method converges slightly slower than CDC. As shown in
4) Choice of Reconstruction Loss: We explore four kinds of Fig. 16, our method converges slightly slower than CDC.
reconstruction loss in our experiments: i) compute the l1 loss Specifically, we conducted experiments on NVIDIA A6000
of the original images and reconstructed images; ii) compute GPUs and recorded the training memory and time for the
the Lpips loss of the original images and reconstructed FFHQ → Sketches adaptation as shown in Table V. These
images; iii) compute the l1 loss on the content codes and results demonstrate that our method does not significantly

Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 04:08:07 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

HE et al.: FEW-SHOT IMAGE GENERATION VIA STYLE ADAPTATION AND CONTENT PRESERVATION 11

Fig. 16. FID score over iteration times. Our method converges slightly slower than CDC.

increase training time and memory usage compared to other should be preserved (content) and what should be adapted
methods. (style). By using these models, it may be possible to reduce
overfitting and improve the generalization capabilities of the
V. C ONCLUSION AND L IMITATION model across a wider range of domain pairs, even when the
domains are significantly different.
In this article, we propose a novel content preservation
method to address the image generation problem in extremely
few shot settings. We find that effective separation of R EFERENCES
content and style enables high content fidelity and robust [1] W. Diao, F. Zhang, J. Sun, Y. Xing, K. Zhang, and L. Bruzzone,
style adaptation, resulting in high-quality and diverse image “ZeRGAN: Zero-reference GAN for fusion of multispectral and
generation. Our method introduces a translation network to panchromatic images,” IEEE Trans. Neural Netw. Learn. Syst., vol. 34,
no. 11, pp. 8195–8209, Nov. 2023.
map between the source and target domain. With the help [2] M.-Y. Liu et al., “Few-shot unsupervised image-to-image translation,”
of the paired image reconstruction, the generative model can in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019,
learn to preserve rich content context inherited from the pp. 10551–10560.
[3] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image
source domain. Due to this flexible tradeoff strategy between translation using cycle-consistent adversarial networks,” in Proc. IEEE
style adaptation and content preservation, our approach can Int. Conf. Comput. Vis., Oct. 2017, pp. 2223–2232.
successfully generate diverse and realistic images under [4] H. Tang, H. Liu, D. Xu, P. H. S. Torr, and N. Sebe, “AttentionGAN:
Unpaired image-to-image translation using attention-guided generative
various source → domain adaptation settings. In addition, adversarial networks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 34,
we design a new balanced metric combining the FID score no. 4, pp. 1972–1987, Apr. 2023.
and the intracluster pairwise LPIPS distance to evaluate the [5] F. Kong et al., “Unpaired artistic portrait style transfer via asymmetric
performance of balancing the quality and diversity of gener- double-stream GAN,” IEEE Trans. Neural Netw. Learn. Syst., vol. 34,
no. 9, pp. 5427–5439, Sep. 2023.
ated images, which can serve as an alternative supplement to [6] C. Luo, Y. Zhu, L. Jin, Z. Li, and D. Peng, “SLOGAN: Handwriting style
current metrics in few-shot generation scenarios. synthesis for arbitrary-length and out-of-vocabulary text,” IEEE Trans.
Neural Netw. Learn. Syst., vol. 34, no. 11, pp. 8503–8515, Nov. 2023.
[7] C. Huang et al., “Self-supervised attentive generative adversarial
A. Limitations networks for video anomaly detection,” IEEE Trans. Neural Netw. Learn.
Syst., vol. 34, no. 11, pp. 9389–9403, Nov. 2023.
Despite the compelling results our method achieves, there [8] A. Noguchi and T. Harada, “Image generation from small datasets via
are still some limitations. First, there is still some overfitting batch statistics adaptation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis.
of certain details of the output. For example, in FFHQ → (ICCV), Oct. 2019, pp. 2750–2758.
[9] E. Robb, W.-S. Chu, A. Kumar, and J.-B. Huang, “Few-shot adaptation
Sketches adaptation, the generated sketch faces will tend to of generative adversarial networks,” 2020, arXiv:2010.11943.
show a few teeth even if the mouth is initially closed on [10] M. Zhao, Y. Cong, and L. Carin, “On leveraging pretrained GANs for
the source image. In FFHQ → FFHQ-babies adaptation, the generation with limited data,” in Proc. Int. Conf. Mach. Learn., 2020,
pp. 11340–11351.
hair color is a little shallower than the corresponding source
[11] Y. Wang et al., “MineGAN: Effective knowledge transfer from GANs
image. Besides, the adaptation should be conducted between to target domains with few images,” in Proc. IEEE/CVF Conf. Comput.
similar domains for good results. If the two domains are too Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 9332–9341.
different, there will be little content in the source domain that [12] U. Ojha et al., “Few-shot image generation via cross-domain
correspondence,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
is meaningful to the target domain. Recognit., Jun. 2021, pp. 10743–10752.
Future research directions include exploring more robust [13] J. Xiao, L. Li, C. Wang, Z.-J. Zha, and Q. Huang, “Few shot generative
techniques for content preservation and style transfer that do model adaption via relaxed spatial structural alignment,” in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022,
not rely heavily on domain similarity. For instance, integrating pp. 11194–11203.
advanced multimodal models like LLaVA-OneVision [29] [14] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,”
could offer a more sophisticated approach to managing style in Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 6840–6851.
and content across diverse domains. These models have the [15] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer,
“High-resolution image synthesis with latent diffusion models,” in
potential to leverage both visual and linguistic information, Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2022,
enabling a finer-grained understanding of what elements pp. 10684–10695.

Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 04:08:07 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

[16] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, Fan Yang received the bachelor’s degree from
“DreamBooth: Fine tuning text-to-image diffusion models for subject- Nanjing University, Nanjing, China, in 2021. He is
driven generation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern currently pursuing the Ph.D. degree with the School
Recognit. (CVPR), Jun. 2023, pp. 22500–22510. of Computer Science and Engineering, Nanyang
[17] N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y. Zhu, Technological University, Singapore.
“Multi-concept customization of text-to-image diffusion,” in Proc. His research interests lie in the fields of computer
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, vision, 3-D deep learning, and 3-D generation.
pp. 1931–1941.
[18] S. Mo, M. Cho, and J. Shin, “Freeze the discriminator: A simple baseline
for fine-tuning GANs,” 2020, arXiv:2002.10964.
[19] X. Huang and S. Belongie, “Arbitrary style transfer in real-time with
adaptive instance normalization,” in Proc. IEEE Int. Conf. Comput. Vis.
(ICCV), Oct. 2017, pp. 1501–1510.
[20] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using
convolutional neural networks,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jun. 2016, pp. 2414–2423.
[21] M.-Y. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-image
translation networks,” in Proc. Adv. Neural Inf. Process. Syst., vol. 30,
2017, pp. 1–9.
[22] K. Saito, K. Saenko, and M.-Y. Liu, “COCO-FUNIT: Few-shot
unsupervised image translation with a content conditioned style
encoder,” in Proc. 16th Eur. Conf. Comput. Vis. Glasgow, U.K.: Springer,
Aug. 2020, pp. 382–398.
[23] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, Fayao Liu received the B.Eng. and M.Eng. degrees
“The unreasonable effectiveness of deep features as a perceptual metric,” from the National University of Defense Technology,
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, Changsha, China, in 2008 and 2010, respectively,
pp. 586–595. and the Ph.D. degree in computer science from
[24] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, The University of Adelaide, Adelaide, SA, Australia,
“Analyzing and improving the image quality of StyleGAN,” in Proc. in December 2015.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, She is currently a Research Scientist with the
pp. 8110–8119. Institute for Infocomm Research (I2R), A*STAR,
[25] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture Singapore. She mainly works on machine learning
for generative adversarial networks,” in Proc. IEEE/CVF Conf. Comput. and computer vision problems, with particular
Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 4401–4410. interests in self-supervised learning, 3-D vision, and
[26] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao, “LSUN: generative models.
Construction of a large-scale image dataset using deep learning with
humans in the loop,” 2015, arXiv:1506.03365.
[27] X. Wang and X. Tang, “Face photo-sketch synthesis and recognition,”
IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 11, pp. 1955–1967,
Nov. 2008.
[28] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter,
“GANs trained by a two time-scale update rule converge to a local nash
equilibrium,” in Proc. Adv. Neural Inf. Process. Syst., vol. 30, 2017,
pp. 1–12.
[29] B. Li et al., “LLaVA-OneVision: Easy visual task transfer,” 2024,
arXiv:2408.03326.

Xiaosheng He received the bachelor’s degree Guosheng Lin received the Ph.D. degree from The
from Shanghai Jiao Tong University, Shanghai, University of Adelaide, Adelaide, SA, Australia,
China, in 2021. He is currently pursuing the in 2014.
Ph.D. degree with the School of Computer Science He is currently an Associate Professor with
and Engineering, Nanyang Technological University the School of Computer Science and Engineering,
(NTU), Singapore. Nanyang Technological University, Singapore. His
His research interests lie in the fields of machine research interests are generally in computer vision
learning and computer vision, with a focus on and machine learning including scene understanding,
generative models and 3-D vision. 3-D vision, and generative learning.

Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 04:08:07 UTC from IEEE Xplore. Restrictions apply.

You might also like