0% found this document useful (0 votes)
131 views10 pages

FS Vton

The paper proposes a novel global appearance flow estimation model using a StyleGAN architecture for the task of virtual garment try-on. It aims to overcome limitations of existing local flow based methods which struggle with large misalignments between person and garment images and difficult poses. The proposed method uses a global style vector to capture whole image context for robust warping, and introduces local refinement to model fine-grained local appearance flows.

Uploaded by

khoua7264
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
131 views10 pages

FS Vton

The paper proposes a novel global appearance flow estimation model using a StyleGAN architecture for the task of virtual garment try-on. It aims to overcome limitations of existing local flow based methods which struggle with large misalignments between person and garment images and difficult poses. The proposed method uses a global style vector to capture whole image context for robust warping, and introduces local refinement to model fine-grained local appearance flows.

Uploaded by

khoua7264
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Style-Based Global Appearance Flow for Virtual Try-On

Sen He, Yi-Zhe Song, Tao Xiang


Center for Vision, Speech and Signal Processing, University of Surrey
iFlyTek-Surrey Joint Research Centre on Artificial Intelligence
{sen.he,y.song,t.xiang}@surrey.ac.uk

Person Garment Cloth-flow PF-AFN Ours Person Garment Cloth-flow PF-AFN Ours

Figure 1. Our global appearance flow based try-on model has a clear advantage over existing local flow based SOTA methods such as
Cloth-flow [13] and PF-AFN [10], especially when there are large mis-alignment between reference and garment images (top row), and
difficult poses/occlusions (bottom row).

Abstract 1. Introduction

Image-based virtual try-on aims to fit an in-shop gar- The transition from offline in-shop retail to e-commerce
ment into a clothed person image. To achieve this, a key step has been accelerated by the recent pandemic caused lock
is garment warping which spatially aligns the target gar- downs. In 2020, retail e-commerce sales worldwide
ment with the corresponding body parts in the person im- amounted to 4.28 trillion US dollars and e-retail revenues
age. Prior methods typically adopt a local appearance flow are projected to grow to 5.4 trillion US dollars in 2022.
estimation model. They are thus intrinsically susceptible However, when it comes to fashion, one of key offline
to difficult body poses/occlusions and large mis-alignments experiences missed by the on-line shoppers is the chang-
between person and garment images (see Fig. 1). To over- ing room where a garment item can be tried-on. To re-
come this limitation, a novel global appearance flow esti- duce the return cost for the online retailers and give shop-
mation model is proposed in this work. For the first time, pers the same offline experience online, image-based vir-
a StyleGAN based architecture is adopted for appearance tual try-on (VTON) has been studied intensively recently
flow estimation. This enables us to take advantage of a [9, 10, 13, 14, 19, 24, 38, 39, 42, 43].
global style vector to encode a whole-image context to cope A VTON model aims to fit an in-shop garment into a per-
with the aforementioned challenges. To guide the StyleGAN son image. A key objective of a VTON model is to align the
flow generator to pay more attention to local garment de- in-shop garment with the corresponding body parts in the
formation, a flow refinement module is introduced to add person image. This is due to the fact that the in-shop gar-
local context. Experiment results on a popular virtual try- ment is usually not spatially aligned with the person image
on benchmark show that our method achieves new state- (see Fig. 1). Without the spatial alignment, directly apply-
of-the-art performance. It is particularly effective in a ‘in- ing advanced detail-preserving image to image translation
the-wild’ application scenario where the reference image is models [18, 30] to fuse the texture in person image and gar-
full-body resulting in a large mis-alignment with the gar- ment image will result in unrealistic effect in the generated
ment image (Fig. 1 Top). Code is available at: https: try-on image, especially in the occluded and misaligned re-
//github.com/SenHe/Flow-Style-VTON . gions.
Previous methods address this alignment problem applied to local face image manipulation tasks, where dif-
through garment warping, i.e., they first warp the in-shop ferent style vectors can generate the same face at different
garment, which is then concatenated with the person image viewpoints [34] and different shapes [15, 28]. This suggests
and fed into an image to image translation model for the fi- that a global style vector does have local spatial context en-
nal try-on image generation. Many of them [9,14,19,38,42, coded. However, we also note that the vanilla StyleGAN ar-
43] adopt a Thin Plate Spline (TPS) [7] based on the warp- chitecture [21, 22], though much more robust against large
ing method, exploiting the correlation between features ex- mis-alignment and difficult poses/occlusions compared to
tracted from the person and garment images. However, as U-Net, is weaker when it comes to local deformation mod-
analyzed in previous works [5, 13, 42], TPS has limitations eling. We therefore introduce a local flow refinement mod-
in handling complex warping, e.g., when different regions ule in the existing StyleGAN generator to have the better of
in the garment require different deformations. As a result, both worlds.
recent SOTA methods [10, 13] estimate dense appearance Concretely, our StyleGAN-based warping module (W in
flow [45] to warp the garment. This involves training a net- Fig. 2) consists of stacked warping blocks that takes as in-
work to predict the dense appearance flow field representing puts a global style vector, garment features and person fea-
the deformation required to align the garment with the cor- tures. The global style vector is computed from the lowest
responding body parts. resolution feature maps of the person image and the in-shop
However, existing appearance flow estimation methods garment for global context modeling. In each warping block
are limited in accurate garment warping due to the lack of in the generator, the global style vector is used to modu-
global context. More specifically, all existing methods are late the feature channels which takes in the corresponding
based on local feature’s correspondence, e.g., local feature garment feature map to estimate the appearance flow. To
concatenation or correlation1 , developed for optical flow enable our flow-estimator to model the fine-grained local
estimation [6, 17]. To estimate the appearance flow, they appearance flow, e.g., the arm and hand regions in Fig. 5,
make the unrealistic assumption that the corresponding re- in each warping block on top of the style based appearance
gions from the person image and the in-shop garment are flow estimation part, we introduce a refinement layer. This
located in the same local receptive filed of the feature ex- refinement layer first warps the garment feature map, which
tractor. When there is a large mis-alignment between the is subsequently concatenated with the person feature map
garment and corresponding body parts (Fig. 1 Top), cur- at the same resolution and then used to predict the local de-
rent appearance flow based methods will deteriorate drasti- tailed appearance flow.
cally and generate unsatisfactory results. Lacking a global The contributions of this work are as follow: (1) We
context also make existing flow-based VTON methods vul- propose a novel style-based appearance flow method to
nerable to difficult poses/occlusions (Fig. 1 Bottom) when warp the garment in virtual try-on. This global flow estima-
correspondences have to be searched beyond a local neigh- tion approach makes our VTON model much robust against
borhood. This severely limits the use of these methods ‘in- large mis-alignments between person and garment images.
the-wild’, whereby a user may have a full-body picture of This makes our method more applicable to ‘in-the-wild’ ap-
herself/himself as the person image to try-on multiple gar- plication where a full-body person image with natural poses
ment items (e.g., top, bottom, and shoes). is used (see in Fig. 1). (2) We conduct extensive experi-
To overcome this limitation, a novel global appearance ments to validate our method, demonstrating clearly that it
flow estimation model is proposed in this work. Specifi- is superior to existing state-of-the-art alternatives.
cally, for the first time, a StyleGAN [21, 22] architecture
for dense appearance flow estimation. This differs funda- 2. Related Work
mentally from existing methods [6, 10, 13, 17] which em-
ploy a U-Net [30] architecture to preserve local spatial con- Image based virtual try-on Image based (2D) VTON
text. Using a global style vector extracted from the whole can be categorized into parser-based methods and parser-
reference and garment images makes it easy for our model free methods. Their main difference is whether an off-the-
to capture global context. However, it also raises an im- shelf human parser2 is required in the inference stage.
portant question: can it capture local spatial context cru- Parser-based methods apply a human segmentation map
cial for local alignments? After all, a single style vector to mask the garment region in the input person image for
seemingly has lost local spatial context. To answer this warping parameter estimation. The masked person image is
question, we first note that StyleGAN has been successfully concatenated with the warped garment and then fed into a
generator for target try-on image generation. Most methods
1 It is worth noting that the tensor correlation methods [6, 10, 17] have [9, 13, 14, 38, 42, 43] apply a pre-trained human parser [11]
the potential to reach global receptive field. However, its computation
grows quadratically with respect to the input size. To make it tractable, 2 Sometimes, pre-trained pose [3] and densePose [12] detection models

its actual implementation is still based on limited local neighborhoods. are also used in a parser based model.
to parse the person image into several pre-defined semantic more attention and adopted by recent state-of-the-art VTON
regions, e.g., head, top, and pants. For better try-on im- models [5, 10]. Fundamentally, appearance flow is used as
age generation, [42] also transforms the segmentation map a sampling grid for garment warping, it is thus information
to match the target garment. The transformed parsing re- lossless and superior in detail preserving. Beyond VTON,
sult, together with the warped garment and the masked per- appearance flow is also popular in other tasks. [45] applied
son image are used for final try-on image generation. The it for novel view synthesis. [1, 29] also applied the idea of
reliance on a parser make these methods sensitive to bad appearance flow to warp the feature map for person pose
human parsing results [10, 19] which inevitably lead to in- transfer. Different from all these existing appearance flow
accurate warping and try-on results. estimation methods, our method, via style modulation, ap-
In contrast, parser-free methods [10,19], in the inference plies a global style vector to estimate the appearance flow.
stage, only takes as inputs the person image the garment im- Our method is thus intrinsically superior in its ability to cop-
age. They are designed specifically to eliminate the nega- ing with large mis-alignments.
tive effects induced by the bad parsing results. Those meth-
ods usually first train a parser-based teacher model and then 3. Methodology
distill a parser-free student model. [19] proposed a pipeline
which distills the garment warping module and try-on gen- 3.1. Problem definition
eration network using paired triplets. [10] further improved Given a person image (p ∈ R3×H×W ) and an in-shop
[19] by introducing cycle-consistency for better distillation. garment image (g ∈ R3×H×W ), the goal of virtual try-on is
Our method is also a parser free method. However, our to generate a try-on image (t ∈ R3×H×W ) where the gar-
method focuses on the design of the garment warping part, ment in g fits to the corresponding parts in p. In addition,
where we propose a novel global appearance flow based in the generated t, both details from g and non-garment re-
garment warping module. gions in p should be preserved. In other words, the same
3D virtual try-on Compared to image based VTON, 3D person in p should appear unchanged in t except now wear-
VTON provides better try-on experience (e.g., allowing be- ing g.
ing viewed with arbitrary views and poses), yet is also more To eliminate the negative effect of inaccurate human
challenging. Most 3D VTON works [2, 27] rely on 3D parsing, our proposed model (F in Fig. 2) is designed to be
parametric human body models [25] and need scanned 3D a parser-free model. Following the strategy adopted by ex-
datasets for training. Collecting large scale 3D datasets is isting parser-free models [10,19], we first pre-train a parser-
expensive and laborious, thus posing a constraint on the based model (F P B ). It is then used as a teacher for knowl-
scalability of a 3D VTON model. To overcome this prob- edge distillation to help train the final parser-free model F.
lem, recently [44] applied non-parametric dual human depth Both F and F P B consist of three parts, i.e., two feature
model [8] for monocular to 3D VTON. However, existing extractors (EpP B , EgP B in F P B and Ep , Eg in F), warping
3D VTON still generate inferior texture details compared module (W P B in F P B and W in F), and a generator (G P B
to the 2D methods. in F P B and G in F). Each of them will be detailed in the
StyleGAN for image manipulation StyleGAN [21, 22] following sections.
has revolutionized the research on image manipulation
[28, 33, 41] lately. Its successful application on the image 3.2. Pre-training a parser-based model
manipulation tasks often thanks to its suitability in learn-
ing a highly disentangled latent space. Recent efforts have As per standard in existing parser-free models [10, 19], a
been focused on unsupervised latent semantics discovery parser-based model F P B is first trained. It is used in two
[4,34, 37]. [24] applied pose conditioned StyleGAN for vir- ways in the subsequent training of the proposed parser-free
tual try-on. However, their model cannot preserve garment model F: (a) to generate person image (p) to be used by F
details and is slow during inference. as input and (b) to supervise the training of F via knowledge
The design of our garment warping network is inspired distillation.
from StyleGAN in image manipulation, especially its super Concretely, FP B takes as inputs the semantic represen-
performance in shape deformation [28, 34]. Instead of us- tation (segmentation map3 , keypoint pose and dense pose)
ing style modulation to generate the warped garment, we of a real person image (pgt ∈ R3×H×W ) in the training set
use style modulation to predict the implicit appearance flow and an unpaired garment (gun ∈ R3×H×W ). The output of
which is then used to warp the garment via sampling. This FP B is the image p where the original person is wearing
design is much more suited to garment detail-preserving gun . p will serve as the input for F during training. This
compared to [24]. design, according to [10], benefits from the fact that we now
Appearance flow In the context of VTON, appearance 3 The garment region in the segmentation map is flipped as background

flow was first introduced by [13]. Since then, it has gained region
𝒑 𝒇𝒑, 𝒇𝒈 : Fully connected layer : Addition
𝒈𝒖𝒏
𝜀,!" 𝒄𝒐𝒏𝒗𝒎: Modulated convolution 𝒰 : Upsampling

𝒲 !" 𝒢 !" 𝒄𝒐𝒏𝒗: Convolution 𝒮 : Sampling

: Concatenation
𝜀+!"
Legend

𝒑𝒈𝒕 ℱ !"
𝒲2 𝒑𝒊
𝒔
𝒑
𝐟𝐜𝐢 𝐟𝐫𝐢
𝐟𝐢"𝟏
𝜀+
𝒰 𝒮 𝑐𝑜𝑛𝑣% 𝒮 𝑐𝑜𝑛𝑣 𝐟𝐢

𝒑𝟒 𝒈𝒊
𝒑 𝒇
𝒑𝟏 𝒑𝟐
𝟑
𝒑 𝒔
𝒈
𝜀, 𝒇𝒈 𝒲
𝒲. 𝒲/ 𝒲0 𝒲1
𝒮 𝒢
𝒈𝟒
𝒈𝟐 𝒈𝟑
𝒈𝟏

𝐟𝟒 .
𝒈 𝒕


Figure 2. A schematic of our framework. The pre-trained parser based model F P B generates an output image as the input of parser free
model F. The two feature extractors in F extract the feature of person image and garment image, respectively. A style vector is extracted
from the lowest resolution feature maps from person image and the garment image. The warping module takes in the style vector and
feature maps from the person image and garment image, and output an appearance flow map. The appearance flow is then used to warp
the garment. Finally, the warped garment is concatenated with person image and fed into the generator to generate the target try-on image.
Note that FP B is only used during training.

have paired person image pgt and garment image g in pgt to based on local feature correspondence [10, 13], originally
train the parser-free model F, that is: proposed in optical flow estimation [6, 17], our method,
based on a global style vector, first estimates a coarse ap-
\mathcal {F}^{*} = \underset {\mathcal {F}}{\text {arg min}} \lVert t - p_{gt}\rVert , (1) pearance flow via style modulation and then refine the pre-
dicted coarse appearance flow based on local feature corre-
where t = F(p, g) is the generated try-on image from F. spondence.
Note that F P B is only used during the training of F. As illustrated in Fig. 2, our warping module (W) consists
of N stacked warping blocks ({Wi }N 1 ), each block is com-
3.3. Feature extraction posed of a style-based appearance flow prediction layer (or-
We apply two convolutional encoders (Ep and Eg ) to ex- ange rectangle) and a local correspondence based appear-
tract the features of p and g. Both Ep and Eg share the ance flow refinement layer (blue rectangle). Concretely, we
same architecture, composed of stacked residual blocks. first extract a global style vector (s ∈ Rc ) using the features
The extracted features from Ep and Eg can be represented as output from the N th (final) blocks of Ep and Eg , denoted as
{pi }N N pN and gN , as:
1 and {gi }1 (N = 4 in Fig. 2 for simplicity), where
ci ×hi ×wi
pi ∈ R and gi ∈ Rci ×hi ×wi are the feature maps s = [f_{p}(p_{N}), f_g(g_{N})], (2)
extracted from the corresponding residual block in Ep and
Eg , respectively. The extracted feature maps will be used in where fp and fg are fully connected layers, and [·, ·] de-
W to predict the appearance flow. notes concatenation. Intrinsically, the extracted global style
vector s4 contains the global information of the person and
3.4. Style based appearance flow estimation garment, e.g., position, structure, etc. Similar to style based
The main novel component of the proposed model is a image manipulation [15, 28, 33, 34], we expect the global
style-based global appearance flow estimation module. Dif- 4 Intuitively, s = f (p ) is enough to generate the appearance flow.
p N
ferent from previous methods that estimate appearance flow But we empirically found that s = [fp (pN ), fg (gN )] yields better results.
style vector s capture the required deformation for warping 46] that have been proven to be effective in texture detail
g into p. It is thus used for style modulation in a StyleGAN preservation.
style generator for estimating a appearance flow field.
More specifically, in the style-based appearance flow 3.5. Learning objectives
prediction layer of each block Wi , we apply style modu- To train our model, we first apply a perceptual loss [20]
lation to predict a coarse flow: between the output of F and the ground truth person image
pgt :
\label {eq:2} \mathbf {f_{ci}} = conv_m(\mathcal {S}(g_{N+1-i}, \mathcal {U}(\mathbf {f_{i-1}})),s), (3)
where convm denotes modulated convolution [21], S(·, ·) L_{p} = \sum _{i} \lVert \phi _{i}(t) - \phi _{i}(p_{gt}) \rVert , (8)
is the sampling operator, U is the upsampling operator, and
fi−1 ∈ R2×hi−1 ×wi−1 is the predicted flow from last warp- where ϕi is the ith block of the pre-trained VGG network
ing block. Note that the first block W1 in W only takes in [35].
the lowest resolution garment feature map and the style vec- To supervise the training of the warping model W, we
tor, i.e., fc1 = convm (gN , s). As can be seen from Equa- apply a loss on the warped garment:
tion 3, the predicted fci depends on the garment feature map
and the global style vector. It thus has a global receptive L_{g} = \lVert \hat {g} - m_{g} \cdot p_{gt}\rVert , (9)
field and is capable to cope with large mis-alignments be-
where mg is the garment mask of pgt predicted by an off-
tween the garment and person images. However, as the style
the-shelf human parsing model.
vector s is a global representation, as a trade-off, it has a
As per standard in previous appearance flow methods
limited ability to accurately estimate the local fine-grained
[10, 13], we also apply a smoothness regularization on the
appearance flow (as shown in Fig. 5). The coarse flow is
predicted flow from each block in W:
thus in need of a local refinement.
To refine fci , we introduce a local correspondence based L_{R} = \sum _{i} \lVert \nabla \mathbf {f_{i}} \rVert , (10)
appearance flow refinement layer in each block Wi . It aims
to estimate a local fine-grained appearance flow:
where ∥∇fi ∥ is the generalized charbonnier loss function
\label {eq:3} \mathbf {f_{ri}} = conv([\mathcal {S}(g_{N+1-i}, \mathbf {f_{ci}}),p_{N+1-i}]), (4) [36].
where fri is the predicted refinement flow, and conv denotes As the inputs (segmentation map, keypoint pose and
convolution. Fundamentally, the refinement layer estimates dense pose) to the parser-based person encoder (E P B ) con-
the refinement flow through the local correspondence, i.e., tain more semantic information than those of the parser-
the correspondence between warped person features and free model F (person image), we apply a distillation loss
garment feature in the same receptive field. Note that after to guide the learning of person encoder Ep in F:
the warping by fci , we can assume that the corresponding
L_{D} = \sum _{i} \lVert p_{i}^{PB} - p_{i}\rVert , (11)
regions/features in gN +1−i and pN +1−i are now located in
the same receptive field. Therefore, we can apply the local
correspondence used in previous works [10, 13] to predict where pPi
B
is the output feature map from ith block in the
the local fine-grained appearance flow. person encoder EpP B in the pre-trained parser based model
Finally, we add the coarse flow and the local fine-grained FP B .
appearance flow together as the output of each warping The overall learning objective is:
block:
L = \lambda _{p}L_{p} + \lambda _{g}L_{g} + \lambda _{R}L_{R} + \lambda _{D}L_{D}, (12)
\label {eq:4} \mathbf {f_{i}} = \mathbf {f_{ci}} + \mathbf {f_{ri}}. (5)
where λp , λg ,λR and λD denote the hyperparameters for
The predicted appearance flow fN from the last block in balancing the four objectives.
W is used to warp the garment:
4. Experiments
\hat {g} = \mathcal {S}(g, \mathbf {f_{N}}). (6)
Datasets We experiment our model on the VITON
And the warped garment ĝ is then concatenated with the dataset5 [14]. It is the most popular dataset used in pre-
person image and fed into a generator for target try-on im- vious VTON works. VITON contains a training set con-
age generation: taining 14, 221 image pairs6 and a testing dataset of 2, 032
t = \mathcal {G}([\hat {g}, p]). (7)
5 The usage of the dataset has been permitted by the author in [14].
The generator G has an encoder-decoder architecture with 6 Each pair means a person image and the image of garment on the
skip connections in between. We follow the designs in [18, person.
pairs. Both person and garment images are of the resolution Methods Warping Parser SSIM ↑ FID↓
256 × 192.
VTON [14] TPS Y 0.74 55.71
We also create a testing dataset, denoted by augmented CP-VTON [38] TPS Y 0.72 24.45
VITON, to evaluate model’s robustness to the random posi- CP-VTON++ [26] TPS Y 0.75 21.04
tioned person image (see example in Fig. 4) with larger mis- Cloth-flow [13] AF Y 0.84 14.43
alignments with the garment images in the original dataset. ACGPN [42] TPS Y 0.84 16.64
As most testing person images in VITON are well posi- DCTON [9] TPS Y 0.83 14.82
tioned such that the person image and the garment are well PF-AFN [10] AF N 0.89 10.09
pre-aligned (e.g., most corresponding regions in the person Zflow [5] AF Y 0.88 15.17
image and garment image are roughly located in the same Cloth-flow⋆ [13] AF N 0.89 10.73
receptive field), it is not suited for this evaluation. Con- Ours AF N 0.91 8.89
cretely, the augmented VITON dataset is created by ran-
domly augmenting the testing person image in VITON via Table 1. Quantitative results of different models on VITON.
shifting and zooming in/out. In particular, we randomly Warping represents the warping methods used in different mod-
augment 1/3 testing person images in VITON by shifting els. Parser indicates whether human parser is used in the model
the person’s position in the image and randomly augment during inference. TPS: Thin Plate Spline. AF: Appearance Flow.
another 1/3 test images in VITON by zooming in/out the ⋆: re-trained with parser free training paradigm.
person in the image and keep another 1/3 testing images
unchanged. When evaluated on this dataset, all compared
models are trained with person image augmentation. Main results The quantitative results on VITON testing
dataset are shown in Table 1. It can be seen that our model
achieves new state-of-the-art performance. Importantly,
Implementation details Our model is implemented in given the already low FID score (10.09) achieved by prior
PyTorch. We train our model with a single Nvidia RTX SOTA method PF-AFN, our method can further decrease it
2080-Ti GPU. We set the batch size as 4 and train the model by 11.9%. In the meanwhile, the following observations can
with 100 epochs. We train the model with Adam opti- be made from Table 1. (1) Appearance flow based warping
mizer [23]. The initial learning rate is set to 5e − 4 which is methods generally perform better than TPS based warping
linearly decayed after 50 epochs. Each residual block in Ep methods. (2) Although it takes more training time, parser-
and Eg is followed by a pooling layer to reduce the spatial free methods are much better than parser-based methods.
dimension. We set N = 5 and c = 256 in the implementa- Our model, benefiting from the proposed novel global ap-
tion. We will release the code upon the acceptance of this pearance flow estimation method, outperforms the previ-
work. ous SOTA parser-free methods (PF-AFN [10] and Cloth-
flow [13]) on all evaluation metrics. The human evaluation
Evaluation metrics and baselines We evaluate our results are shown in Table 2. The result is consistent with
model both automatically and manually. In the auto- that in Table 1. Our model outperforms all compared mod-
matic evaluation, as per standard in VTON, we evaluate els with more than 10% preference rate. The qualitative re-
model performance using structure similarity (SSIM) [40] sults from different models are illustrated in Fig. 3. Overall,
and Fréchet Inception Distance (FID) [16]. According to our method generates better try-on images. For example,
[10, 31], inception score (IS) [32] is not suitable to evaluate the hard pose and occlusion in second and third rows.
VTON images, we thus do not adopt it in the evaluation. In The quantitative results on augmented testing dataset are
the manual (subjective) evaluation, we run perceptual study shown in Table 3. As can be seen that our model again
on Amazon Mechanical Turk (AMT) to compare the qual- performs best on the augmented VITON testing dataset.
ity of the generated try-on images from different models. Importantly, all other models’ performance drops dramat-
Given an input person image, a garment image and the gen- ically. And our model can still maintain the performance
erated try-on image from two models, the AMT workers (SSIM score) compared to that on the original VITON test-
were asked to vote which generated try-on image is better. ing dataset. The qualitative examples are illustrated in
Each AMT worker was randomly allocated 100 images to Fig. 4. Only our model can generate consistent (e.g., the
compare two models. 15 AMT workers participated in the garment’s left sleeve) and high quality try-on images given
evaluation for all models comparison. the large mis-alignments.
We compare our methods with other parser-based meth-
ods VTON [14], CP-VTON [38], Cloth-flow [13], CP- Ablation Study In this experiment, we validate the de-
VTON++ [26], ACGPN [42], DCTON [9] and ZFlow [5]. sign of our appearance flow estimation blocks (Wi ). Specif-
We also compare with the SOTA parser-free method PF- ically, we first experiment our method with only global style
AFN [10]. modulation (SM) based appearance flow estimation, that
person garment CP-VTON++ ACGPN PF-AFN Ours

Figure 3. Qualitative results from different models (CP-VTON++ [26], ACGPN [42], PF-AFN [9] and ours) on VITON testing dataset.

Compared methods preference rate Methods SSIM ↑ FID↓ ▽SSIM /▽FID


CP-VTON++ [26] 12.7% / 87.3% ACGPN 0.81 20.75 0.003/4.11
ACGPN [42] 20.2% / 79.8% Cloth-flow⋆ [13] 0.86 13.05 0.003/2.96
Cloth-flow⋆ [13] 38.5% / 61.5% AF-PFN [10] 0.87 12.19 0.002/2.10
AF-PFN [10] 43.2% / 56.8%
Ours 0.91 9.91 0/1.02
Table 2. The preference rate comparing other models against our
model (other models/our model) in human evaluation. Table 3. Quantitative results of different models on augmented VI-
TON and their relative performance drop (▽SSIM /▽FID ) compared
to the standard VITON testing dataset.
is, only using fci in Equation 3 in each Wi . We then ex-
periment our method with only refinement flow (RF) esti-
mation, that is, only using fri in Equation 4 in each Wi . are shown in Table 4. Our proposed global style modulation
Finally, we experiment with our combined method (SM + (SM) based appearance flow method outperforms the local
RF) which first estimates the appearance flow globally via correspondence based method. When they were combined,
style modulation and then refines the appearance flow lo- the performance is further boosted. As illustrated in Fig. 5,
cally through local correspondence. The quantitative results without local refinement, our method (global style modu-
person garment ACGPN Cloth-flow PF-AFN Ours

Figure 4. Illustrating different VTON models’ robustness to the randomly positioned person image. First row uses original person image
as input. And second row uses vertically shifted person image as input. ACGPN [42], Cloth-flow [13], PF-AFN [10].

lation only) sometimes cannot accurately predict the local person garment 𝐟𝐜𝐢 𝐟𝐜𝐢 + 𝐟𝐫𝐢
fine-grained appearance flow, e.g., the sleeve regions, and
thus generates unsatisfactory try-on image. However, with
only local correspondence based appearance flow estima-
tion, e.g., only using fri in Wi , the method suffers when the
corresponding regions are not located in the same receptive
field. As illustrated in Fig. 6, fri cannot accurately estimate
the appearance flow when there exists a large misalignment
between the input person images and garment images. Once
fci was first used to reduce the misalignment, our model can
successfully overcome the problem.

Methods SSIM ↑ FID↓


RF 0.89 10.73 Figure 5. Comparing results with only fci used in Wi and fci + fri
SM 0.89 9.84 used in Wi .

SM + RF 0.91 8.89 person garment 𝐟𝐫𝐢 𝐟𝐜𝐢 + 𝐟𝐫𝐢

Table 4. Results on VTON testing dataset when different appear-


ance flow estimation methods were used in Wi . RF: local cor-
respondence based flow estimation. SM: style modulation based
flow estimation.

5. Conclusion
In this paper, we have proposed a style based global ap-
pearance flow estimation method to warp the garment for
virtual try-on. Our method via style modulation first es-
timates the appearance flow globally and then refines the
appearance flow locally. Our method achieves state-of-the-
art performance on the VITON benchmark and it is more Figure 6. Comparing results with only fri used in Wi and fci + fri
robust against large mis-alignment between person and gar- used in Wi in the case of large misalignment between the input
ment images, as well as difficult poses/occlusions. We con- person image and garment image.
ducted extensive experiments to show the superiority of our
method and validated our architecture design.
References [17] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper,
Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolu-
[1] Badour AlBahar, Jingwan Lu, Jimei Yang, Zhixin Shu, Eli tion of optical flow estimation with deep networks. In CVPR,
Shechtman, and Jia-Bin Huang. Pose with style: Detail- 2017. 2, 4
preserving pose-guided image synthesis with conditional
[18] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A
stylegan. In SIGGRAPH Asia, 2021. 3
Efros. Image-to-image translation with conditional adver-
[2] Bharat Lal Bhatnagar, Garvita Tiwari, Christian Theobalt, sarial networks. In CVPR, 2017. 1, 5
and Gerard Pons-Moll. Multi-garment net: Learning to dress [19] Thibaut Issenhuth, Jérémie Mary, and Clément Calauzenes.
3d people from images. In ICCV, 2019. 3 Do not mask what you do not need to mask: a parser-free
[3] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. virtual try-on. In ECCV, 2020. 1, 2, 3
Realtime multi-person 2d pose estimation using part affinity [20] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual
fields. In CVPR, 2017. 2 losses for real-time style transfer and super-resolution. In
[4] Anton Cherepkov, Andrey Voynov, and Artem Babenko. ECCV, 2016. 5
Navigating the gan parameter space for semantic image edit- [21] Tero Karras, Samuli Laine, and Timo Aila. A style-based
ing. In CVPR, 2021. 3 generator architecture for generative adversarial networks. In
[5] Ayush Chopra, Rishabh Jain, Mayur Hemani, and Balaji Kr- CVPR, 2019. 2, 3, 5
ishnamurthy. Zflow: Gated appearance flow-based virtual [22] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,
try-on with 3d priors. In ICCV, 2021. 2, 3, 6 Jaakko Lehtinen, and Timo Aila. Analyzing and improving
[6] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip the image quality of stylegan. In CVPR, 2020. 2, 3
Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van [23] Diederik P Kingma and Jimmy Ba. Adam: A method for
Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: stochastic optimization. In ICLR, 2015. 6
Learning optical flow with convolutional networks. In [24] Kathleen M Lewis, Srivatsan Varadharajan, and Ira
CVPR, 2015. 2, 4 Kemelmacher-Shlizerman. Tryongan: body-aware try-on via
[7] Jean Duchon. Splines minimizing rotation-invariant semi- layered interpolation. TOG, 40(4):1–10, 2021. 1, 3
norms in sobolev spaces. In Constructive theory of functions [25] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard
of several variables, pages 85–100. Springer, 1977. 2 Pons-Moll, and Michael J Black. Smpl: A skinned multi-
[8] Valentin Gabeur, Jean-Sébastien Franco, Xavier Martin, person linear model. TOG, 34(6):1–16, 2015. 3
Cordelia Schmid, and Gregory Rogez. Moulding humans: [26] Matiur Rahman Minar, Thai Thanh Tuan, Heejune Ahn, Paul
Non-parametric 3d human shape estimation from single im- Rosin, and Yu-Kun Lai. Cp-vton+: Clothing shape and
ages. In ICCV, 2019. 3 texture preserving image-based virtual try-on. In CVPRW,
[9] Chongjian Ge, Yibing Song, Yuying Ge, Han Yang, Wei Liu, 2020. 6, 7
and Ping Luo. Disentangled cycle consistency for highly- [27] Aymen Mir, Thiemo Alldieck, and Gerard Pons-Moll. Learn-
realistic virtual try-on. In CVPR, 2021. 1, 2, 6, 7 ing to transfer texture from clothing images to 3d humans. In
[10] Yuying Ge, Yibing Song, Ruimao Zhang, Chongjian Ge, Wei CVPR, 2020. 3
Liu, and Ping Luo. Parser-free virtual try-on via distilling [28] Roy Or-El, Soumyadip Sengupta, Ohad Fried, Eli Shecht-
appearance flows. In CVPR, 2021. 1, 2, 3, 4, 5, 6, 7, 8 man, and Ira Kemelmacher-Shlizerman. Lifespan age trans-
[11] Ke Gong, Xiaodan Liang, Dongyu Zhang, Xiaohui Shen, formation synthesis. In ECCV, pages 739–755, 2020. 2, 3,
and Liang Lin. Look into person: Self-supervised structure- 4
sensitive learning and a new benchmark for human parsing. [29] Yurui Ren, Xiaoming Yu, Junming Chen, Thomas H Li, and
In CVPR, 2017. 2 Ge Li. Deep image spatial transformation for person image
[12] Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. generation. In CVPR, 2020. 3
Densepose: Dense human pose estimation in the wild. In [30] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net:
CVPR, 2018. 2 Convolutional networks for biomedical image segmentation.
[13] Xintong Han, Xiaojun Hu, Weilin Huang, and Matthew R In MICCAI, 2015. 1, 2
Scott. Clothflow: A flow-based model for clothed person [31] Mihaela Rosca, Balaji Lakshminarayanan, David Warde-
generation. In ICCV, 2019. 1, 2, 3, 4, 5, 6, 7, 8 Farley, and Shakir Mohamed. Variational approaches
[14] Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S for auto-encoding generative adversarial networks. arXiv
Davis. Viton: An image-based virtual try-on network. In preprint arXiv:1706.04987, 2017. 6
CVPR, 2018. 1, 2, 5, 6 [32] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki
[15] Sen He, Wentong Liao, Michael Ying Yang, Yi-Zhe Song, Cheung, Alec Radford, and Xi Chen. Improved techniques
Bodo Rosenhahn, and Tao Xiang. Disentangled lifespan face for training gans. In NeurIPS, 2016. 6
synthesis. In ICCV, 2021. 2, 4 [33] Yujun Shen, Ceyuan Yang, Xiaoou Tang, and Bolei Zhou. In-
[16] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, terfacegan: Interpreting the disentangled face representation
Bernhard Nessler, and Sepp Hochreiter. Gans trained by a learned by gans. TPAMI, 2020. 3, 4
two time-scale update rule converge to a local nash equilib- [34] Yujun Shen and Bolei Zhou. Closed-form factorization of
rium. In NeurIPS, 2017. 6 latent semantics in gans. In CVPR, 2021. 2, 3, 4
[35] Karen Simonyan and Andrew Zisserman. Very deep convo-
lutional networks for large-scale image recognition. In ICLR,
2015. 5
[36] Deqing Sun, Stefan Roth, and Michael J Black. A quan-
titative analysis of current practices in optical flow estima-
tion and the principles behind them. IJCV, 106(2):115–137,
2014. 5
[37] Christos Tzelepis, Georgios Tzimiropoulos, and Ioannis Pa-
tras. Warpedganspace: Finding non-linear rbf paths in gan
latent space. In ICCV, 2021. 3
[38] Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin
Chen, Liang Lin, and Meng Yang. Toward characteristic-
preserving image-based virtual try-on network. In ECCV,
2018. 1, 2, 6
[39] Jiahang Wang, Tong Sha, Wei Zhang, Zhoujun Li, and Tao
Mei. Down to the last detail: Virtual try-on with fine-grained
details. In ACM MM, 2020. 1
[40] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P
Simoncelli. Image quality assessment: from error visibility
to structural similarity. TIP, 13(4):600–612, 2004. 6
[41] Ceyuan Yang, Yujun Shen, and Bolei Zhou. Semantic hier-
archy emerges in deep generative representations for scene
synthesis. IJCV, 129(5):1451–1466, 2021. 3
[42] Han Yang, Ruimao Zhang, Xiaobao Guo, Wei Liu, Wang-
meng Zuo, and Ping Luo. Towards photo-realistic virtual
try-on by adaptively generating-preserving image content. In
CVPR, 2020. 1, 2, 3, 6, 7, 8
[43] Ruiyun Yu, Xiaoqi Wang, and Xiaohui Xie. Vtnfp: An
image-based virtual try-on network with body and clothing
feature preservation. In ICCV, 2019. 1, 2
[44] Fuwei Zhao, Zhenyu Xie, Michael Kampffmeyer, Haoye
Dong, Songfang Han, Tianxiang Zheng, Tao Zhang, and Xi-
aodan Liang. M3d-vton: A monocular-to-3d virtual try-on
network. In ICCV, 2021. 3
[45] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Ma-
lik, and Alexei A Efros. View synthesis by appearance flow.
In ECCV, 2016. 2, 3
[46] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A
Efros. Unpaired image-to-image translation using cycle-
consistent adversarial networks. In ICCV, 2017. 5

You might also like