0% found this document useful (0 votes)
68 views11 pages

Mpvton, Iccv 2019 PDF

1) A new task of multi-pose guided virtual try-on is proposed, which aims to generate realistic images of a person wearing different clothes and in different poses using only a single input image. 2) The paper introduces the Multi-pose Guided Virtual Try-on Network (MG-VTON) which contains four modules to generate an image of a person fitted with desired clothes and in a desired pose from an input image. 3) Extensive experiments on existing datasets and a new large multi-pose virtual try-on dataset demonstrate that MG-VTON outperforms state-of-the-art methods both qualitatively and quantitatively for the new task of multi-pose virtual try-

Uploaded by

Thắng Văn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views11 pages

Mpvton, Iccv 2019 PDF

1) A new task of multi-pose guided virtual try-on is proposed, which aims to generate realistic images of a person wearing different clothes and in different poses using only a single input image. 2) The paper introduces the Multi-pose Guided Virtual Try-on Network (MG-VTON) which contains four modules to generate an image of a person fitted with desired clothes and in a desired pose from an input image. 3) Extensive experiments on existing datasets and a new large multi-pose virtual try-on dataset demonstrate that MG-VTON outperforms state-of-the-art methods both qualitatively and quantitatively for the new task of multi-pose virtual try-

Uploaded by

Thắng Văn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Towards Multi-pose Guided Virtual Try-on Network

Haoye Dong1 , Xiaodan Liang1 , Bochao Wang1 , Hanjiang Lai1 , Jia Zhu2 , Jian Yin1
1
Sun Yat-sen University, 2 South China Normal University
{donghy7@mail2, wangboch@mail2, laihanj3@mail, issjyin@mail}.sysu.edu.cn,
[email protected], [email protected]
arXiv:1902.11026v1 [cs.CV] 28 Feb 2019

Abstract

Virtual try-on system under arbitrary human poses has


huge application potential, yet raises quite a lot of chal-
lenges, e.g. self-occlusions, heavy misalignment among di-
verse poses, and diverse clothes textures. Existing methods
aim at fitting new clothes into a person can only transfer
clothes on the fixed human pose, but still show unsatisfac-
tory performances which often fail to preserve the identity,
lose the texture details, and decrease the diversity of poses.
In this paper, we make the first attempt towards multi-pose
guided virtual try-on system, which enables transfer clothes
on a person image under diverse poses. Given an input
person image, a desired clothes image, and a desired pose, Figure 1. Some results of our model by manipulating both various
the proposed Multi-pose Guided Virtual Try-on Network clothes and diverse poses. The input image of the clothes and
(MG-VTON) can generate a new person image after fit- poses are shown in the first row, while the input images of the
ting the desired clothes into the input image and manipu- person are shown in the first column. The results manipulated by
lating human poses. Our MG-VTON is constructed in three both clothes and pose are shown in the other columns.
stages: 1) a desired human parsing map of the target im-
stage method to synthesize the image of person conditioned
age is synthesized to match both the desired pose and the
on both clothes and pose. Given an image of a person, a
desired clothes shape; 2) a deep Warping Generative Ad-
desired clothes, and a desired pose, we generate the real-
versarial Network (Warp-GAN) warps the desired clothes
istic image that preserves the appearance of both desired
appearance into the synthesized human parsing map and
clothes and person, meanwhile reconstructing the pose, as
alleviates the misalignment problem between the input hu-
illustrated in Figure 1. Obviously, delicate and reasonable
man pose and desired human pose; 3) a refinement render
synthesized outfit with arbitrary pose is helpful for users in
utilizing multi-pose composition masks recovers the texture
selecting clothes while shopping.
details of clothes and removes some artifacts. Extensive ex-
periments on well-known datasets and our newly collected However, recent image synthesis approaches [8, 29] for
largest virtual try-on benchmark demonstrate that our MG- virtual try-on mainly focus on the fixed pose and fail to pre-
VTON significantly outperforms all state-of-the-art methods serve the fine details, such as the clothing of lower-body and
both qualitatively and quantitatively with promising multi- the hair of the person lose the details and style, as shown
pose virtual try-on performances. in Figure 4. In order to generate the realistic image, those
methods apply a coarse-to-fine network to produce the im-
age conditioned on clothes only. They ignore the significant
1. Introduction features of the human parsing, which leads to synthesize
blurry and unreasonable image, especially in case of condi-
Learning to synthesize the image of person conditioned tioned on various poses. For instance, as shown in Figure 4,
on the image of clothes and manipulate the pose simulta- the clothing of lower-body cannot be preserved while the
neously is a significant and valuable task in many appli- clothing of upper-body is replaced. The head of the person
cations such as virtual try-on, virtual reality, and human- fail to identify while conditioned different poses. Other ex-
computer interaction. In this work, we propose a multi- iting works [14, 20, 35] usually leverage 3D measurements

1
to solve those issues since the 3D information have abun- • A new task of virtual try-on conditioned on multi-pose
dant details of the shape of the body that can help to gener- is proposed, which aims to restructure the person im-
ate the realistic results. However, it needs expert knowledge age by manipulating both diverse poses and clothes.
and huge labor cost to build the 3D models, which requires
collecting the 3D annotated data and massive computation. • We propose a novel Multi-pose Guided Virtual Try-on
These costs and complexity would limit the applications in Network (MG-VTON) that generates a new person im-
the practical virtual try-on simulation. age after fitting the desired clothes onto the input per-
son image and manipulating human poses. MG-VTON
In this paper, we study the problem of virtual try-on con-
contains four modules: 1) a pose-clothes-guided hu-
ditioned on 2D images and arbitrary poses, which aims to
man parsing network is designed to guide the image
learn a mapping function from an input image of a person
synthesis; 2) a Warp-GAN learns to synthesized realis-
to another image of the same person with a new outfit and
tic image by using a warping features strategy; 3) a re-
diverse pose, by manipulating the target clothes and pose.
finement network learns to recover the texture details;
Although the image-based virtual try-on with the fixed pose
4) a mask-based geometric matching network is pre-
has been studied widely [8, 29, 37], the task of multi-pose
sented to warp clothes that enhances the visual quality
virtual try-on is less explored. In addition, without mod-
of the generated image.
eling the mapping of the intricate interplay among of the
appearance, the clothes, and the pose, directly using the ex- • A new dataset for the multi-pose guided virtual try-
isting virtual try-on methods to synthesized image based on on task is collected, which covers person images with
different poses often result in blurry and artifacts. more poses and clothes diversity. The extensive exper-
Targeting on the problems mentioned above, we propose iments demonstrate that our approach can achieve the
a novel Multi-pose Guided Virtual Try-on Network (MG- competitive quantitative and qualitative results.
VTON) that can generate a new person image after fitting
both desired clothes into the input image and manipulating
human poses. Our MG-VTON is a multi-stage framework 2. Related Work
with generative adversarial learning. Concretely, we design Generative Adversarial Networks (GANs). GANs [7]
a pose-clothes-guided human parsing network to estimate consists of a generator and a discriminator that the dis-
a plausible human parsing of the target image conditioned criminator learns to classify between the synthesized im-
on the approximate shape of the body, the face mask, the ages and the real images while the generator tries to fool
hair mask, the desired clothes, and the target pose, which the discriminator. The generator aims to generate realis-
could guide the synthesis in an effective way with the pre- tic images, which are indistinguishable from the real im-
cise region of body parts. To seamlessly fit the desired ages. And the discriminator focuses on distinguishing be-
clothes on the person, we warp the desired clothes image, tween the synthesized and real images. Existing works
by exploiting a geometric matching model to estimate the have leveraged various applications based on GANs, such
transformation parameters between the mask of the input as style transfer [9, 36, 12, 34], image inpainting [33], text-
clothes image and the mask of the synthesized clothes ex- to-image [22], and super-resolution imaging [16]. Inspired
tracted from the synthesized human parsing. In addition, by those impressive results of GANs, we also apply the ad-
we design a deep Warping Generative Adversarial Network versarial loss to exploit a virtual try-on method with GANs.
(Warp-GAN) to synthesize the coarse result alleviating the Person image synthesis. Skeleton-aided [32] proposed
large misalignment caused by the different poses and the di- a skeleton-guided person image generation method, which
versity of clothes. Finally, we present a refinement network conditioned on a person image and the target skeletons.
utilizing multi-pose composition masks to recover the tex- PG2 [17] applied a coarse-to-fine framework that consists of
ture details and alleviate the artifact caused by the large mis- a coarse stage and a refined stage. Besides, they proposed a
alignment between the reference pose and the target pose. novel model [18] to further improve the quality of result by
To demonstrate our model, we collected a new dataset, using a decomposition strategy. The deformableGANs [27]
named MPV, by collecting various clothes image and per- and [1] made attempt to alleviate the misalignment problem
son images with diverse poses from the same person. In ad- between different poses by using affine transformation on
dition, we also conduct experiments on DeepFashion [38] the coarse rectangle region and warped the parts on pixel-
datasets for testing. Following the object evaluation proto- level, respectively. V-UNET [5] introduced a variational U-
col [30], we conduct a human subjective study on the Ama- Net [24] to synthesize the person image by restructuring the
zon Mechanical Turk (AMT) platform. Both quantitative shape with stickman label. [21] applied CycleGAN [36] di-
and qualitative results indicate that our method achieves ef- rectly to manipulate pose. However, all those works fail to
fective performance and high-quality images with appealing preserve the texture details consistency corresponding with
details. The main contributions are listed as follows: the pose. The reason behind that is they ignore to consider
Reference Image Hair Mask Face Mask Body Shape

Synthesized
Stage I: Parsing
Coarse
Conditional Parsing Learning
Result
Target Clothes Target Pose
w/o Clothes Stage II:
Reference Warp-GAN Refined
Warped Clothes Image Target Pose Stage III: Result
Refinement Render
Warped Clothes Target Pose
Geometric Matching Learning

Remove Clothes

Figure 2. The overview of the proposed MG-VTON. Stage I: We first decompose the reference image into three binary masks. Then, we
concatenate them with the target clothes and target pose as an input of the conditional parsing network to predict human parsing map. Stage
II: Next, we warp clothes, remove the clothing from the reference image, and concatenate them with the target pose and synthesized parsing
to synthesize the coarse result by using Warp-GAN. Stage III: We finally refine the coarse result with a refinement render, conditioning on
the warped clothes, target pose, and the coarse result.

the interplay between the human parsing map and the pose and manipulating the pose. Inspired by the coarse-to-fine
in the person image synthesis. The human parsing map can idea [8, 17], we adopt an outline-coarse-fine strategy that
guide the generator to synthesize image in the precise re- divides this task into three subtasks, including the condi-
gion level that ensures the coherence of body structure. tional parsing learning, the Warp-GAN, and the refinement
Virtual try-on. VITON [8] and CP-VTON [29] all pre- render. The Figure 2 illustrates the overview of MG-VTON.
sented an image-based virtual try-on network, which can We first apply the pose estimator [4] to estimate the pose.
transfer a desired clothes on the person by using a warp- Then, we encode the pose as 18 heatmaps, which is filled
ing strategy. VITON computed the transformation mapping with ones in a circle with radius 4 pixels and zeros else-
by the shape context TPS warps [2] directly. CP-VTON where. A human parser [6] is used to predict the human
introduced a learning method to estimate the transforma- segmentation maps, consisting of 20 labels, from which we
tion parameters. FashionGAN [37] learned to generate new extract the binary mask of the face, the hair, and the shape of
clothes on the input image of the person conditioned on a the body. Following VITON [8], we downsample the shape
sentence describing the different outfit. However, the above of the body to a lower resolution (16 × 12) and directly re-
all methods synthesized the image of person only on the size it to the original resolution (256×192), which alleviates
fixed pose, which limits the applications in the practical vir- the artifacts caused by the variety of the body shape.
tual try-on simulation. ClothNet [15] presented an image-
based generative model to produce new clothes conditioned 3.1. Conditional Parsing Learning
on color. CAGAN [10] proposed a conditional analogy net-
To preserve the structural coherence of the person while
work to synthesize person image conditioned on the paired
manipulating both clothes and the pose, we design a pose-
of clothes, which limits the practical virtual try-on scenar-
clothes-guided human parsing network, conditioned on the
ios. In order to generate the realistic-look person image
image of clothes, the pose heatmap, the approximated shape
in different clothes, ClothCap [20] utilized the 3D scanner
of the body, the mask of the face, and the mask of hair. As
to capture the clothes, the shape of the body automatically.
shown in Figure 4, the baseline methods failed to preserve
[26] presented a virtual fitting system that requires the 3D
some parts of the person (e.g., the color of the trousers and
body shape, which is laborious for collecting the annotation.
the style of hair were replaced.), due to feeding the image
In this paper, we introduce a novel and effective method for
of the person and clothes into the model directly. In this
learning to synthesize image with the new outfit on the per-
work, we leverage the human parsing maps to address those
son through adversarial learning, which can manipulate the
problems, which can help generator to synthesize the high-
pose simultaneously.
quality image on parts-level.
3. MG-VTON Formally, given an input image of person I, an in-
put image of clothes C, and the target pose P , this stage
0
We propose a novel Multi-pose Guided Virtual Try-on learns to predict the human parsing map St conditioned
Network (MG-VTON) that learns to synthesize the new per- on clothes C and the pose P . As shown in Figure 3 (a),
son image for virtual try-on by manipulating both clothes we first extract the hair mask Mh , the face mask Mf , the
and pose. Given an input person image, a desired clothes, body shape Mb , and the target pose P by using a hu-
and a desired pose, the proposed MG-VTON aims to pro- man parser [6] and a pose estimator [4], respectively. We
duce a new image of the person wearing the desired clothes then concatenate them with the image of clothes as input
Reference (a). Conditional Parsing Hair Face Body Target Target (b). Conditional Parsing
Hair Mask Face Mask Body Shape Generator Discriminator
Image Mask Mask Shape Clothes Pose
Synthesized Real
Segmentation
Residual Blocks Parsing Parsing
Real
… or
Target Clothes Target Pose Synthesized Parsing Real Parsing Fake

OR

w/o Clothes Reference Reference Synthesized (c). Warp-GAN w/o Clothes (d). Warp-GAN
Reference Image Reference Image Parsing Parsing Parsing Generator Reference Warped Target Synthesized Discriminator
Image Clothes Pose Parsing
Remove Clothes
Estimate Parsing Real
or
Grid Pre-trained Coarse Real
Target Synthesized Fake
Sampling Geometric Matcher Result Image
Clothes Warped Clothes Parsing Target Pose Coarse Result Real Image

Warping OR
Residual Blocks

Coarse Result (e). Refinement Render Feature Feature Transformation (f). Geometric Matching
Generator Extracting Matching Parameters Module
Synthesized Body Grid Warped Synthsized
Residual Composition Mask Polish Real Parsing Shape Estimating Sampling Clothes Mask Clothes Mask
Blocks Result Image
Target Warped θ

Pose Clothes Target Target
Clothes Clothes Mask Element-wise
matrix multiplication
Concatenation

Figure 3. The network architecture of the proposed MG-VTON. (a)(b): The conditional parsing learning module consists of a pose-
clothes-guided network that predicts the human parsing, which helps to generate high-quality person image. (c)(d): The Warp-GAN learns
to generate the realistic image by using a warping features strategy due to the misalignment caused by the diversity of pose. (e): The
refinement render network learns the pose-guided composition mask that enhances the visual quality of the synthesized image. (f): The
geometric matching network learns to estimate the transformation mapping conditioned on the body shape and clothes mask.

which is fed into the conditional parsing network. The in- The St denotes the ground truth human parsing. The pdata
0
ference of St can be formulate as maximizing the poste- represents the distributions of the real data.
0
rior probability p(St |(Mh , Mf , Mb , C, P )). Furthermore,
this stage is based on the conditional generative adversar- 3.2. Warp-GAN
ial network (CGAN) [19] which generates promising re-
sults on image manipulating. Thus, the poster probability Since the misalignment of pixels would lead to generate
0
p(St |(Mh , Mf , Mb , C, P )) is expressed as: the blurry results [27], we introduce a deep Warping Gen-
0
erative Adversarial Network (Warp-GAN) warps the de-
p(St |(Mh , Mf , Mb , C, P )) = G(Mh , Mf , Mb , C, P ). sired clothes appearance into the synthesized human pars-
(1) ing map, which alleviates the misalignment problem be-
We adopt a ResNet-like network as the generator G to build tween the input human pose and desired human pose. Dif-
the conditional parsing model. We adopt the discriminator ferent from deformableGANs [27] and [1], we warp the
D directly from the pix2pixHD [30]. We apply the L1 loss feature map from the bottleneck layer by using both the
for further improving the performance, which is advanta- affine and TPS (Thin-Plate Spline) [3] transformation rather
geous for generating more smooth results [32]. Inspired by than process the pixel directly by using affine only. Thanks
the LIP [6], we apply the pixel-wise softmax loss to encour- to the generalization capacity of [23], we directly use the
age the generator to synthesize high-quality human parsing pre-trained model of [23] to estimate the transformation
maps. Therefore, the problem of conditional parsing learn- mapping between the reference parsing and the synthesized
ing can be formulated as: parsing. We then warp the w/o clothes reference image by
min max F (G, D) using this transformation mapping.
G D
As illustrated in Figure 3 (c) and (d), the proposed
= EM,C,P ∼pdata [log(1 − D(G(M, C, P ), M, C, P ))] deep warping network consists of the Warp-GAN generator
+ ESt ,M,C,P ∼pdata [log D(St , M, C, P )] (2) Gwarp and the Warp-GAN discriminator Dwarp . We use the
+ ESt ,M,C,P ∼pdata [kSt − G(M, C, P )k1 ] geometric matching module to warp clothes image, as de-
scribed in the section 3.4. Formally, we take warped clothes
+ ESt ,M,C,P ∼pdata [Lparsing (St , G(M, C, P ))],
image Cw , w/o clothes reference image Iw/o clothes , the tar-
0
where M denotes the concatenation of Mh , Mf , and Mb . get pose P , and the synthesized human parsing St as in-
The loss Lparsing denotes the pixel-wise softmax loss [6]. put of the Warp-GAN generator and synthesize the result
0
Iˆ = Gwarp (Cw , Iw/o clothes , P, St ). Inspired by [11, 8, 16], P as input, the Gp learns to predict a towards multi-pose
we apply a perceptual loss to measure the distances between composition mask and synthesize the rendered result:
high-level features in the pre-trained model, which encour-
ages generator to synthesize high-quality and realistic-look Iˆp = Gp (Cw , I,
ˆ P ) Cw + (1 − Gp (Cw , I,
ˆ P )) I,
ˆ (6)
images. We formulate the perceptual loss as:
where denotes the element-wise matrix multiplication.
n
X We also adopt the perceptual loss to enhance the perfor-
ˆ I) =
Lperceptual (I, ˆ − φi (I)k1 ,
αi kφi (I) (3) mance that the objective function of Gp can be written as:
i=0

where φi (I) denotes the i-th (i = 0, 1, 2, 3, 4) layer feature Lp = µ1 Lperceptual (Iˆp , I) + µ2 k1 − Gp (Cw , Iˆc , P )k1 , (7)
map in pre-trained network φ of ground truth image I. We
where µ1 denotes the weight of perceptual loss and µ2 de-
use the pre-trained VGG19 [28] as φ and weightedly sum
notes the weight of the mask loss.
the L1 norms of last five layer feature maps in φ to rep-
resent perceptual losses between images. The αi controls 3.4. Geometric matching learning
the weight of loss for each layer. In addition, following
pixp2pixHD [30], due to the feature map at different scales Inspired by [23], we adopt the convolutional neural net-
from different layers of discriminator enhance the perfor- work to learn the transformation parameters, including fea-
mance of image synthesis, we also introduce a feature loss ture extracting layers, feature matching layers, and the
and formulate it as: transformation parameters estimating layers. As shown in
Figure 3 (f), we take the mask of the clothes image, and the
n
ˆ I) =
X
ˆ − Fi (I)k1 , mask of body shape as input which is first passed through
Lfeature (I, γi kFi (I) (4)
the feature extracting layers. Then, we predict the corre-
i=0
lation map by using the matching layers. Finally, we ap-
where Fi (I) represent the i-th (i = 0, 1, 2) layer feature ply a regression network to estimate the TPS (Thin-Plate
map of the trained Dwarp . The γi denotes the weight of L1 Spline) [3] transformation parameters for the clothes image
loss for corresponding layer. directly based on the correlation map.
Furthermore, we also apply the adversarial loss Ladv [7, Formally, given an input image of clothes C and its mask
19] and L1 loss L1 [32] to improve the performance. We Cmask , following the stage of conditional parsing learning,
design a weight sum losses as the loss of Gwarp , which en- we obtain the approximated body shape Mb and the syn-
courages the Gwarp to synthesize realistic and natural images thesized clothes mask Ĉmask from the synthesized human
in different aspects, written as follows: parsing. This subtask aims to learn the transformation map-
ping function T with parameter θ for warping the input im-
LGwarp = λ1 Ladv + λ2 Lperceptual + λ3 Lfeature + λ4 L1 , (5) age of clothes C. Due to the unseen of synthesized clothes
but have the synthesized clothes mask, we learn the map-
where λi (i = 1, 2, 3, 4) denotes the weight of correspond- ping between the original clothes mask Cmask and the syn-
ing loss, respectively. thesized clothes mask Ĉmask obey body shape Mb . Thus,
the objective function of the geometric matching learning
3.3. Refinement render can be formulated as:
In the coarse stage, the identification information and the
Lgeo matching (θ) = kTθ (Cmask ) − Ĉmask k1 , (8)
shape of the person can be preserve, but the texture details
are lost due to the complexity of the clothes image. Pasting
Therefore, the warped clothes Cw can be formulated as
the warped clothes onto the target person directly may lead
Cw = Tθ (C), which is helpful for addressing the problem
to generate the artifacts. Learning the composition mask
of misalignment and learning the composition mask in the
between the warped clothes image and the coarse results
above subsection 3.2 and subsection 3.3.
also generates the artifacts [8, 29] due to the diversity of
pose. To solve the above issues, we present a refinement
render utilizing multi-pose composition masks to recover
4. Experiments
the texture details and remove some artifacts. In this section, we first make visual comparisons with
Formally, we define Cw as an image of warped clothes other methods and then discuss the results quantitatively.
obtained by geometric matching learning module, Iˆc as a We also conduct the human perceptual study and the abla-
coarse result generated by the Warp-GAN, P as the target tion study, and further train our model on our newly col-
pose heatmap, and Gp as the generator of the refinement lected dataset MPV test it on the Deepfashion to verify the
render. As illustrated in Figure 3 (e), taking Cw , Iˆc , and generation capacity.
MG-VTON MG-VTON MG-VTON
Reference Image Target Clothes Target Pose VITON CP-VTON (w/o Render ) (w/o Mask) (Ours)

Figure 4. Visualized comparison with other methods on our collected dataset MPV. MG-VTON (w/o Render) is the model where the
refinement render is removed. The model where the multi-pose composition mask is removed denotes as MG-VTON (w/o Mask).

4.1. Datasets 4.3. Implementation Details


Since each person image in the dataset used in VI- Setting. We train the conditional parsing network, Warp-
TON [8] and CP-VTON [29] only has one fixed pose, we GAN, refinement render, and geometric matching network
collected the new dataset from the internet, named MPV, for 200, 15, 5, 35 epochs, respectively, using ADAM op-
which contains 35,687 person images and 13,524 clothes timizer [13], with the batch size of 40, learning rate of
images. Each person image in MPV has different poses. 0.0002, β1 = 0.5, β2 = 0.999. We use two NVIDIA Ti-
The image is in the resolution of 256 × 192. We extract the tan XP GPUs and Pytorch platform on Ubuntu 14.04.
62,780 three-tuples of the same person in the same clothes Architecture. As shown in Figure 3, each generator
but with diverse poses. We further divide them into the train of MG-VTON is a ResNet-like network, which consists of
set and the test set with 52,236 and 10,544 three-tuples, re- three downsample layers, three upsample layers, and nine
spectively. Note that we shuffle the test set with different residual blocks, each block has three convolutional layers
clothes and diverse pose for quality evaluation. DeepFash- with 3x3 filter kernels followed by the bath-norm layer and
ion [38] only have the pairs of the same person in differ- Relu activation function. Their number of filters are 64,
ent poses but do not have the image of clothes. To verify 128, 256, 512, 512, 512, 512, 512, 512, 512, 512, 512, 256,
the generalization capacity of the proposed model, we ex- 128, 64. For the discriminator, we apply the same architec-
tract 10,000 pairs from DeepFashion, and randomly select ture as pix2pixHD [30], which can handle the feature map
clothes image from the test set of the MPV for testing. in different scale with different layers. Each discrimina-
tor contains four downsample layers which include 4x4 ker-
nels, InstanceNorm, and LeakyReLU activation function.
4.2. Evaluation Metrics
4.4. Baselines
We apply three measures to evaluate the proposed model,
including subjective and objective metrics: 1) We perform VITON [8] and CP-VTON [29] are the state-of-the-art
pairwise A/B tests deployed on the Amazon Mechanical image-based virtual try-on method which assumes the pose
Turk (AMT) platform for human perceptual study. 2) we of the person is fixed. They all used warped clothes image to
use Structural SIMilarity (SSIM) [31] to measure the simi- improve the visual quality, but lack of the ability to generate
larity between the synthesized image and ground truth im- image under arbitrary poses. In particular, VTION directly
age. In this work, we take the target image (the same person applied shape context matching [2] to compute the transfor-
wearing the same clothes) as the ground truth image used to mation mapping. CP-VTON borrowed the idea from [23] to
compare with the synthesized image for computing SSIM. estimate the transformation mapping using a convolutional
3) We use Inception Score (IS) [25] to measure the qual- network. To obtain fairness, we first enriched the input of
ity of the generated images, which is a common method to the VITON and CP-VTON by adding the target pose. Then,
verify the performances for image generation. we retrained the VITON and CP-VTON on MPV dataset
with the same splits (train set and test set) as our model. ing network with an adversarial loss carefully to solve the
issue that the identity cannot be preserved. Furthermore,
4.5. Quantitative Results we capture the interplay of among the poses and present
We conduct experiments on two benchmarks and com- a multi-pose based refined network that learns to erase the
pare against two recent related works using two widely used noise and artifacts.
metrics SSIM and IS to verify the performance of the image
synthesis, summarized in Table. 1, higher scores are better.
The results shows that ours proposed methods significantly
achieve higher scores and consistently outperform all base-
lines on both datasets thanks to the cooperation of our con-
ditional parsing generator, Warp-GAN, and the refinement
render. Note that the MG-VTON (w/o Render) achieves the
best SSIM score and the MG-VTON (w/o Mask) achieve
the best IS score, but they obtain worse visual quality re-
sults and achieve lower scores in AMT study compare with
MG-VTON (ours), as illustrated in the Table 2 and Figure 6.
As shown in Figure 4, MG-VTON (ours) synthesizes more
realistic-looking results than MG-VTON (w/o Render), but
the latter achieve higher SSIM score, which also can be ob-
served in [11]. Hence, we believe that the proposed MG-
VTON can generate high-quality person image for multi-
pose virtural try-on with convincing results.
Figure 5. Some results from our model trained on MPV and tested
Table 1. Comparisons on MPV and DeepFashion. on DeepFashion, which synthesizes the realistic image and cap-
tures the desired pose and clothes well.
MPV DeepFashion
Model SSIM IS IS
Table 2. Pairwise comparison on MPV and DeepFashion. Each
VITON [8] 0.639 2.394 ± 0.205 2.302 ± 0.116 cell lists the percentage where our MG-VTON is preferred over
CP-VTON [29] 0.705 2.519 ± 0.107 2.459 ± 0.212 the other method. Chance is at 50%.
MG-VTON (w/o Render) 0.754 2.694 ± 0.119 2.813 ± 0.047
MG-VTON (w/o Mask) 0.733 3.309 ± 0.137 3.368 ± 0.055 VITON CP-VTON MG-VTON MG-VTON
MG-VTON (Ours) 0.744 3.154 ± 0.142 3.030 ± 0.057 (w/o Render) (w/o Mask)
MPV 83.1% 85.9% 82.4% 84.6%
4.6. Qualitative Results DeepFashion 88.9% 83.3% 84.6% 75.5%

We perform visual comparisons of the proposed method


with VITON [8], CP-VTON [29], MG-VTON (w/o Ren- 4.7. Human Perceptual Study
der), and MG-VTON (w/o Mask), illustrated in Figure 4,
which shows that our model generates reasonable results We perform a human study on MPV and Deepfash-
with convincing details. Although the baseline methods ion [38] to evaluate the visual quality of the generated im-
have synthesized few details of clothes, it is far from the age. Similar to pix2pixHD [30], we deployed the A/B tests
practice towards multi-pose virtual try-on scenario. Specifi- on the Amazon Mechanical Turk (AMT) platform. There
cally, the identity and the clothing of the lower-body cannot are 1,600 images with size 256 × 192. We have shown three
be preserved by the baseline methods. Besides, the cloth- images for reference (reference image, clothes, pose) and
ing of the lower-body also cannot be preserved while the two synthesized images with the option for picking. The
clothing of upper-body is change by the baseline methods. workers are given two choices with unlimited time to pick
Furthermore, the baseline methods cannot synthesize the the one image looks more realistic and natural, considering
hairstyle and face well that result in blurry and artifacts. The how well target clothes and pose are captured and whether
reasons behind are that they overlook the high-level seman- the identity and the appearance of the person are preserved.
tics of the reference image and the relationship between the Specifically, the workers are shown the reference image, tar-
reference image and target pose in the virtual try-on task. get clothes, target pose, and the shuffled image pairs. We
On the contrary, we adopt clothes and pose guided network collected 8,000 comparisons from 100 unique workers. As
to generate the target human parsing, which is helpful to al- illustrated in Table 2, the image synthesized by our model
leviate the problem that lower-body clothing and hair can- obtained higher human evaluation scores and indicate the
not be preserved. In addition, we also design a deep warp- high-quality results compare to the baseline methods.
w/o mask loss of w/o perceptual loss of w/o pose of w/o warping
Inputs Refinement Render Refinement Render Refinement Render Full
of Warp-GAN

Figure 6. Ablation study on our collected dataset MPV. Zoom in for details.

Figure 7. Effect of the quality of human parsing. The quality of


human parsing significantly affects the quality of the synthesized
Figure 8. Effect of clothes and pose for the human parsing, which
image in the virtual try-on task.
is manipulating by the pose and the clothes.

4.8. Ablation Study


captures the target pose and clothes well.
We conduct an ablation study to analyze the important
parts of our method. Observed from Table. 1, MG-VTON
(w/o Mask) achieves the best scores. However, as shown 5. Conclusions
in Figure 4, it may inevitably generate artifacts. In Fig-
ure 6, we further evaluate the effect of the components of In this work, we make the first attempt to investigate
our MG-VTON. It shows that the multi-pose composition the multi-pose guided virtual try-on system, which enables
mask loss, the perceptual loss, and the pose in the refine- clothes transferred onto a person image under diverse poses.
ment render stage, and the warping module in Warp-GAN We propose a Multi-pose Guided Virtual Try-on Network
are all important to enhance the performance. (MG-VTON) that generates a new person image after fit-
We also conduct an experiment to verify the effect of the ting the desired clothes into the input image and manipulat-
human parsing in our MG-VTON. As shown in Figure 7, ing human poses. Our MG-VTON decomposes the virtual
there is a positive correlation between the quality of the hu- try-on task into three stages, incorporates a human pars-
man parsing with that of the result. We further to verify the ing model is to guide the image synthesis, a Warp-GAN
effect of the synthesized human parsing by manipulating learns to synthesize the realistic image by alleviating mis-
the desired pose and clothes, as illustrated in Figure 8. We alignment caused by diverse pose, and a refinement render
manipulate the human parsing instead of the person image recovers the texture details. We construct a new dataset for
directly, and we can synthesize the person image in an eas- the multi-pose guided virtual try-on task covering person
ier and more effective way. Furthermore, we introduce an images with more poses and clothes diversity. Extensive
experiment that trained on our collected dataset MPV and experiments demonstrate that our MG-VTON significantly
test on the DeepFashion dataset to verify the generalization outperforms all state-of-the-art methods both qualitatively
of the proposed model. As the Figure 5 shown, our model and quantitatively with promising performances.
References [20] G. Pons-Moll, S. Pujades, S. Hu, and M. J. Black. Clothcap:
Seamless 4d clothing capture and retargeting. ACM Trans-
[1] G. Balakrishnan, A. Zhao, A. V. Dalca, F. Durand, and actions on Graphics (TOG), 36(4):73, 2017. 1, 3
J. Guttag. Synthesizing images of humans in unseen poses. [21] A. Pumarola, A. Agudo, A. Sanfeliu, and F. Moreno-Noguer.
In CVPR, 2018. 2, 4 Unsupervised person image synthesis in arbitrary poses. In
[2] S. Belongie, J. Malik, and J. Puzicha. Shape matching CVPR, 2018. 2
and object recognition using shape contexts. IEEE TPAMI, [22] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and
24(4):509–522, 2002. 3, 6 H. Lee. Generative adversarial text-to-image synthesis. In
[3] F. L. Bookstein. Principal warps: Thin-plate splines and the ICML, 2016. 2
decomposition of deformations. IEEE TPAMI, 11(6):567– [23] I. Rocco, R. Arandjelović, and J. Sivic. Convolutional neu-
585, 1989. 4, 5 ral network architecture for geometric matching. In CVPR,
[4] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi- 2017. 4, 5, 6
person 2d pose estimation using part affinity fields. In CVPR, [24] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolu-
2017. 3 tional networks for biomedical image segmentation. In MIC-
[5] P. Esser, E. Sutter, and B. Ommer. A variational u-net for CAI, pages 234–241, 2015. 2
conditional appearance and shape generation. In CVPR, [25] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Rad-
2018. 2 ford, X. Chen, and X. Chen. Improved techniques for train-
[6] K. Gong, X. Liang, X. Shen, and L. Lin. Look into per- ing gans. In NIPS, 2016. 6
son: Self-supervised structure-sensitive learning and a new [26] M. Sekine, K. Sugita, F. Perbet, B. Stenger, and
benchmark for human parsing. In CVPR, 2017. 3, 4 M. Nishiyama. Virtual fitting by single-shot body shape es-
timation. In International Conference on 3d Body Scanning
[7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
Technologies, pages 406–413, 2014. 3
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-
[27] A. Siarohin, E. Sangineto, S. Lathuiliere, and N. Sebe.
erative adversarial nets. In NIPS, 2014. 2, 5
Deformable gans for pose-based human image generation.
[8] X. Han, Z. Wu, Z. Wu, R. Yu, and L. S. Davis. Viton: An arXiv preprint arXiv:1801.00055, 2017. 2, 4
image-based virtual try-on network. In CVPR, 2018. 1, 2, 3,
[28] K. Simonyan and A. Zisserman. Very deep convolutional
5, 6, 7
networks for large-scale image recognition. In ICLR, 2015.
[9] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image 5
translation with conditional adversarial networks. In CVPR, [29] B. Wang, H. Zhang, X. Liang, Y. Chen, and L. Lin. To-
2017. 2 ward characteristic-preserving image-based virtual try-on
[10] N. Jetchev and U. Bergmann. The conditional analogy gan: network. In ECCV, 2018. 1, 2, 3, 5, 6, 7
Swapping fashion articles on people images. ICCVW, 2(6):8, [30] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and
2017. 3 B. Catanzaro. High-resolution image synthesis and semantic
[11] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for manipulation with conditional gans. In CVPR, 2018. 2, 4, 5,
real-time style transfer and super-resolution. In ECCV, pages 6, 7
694–711, 2016. 5, 7 [31] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli.
[12] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to Image quality assessment: from error visibility to structural
discover cross-domain relations with generative adversarial similarity. TIP, 13(4):600–612, 2004. 6
networks. arXiv preprint arXiv:1703.05192, 2017. 2 [32] Y. Yan, J. Xu, B. Ni, W. Zhang, and X. Yang. Skeleton-aided
[13] D. P. Kingma and J. Ba. Adam: A method for stochastic articulated motion generation. In ACM MM, 2017. 2, 4, 5
optimization. arXiv preprint arXiv:1412.6980, 2014. 6 [33] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li.
[14] Z. Laehner, D. Cremers, and T. Tung. Deepwrinkles: Accu- High-resolution image inpainting using multi-scale neural
rate and realistic clothing modeling. In ECCV, 2018. 1 patch synthesis. In CVPR, 2017. 2
[34] Z. Yi, H. Zhang, P. Tan, and M. Gong. Dualgan: Unsu-
[15] C. Lassner, G. Pons-Moll, and P. V. Gehler. A generative
pervised dual learning for image-to-image translation. arXiv
model of people in clothing. In CVPR, 2017. 3
preprint, 2017. 2
[16] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, [35] C. Zhang, S. Pujades, M. J. Black, and G. Pons-Moll. De-
A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. tailed, accurate, human shape estimation from clothed 3d
Photo-realistic single image super-resolution using a genera- scan sequences. In CVPR, volume 2, page 3, 2017. 1
tive adversarial network. In CVPR, 2017. 2, 5 [36] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-
[17] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and to-image translation using cycle-consistent adversarial net-
L. Van Gool. Pose guided person image generation. In NIPS, works. In ICCV, 2017. 2
2017. 2, 3 [37] S. Zhu, S. Fidler, R. Urtasun, D. Lin, and C. C. Loy. Be your
[18] L. Ma, Q. Sun, S. Georgoulis, L. Van Gool, B. Schiele, and own prada: Fashion synthesis with structural coherence. In
M. Fritz. Disentangled person image generation. In CVPR, ICCV, 2017. 2, 3
2018. 2 [38] S. Q. X. W. Ziwei Liu, Ping Luo and X. Tang. Deepfashion:
[19] M. Mirza and S. Osindero. Conditional generative adversar- Powering robust clothes recognition and retrieval with rich
ial nets. arXiv preprint arXiv:1411.1784, 2014. 4, 5 annotations. In CVPR, pages 1096–1104, 2016. 2, 6, 7
MG-VTON Composition Warped Synthesized
Reference Image Target Clothes Target Pose (Ours) Mask Clothes Parsing

Figure 9. Test results of our MG-VTON on MPV dataset.


MG-VTON Composition Warped Synthesized
Reference Image Target Clothes Target Pose (Ours) Mask Clothes Parsing

Figure 10. Test results of our MG-VTON, trained on MPV dataset, test on DeepFashion dataset.

You might also like