Mpvton, Iccv 2019 PDF
Mpvton, Iccv 2019 PDF
Haoye Dong1 , Xiaodan Liang1 , Bochao Wang1 , Hanjiang Lai1 , Jia Zhu2 , Jian Yin1
1
Sun Yat-sen University, 2 South China Normal University
{donghy7@mail2, wangboch@mail2, laihanj3@mail, issjyin@mail}.sysu.edu.cn,
[email protected], [email protected]
arXiv:1902.11026v1 [cs.CV] 28 Feb 2019
Abstract
1
to solve those issues since the 3D information have abun- • A new task of virtual try-on conditioned on multi-pose
dant details of the shape of the body that can help to gener- is proposed, which aims to restructure the person im-
ate the realistic results. However, it needs expert knowledge age by manipulating both diverse poses and clothes.
and huge labor cost to build the 3D models, which requires
collecting the 3D annotated data and massive computation. • We propose a novel Multi-pose Guided Virtual Try-on
These costs and complexity would limit the applications in Network (MG-VTON) that generates a new person im-
the practical virtual try-on simulation. age after fitting the desired clothes onto the input per-
son image and manipulating human poses. MG-VTON
In this paper, we study the problem of virtual try-on con-
contains four modules: 1) a pose-clothes-guided hu-
ditioned on 2D images and arbitrary poses, which aims to
man parsing network is designed to guide the image
learn a mapping function from an input image of a person
synthesis; 2) a Warp-GAN learns to synthesized realis-
to another image of the same person with a new outfit and
tic image by using a warping features strategy; 3) a re-
diverse pose, by manipulating the target clothes and pose.
finement network learns to recover the texture details;
Although the image-based virtual try-on with the fixed pose
4) a mask-based geometric matching network is pre-
has been studied widely [8, 29, 37], the task of multi-pose
sented to warp clothes that enhances the visual quality
virtual try-on is less explored. In addition, without mod-
of the generated image.
eling the mapping of the intricate interplay among of the
appearance, the clothes, and the pose, directly using the ex- • A new dataset for the multi-pose guided virtual try-
isting virtual try-on methods to synthesized image based on on task is collected, which covers person images with
different poses often result in blurry and artifacts. more poses and clothes diversity. The extensive exper-
Targeting on the problems mentioned above, we propose iments demonstrate that our approach can achieve the
a novel Multi-pose Guided Virtual Try-on Network (MG- competitive quantitative and qualitative results.
VTON) that can generate a new person image after fitting
both desired clothes into the input image and manipulating
human poses. Our MG-VTON is a multi-stage framework 2. Related Work
with generative adversarial learning. Concretely, we design Generative Adversarial Networks (GANs). GANs [7]
a pose-clothes-guided human parsing network to estimate consists of a generator and a discriminator that the dis-
a plausible human parsing of the target image conditioned criminator learns to classify between the synthesized im-
on the approximate shape of the body, the face mask, the ages and the real images while the generator tries to fool
hair mask, the desired clothes, and the target pose, which the discriminator. The generator aims to generate realis-
could guide the synthesis in an effective way with the pre- tic images, which are indistinguishable from the real im-
cise region of body parts. To seamlessly fit the desired ages. And the discriminator focuses on distinguishing be-
clothes on the person, we warp the desired clothes image, tween the synthesized and real images. Existing works
by exploiting a geometric matching model to estimate the have leveraged various applications based on GANs, such
transformation parameters between the mask of the input as style transfer [9, 36, 12, 34], image inpainting [33], text-
clothes image and the mask of the synthesized clothes ex- to-image [22], and super-resolution imaging [16]. Inspired
tracted from the synthesized human parsing. In addition, by those impressive results of GANs, we also apply the ad-
we design a deep Warping Generative Adversarial Network versarial loss to exploit a virtual try-on method with GANs.
(Warp-GAN) to synthesize the coarse result alleviating the Person image synthesis. Skeleton-aided [32] proposed
large misalignment caused by the different poses and the di- a skeleton-guided person image generation method, which
versity of clothes. Finally, we present a refinement network conditioned on a person image and the target skeletons.
utilizing multi-pose composition masks to recover the tex- PG2 [17] applied a coarse-to-fine framework that consists of
ture details and alleviate the artifact caused by the large mis- a coarse stage and a refined stage. Besides, they proposed a
alignment between the reference pose and the target pose. novel model [18] to further improve the quality of result by
To demonstrate our model, we collected a new dataset, using a decomposition strategy. The deformableGANs [27]
named MPV, by collecting various clothes image and per- and [1] made attempt to alleviate the misalignment problem
son images with diverse poses from the same person. In ad- between different poses by using affine transformation on
dition, we also conduct experiments on DeepFashion [38] the coarse rectangle region and warped the parts on pixel-
datasets for testing. Following the object evaluation proto- level, respectively. V-UNET [5] introduced a variational U-
col [30], we conduct a human subjective study on the Ama- Net [24] to synthesize the person image by restructuring the
zon Mechanical Turk (AMT) platform. Both quantitative shape with stickman label. [21] applied CycleGAN [36] di-
and qualitative results indicate that our method achieves ef- rectly to manipulate pose. However, all those works fail to
fective performance and high-quality images with appealing preserve the texture details consistency corresponding with
details. The main contributions are listed as follows: the pose. The reason behind that is they ignore to consider
Reference Image Hair Mask Face Mask Body Shape
Synthesized
Stage I: Parsing
Coarse
Conditional Parsing Learning
Result
Target Clothes Target Pose
w/o Clothes Stage II:
Reference Warp-GAN Refined
Warped Clothes Image Target Pose Stage III: Result
Refinement Render
Warped Clothes Target Pose
Geometric Matching Learning
Remove Clothes
Figure 2. The overview of the proposed MG-VTON. Stage I: We first decompose the reference image into three binary masks. Then, we
concatenate them with the target clothes and target pose as an input of the conditional parsing network to predict human parsing map. Stage
II: Next, we warp clothes, remove the clothing from the reference image, and concatenate them with the target pose and synthesized parsing
to synthesize the coarse result by using Warp-GAN. Stage III: We finally refine the coarse result with a refinement render, conditioning on
the warped clothes, target pose, and the coarse result.
the interplay between the human parsing map and the pose and manipulating the pose. Inspired by the coarse-to-fine
in the person image synthesis. The human parsing map can idea [8, 17], we adopt an outline-coarse-fine strategy that
guide the generator to synthesize image in the precise re- divides this task into three subtasks, including the condi-
gion level that ensures the coherence of body structure. tional parsing learning, the Warp-GAN, and the refinement
Virtual try-on. VITON [8] and CP-VTON [29] all pre- render. The Figure 2 illustrates the overview of MG-VTON.
sented an image-based virtual try-on network, which can We first apply the pose estimator [4] to estimate the pose.
transfer a desired clothes on the person by using a warp- Then, we encode the pose as 18 heatmaps, which is filled
ing strategy. VITON computed the transformation mapping with ones in a circle with radius 4 pixels and zeros else-
by the shape context TPS warps [2] directly. CP-VTON where. A human parser [6] is used to predict the human
introduced a learning method to estimate the transforma- segmentation maps, consisting of 20 labels, from which we
tion parameters. FashionGAN [37] learned to generate new extract the binary mask of the face, the hair, and the shape of
clothes on the input image of the person conditioned on a the body. Following VITON [8], we downsample the shape
sentence describing the different outfit. However, the above of the body to a lower resolution (16 × 12) and directly re-
all methods synthesized the image of person only on the size it to the original resolution (256×192), which alleviates
fixed pose, which limits the applications in the practical vir- the artifacts caused by the variety of the body shape.
tual try-on simulation. ClothNet [15] presented an image-
based generative model to produce new clothes conditioned 3.1. Conditional Parsing Learning
on color. CAGAN [10] proposed a conditional analogy net-
To preserve the structural coherence of the person while
work to synthesize person image conditioned on the paired
manipulating both clothes and the pose, we design a pose-
of clothes, which limits the practical virtual try-on scenar-
clothes-guided human parsing network, conditioned on the
ios. In order to generate the realistic-look person image
image of clothes, the pose heatmap, the approximated shape
in different clothes, ClothCap [20] utilized the 3D scanner
of the body, the mask of the face, and the mask of hair. As
to capture the clothes, the shape of the body automatically.
shown in Figure 4, the baseline methods failed to preserve
[26] presented a virtual fitting system that requires the 3D
some parts of the person (e.g., the color of the trousers and
body shape, which is laborious for collecting the annotation.
the style of hair were replaced.), due to feeding the image
In this paper, we introduce a novel and effective method for
of the person and clothes into the model directly. In this
learning to synthesize image with the new outfit on the per-
work, we leverage the human parsing maps to address those
son through adversarial learning, which can manipulate the
problems, which can help generator to synthesize the high-
pose simultaneously.
quality image on parts-level.
3. MG-VTON Formally, given an input image of person I, an in-
put image of clothes C, and the target pose P , this stage
0
We propose a novel Multi-pose Guided Virtual Try-on learns to predict the human parsing map St conditioned
Network (MG-VTON) that learns to synthesize the new per- on clothes C and the pose P . As shown in Figure 3 (a),
son image for virtual try-on by manipulating both clothes we first extract the hair mask Mh , the face mask Mf , the
and pose. Given an input person image, a desired clothes, body shape Mb , and the target pose P by using a hu-
and a desired pose, the proposed MG-VTON aims to pro- man parser [6] and a pose estimator [4], respectively. We
duce a new image of the person wearing the desired clothes then concatenate them with the image of clothes as input
Reference (a). Conditional Parsing Hair Face Body Target Target (b). Conditional Parsing
Hair Mask Face Mask Body Shape Generator Discriminator
Image Mask Mask Shape Clothes Pose
Synthesized Real
Segmentation
Residual Blocks Parsing Parsing
Real
… or
Target Clothes Target Pose Synthesized Parsing Real Parsing Fake
OR
w/o Clothes Reference Reference Synthesized (c). Warp-GAN w/o Clothes (d). Warp-GAN
Reference Image Reference Image Parsing Parsing Parsing Generator Reference Warped Target Synthesized Discriminator
Image Clothes Pose Parsing
Remove Clothes
Estimate Parsing Real
or
Grid Pre-trained Coarse Real
Target Synthesized Fake
Sampling Geometric Matcher Result Image
Clothes Warped Clothes Parsing Target Pose Coarse Result Real Image
…
Warping OR
Residual Blocks
Coarse Result (e). Refinement Render Feature Feature Transformation (f). Geometric Matching
Generator Extracting Matching Parameters Module
Synthesized Body Grid Warped Synthsized
Residual Composition Mask Polish Real Parsing Shape Estimating Sampling Clothes Mask Clothes Mask
Blocks Result Image
Target Warped θ
…
Pose Clothes Target Target
Clothes Clothes Mask Element-wise
matrix multiplication
Concatenation
Figure 3. The network architecture of the proposed MG-VTON. (a)(b): The conditional parsing learning module consists of a pose-
clothes-guided network that predicts the human parsing, which helps to generate high-quality person image. (c)(d): The Warp-GAN learns
to generate the realistic image by using a warping features strategy due to the misalignment caused by the diversity of pose. (e): The
refinement render network learns the pose-guided composition mask that enhances the visual quality of the synthesized image. (f): The
geometric matching network learns to estimate the transformation mapping conditioned on the body shape and clothes mask.
which is fed into the conditional parsing network. The in- The St denotes the ground truth human parsing. The pdata
0
ference of St can be formulate as maximizing the poste- represents the distributions of the real data.
0
rior probability p(St |(Mh , Mf , Mb , C, P )). Furthermore,
this stage is based on the conditional generative adversar- 3.2. Warp-GAN
ial network (CGAN) [19] which generates promising re-
sults on image manipulating. Thus, the poster probability Since the misalignment of pixels would lead to generate
0
p(St |(Mh , Mf , Mb , C, P )) is expressed as: the blurry results [27], we introduce a deep Warping Gen-
0
erative Adversarial Network (Warp-GAN) warps the de-
p(St |(Mh , Mf , Mb , C, P )) = G(Mh , Mf , Mb , C, P ). sired clothes appearance into the synthesized human pars-
(1) ing map, which alleviates the misalignment problem be-
We adopt a ResNet-like network as the generator G to build tween the input human pose and desired human pose. Dif-
the conditional parsing model. We adopt the discriminator ferent from deformableGANs [27] and [1], we warp the
D directly from the pix2pixHD [30]. We apply the L1 loss feature map from the bottleneck layer by using both the
for further improving the performance, which is advanta- affine and TPS (Thin-Plate Spline) [3] transformation rather
geous for generating more smooth results [32]. Inspired by than process the pixel directly by using affine only. Thanks
the LIP [6], we apply the pixel-wise softmax loss to encour- to the generalization capacity of [23], we directly use the
age the generator to synthesize high-quality human parsing pre-trained model of [23] to estimate the transformation
maps. Therefore, the problem of conditional parsing learn- mapping between the reference parsing and the synthesized
ing can be formulated as: parsing. We then warp the w/o clothes reference image by
min max F (G, D) using this transformation mapping.
G D
As illustrated in Figure 3 (c) and (d), the proposed
= EM,C,P ∼pdata [log(1 − D(G(M, C, P ), M, C, P ))] deep warping network consists of the Warp-GAN generator
+ ESt ,M,C,P ∼pdata [log D(St , M, C, P )] (2) Gwarp and the Warp-GAN discriminator Dwarp . We use the
+ ESt ,M,C,P ∼pdata [kSt − G(M, C, P )k1 ] geometric matching module to warp clothes image, as de-
scribed in the section 3.4. Formally, we take warped clothes
+ ESt ,M,C,P ∼pdata [Lparsing (St , G(M, C, P ))],
image Cw , w/o clothes reference image Iw/o clothes , the tar-
0
where M denotes the concatenation of Mh , Mf , and Mb . get pose P , and the synthesized human parsing St as in-
The loss Lparsing denotes the pixel-wise softmax loss [6]. put of the Warp-GAN generator and synthesize the result
0
Iˆ = Gwarp (Cw , Iw/o clothes , P, St ). Inspired by [11, 8, 16], P as input, the Gp learns to predict a towards multi-pose
we apply a perceptual loss to measure the distances between composition mask and synthesize the rendered result:
high-level features in the pre-trained model, which encour-
ages generator to synthesize high-quality and realistic-look Iˆp = Gp (Cw , I,
ˆ P ) Cw + (1 − Gp (Cw , I,
ˆ P )) I,
ˆ (6)
images. We formulate the perceptual loss as:
where denotes the element-wise matrix multiplication.
n
X We also adopt the perceptual loss to enhance the perfor-
ˆ I) =
Lperceptual (I, ˆ − φi (I)k1 ,
αi kφi (I) (3) mance that the objective function of Gp can be written as:
i=0
where φi (I) denotes the i-th (i = 0, 1, 2, 3, 4) layer feature Lp = µ1 Lperceptual (Iˆp , I) + µ2 k1 − Gp (Cw , Iˆc , P )k1 , (7)
map in pre-trained network φ of ground truth image I. We
where µ1 denotes the weight of perceptual loss and µ2 de-
use the pre-trained VGG19 [28] as φ and weightedly sum
notes the weight of the mask loss.
the L1 norms of last five layer feature maps in φ to rep-
resent perceptual losses between images. The αi controls 3.4. Geometric matching learning
the weight of loss for each layer. In addition, following
pixp2pixHD [30], due to the feature map at different scales Inspired by [23], we adopt the convolutional neural net-
from different layers of discriminator enhance the perfor- work to learn the transformation parameters, including fea-
mance of image synthesis, we also introduce a feature loss ture extracting layers, feature matching layers, and the
and formulate it as: transformation parameters estimating layers. As shown in
Figure 3 (f), we take the mask of the clothes image, and the
n
ˆ I) =
X
ˆ − Fi (I)k1 , mask of body shape as input which is first passed through
Lfeature (I, γi kFi (I) (4)
the feature extracting layers. Then, we predict the corre-
i=0
lation map by using the matching layers. Finally, we ap-
where Fi (I) represent the i-th (i = 0, 1, 2) layer feature ply a regression network to estimate the TPS (Thin-Plate
map of the trained Dwarp . The γi denotes the weight of L1 Spline) [3] transformation parameters for the clothes image
loss for corresponding layer. directly based on the correlation map.
Furthermore, we also apply the adversarial loss Ladv [7, Formally, given an input image of clothes C and its mask
19] and L1 loss L1 [32] to improve the performance. We Cmask , following the stage of conditional parsing learning,
design a weight sum losses as the loss of Gwarp , which en- we obtain the approximated body shape Mb and the syn-
courages the Gwarp to synthesize realistic and natural images thesized clothes mask Ĉmask from the synthesized human
in different aspects, written as follows: parsing. This subtask aims to learn the transformation map-
ping function T with parameter θ for warping the input im-
LGwarp = λ1 Ladv + λ2 Lperceptual + λ3 Lfeature + λ4 L1 , (5) age of clothes C. Due to the unseen of synthesized clothes
but have the synthesized clothes mask, we learn the map-
where λi (i = 1, 2, 3, 4) denotes the weight of correspond- ping between the original clothes mask Cmask and the syn-
ing loss, respectively. thesized clothes mask Ĉmask obey body shape Mb . Thus,
the objective function of the geometric matching learning
3.3. Refinement render can be formulated as:
In the coarse stage, the identification information and the
Lgeo matching (θ) = kTθ (Cmask ) − Ĉmask k1 , (8)
shape of the person can be preserve, but the texture details
are lost due to the complexity of the clothes image. Pasting
Therefore, the warped clothes Cw can be formulated as
the warped clothes onto the target person directly may lead
Cw = Tθ (C), which is helpful for addressing the problem
to generate the artifacts. Learning the composition mask
of misalignment and learning the composition mask in the
between the warped clothes image and the coarse results
above subsection 3.2 and subsection 3.3.
also generates the artifacts [8, 29] due to the diversity of
pose. To solve the above issues, we present a refinement
render utilizing multi-pose composition masks to recover
4. Experiments
the texture details and remove some artifacts. In this section, we first make visual comparisons with
Formally, we define Cw as an image of warped clothes other methods and then discuss the results quantitatively.
obtained by geometric matching learning module, Iˆc as a We also conduct the human perceptual study and the abla-
coarse result generated by the Warp-GAN, P as the target tion study, and further train our model on our newly col-
pose heatmap, and Gp as the generator of the refinement lected dataset MPV test it on the Deepfashion to verify the
render. As illustrated in Figure 3 (e), taking Cw , Iˆc , and generation capacity.
MG-VTON MG-VTON MG-VTON
Reference Image Target Clothes Target Pose VITON CP-VTON (w/o Render ) (w/o Mask) (Ours)
Figure 4. Visualized comparison with other methods on our collected dataset MPV. MG-VTON (w/o Render) is the model where the
refinement render is removed. The model where the multi-pose composition mask is removed denotes as MG-VTON (w/o Mask).
Figure 6. Ablation study on our collected dataset MPV. Zoom in for details.
Figure 10. Test results of our MG-VTON, trained on MPV dataset, test on DeepFashion dataset.