0% found this document useful (0 votes)
44 views10 pages

Fele C-VTON Context-Driven Image-Based Virtual Try-On Network WACV 2022 Paper

Uploaded by

santiagobega
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views10 pages

Fele C-VTON Context-Driven Image-Based Virtual Try-On Network WACV 2022 Paper

Uploaded by

santiagobega
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

C-VTON: Context-Driven Image-Based Virtual Try-On Network

Benjamin Fele1,2 Ajda Lampe1,2 Peter Peer2 Vitomir Štruc1


1
Faculty of Electrical Engineering, 2 Faculty of Computer and Information Science
University of Ljubljana, SI-1000 Ljubljana, Slovenia
{benjamin.fele, vitomir.struc}@fe.uni-lj.si, {ajda.lampe, peter.peer}@fri.uni-lj.si

Figure 1. Example results generated with the proposed Context-Driven Virtual Try-On Network (C-VTON). The original input image is
shown in the upper left corner of each example and the target clothing in the lower left. Note that C-VTON generates convincing results
even with subjects in difficult poses and realistically reconstructs on-shirt graphics.

ware and dedicated imaging equipment, but are applicable


Abstract
with standard intensity images, e.g., [3, 4, 7, 8, 12]. As il-
Image-based virtual try-on techniques have shown great lustrated in Figure 1, the goal of such solutions is to replace
promise for enhancing the user-experience and improving a piece of clothing in an input image with a target garment
customer satisfaction on fashion-oriented e-commerce plat- as realistically as possible. This allows for the design of vir-
forms. However, existing techniques are currently still lim- tual fitting rooms that let consumers try-on clothes without
ited in the quality of the try-on results they are able to pro- visiting (brick and mortar) stores, and also benefits retailers
duce from input images of diverse characteristics. In this by reducing product returns and shipping costs [1, 2, 20].
work, we propose a Context-Driven Virtual Try-On Network Existing solutions to image-based virtual try-on are dom-
(C-VTON) that addresses these limitations and convincingly inated by two-stage approaches (and their extensions) that
transfers selected clothing items to the target subjects even typically include: (i) a geometric matching stage that aligns
under challenging pose configurations and in the presence the target clothing to the pose of the person in the input im-
of self-occlusions. At the core of the C-VTON pipeline are: age, and estimates an approximate position and shape of
(i) a geometric matching procedure that efficiently aligns the target garment in the final try-on result, and (ii) an
the target clothing with the pose of the person in the input image synthesis stage that uses dedicated generative mod-
images, and (ii) a powerful image generator that utilizes els (e.g., Generative Adversarial Networks (GANs) [9]), to-
various types of contextual information when synthesizing gether with various refinement strategies to synthesize the
the final try-on result. C-VTON is evaluated in rigorous ex- final try-on image based on the aligned clothing and differ-
periments on the VITON and MPV datasets and in compari- ent auxiliary sources of information, e.g., pose keypoints,
son to state-of-the-art techniques from the literature. Exper- parsed body-parts, or clothing annotations among others
imental results show that the proposed approach is able to [3, 12, 35]. To further improve on this overall framework,
produce photo-realistic and visually convincing results and various enhancements have also been proposed, including:
significantly improves on the existing state-of-the-art. (i) refinements of the data fed to the geometric matching
stage [23, 35, 36], (ii) integration of clothing segmentations
into the synthesis procedure [35, 36], and (iii) the use of
1. Introduction
knowledge distillation schemes to minimize the impact of
In the age of online shopping, virtual try-on technol- parser-related errors [8, 16].
ogy is becoming increasingly important and a considerable While the outlined advances greatly improved the qual-
amount of research effort is being directed towards this area ity of the generated try-on results, the loss of details on the
as a result [2]. Especially appealing here are image-based transferred garments that is often a consequence of difficul-
virtual try-on solutions that do not require specialized hard- ties with the geometric matching stage still represents a ma-

3144
jor challenge to image-based virtual try-on solutions [4, 12, tion modelling, the generated images exhibited only lim-
34]. Additionally, poor quality (human/clothing) parsing re- ited photo realism. To address this shortcoming, Han et
sults typically still lead to unconvincing try-on images with al. [12] proposed a two-stage approach, named VITON, that
garment textures being placed over incorrect body parts. Al- used a coarse-to-fine image generation strategy and utilized
though recent (destilled) parser-free models, e.g., [16, 8], a Thin-Plate Spline (TPS) transformation [6] to align the
address this issue to some degree, they still inherit the main image of the desired clothing with the pose of the target
characteristics of the teacher models (including parsing is- subject. Wang et al. [34] improved on this approach with
sues) and often struggle with the generation of realistic CP-VTON, which introduced a Geometric Matching Mod-
body-parts, such as hands or arms. ule (GMM) that allowed to learn the TPS clothing transfor-
In this paper, we propose a Context-Driven Virtual Try- mations in an end-to-end manner (similarly to [28]) and led
On Network (C-VTON) that aims to address these issues. to impressive try-on results. Follow-up work further refined
To improve the quality of the generated on-garment textures the geometric matching stage using various mechanisms.
and logotypes, we design a novel geometric matching mod- CP-VTON+ [23], for instance, improved the human mask
ule that conditions the pose-matching procedure on body- fed to the GMM, VTNFP [36] designed an elaborate per-
segmentations only, and, therefore, minimizes the depen- son representation as the input to the GMM, whereas LA-
dence on multiple (potentially error-prone) pre-processing VITON [21] and ACGPN [35] introduced additional trans-
steps. Additionally, we formulate learning objectives for the formation constraints when training their warping/matching
module’s training procedure that penalize the appearance modules. With C-VTON we follow the outlined body of
of the aligned clothing solely within the body area (while work and design a novel matching module based on simpli-
ignoring other body parts) to ensure that challenging pose fied inputs that can be estimated reliably even in the pres-
configuration and self-occlusions (e.g. from hands) do not ence of considerable appearance variability. We achieve this
adversely affect performance. This design leads to realistic by conditioning the module on body-parts only and leverag-
virtual try-on results with convincing details, as also shown ing the power of recent human-parsing models.
in Figure 1. Finally, we develop a powerful context-aware Several solutions have also been presented in the lit-
image generator (CAG) that utilizes contextual cues in ad- erature to improve the quality of the generated try-on re-
dition to the warped clothing to steer the synthesis process. sults during the image synthesis stage. MG-VTON [4],
The generator is designed as a standard residual network, ACGPN [35] and VITON-HD [3], for example, proposed
but relies on conditional (context-dependent) normalization using secondary neural networks that generate clothing seg-
operations (akin to SPADE layers [24]) to ensure the con- mentations matching the target garment and utilizing these
textual information is considered to a sufficient degree when as additional sources of information for the generator. S-
generating virtual try-on result. In summary, we make the WUTON [16] and PF-AFN [8] employed a teacher-student
following main contributions in this paper: knowledge distillation scheme to alleviate the need for
• We propose a novel image-based approach to virtual try- error-prone (intermediate) processing steps that often con-
on, named Context-Driven Virutal Try-On Network (C- tribute to difficulties with existing try-on approaches. FE-
VTON), that produces state-of-the-art results with input GAN [5] and VITON-HD [3] followed recent developments
images of diverse characteristics. in image synthesis [24] and introduced generators with con-
• We design a simplified geometric matching module, ditional normalization layers to help with the quality and
termed Body-Part Geometric Matcher (BPGM), capable realism of the synthesized try-on results. Similarly to these
of producing accurate garment transformations even with techniques, C-VTON also uses an advanced image gener-
subjects in challenging poses and arm configurations. ator with conditional normalization layers in the synthesis
step, but capitalizes on the use of contextual information
• We introduce a Context-Aware Generator (CAG) that al-
to steer the generation process. Furthermore, three power-
lows for the synthesis of high-quality try-on results by
ful discriminators are employed in an adversarial training
making use of various sources of contextual information.
procedure to make full use of the available contextual infor-
mation and improve the realism of the generated results.
2. Related work
3. Context-Driven Virtual Try-On Network
Image-based virtual try-on techniques have recently ap-
peared as an appealing alternative to traditional try-on so- We propose a Context-Driven Virtual Try-On Network
lutions that rely on 3D modeling and dedicated computer- (C-VTON) that relies on robust pose matching and con-
graphics pipelines [10, 25, 26, 30]. The pioneering work textualized synthesis to generate visually convincing virtual
from [17, 27], for example, approached the virtual try-on try-on results. Formally, given an input image of a subject,
task as an image analogy problem with promising results. I ∈ Rw×h×3 , and a reference image of some target cloth-
However, due to the lack of explicit (clothing) deforma- ing, C ∈ Rw×h×3 , the goal of the model is to synthesize a

3145
Input data Stage 1: Geometric matching Stage 2: Virtual try-on synthesis
Input I S Body Warped Masked Im S Body Try-on
image I segmentation S clothing Cw image Im segmentation S result IC
Cw IC
Body-Part Context-Aware
Geometric Generator
C C Matcher (BPGM) C Cw (CAG)

Target Target Target Warped


clothing C clothing C clothing C clothing Cw

Figure 2. Overview of the proposed Context-Driven Virtual Try-On Network (C-VTON) that given an input image of a subject I and some
target clothing C generates a visually convincing virtual try-on image IC . C-VTON is designed as a two-stage pipeline comprising a Body-
Part Geometric Matcher (BPGM) that pre-aligns the target clothing C with the pose of the subject in I and a Context-Aware Generator
(CAG) that generates the final try-on image IC based on the warped clothing and other sources of contextual information.

photo-realistic output image, IC ∈ Rw×h×3 , of the subject S


Mt Mw Mc
from I wearing the target clothing C (see Figure 1). E1

TPS(θ)
Lshp
As illustrated in Figure 2, C-VTON is designed as a two-

Correlation
Lshp
stage pipeline with two main components: (i) a Body-Part
R Lapp
C Cw Mb Ib
Geometric Matcher (BPGM) that warps the reference im- C Lvgg

TPS(θ)
age of the target clothing C to match the pose of the subject
in the input image I, and (ii) a Context-Aware Generator E2
(CAG) that uses the output of the BPGM together with var- Lapp, Lvgg
ious sources of contextual information to generate the final Figure 3. Overview of the Body-Part Geometric Matcher (BPGM).
(virtual) try-on result IC . Details on the two components The BPGM architecture is shown on the left and the training
are given in the following sections. losses designed to preserve on-garment textures and overall gar-
ment shape are on the right. Unlike competing solutions, BPGM
3.1. The Body-Part Geometric Matcher (BPGM) estimates the warping function based on body-part locations only,
The first stage of the C-VTON pipeline consists of the leading to robust performance even in the presence of challenging
poses, e.g., with crossed arms, arms occluding the body, etc.
proposed BPGM, and is responsible for estimating the pa-
rameters of a Thin-Plate Spline (TPS) [6] transformation
sent spatial dimensions that depend on the depth of the en-
that is used to align the target clothing with the pose of
coders and df denotes the number of output channels. Next,
the person in the input image I. This allows for approxi-
the feature representations are normalized channel-wise to
mate positional matching of the target garment and helps to
unit L2 norm, spatially flattened and organized into a ma-
make the task of the generator in the next stage easier. As
trix Ψ{E1 ,E2 } ∈ Rdf ×wf hf , which serves as the basis for
shown in Figure 3, the BPGM takes a reference image of the
computing the correlation matrix Corr [28], i.e.:
target clothing C and body segmentations S as input, and
then produces a warped version Cw of the target clothing at Corr = {\Psi }_{E_1}^\top {\Psi }_{E_2}\in \mathbb {R}^{({w_f}{h_f}) \times ({w_f}{h_f})}. (1)
the output. The body segmentations S ∈ {0, 1}w×h×d are
generated using the DensePose model from [11] and con- The correlation matrix Corr is then fed into a regressor, R,
tain d = 25 channels (classes), each corresponding to a which predicts a parameter vector θ (with 2n2 dimensions)
different body part. Compared to other virtual try-on archi- that corresponds to x and y offsets on an n × n grid, accord-
tectures that utilize complex clothing-agnostic person rep- ing to which the target clothing C is warped.
resentations, e.g., [3, 16, 34], to obtain geometric clothing Training Objectives. Three loss functions are used to
transformations, BPGM relies on body-part segmentations learn the parameters of the BPGM,, i.e.:
only, which are sufficient for reliably matching target gar-
• A target shape loss (Lshp ) that encourages the warping
ments to person images, as we show in our experiments.
procedure to render the target clothing in a shape that
BPGM Architecture. The proposed BPGM follows the
matches the pose of the subject in I, i.e.:
design of the Geometric Matching Module (GMM) from
[34] and consists of two distinct encoders, E1 and E2 . The \mathcal {L}_{shp} = \lVert M_{w} - M_{c} \rVert _1 = \lVert T_\theta (M_t) - M_{c} \rVert _1, \label {eq: shape loss} (2)
first takes the target clothing C as input and generates a
corresponding feature representation ψE1 ∈ Rwf ×hf ×df . where Mt and Mw are binary masks corresponding to
Similarly, the second encoder accepts the body segmenta- the original (C) and warped target clothing (Cw ), respec-
tions S at the input and produces a feature representation tively, Mc is a binary mask corresponding to the clothing
ψE2 ∈ Rwf ×hf ×df at the output. Here, wf and hf repre- area in the input image (generated with the segmentation

3146
a) Context-Aware Generator (CAG)
model of Li et al. from [22]), and Tθ denotes the TPS Im
S a) Context-Aware Generator
C (CAG) Cw

Context
Image
transformation parameterized by θ. S Im C Cw

Context
Image
• An appearance loss (Lapp ) that forces the visual appear-

(IC) (IC)
ance of the warped clothing Cw within the body area Mb
to be as similar as possible to the input image I, i.e.: IC‘ I
IC‘ Lper I
\mathcal {L}_{app} = \lVert C_{w} \odot M_{b} - I_{b} \rVert _1, \label {eq: appearence loss} (3) Lper

ResNet

ResNet
Upsampling

Upsampling
w/ CAN

w/ CAN
ResNet

ResNet
Upsampling

Upsampling
where ⊙ is the Hadamard product and Mb a binary masks

w/ CAN

w/ CAN
IC Dseg Lseg

block

block
of the body area (a channel in S), and Ib = I ⊙ Mb . IC DDseg LLseg

block

block
match mth
• A perceptual loss (Lvgg ) that ensures that the target DDmatch LLmth
patch ptc
clothing and its warped version contain the same semantic
b) ResNet(a)block w/ Context-Aware Dpatch
Normalization Lptc
content within the body area, i.e. [19]: Context-Aware Generator (CAG) (CAN)
IC
b) ResNet block w/ Context-Aware NormalizationConv.
(CAN)
layer
×2 IC
CAN Conv. layer
\label {eq:vggloss} \mathcal {L}_{vgg} = \sum _i^n \lambda _i \lVert \phi _i(C_{w} \odot M_{b}) - \phi _i(I\odot M_{b}) \rVert _1, (4) ×2 CAN
γ β
Batch γ β
where ϕi (·) is a feature map generated before each (of the Batch XBN
Norm XCAN
n = 5) max-pooling layer of a VGG19 [32] model (pre- Norm XBN XCAN

trained on ImageNet), and λi is the corresponding weight.


(b) ResNet block w/ Context-Aware Normalization (CAN)
Among the above losses, Lshp aims to match the general
Figure 4. Overview of the Context-Aware Generator (CAG): (a)
garment area, while Lapp and Lvgg are designed to specif-
the proposed generator with associated losses used during training,
ically match the on-garment graphics, without forcing the (b) a schematic representations of a ResNet block with Context-
BPGM matcher to align sleeves, which are often a source Aware Normalization (CAN). The proposed CAG is designed to
of unrealistic transformations used later by the generator. exploit contextual information for the synthesis step - at the input,
Finally, the joint learning objective for the BPGM is: for activation normalization and through a series of discriminators.

\label {eq:bpgmjoint} \mathcal {L}_{BPGM} = \lambda _{shp} \mathcal {L}_{shp} + \lambda _{app} \mathcal {L}_{app} + \lambda _{vgg} \mathcal {L}_{vgg}, (5)
image context IC and feed the generator with critical con-
where λshp , λapp and λvgg are balancing weights. The pa- textual information. As illustrated Figure 4(a), this is done
rameters of the BPGM are learned over a dataset of input at different resolutions to ensure (i) that the activations of
images I with matched images of target clothing C. the generator are spatially normalized at different levels of
granularity, and (ii) that the information on the targeted
3.2. The Context-Aware Generator (CAG) semantic layout and desired appearance of the synthesized
The second stage of the C-VTON pipeline consists of the output is propagated efficiently throughout the generator.
Context-Aware Generator (CAG) and is responsible for syn- Each ResNet block of the generator has two inputs: the
thesizing the final virtual try-on image IC given the warped image context IC, and the activation map from the previ-
target clothing and other contextual cues as input. To sim- ous model layer. The only exception here is the first ResNet
plify the discussion in the remainder of the section, we block of the generator that uses image context IC at the
jointly refer to all inputs of the generator as Image Context smallest resolution (8 × 6 pixels) for both inputs, as shown
(IC) hereafter, and define it as a channel-wise concatena- in Figure 4(a). The utilized ResNet blocks consist of a se-
tion of the body-part segmentations S, the input image with quence of batch-normalization and convolutional layers re-
masked clothing area Im (computed as Im = I ⊙ Mc ), and peated twice with CAN operations preceding the convolu-
the target and warped clothing images, C and Cw , respec- tional layers. If the output of the batch normalization is
tively, i.e., IC = S ⊕ Im ⊕ C ⊕ Cw . A visual illustration denoted as XBN , then the context-aware normalization can
of IC is presented at the top of Figure 4(a). formally be defined as: XCAN = XBN ⊙ γ + β, where ⊙
CAG Design. The context-aware generator consists of a denotes the Hadamard product and XCAN is the normalized
sequence of ResNet blocks [13] and (2×) upsampling lay- output. γ and β stand for (spatial) scale and bias parameters
ers augmented with what we refer to as Context-Aware Nor- with the same dimensionality as XBN . The parameters are
malization (CAN) operations. Similarly to recent spatially- learned during the training procedure and computed using
adaptive normalization mechanisms used in the field of con- three convolutional layers, one of which is shared to first
ditional image synthesis [24, 31], the proposed CAN layers project IC onto a joint embedding space before estimating
are designed to efficiently utilize the information from the the values of γ and β with distinct convolutional operations.

3147
Training Objectives. Two types of losses are designed the following learning objective:
to learn the parameters of C-VTON’s generator, i.e.:
\begin {aligned} \mathcal {L}_{D_{mth}} = &-\mathbb {E}_{(I, C)} \left [ \log D_{mth}(I, C) \right ] \\ & -\mathbb {E}_{(I_{C}, C)} \left [ \log (1 - D_{mth}(I_{C}, C)) \right ], \end {aligned}
• A perceptual loss (Lper ) that encourages the generator (9)
to produce a virtual try-on result as close as possible to
the reference input image I in terms of semantics. Be- leading to the following generator loss (Lmth ):
cause the loss assumes the desired target appearance IC
is known, the image context IC is constructed with target \mathcal {L}_{mth} = - \mathbb {E}_{(I_{C}, C)} \left [ \log D_{mth}(I_{C}, C) \right ]. (10)
clothing C that matches the one in the input image I, i.e.:
– The patch discriminator (Dptc ) contributes towards
\label {eq:vggloss2} \mathcal {L}_{per} = \sum _i^n \tau _i \lVert \phi _i({I}'_C) - \phi _i(I) \rVert _1, (6) realistic body-part generation by focusing on the ap-
pearances of local patches, P = {p0 , ..., pm }, pi ∈
Rwp ×hp ×3 , centered at m = 5 characteristic body-parts,
where ϕi (·) are feature maps produced by a pretrained
i.e., the neck, and both upper arms and forearms. Differ-
VGG19 [32] model before each of the n = 5 max-pooling
ent from PatchGAN [15], where patches are extracted
layer, τi is the i-th balancing weight, and IC′ is the try-on
implicitly through convolutional operations in the dis-
result generated with the matching target clothing.
criminator, we sample the patches from fixed locations
• Three adversarial losses defined through three discrimi- based on the segmentation map S. The discriminator is
nators, each aimed at realistic generation of different as- trained to distinguish between real and generated body
pects of the final try-on image, i.e.: areas based on the following objective:
– The segmentation discriminator (Dseg ), inspired by
\begin {aligned} \mathcal {L}_{D_{ptc}} = &-\mathbb {E}_{P_{real}} \left [ \log D_{ptc}(P_{real}) \right ] \\ & -\mathbb {E}_{P_{fake}} \left [ \log (1 - D_{ptc}(P_{fake})) \right ], \end {aligned} \label {eq:patchloss}
[31], aims to ensure realistic body-part generation by (11)
predicting (per-pixel) segmentation maps S and their
origin (real or fake). Given an input image (I or IC ),
where Preal and Pf ake correspond to patches extracted
Dseg outputs a w×h×(d+1) dimensional tensor, where
from the real and generated images I and IC , respec-
the first d channels contain segmented body parts and
tively. The generator loss then takes the following form:
the (d + 1)-st channel encodes whether a pixel is from
a real or generated data distribution. Dseg is trained by
\mathcal {L}_{ptc} = - \mathbb {E}_{P_{fake}} \left [ \log D_{ptc}(P_{fake}) \right ]. (12)
minimizing a (d + 1)-class cross-entropy loss, i.e.:
The joint objective for training C-VTON’s context-aware
\begin {aligned} \label {eq:sd:loss} \mathcal {L}_{D_{seg}} = &-\mathbb {E}_{(I, S)} \left [ \sum _{k=1}^d \alpha _k \left (S_{k} \odot \log D_{seg}(I)_{k} \right ) \right ] \\ & -\mathbb {E}_{I_{C}} \left [ \log D_{seg}(I_{C})_{d+1} \right ], \end {aligned} generator is a weighted sum of all loss terms, i.e.:
(7)
\mathcal {L}_G = \lambda _{per} \mathcal {L}_{per} + \lambda _{seg} \mathcal {L}_{seg} \\ + \lambda _{mth} \mathcal {L}_{mth} + \lambda _{ptc} \mathcal {L}_{ptc}, \label {eq: overall_objective_CAG} (13)

where λper , λseg , λmth and λptc denote hyperparameters


where the first (segmentation-related) term applies to
that determine the relative importance of each loss term.
the (real) input images I and penalizes the first d output
channels, and the second term penalizes the last remain-
4. Experiments
ing channel generated from the synthesized image IC =
CAG(IC). αk is a balancing weight calculated as the In this section we now present experiments that: (i) com-
inverse frequency of the body-part in the given channels pare C-VTON to competing models, (ii) demonstrate the
of the segmentation map S, i.e., αk = hw/⌊Sk ⌋, where impact of key components on performance, and (iii) ex-
⌊·⌋ is a cardinality operator. The corresponding adver- plore the characteristics of the proposed model.
sarial loss (Lseg ) for the generator is finally defined as:
4.1. Datasets
Following prior work [8, 16, 23, 35], two datasets are se-
\mathcal {L}_{seg} = - \mathbb {E}_{(I_{C}, S)} \left [\sum _{k=1}^{d} \alpha _k \left (S_{k} \odot \log D_{seg}(I_{C})_{k} \right )\right ]. lected for the experiments, i.e., VITON [12] and MPV [4].
VITON [12] is a popular dataset for evaluating virtual
(8) try-on solutions and consists of 14, 221 training and 2032
– The matching discriminator (Dmth ) aims to encour- testing pairs of images (i.e., subjects and target clothing)
age the generator to synthesize output images with the with a resolution of 256 × 192 pixels. For the experiments,
desired target clothing by predicting whether the target duplicate images in the training and test sets are filtered out,
garment C corresponds to the clothing being worn in ei- leaving 8586 image pairs in the training set and 416 image
ther I or IC . Formally, we train Dmth by minimizing pairs in the test set. After duplicate removal, the test set

3148
Data Model Published FID↓ LPIPS↓ (µ ± σ) Dataset Model Published vs. C-VTON
CP-VTON [34] ECCV 2018 47.36 0.303 ± 0.043 CP-VTON [34] ECCV 2018 0.766
CP-VTON+ [23] CVPRW 2020 41.37 0.278 ± 0.047 CP-VTON+ [23] CVPRW 2020 0.756
VITON

VITON
ACGPN [35] CVPR 2020 37.94 0.233 ± 0.047 ACGPN [35] CVPR 2020 0.674
PF-AFN [8] CVPR 2021 27.23 0.237 ± 0.049 PF-AFN [8] CVPR 2021 0.527
C-VTON This work 19.54 0.108 ± 0.033 MPV S-WUTON [16] ECCV 2020 0.607
S-WUTON [16] ECCV 2020 8.188 0.161 ± 0.070
MPV

PF-AFN† [8] CVPR 2021 6.429 n/a Table 2. Results of the human perceptual study reported in terms of
C-VTON This work 4.846 0.073 ± 0.039 the frequency C-VTON generated results were preferred over oth-

As reported in the original publication ers. The study was conducted with 100 randomly selected images
Table 1. Quantitative comparison of C-VTON and competing for each dataset and 70 human participants.
state-of-the-art models in terms of FID and LPIPS scores - lower
is better, as also indicated by the corresponding arrows. criminator Dseg has an UNet [29] encoder-decoder archi-
tecture and consists of a total of 12 ResNet blocks.
contains unique images not seen during training and allows Training Details. The ADAM optimizer [18] is used for
for a fair comparison between different approaches. the training procedure with a learning rate of lrBP GM =
MPV [4] represents another virtual try-on dataset with 0.0001 for the BPGM, lrG = 0.0001 for the generator and
35, 687 person images (256 × 192) wearing 13, 524 unique lrD = 0.0004 for the discriminators. All weights in the
garments. Different from VITON, MPV exhibits a higher learning objectives from Eqs. (4), (5), (6) and (13) are set
degree of appearance variability with larger differences to 1, except for λvgg = 0.1 and λper = 10. The geometric
in zoom level and view point. For the experiments, the matcher is trained for 30 epochs and the generator for 100
images in MPV are prefiltered to feature only (close to) in all configurations. Source code is available at https:
frontal views in accordance with standard methodology, //github.com/benquick123/C-VTON.
e.g., [8, 16]. The final train and test sets contain 17, 400
paired and 3662 unpaired person and clothing images. 4.3. Quantitative Results
To demonstrate the performance of C-VTON, we first
4.2. Implementation Details
analyze Fréchet Inception Distances (FID [14]) and Learned
C-VTON is implemented in Python using PyTorch. Perceptual Image Patch Similarities (LPIPS) [37] over pro-
Most modules utilized in the processing pipeline build on cessed VITON and MPV test images and conduct a hu-
ResNet-like blocks [13] that consist of two conv+ReLU lay- man perceptual study (similarly to [8, 12, 34]) on the
ers and a trainable shortcut connection. Architectural de- MTurk platform. For comparison purposes, we also re-
tails for the main C-VTON components are given below. port result for multiple state-of-the-art models, i.e., CP-
The Body-Part Geometric Matcher (BPGM) consists VTON [34], CP-VTON+ [23], ACGPN [35], PF-AFN [8]
of 2 encoders, E1 and E2 , with 5 stacked convolutional lay- and S-WUTON [16]. Pretrained (publicly released) models
ers, followed by a downsampling operation, a ReLU acti- are used for the experiments to ensure a fair comparison,
vation function and batch normalization. The feature re- except for S-WUTON, where synthesized test images were
gressor R is implemented with 4 convolutional layers, each made available for scoring by the authors of the model.
followed by a ReLU activation and batch normalization lay- FID and LPIPS Scores. A quantitative comparison of
ers. An 18-dimensional linear output layer is used to obtain C-VTON and the selected competitors is presented in Ta-
the parameters (θ) for thin-plate spline transformation. ble 1. We note that the results for PF-AFN on MPV are
The Context-Aware Generator (CAG) consists of borrowed from [8], since no pretrained model is publicly
ResNet blocks with context-aware normalization added be- available for this dataset. As can be seen, C-VTON signif-
fore every convolutional layer. We use 6 such blocks each icantly outperforms all competing models on both datasets.
followed by an (2×) upsampling layer. Contextual inputs On VITON it reduces the FID score by 28.2% compared to
are resized to match each block’s input resolution. An ex- the runner-up and the LPIPS measure by 53.6%. Similar
ponential moving average (EMA) is applied over generator (relative) performances are also observed on MPV, where
weights with a decay value of 0.9999, similarly to [33]. C-VTON again leads to comparable reductions in FID and
Discriminators. The matching discriminator Dmth is LPIPS scores when compared to the runner-ups. We at-
implemented with two encoders (one for C and one for IC ) tribute these results to the simplified geometric matching
consisting of 6 ResNet blocks each. The output of the en- procedure used in C-VTON and the inclusion of diverse
coders is concatenated and fed to a linear layer that pro- contextual information in the final image synthesis step.
duces the final output. The patch discriminator Dptc com- Human Perceptual Study. We also evaluate C-VTON
prises 4 ResNet blocks arranged in an encoder architecture, through a human perceptual study to analyze the (subjec-
with a fully-connected layer on top. The segmentation dis- tively) perceived quality of the generated try-on images. In

3149
Original
Original Target
Target CP-VTON
CP-VTON CP-VTON+
CP-VTON+ ACGPN
ACGPN PF-AFN
PF-AFN Ours
Ours Original
Original Target
Target S-WUTON
S-WUTON Ours
Ours

Figure 5. Comparison of C-VTON (ours) and several recent state-of-the-art models on the VITON (left) and MPV (right) datasets. Areas
of interest in the synthesized images are marked with a red bounding box. C-VTON performs considerably better than competing models
when synthesizing arms and hands and also better preserves on-shirt graphics. Best viewed electronically and zoomed-in for details.
w/o w/o w/o
Original Target w/o D Final
Original Target CAG
w/o BPGM
CAN w/o BPGM w/o D w/o EMAEMA C-VTON
VITON MPV
Model
FID↓ LPIPS↓ FID↓ LPIPS↓
C-VTON 19.535 0.108 ± 0.033 4.846 0.073 ± 0.039
A1: w/o CAN 24.521 0.162 ± 0.037 12.096 0.159 ± 0.049
A2: w/o BPGM 24.422 0.140 ± 0.036 6.728 0.096 ± 0.046
A3: w/o D† 21.359 0.109 ± 0.033 5.898 0.076 ± 0.040
A4: w/o EMA 24.571 0.150 ± 0.035 5.304 0.102 ± 0.043

D stands for the set of discriminators D = {Dseg , Dmth , Dptc }

Table 3. Ablation study results. For each C-VTON variant (A1-


A4) one key component is ablated to demonstrate its contribution. Figure 6. Qualitative ablation-study results. The aggregation of all
component results in noticeable improvements in sleeve and arm
generation, on-garment graphics and realistic garment shapes.
the scope of the study, participants were shown the origi-
nal input image, the target garment and two distinct try-on visually convincing virtual try-on results can be produced
results, where one was always the result of C-VTON and with C-VTON even with subjects imaged in difficult poses
the other was generated by one of the competing solutions. and with challenging arm/hand configurations.
The participants had to choose the more convincing of the Among the evaluated competitors, PF-AFN produces the
two images based on multiple factors, i.e.: texture transfer most convincing results on the VITON dataset. However,
quality, arm generation capabilities, pose preservation, and as illustrated by the presented examples, the method some-
overall quality of results. 100 randomly selected images times does not preserve arms, the initial body shape and/or
from each dataset were used for the study, which featured the pose of the subjects, whereas C-VTON fares much bet-
70 participants in total. The results in Table 2, reported in ter in this regard. The remaining approaches, i.e., CP-
terms of frequency C-VTON generated results were pre- VTON, CP-VTON+ and ACGPN, produce less convincing
ferred over others, show that the proposed approach was results and often fail to preserve certain (non-transferable)
clearly favored among the human raters. image parts (e.g. trousers and skirts) and textures from the
target garment. On MPV, S-WUTON similarly struggles to
4.4. Qualitative Results preserve arms and body shape, while our model synthesizes
Next, we explore the performance of C-VTON through both well. The excellent performance of C-VTON in this
visual comparisons with competing models in Figure 5. regard is the results of the body-part segmentation proce-
Due to the unavailability of a pretrained PF-AFN model dure used and the set of carefully designed discriminators
for MPV, C-VTON is only compared to S-WUTON on this that ensure realism of the generated images.
dataset. As can be seen from the presented examples, the
4.5. Ablation Study
proposed approach generates the most convincing virtual
try-on results and performs particularly well with hand and C-VTON relies on several key components to facilitate
on-shirt graphics synthesis. The results clearly show that image-based virtual try-on. To demonstrate the impact of

3150
Orig. Target
Original
[34] [23] [35]
Target CP-VTON CP-VTON+ ACGPN
[8]
PF-AFN
Ours
Ours
Orig.
Original
Target
Target
[16] Ours
S-WUTONOurs

Figure 9. Examples of less convincing try-on results. With certain


image characteristics C-VTON generates blurry clothing edges
and only partially transfers target garments. Zoom in for details.
Figure 7. Sample results with multiple target garments and dif-
ferent subjects. Challenging examples with long to short-sleeved
garment transfer are presented. Best viewed zoomed-in for details. lighting the benefits of the proposed geometric matcher, and
(iii) illustrating some of the model’s limitations.
Original Target GMM out S-w-GMM BPGM out S-w-BPGM
Multiple Targets and Subjects. Figure 7 shows visual
try-on results for two distinct subjects and 6 different tar-
get garments with varying sleeve lengths. Note that despite
the fact that the arms are completely covered in the input
images, C-VTON is able to generate realistic try-on results
with convincing arm appearances. Varying sleeve lengths
are also transferred well onto the synthesized images.
GMM vs. BPGM. Figure 8 shows a comparison be-
tween the original geometric matching module (GMM)
Figure 8. Comparison of the geometric matching module (GMM)
from [34] and the proposed body-part geometric matcher
from [34] and our Body-Part Geometric matcher (BPGM). S-w-
GMM: Synthesis with GMM, S-w-BPS: Synthesis with BPGM. (BPGM). Note that our BPGM generates more realistic
warps that better preserve the shape and texture of the tar-
get clothing in the final try-on result. As illustrated by the
these components on performance, an ablation study is con-
example in the top row, the better alignment ensured by
ducted. Specifically, four C-VTON variants and imple-
the BPGM leads to a correctly rendered V-neck. Similarly,
mented, i.e.: (i) C-VTON without CAN operations (A1),
on-shirt graphics are better preserved when the proposed
(ii) C-VTON without the BPGM (A2), (iii) C-VTON with-
BPGM is used instead of the original GMM, as seen by the
out the discriminators (A3), and (iv) C-VTON without the
example in the bottom row of Figure 8.
exponential moving average - EMA (A4).
Limitations. Issues with the masking procedure (of Im )
The results in Table 3 show that FID and LPIPS scores
when generating the image context IC, loose clothing in
increase when any of the key components is ablated. Inter-
the input images, and the inability of the model to differen-
estingly, the absence of CAN operations seems to affect re-
tiate between the front and backside of the target garment C
sults the most, while the discriminators appear to contribute
are among the main causes for some of the less convincing
the least. However, when looking at the visual examples
virtual try-on results produced with C-VTON. These causes
in Figure 6, we see that the discriminators critically affect
lead to unrealistic and soft garment edges, incorrectly syn-
the final image quality despite the somewhat smaller change
thesized clothing types and improperly rendered neck areas.
in quantitative scores. The difference is especially notice-
Similar limitations are also observed with the competing
able when comparing sharpness and artefacts when training
models, as seen from the presented examples in Figure 9.
C-VTON with or without the discriminators. Furthermore,
CAN operations contribute to sleeve and arm generation,
the BPGM to a higher quality of on-garment graphics, and 5. Conclusion
EMA to more realistic garment shapes. C-VTON combines In this paper, we proposed C-VTON, a novel approach
these contributions without creating new artefacts, as illus- to image-based virtual try-on capable of synthesizing high-
trated by the results for the complete model. quality try-on results across a wide range of input-image
characteristics. The model was evaluated in extensive ex-
4.6. Strengths and Weaknesses
periments on the VITON and MPV datasets and was shown
Finaly, we demonstrate some of C-VTON’s strengths to clearly outperform the state-of-the-art. Additional results
and weaknesses by: (i) presenting virtual try-on results for that further highlight the merits of the proposed approach
multiple target garments and different subjects, (ii) high- are available in the Supplementary Material.

3151
References Equilibrium. In Advances in Neural Information Processing
Systems (NIPS), pages 6626–6637, 2017.
[1] Rose Francoise Bertram and Ting Chi. A Study of Com-
[15] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A
panies’ Business Responses to Fashion E-Commerce’s Envi-
Efros. Image-to-Image Translation with Conditional Adver-
ronmental Impact. International Journal of Fashion Design,
sarial Networks. In Computer Vision and Pattern Recogni-
Technology and Education, 11(2):254–264, 2018.
tion (CVPR), pages 1125–1134, 2017.
[2] Wen-Huang Cheng, Sijie Song, Chieh-Yun Chen, Shin-
[16] Thibaut Issenhuth, Jérémie Mary, and Clément Calauzènes.
tami Chusnul Hidayati, and Jiaying Liu. Fashion Meets
Do Not Mask What You Do Not Need to Mask: A Parser-
Computer Vision: A Survey. ACM Computing Surveys,
Free Virtual Try-On. In European Conference on Computer
54(4):1–41, 2021.
Vision (ECCV), pages 619–635, 2020.
[3] Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul
[17] Nikolay Jetchev and Urs Bergmann. The Conditional Anal-
Choo. Viton-hd: High-resolution virtual try-on via
ogy GAN: Swapping Fashion Articles on People Images.
misalignment-aware normalization. In Proceedings of the
In International Conference on Computer Vision Workshops
IEEE/CVF Conference on Computer Vision and Pattern
(ICCV-W), pages 2287–2292, 2017.
Recognition (CVPR), pages 14131–14140, June 2021.
[18] Diederik P Kingma and Jimmy Ba. Adam: A method for
[4] Haoye Dong, Xiaodan Liang, Xiaohui Shen, Bochao Wang,
Stochastic Optimization. In International Conference on
Hanjiang Lai, Jia Zhu, Zhiting Hu, and Jian Yin. Towards
Learning Representations (ICLR), 2015.
Multi-Pose Guided Virtual Try-on Network. In International
Conference on Computer Vision (ICCV), pages 9026–9035, [19] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero,
2019. Andrew Cunningham, Alejandro Acosta, Andrew Aitken,
[5] Haoye Dong, Xiaodan Liang, Yixuan Zhang, Xujie Zhang, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe
Xiaohui Shen, Zhenyu Xie, Bowen Wu, and Jian Yin. Fash- Shi. Photo-Realistic Single Image Super-Resolution using
ion Editing With Adversarial Parsing Learning. In Computer a Generative Adversarial Network. In Computer Vision and
Vision and Pattern Recognition (CVPR), pages 8120–8128, Pattern Recognition (CVPR), pages 4681–4690, 2017.
2020. [20] Hanna Lee and Yingjiao Xu. Classification of Virtual Fit-
[6] Jean Duchon. Splines Minimizing Rotation-Invariant Semi- ting Room Technologies in the Fashion Industry: From the
Norms in Sobolev Spaces. In Constructive Theory of Func- Perspective of Consumer Experience. International Journal
tions of Several Variables, pages 85–100. Springer, 1977. of Fashion Design, Technology and Education, 13(1):1–10,
[7] Chongjian Ge, Yibing Song, Yuying Ge, Han Yang, Wei 2020.
Liu, and Ping Luo. Disentangled Cycle Consistency for [21] Hyug Jae Lee, Rokkyu Lee, Minseok Kang, Myounghoon
Highly-realistic Virtual Try-On. Computer Vision and Pat- Cho, and Gunhan Park. LA-VITON: A Network for
tern Recognition (CVPR), 2021. Looking-Attractive Virtual Try-On. In International Con-
[8] Yuying Ge, Yibing Song, Ruimao Zhang, Chongjian Ge, Wei ference on Computer Vision Workshops (ICCV-W), 2019.
Liu, and Ping Luo. Parser-Free Virtual Try-on via Distilling [22] Peike Li, Yunqiu Xu, Yunchao Wei, and Yi Yang. Self-
Appearance Flows. In Computer Vision and Pattern Recog- Correction for Human Parsing. IEEE Transactions on Pat-
nition (CVPR), pages 8485–8493, 2021. tern Analysis and Machine Intelligence, pages 1 – 12, 2021.
[9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing [23] Matiur Rahman Minar, Thai Thanh Tuan, Heejune Ahn, Paul
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Rosin, and Yu-Kun Lai. CP-VTON+: Clothing Shape and
Yoshua Bengio. Generative Adversarial Nets. In Advances Texture Preserving Image-based Virtual Try-on. In Com-
in Neural Information Processing Systems (NIPS), 2014. puter Vision and Pattern Recognition Workshops (CVPR-W),
[10] Peng Guan, Loretta Reiss, David A Hirshberg, Alexander page 11, 2020.
Weiss, and Michael J Black. DRAPE: Dressing Any Person. [24] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan
ACM Transactions on Graphics, 31(4):1–10, 2012. Zhu. Semantic Image Synthesis with Spatially-Adaptive
[11] Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. Normalization. In Computer Vision and Pattern Recognition
DensePose: Dense Human Pose Estimation in the Wild. In (CVPR), pages 2337–2346, 2019.
Computer Vision and Pattern Recognition (CVPR), pages [25] Chaitanya Patel, Zhouyingcheng Liao, and Gerard Pons-
7297–7306, 2018. Moll. TailorNet: Predicting Clothing in 3D as a Function of
[12] Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S Human Pose, Shape and Garment Style. In Computer Vision
Davis. Viton: An Image-based Virtual Try-on Network. and Pattern Recognition (CVPR), pages 7365–7375, 2020.
In Computer Vision and Pattern Recognition (CVPR), pages [26] Gerard Pons-Moll, Sergi Pujades, Sonny Hu, and Michael
7543–7552, 2018. Black. ClothCap: Seamless 4D Clothing Capture and Retar-
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. geting. ACM Transactions on Graphics, 36(4):1–15, 2017.
Deep Residual Learning for Image Recognition. In Com- [27] Amit Raj, Patsorn Sangkloy, Huiwen Chang, James Hays,
puter Vision and Pattern Recognition (CVPR), pages 770– Duygu Ceylan, and Jingwan Lu. SwapNet: Image Based
778, 2016. Garment Transfer. In European Conference on Computer
[14] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Vision (ECCV), pages 679–695, 2018.
Bernhard Nessler, and Sepp Hochreiter. GANs Trained by [28] Ignacio Rocco, Relja Arandjelovic, and Josef Sivic. Convo-
a Two Time-scale Update Rule Converge to a Local Nash lutional Neural Network Architecture for Geometric Match-

3152
ing. In Computer Vision and Pattern Recognition (CVPR),
pages 6148–6157, 2017.
[29] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
Net: Convolutional Networks for Biomedical Image Seg-
mentation. In Medical Image Computing and Computer-
Assisted Intervention (MICCAI), pages 234–241, 2015.
[30] Igor Santesteban, Miguel A Otaduy, and Dan Casas.
Learning-based Animation of Clothing for Virtual Try-on.
Computer Graphics Forum, 38(2):355–366, 2019.
[31] Edgar Schönfeld, Vadim Sushko, Dan Zhang, Juergen Gall,
Bernt Schiele, and Anna Khoreva. You only Need Adver-
sarial Supervision for Semantic Image Synthesis. In Inter-
national Conference on Learning Representations (ICLR),
2021.
[32] Karen Simonyan and Andrew Zisserman. Very Deep Convo-
lutional Networks for Large-Scale Image Recognition. In In-
ternational Conference on Learning Representations (ICLR),
2015.
[33] Vadim Sushko, Edgar Schönfeld, Dan Zhang, Juergen Gall,
Bernt Schiele, and Anna Khoreva. You Only Need Adver-
sarial Supervision for Semantic Image Synthesis. In Inter-
national Conference on Learning Representations (ICLR),
2020.
[34] Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin
Chen, Liang Lin, and Meng Yang. Toward Characteristic-
Preserving Image-based Virtual Try-on Network. In Euro-
pean Conference on Computer Vision (ECCV), pages 589–
604, 2018.
[35] Han Yang, Ruimao Zhang, Xiaobao Guo, Wei Liu, Wang-
meng Zuo, and Ping Luo. Towards Photo-Realistic Virtual
Try-on by Adaptively Generating-Preserving Image Content.
In Computer Vision and Pattern Recognition (CVPR), pages
7850–7859, 2020.
[36] Ruiyun Yu, Xiaoqi Wang, and Xiaohui Xie. VTNFP: An
Image-Based Virtual Try-On Network With Body and Cloth-
ing Feature Preservation. In International Conference on
Computer Vision (ICCV), pages 10510–10519, 2019.
[37] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
and Oliver Wang. The Unreasonable Effectiveness of Deep
Features as a Perceptual Metric. In Computer Vision and
Pattern Recognition (CVPR), pages 586–595, 2018.

3153

You might also like