Tame VTON
Tame VTON
Figure 1: Comparison results of three methods on VITON-HD dataset at 512 × 384 resolution. It can be seen that our method
can generate high-quality results and ensure the restoration of clothes.
ABSTRACT clothes as a condition for guiding the diffusion model to inpaint is
Virtual try-on is a critical image synthesis task that aims to transfer insufficient to maintain the details of the clothes. To overcome this
clothes from one image to another while preserving the details challenge, we propose an exemplar-based inpainting approach that
of both humans and clothes. While many existing methods rely leverages a warping module to guide the diffusion model’s genera-
on Generative Adversarial Networks (GANs) to achieve this, flaws tion effectively. The warping module performs initial processing on
can still occur, particularly at high resolutions. Recently, the diffu- the clothes, which helps to preserve the local details of the clothes.
sion model has emerged as a promising alternative for generating We then combine the warped clothes with clothes-agnostic person
high-quality images in various applications. However, simply using image and add noise as the input of diffusion model. Additionally,
the warped clothes is used as local conditions for each denoising
∗ Corresponding authors. process to ensure that the resulting output retains as much de-
tail as possible. Our approach, namely Diffusion-based Conditional
Permission to make digital or hard copies of all or part of this work for personal or Inpainting for Virtual Try-ON (DCI-VTON), effectively utilizes the
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation power of the diffusion model, and the incorporation of the warping
on the first page. Copyrights for components of this work owned by others than the module helps to produce high-quality and realistic virtual try-on
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or results. Experimental results on VITON-HD demonstrate the effec-
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected]. tiveness and superiority of our method. Source code and trained
MM ’23, October 29–November 3, 2023, Ottawa, ON, Canada. models will be publicly released at: https://github.com/bcmi/DCI-
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM. VTON-Virtual-Try-On.
ACM ISBN 979-8-4007-0108-5/23/10. . . $15.00
https://doi.org/10.1145/3581783.3612255
MM ’23, October 29–November 3, 2023, Ottawa, ON, Canada. Junhong Gou, Siyu Sun, Jianfu Zhang, Jianlou Si, Chen Qian, & Liqing Zhang
CCS CONCEPTS can fill the target region of the source image seamlessly with the
• Computing methodologies → Computational photography. objects in the reference image and maintain the overall fidelity and
harmonious. Similar to this task, we can also regard virtual try-on
KEYWORDS as an inpainting task. The primary difference is that the task scene
now involves inpainting garments onto humans. In this way we can
virtual try-on, diffusion models, appearance flow, high-resolution
indeed generate high-quality synthetic results, as shown in Figure
image synthesis
1 (b). However, it is evident that such an approach cannot fully
ACM Reference Format: preserve the details of the clothes image, and the clothes style (e.g.,
Junhong Gou, Siyu Sun, Jianfu Zhang, Jianlou Si, Chen Qian, and Liqing color, pattern) is biased. In this example, the color of the clothes
Zhang. 2023. Taming the Power of Diffusion Models for High-Quality Virtual and the arrangement of the stripes are completely different from
Try-On with Appearance Flow. In Proceedings of the 31st ACM International the target clothes.
Conference on Multimedia (MM ’23), October 29–November 3, 2023, Ottawa, Motivated by the above points, we propose a virtual try-on frame-
ON, Canada. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/
work based on the diffusion model. To fully utilize the diffusion
3581783.3612255
model’s powerful generation capabilities while also improving the
model’s controllability for the try-on task, we divide the entire
1 INTRODUCTION framework into two major modules, namely the warping mod-
Virtual try-on is a prevalently-researched technology that can en- ule and the refinement module. Similar to previous virtual try-on
hance consumers’ shopping experiences. This technique seeks to methods [11, 15, 17, 22], we predict an appearance flow field in the
transfer the clothes in one image to the target person in another warping module to fit the clothes to the pose of the target person.
image, resulting in a real and plausible composite image. The key Then, the warped clothes are directly combined with the image of
point of this task is that, on the presumption that the synthetic the person whose torso and arms are masked to get a coarse result.
results are sufficiently realistic, the textural details of the garment This coarse result will be input to our refinement module after
and other character attributes of the target person (e.g., appearance adding noise, and an improved result will be obtained after being
and pose) should be well maintained. denoised by the diffusion model. A high-quality synthetic result
Most of the previous virtual try-on works were based on Gener- could be produced via such a process, and the powerful generative
ative Adversarial Networks [12] (GANs) in order to generate more ability of the diffusion model also ensures that our results will not
realistic pictures. To further preserve the details, previous stud- involve too many artifacts like the previous GANs-based methods.
ies [11, 15–17, 28, 42, 45] employed an explicit warping module After giving an initial guidance of the rough result plus the global
that aligns the target clothes with the human body. After getting conditional guidance of the original clothes image, we also refer
the warped clothes, they fed it into the generator along with the to [44] and concatenate the inpaint image and the inpaint mask
clothes-agnostic image of the person to get the final result. Based together as input to control the generation of the diffusion model.
on these, some works [6, 22] additionally expand the task to high- Moreover, the warped clothes are combined with inpaint image as
resolution scenarios. However, the reliability of such a framework local condition to guide each step of the denoising process. In this
is heavily contingent on the quality of warped garments. Warped way, the issue that the simple inpainting process cannot preserve
garments in low-quality impede faithful generations. Furthermore, the details of the clothes has been overcome, as illustrated in Figure
GANs-based generators inherit the weaknesses of the GAN model, 1 (c).
i.e., convergence heavily depends on the choice of hyperparame- To evaluate our proposed method, we conduct extensive experi-
ters [1, 14], and mode drop in the output distribution [4, 29]. Even ments on the VITON-HD dataset [6] and DressCode dataset [30],
though these works have produced some positive outcomes, there and compare it with previous works, which proves that our method
are still issues like unrealistic and poor details as shown in Figure 1 can achieve excellent performance. Furthermore, we additionally
(a). conduct some experiments on virtual try-on task in more complex
More recently, diffusion models [19, 35, 39, 40] have gradually scenarios on the DeepFashion [25] dataset. Specifically, we use
emerged and are considered as alternative generative models. Com- another person’s clothes as a reference to transfer it to the target
pared to GANs, diffusion models can offer desirable qualities, includ- person. This task involves the transfer of various human poses,
ing distribution coverage, a fixed training objective, and scalabil- which is more challenging than the scene where template clothes
ity [8, 32]. Although the diffusion model has excellent performance are provided.
in many image generation tasks [5, 27, 34, 35], virtual try-on re-
mains a very challenging task, for which preserving the detailed
features in the reference image (i.e., garment) is critical and es- 2 RELATED WORK
sential. For our virtual try-on task, a naive method is that we can
describe the clothes style through text and then use the mature 2.1 Virtual Try-On
text-to-image diffusion model framework [34, 35, 37] to complete Virtual try-on has always been an appealing research subject since
the try-on task. However, it is difficult for text to accurately de- it may significantly enhance the shopping experience of consumers.
pict some complicated garment texture patterns, resulting in an According to [9, 17], we can divide the existing virtual try-on tech-
inability to yield results that are completely consistent with our nologies into 2D and 3D categories. 3D virtual try-on technology
expectations. Recently, Yang et al. [44] have proposed a method can bring a better user experience, but it relies on 3D parametric
for exemplar-based image inpainting with diffusion models, which human models and unfortunately building a large-scale 3D dataset
Taming the Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow MM ’23, October 29–November 3, 2023, Ottawa, ON, Canada.
Figure 3: The training pipeline of the diffusion model in our method. There are two branches in our training pipeline: the
reconstruction branch above and the refinement branch below. The main difference between them is the inconsistency of the
input object and the optimization goal. For better visualization, we show the images corresponding to the variables in the latent
space.
𝐼𝑝 , and the part where the 𝑚 is 1 should contain all the elements of correspondences between the clothes image and the person im-
𝐼𝑐 and integrate seamlessly with the person. age for warping the clothes. Similar to previous works [11, 15, 22],
To ensure that the clothes in the inpainting region not only main- the final flow is obtained by an iterative refinement strategy. This
tains most of the original clothes’s characteristics but additionally method enables us to capture the long-range correspondence be-
can be “worn” by the person in a reasonable manner, we first warp tween 𝐼𝑝 and 𝐼𝑐 , allowing us to deal with significant misalignment
the clothes to align it with the person to create a preliminary com- more effectively.
posite result, and then refine the inpainting region via the diffusion Specifically, for two kinds of input 𝐼𝑐 and 𝑆𝑝 &𝑃, we use two
model. Figure 2 shows the overall process of our method, where symmetrical encoders to extract the feature pyramids {𝐸𝑐 }𝑖=1 𝑁 and
the light blue and light green areas represent the processes of warp- {𝐸𝑝 }𝑖=1 . Correspondingly, the flow 𝐹𝑖 we predict in each layer will
𝑁
ing and refinement respectively. In order to exclude the influence be passed to the next layer for refinement to output 𝐹𝑖+1 until the
of the clothes worn by the target person of 𝐼𝑝 on the succeeding final output is obtained. In each layer, the output flow 𝐹𝑖 −1 of the
steps, we use person representations extracted from off-the-shelf previous layer will first be up-sampled to the same size and warp the
models [13, 23] as input. For warping phase, the clothes-agnostic corresponding features 𝐸𝑐𝑖 , and the result will then correlate with 𝐸𝑝𝑖
segmentation map 𝑆𝑝 is concatenated with densepose 𝑃, and then,
to predict the increment of the flow. The final output 𝐹 𝑁 ∈ R𝐻 ×𝑊 ×2
together with the clothes 𝐼𝑐 , is fed into the warping network to
is a set of 2D coordinate vectors, each of which indicates which
predict an appearance flow field to warp the clothes. The warped
pixels in the clothes image 𝐼𝑐 should be used to fill the given pixel
clothes 𝐼˜𝑐 and clothes-agnostic person 𝐼𝑎 is combined to generate in the person image 𝐼𝑝 .
coarse result 𝐼 0′ , which is then noised for subsequent refinement by Loss Functions: Since the appearance flow is a variable with a
the diffusion model to get finer results 𝐼ˆ. high degree of freedom, total-variation (TV) loss can solve this
In the training process, since it is impossible to obtain data pairs problem well for the smoothness of the final warping result. L𝑇𝑉
of the same person wearing different clothes in the same posture, we can be calculated by the following formula:
use the clothes-agnostic image 𝐼𝑎 extracted from 𝐼𝑝 and the template
∑︁
𝑁
image 𝐼𝑐 of the clothes on the target person of 𝐼𝑝 to reconstruct 𝐼𝑝 .
L𝑇𝑉 = ||∇𝐹𝑖 || 1 . (1)
𝑖=1
Referring to [11], we also added a second-order smooth constraint,
3.1 Warping Network which is calculated by:
There are currently two common methods for warping clothes, ∑︁
𝑁 ∑︁ ∑︁
namely TPS warping and appearance flow-based warping. The L𝑠𝑒𝑐 = P (𝐹𝑖𝑡 −𝜋 + 𝐹𝑖𝑡 +𝜋 − 2𝐹𝑖𝑡 ), (2)
warping method based on the appearance flow has a higher degree 𝑖=1 𝑡 𝜋 ∈ N𝑡
of freedom, and correspondingly can adapt to more flexible transfor- in which 𝐹𝑖𝑡 indicates the 𝑡-th point in flow map 𝐹𝑖 . N𝑡 indicates the
mations. The objective of warping network is to predict the dense set of horizontal, vertical, and both diagonal neighborhoods around
Taming the Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow MM ’23, October 29–November 3, 2023, Ottawa, ON, Canada.
Table 1: Quantitative comparison with baselines. We multiply KID by 100 for better comparison. For User result “a / b”, a is
frequency that each method is chosen as the best method for restoring the clothes, and b represents the best generated result.
the 𝑡-th point. P is generalized charbonnier loss function [41]. 𝑚, which means that the clothes in the 𝐼𝑙𝑐 are only used to provide
Moreover, for the warped clothes and corresponding warped mask, detailed information, and the final inpainting result will redraw
perceptual loss [20] and L1 loss are used to constrain them to en- the entire mask area. As a result, the clothes in the final composite
courage the network to warp the clothes to fit the person’s pose. result might not exactly match its initial warping result. The benefit
Formally, L𝐿1 and L𝑉 𝐺𝐺 are as follows: of this is that it can prevent certain adverse repercussions from
∑︁
𝑁 poor warping results. Additionally, it can connect the human body
L𝐿1 = ||W (D𝑖 (𝑀𝑐 ), 𝐹𝑖 ) − D𝑖 (𝑆𝑐 )|| 1, (3) part and the clothes part more effectively. In order to make better
𝑖=1 use of the spatial information contained in the pre-warped clothes
∑︁
𝑁 ∑︁
5 and align the final result with the rough result 𝐼 0′ , we also use it as
L𝑉 𝐺𝐺 = ||Φ𝑚 (W (D𝑖 (𝐼𝑐 ), 𝐹𝑖 )) − Φ𝑚 (D𝑖 (𝑆𝑐 ⊙ 𝐼𝑝 ))|| 1, (4) the initial condition, add noise and input it into the diffusion model
𝑖=3 𝑚=1 for refinement.
where 𝑀𝑐 and 𝑆𝑐 indicate the mask of 𝐼𝑐 and clothes mask of 𝐼𝑝 Reconstruction Branch: The reconstruction branch performs
respectively. W represents the warping function, and D represents similarly to the vanilla diffusion model, which generates realistic
the downsampling function. Φ𝑚 indicates the 𝑚-th feature map in images by learning the reverse diffusion process. For the target
a VGG-19 [38] network pre-trained on ImageNet [7]. image 𝐼 0 , we first perform a forward diffusion process, 𝑞(·), on it,
The total loss function of the entire warping network can be and gradually add noise to it according to the Markov chain and
expressed as: convert it into a Gaussian distribution. To reduce computational
complexity, we employ an latent diffusion model[35], which embeds
L𝑤 = L𝐿1 + 𝜆𝑉 𝐺𝐺 L𝑉 𝐺𝐺 + 𝜆𝑇𝑉 L𝑇𝑉 + 𝜆𝑠𝑒𝑐 L𝑠𝑒𝑐 . (5)
the images from image space to latent space through a pretrained
where 𝜆𝑉 𝐺𝐺 , 𝜆𝑇𝑉 and 𝜆𝑠𝑒𝑐 denote the hyper-parameters controlling encoder E and reconstructs images by a pretrained decoder D. The
relative importance between different losses. forward process is performed the latent variable 𝑧 0 = E (𝐼 0 ) at an
arbitrary timestamp 𝑡:
3.2 Diffusion Model √ √
𝑧𝑡 = 𝛼𝑡 𝑧 0 + 1 − 𝛼𝑡 𝜖, (6)
As indicated in the overview of our strategy in Figure 2, we in-
Î𝑡
tend to apply the diffusion model to refine the coarse synthesis where 𝛼 := 𝑠=1 (1−𝛽𝑠 ) and 𝜖 ∼ N (0, 𝐼 ). 𝛽 is a pre-defined variance
results. To make better use of the initial rough results, we divide the schedule in 𝑇 steps.
training process into two branches: reconstruction and refinement. Afterwards, we obtain 𝑧𝑙𝑐 by feeding 𝐼𝑙𝑐 into the E, and then con-
Figure 3 depicts our diffusion model training pipeline. During the catenate them together with the downsampled mask 𝑚 as the input
training process, we will optimize the two branches simultaneously. {𝑧𝑡 , 𝑧𝑙𝑐 , 𝑚}. During denoising, an enhanced Diffusion UNet [36] is
Intuitively, in the process of optimizing the reconstruction branch, used to predict a denoised variant of their input. The global condi-
our model can rely on global and local conditions to generate a tion 𝑐 extracted from 𝐼𝑐 is injected into diffusion UNet through cross
corresponding real person image. The refinement branch improves attention mechanism. So, the objective of this branch is difined as:
the similarity between the prediction results of the model and the
L𝑠𝑖𝑚𝑝𝑙𝑒 = ||𝜖 − 𝜖𝜃 (𝑧𝑡 , 𝑧𝑙𝑐 , 𝑚, 𝑐, 𝑡)|| 2 . (7)
rough results by controlling the initial noise. The global condition
𝑐 indicates the condition extracted by frozen pretrained CLIP [33] Refinement Branch: This branch is based on the rough synthesis
image encoder from 𝐼𝑐 . Due to the cross attention mechanism in result 𝐼 0′ to inpaint the human body area and deal with the part
LDM [35], it is easily to use the global attributes of the inpainting where the clothes meet the human body, and can also eliminate the
object (e.g., shape and pattern category) to guide the generation negative effects of some inappropriate warping results. Although
of the diffusion model, but it is challenging to effectively provide after the training of the reconstruction branch, the diffusion model
information for some fine-grained attributes (e.g., text, pattern con- can generate a synthetic image that basically restores the char-
tent, and color composition). The lack of details is compensated for acteristics of the clothes under the guidance of local conditions
by using local conditions. Specifically, we add the warped clothes to and global conditions, but the lack of spatial guidance makes the
the inpainting image 𝐼𝑎 as input for each denoising step of the dif- generated images unable to fully restore the clothes pattern layout.
fusion model. Note that we have not changed the inpainting mask For example, in the case of a striped clothes, the global condition
MM ’23, October 29–November 3, 2023, Ottawa, ON, Canada. Junhong Gou, Siyu Sun, Jianfu Zhang, Jianlou Si, Chen Qian, & Liqing Zhang
may prompt the model to build a striped pattern, whereas the lo- set with 11,647 and 2,032 pairs respectively, and conduct experi-
cal condition adds information such as the thickness and color of ments at there different resolution. Moreover, in order to verify that
the stripe, but these information is insufficient. The initial condi- our method can function in more complicated situations, we also
tion is to further infuse information into the model, such as the conduct experiments on the DeepFashion dataset [25] and Dress-
arrangement and layout of these stripes. Code dataset [30], and the experimental results of this part will be
Similar to the reconstruct branch, we first employ the encoder provides in the supplementary material.
E to extract 𝑧 0′ from 𝐼 0′ by 𝑧 0′ = E (𝐼 0′ ), and then perform forward Evaluation Metrics: For the two settings of test, we employ dif-
process on 𝑧 0′ to get 𝑧𝑡′ . Then, {𝑧𝑡′ , 𝑧𝑙𝑐 , 𝑚} is fed into the diffusion ferent metrics to evaluate the performance of our method. For the
model for denoising. When the noise 𝜖ˆ predicted by the model is paired setting, which means the clothes image is used to reconstruct
obtained, according to the Eq.6, we can obtain the refined latent person image, we use two widely used metrics: Structural Simi-
variable 𝑧ˆ after denoising by reverse the equation and the final larity (SSIM) [43] and Learned Perceptual Image Patch Similarity
image result can be recovered such that 𝐼ˆ = D (𝑧). ˆ After getting 𝐼ˆ, (LPIPS) [47]. While for the unpaired setting, that is, we need to
we use perceptual loss [20] to optimize it, which can be calculated change the clothes of the person image, we measure Frechet Incep-
by: tion Distance (FID) [18] and Kernel Inception Distance (KID) [3].
∑︁
5 We consider human perception and include user study for more
L𝑉 𝐺𝐺 = ||𝜙𝑚 (𝐼ˆ) − 𝜙𝑚 (𝐼𝑔𝑡 )|| 1 . (8) comprehensive comparison. Specifically, we collect the composite
𝑚=1 images generated by different methods for 300 pairs randomly se-
Totally, our diffusion model is trained end-to-end using the fol- lected from the test set at 512 × 384 resolution. 20 human raters are
lowing objective function: asked to select the method that restores the most clothes and the
method that produces the most realistic results for each test tuple.
L𝑑 = L𝑠𝑖𝑚𝑝𝑙𝑒 + 𝜆𝑝𝑒𝑟𝑐𝑒𝑝𝑡𝑢𝑎𝑙 L𝑉 𝐺𝐺 , (9)
Then, we report the frequency that each method is selected as the
where 𝜆𝑝𝑒𝑟𝑐𝑒𝑝𝑡𝑢𝑎𝑙 is the hyper-parameter used to balance these best one in these two aspects.
two losses. Implementation Details: For the two major modules of our
model, the warping module and the refinement module, we train
4 EXPERIMENTS them separately. We train the warping network for 100 epochs with
4.1 Experiments Setting Adam optimizer [21] for the learning rate of 5 × 10 −5 . The hyper-
parameters 𝜆𝑉 𝐺𝐺 , 𝜆𝑇𝑉 and 𝜆𝑠𝑒𝑐 are set as 0.2, 0.01 and 6. Note that,
Datasets: Our experiments are mainly carried out on the VITON- the training of warping module is under the 256 × 192 resolution.
HD dataset[6], which contains 13,679 frontal-view woman and top Referring to [22], when inference, we will upsample the predicted
clothes image pairs at the resolution of 1024×768. Following pre- appearance flow to the corresponding size.
vious work [6, 22], we split the dataset into a training and a test
Taming the Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow MM ’23, October 29–November 3, 2023, Ottawa, ON, Canada.
For the diffusion model, we use KL-regularized autoencoder performance at all three resolutions. After being fine-tuned on the
with latent-space downsampling factor 𝑓 = 8. Therefore, the spatial VITON-HD dataset, Paint-by-Example also has a very competitive
dimension of latent space is 𝑐 × (𝐻 /𝑓 ) × (𝑊 /𝑓 ), where the channel effect. Thanks to the strong image priors embedded in the diffusion
dimension 𝑐 is 4. For the denoising UNet, we follow the architecture model, in the unpaired setting, FID and KID metrics of this method
of [44]. We use AdamW [26] optimizer with the learning rate of even surpass HR-VTON in some resolution conditions. However,
1 × 10 −5 and the hyper-parameter 𝜆𝑝𝑒𝑟𝑐𝑒𝑝𝑡𝑢𝑎𝑙 is set to 1 × 10 −4 . in paired settings, its impact is significantly decreased, owing to
We utilize [44] as initialization to provide a strong image prior and the difficulty of preserving most clothes details. In comparison,
basic inpainting ability, and then we train on 2 NVIDIA Tesla A100 our method achieves the best results on various metrics and has
GPUs for 40 epochs. During inference, we use PLMS [24] sampling superior performance in three resolutions. Combining the powerful
method and the number of sampling steps is set to 100. generation ability of the diffusion model and the strong guidance
of our three conditions on the generation process, our model can
generate real and natural images while retaining the original clothes
4.2 Quantitative Evaluation to the greatest extent possible.
We compare our method with previous virtual try-on methods: CP-
VTON [42], PF-AFN [11], VITON-HD [6] and HR-VTON [22], and
diffusion inpainting method Paint-by-Example [44]. Table 1 shows 4.3 Ablation Study
quantitative comparison with these methods. It can be seen that By taking 512 × 384 resolution on VITON-HD dataset as the basic
in the virtual try-on method, HR-VTON achieves state-of-the-art setting, we conduct ablation studies to validate the effectiveness of
MM ’23, October 29–November 3, 2023, Ottawa, ON, Canada. Junhong Gou, Siyu Sun, Jianfu Zhang, Jianlou Si, Chen Qian, & Liqing Zhang
Table 2: Ablation Studies of network components in our can also restore the characteristics of clothes to a large extent. In
model. We multiply KID by 100 for better comparison. the absence of initial conditions, although the stripe arrangement is
roughly the same, the distribution and color of each stripe are quite
Method LPIPS↓ SSIM↑ FID↓ KID↓ different. In other cases, none of ablative methods can preserve the
clothes details well. And for such a meaningful pattern in the second
w/o warping module 0.054 0.891 8.13 0.034
row, only our full-fledged model can preserve it well. In the absence
w/o global condition 0.045 0.896 8.18 0.030
of global conditions, there will still be a certain chromatic aberration.
w/o local condition 0.065 0.888 8.14 0.032
By comparing the results of three and four columns, it can be found
w/o initial condition 0.064 0.871 10.26 0.180
that the initial condition is a good complement to the local condition,
Ours 0.043 0.896 8.09 0.028
and it arranges the local conditions spatially. From the results in
the last column, it is not difficult to draw the conclusion that pre-
warping the clothes can be beneficial in restoring such patterns
each component in our network, and the results are shown in Table with practical significance.
2. First, we explore how much the warping module will affect the
subsequent synthesis process (w/o warping module). Referring
4.4 Qualitative Evaluation
to [48] , we no longer use the warping network to finely warp the
clothes, but transform the clothes to a reasonable size and position The composite images produced by various methods on the VITON-
through the basic affine transformation as the result of the warping HD dataset at 512 × 384 are exhibited in Figure 5. Although some
and input it into the diffusion model. Specifically, We first center- previous virtual try-on methods properly synthesize the human
align the image of the clothes with the inpainting area, and then body and clothes, dealing with the interaction between the two is
roughly scale the clothes to fill the inpainting area. This process difficult. Paint by Example [44] cannot guarantee that the clothes
can be expressed by the following formula: in the generated results are identical to the given clothes, and there
" # will be texture and pattern differences. It can be seen that our
𝑎𝑓 𝑓 𝑅 0 𝑥 𝐼𝑐 − 𝑥 𝐼𝑐 method can generate more realistic and reasonable results than
𝐼𝑐 = 𝐼𝑐 + 𝑎 𝑐 , (10)
0 𝑅 𝑦𝐼𝑐 − 𝑦𝐼𝑐 previous methods and can restore the texture characteristics of
𝑎 𝑐
clothes sufficiently. In the first row, we can see that the previous
where 𝑅 denotes the scale factor computed from the aspect ratio,
methods cannot handle the crossed hands of the person well, and
while (𝑥 𝐼𝑐 , 𝑦𝐼𝑐 ) and (𝑥 𝐼𝑐 , 𝑦𝐼𝑐 ) represent the center of 𝐼𝑎 and 𝐼𝑐 , re-
𝑎 𝑎 𝑐 𝑐 our method can cope with such complicated poses well. Similarly,
spectively. It can be shown that the warping module facilitates in the second row, the neckline of the clothes and the part where
subsequent synthesis, particularly in complex scenes wherein a the clothes meet the left hand, our method obtains more realistic
person’s posture changes significantly and it is difficult to correctly results. Moreover, for some transparent materials or hollow styles
put clothes on the person without pre-warping processing. This of clothes, our method can achieve excellent results, as shown in
also demonstrates that our method is capable of coping with the the last row of samples. It is obvious that our method can achieve a
negative impacts of certain poor warping results. more realistic try-on effect for these clothes, such as the mesh style
Afterwards, we explored the influence of the three conditions of the clothes in the last row. More examples of composite results
on the model. First, we remove the global condition (w/o global and the discussion on limitation of our method are presented in the
condition), which means we no longer feed the CLIP features into supplementary materials.
the network but instead replace them with a learnable variable
vector. The global condition among them has the least effect on the
5 CONCLUSION
model. The primary cause of the limited impact on the results may
be that such coarse-grained features are mostly contained by the In this work, we treat the virtual try-on task as an inpainting task
fine-grained features of other conditions. We then try to remove and solve it using the diffusion model. In order to allow the diffusion
the local condition by using 𝐼𝑎 instead of 𝐼𝑙𝑐 in the input of the model to better retain the characteristics of the clothes during the
diffusion model (w/o local condition), only providing guidance inpainting process and improve the authenticity of the generated
outside the inpainting region. It is evident that the lack of local image, we use a warping network to predict the appearance flow
conditions results in some performance reduction. Following that, to warp the clothes before inpainting. Under the premise of using
we remove the refinement branch, thereby discarding the initial the global condition, we add the warped clothes to the input of the
condition (w/o initial condition). Compared with local conditions, diffusion model as the local condition. Meanwhile, a new branch is
the lack of initial conditions has a greater impact on performance, introduced to assist the model in making better use of the coarse
which largely shows that our new refinement branch can make good synthesis results obtained in the previous step. The experimental re-
use of rough results to guide the generated results more accurately. sults on the VITON-HD dataset have demonstrated the superiority
These results demonstrate that the guidance of the three conditions of our method.
in the process of formation is complementary and indispensable.
In order to more intuitively show the impact of these components ACKNOWLEDGMENTS
on the final result, we visualize them in Figure 4. For such plaid The work was supported by the Shanghai Municipal Science and
shirts of first row, our full-fledged method can well restore the Technology Major / Key Project, China (Grant No. 20511100300 /
texture and color on the clothes. In the model that lacks global 2021SHZDZX0102) and the National Natural Science Foundation
conditions, in addition to the difference in general color, its results of China (Grant No. 62076162).
Taming the Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow MM ’23, October 29–November 3, 2023, Ottawa, ON, Canada.
ACM Reference Format: in the part that is in contact with the hair. Our approach effectively
Junhong Gou, Siyu Sun, Jianfu Zhang, Jianlou Si, Chen Qian, and Liqing addresses these issues while preserving the original stripe layout.
Zhang. 2023. Supplementary for Taming the Power of Diffusion Models As for the clothes made of tulle in the fourth row, due to the im-
for High-Quality Virtual Try-On with Appearance Flow. In Proceedings of age prior contained in the diffusion model, we can better restore
the 31st ACM International Conference on Multimedia (MM ’23), October 29– the characteristics of such materials. In addition, when facing the
November 3, 2023, Ottawa, ON, Canada. ACM, New York, NY, USA, 7 pages.
densely spotted clothing texture in the last row, we can see that
https://doi.org/10.1145/3581783.3612255
most of these spots in the results generated by the previous method
In this document, we provide additional materials to supplement are blurred or disappear. In Figure 2, for people standing sideways
our main text. In Appendix A, we show more qualitative comparison like in the fourth row, the previous method cannot handle such a
results on the VITON-HD [1] dataset. Additionally, we perform situation well. In contrast, our method can get a more reasonable
experiments on DressCode [5] and DeepFashion [4] datasets, and result. Additionally, in the fifth row our method can more effec-
more qualitative results would be shown. Then, we compare our tively produce stacking wrinkles on the clothing to improve realism.
proposed approach to text-to-image based inpainting approach Moreover, in the 1st, 5th and 6th rows of Figure 3, the texture of
in Appendix A.4. Finally, we show failure cases generated by our the clothing is blocked by the arms of the person. Other methods
method and discuss the limitations of our method in Appendix B. cannot reasonably guarantee the pattern layout of the clothing, but
our method can handle such occlusion more effectively.
A MORE QUALITATIVE RESULTS
A.1 Results on VITON-HD A.2 Results on DressCode
In this section, we show more composite images produced by vari- Similar to VITON-HD [1], DressCode [5] is a dataset containing
ous methods on VITON-HD dataset in Figure 1, 2 and 3. The previ- high-quality try-on data pairs, and is consist of three sub-datasets,
ous methods include CP-VTON [7], PF-AFN [2], VITON-HD [1], HR- namely dresses, upper-body and lower-body. Overall, the dataset
VTON [3], and diffusion inpainting method Paint-by-Example [8]. is composed of 53,795 image pairs: 15,366 pairs for upper-body
It is evident that our method outperforms the previous methods clothes, 8,951 pairs for lower-body clothes, and 29,478 pairs for
in terms of the characteristic restoration of clothes and the au- dresses. For training, we followed the method of extracting the
thenticity of synthesized pictures. Other methods generally suffer agnostic mask in [5], and the rest of the settings were consistent
from issues with insufficient restoration of clothing and excessively with the setting in VITON-HD. All experiments on the DressCode
blurry, unrealistic results. dataset are performed at 512 × 384 resolution.
In the third row of Figure 1, it is difficult for many previous meth- Table 1 shows the quantitative comparison among PF-AFN [2],
ods to maintain the fluidity of the stripes on the clothes, especially HR-VTON [3] and Ours. We measure the LPIPS and FID metrics
∗ Corresponding authors. of the three for paired an unpaired setting respectively. It can be
seen that our method achieves the best performance on all three
Permission to make digital or hard copies of all or part of this work for personal or sub-datasets. Besides that, we visualize the results of our method,
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation as shown in Figure 4. In the three sub-datasets, our method can
on the first page. Copyrights for components of this work owned by others than the achieve realistic and natural try-on results.
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected].
MM ’23, October 29–November 3, 2023, Ottawa, ON, Canada. A.3 Results on DeepFashion
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 979-8-4007-0108-5/23/10. . . $15.00 In the DeepFashion [4] dataset, our task goal is to transfer the
https://doi.org/10.1145/3581783.3612255 clothes worn by the person in one image to the person in another
MM ’23, October 29–November 3, 2023, Ottawa, ON, Canada. Junhong Gou, Siyu Sun, Jianfu Zhang, Jianlou Si, Chen Qian, & Liqing Zhang
image. Compared with the previous task background of providing In Figure 5, we show some visualization results on DeepFash-
template clothes, this task is undoubtedly more challenging. ion dataset. It can be seen that even if the clothes are transferred
We train our model on the DeepFashion dataset at 512 resolution, between the person in different poses, our method can preserve
and the training process is the same as on the VITON-HD dataset. the characteristics of the clothes effectively and generate a realistic
During the training process, we use two images of the same person composite image.
in the same dress in different poses as training pairs, and then
extract the clothes from one of the image and put it on the person A.4 Comparisons to Text-to-Image Approach
in another image. Following the training/test split used in PATN [9]
for pose transfer, we first obtained 101,966 data pairs for training. Additionally, we experiment with the existing text-to-image inpaint-
On this basis, we eliminated the data pairs in which the clothes ing method on DeepFashion to compare with our method. Specifi-
accounted for too little in the image, and finally obtained 51,644 cally, we use the pretrained stable diffusion inpainting model [6],
pairs for training. and then use the text description corresponding to the clothes as
the condition to generate the final result. Similarly, in the input we
MM ’23, October 29–November 3, 2023, Ottawa, ON, Canada. Junhong Gou, Siyu Sun, Jianfu Zhang, Jianlou Si, Chen Qian, & Liqing Zhang
will mask the upper half of the human body. The comparison results REFERENCES
are shown in Figure 6. It is clear that utilizing text as the condition [1] Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. 2021. Viton-hd:
alone makes it impossible to recover the qualities of the clothing High-resolution virtual try-on via misalignment-aware normalization. In CVPR.
[2] Yuying Ge, Yibing Song, Ruimao Zhang, Chongjian Ge, Wei Liu, and Ping Luo.
we need, as the clothing’s color, material, and pattern details will 2021. Parser-free virtual try-on via distilling appearance flows. In CVPR.
vary. [3] Sangyun Lee, Gyojung Gu, Sunghyun Park, Seunghwan Choi, and Jaegul Choo.
2022. High-Resolution Virtual Try-On with Misalignment and Occlusion-Handled
Conditions. In ECCV.
B DISCUSSIONS ON LIMITATIONS [4] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. 2016. Deepfashion:
Powering robust clothes recognition and retrieval with rich annotations. In CVPR.
Despite producing excellent results, our method does not entirely [5] Davide Morelli, Matteo Fincato, Marcella Cornia, Federico Landi, Fabio Cesari,
cover all cases. As shown in Figure 7, we display some less satisfac- and Rita Cucchiara. 2022. Dress Code: High-Resolution Multi-Category Virtual
tory composite results. In these two examples, our approach fails to Try-On. In CVPR.
[6] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn
accurately reproduce the clothing patterns. This demonstrates that Ommer. 2022. High-resolution image synthesis with latent diffusion models. In
for some relatively tiny and complex patterns, our method cannot CVPR.
accurately preserve every detail. It is challenging for our method to [7] Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin Chen, Liang Lin, and Meng
Yang. 2018. Toward characteristic-preserving image-based virtual try-on network.
exactly replicate some little writing on clothing, but for some less In ECCV.
strict patterns, the produced results can be fairly consistent. One [8] Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun,
Dong Chen, and Fang Wen. 2022. Paint by Example: Exemplar-based Image Editing
reason for this could be that the inpainting process takes place in with Diffusion Models. arXiv preprint arXiv:2211.13227 (2022).
the latent space, which will result in a certain loss, especially for [9] Zhen Zhu, Tengteng Huang, Baoguang Shi, Miao Yu, Bofei Wang, and Xiang
such a small and precise target. Bai. 2019. Progressive pose attention transfer for person image generation. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Supplementary for Taming the Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow MM ’23, October 29–November 3, 2023, Ottawa, ON, Canada.
Figure 6: Visual comparison of our method and text-to-image method on DeepFashion dataset.
Figure 7: Visualization of our failure cases on VITON-HD dataset. Each sample tuple is the target person, target clothes and
composite image from left to right. For these examples, we zoom in for better observation.