0% found this document useful (0 votes)
65 views15 pages

LaDI VTON

This document proposes LaDI-VTON, a new model for virtual clothing try-ons that uses latent diffusion to generate realistic images. The model enhances an autoencoder with learnable skip connections to better preserve details. It also uses a textual inversion module to condition generations on garment textures without losing information. Experiments on standard datasets show the approach outperforms state-of-the-art methods, demonstrating diffusion models can achieve higher realism for virtual try-ons than GANs.

Uploaded by

laure9239
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views15 pages

LaDI VTON

This document proposes LaDI-VTON, a new model for virtual clothing try-ons that uses latent diffusion to generate realistic images. The model enhances an autoencoder with learnable skip connections to better preserve details. It also uses a textual inversion module to condition generations on garment textures without losing information. Experiments on standard datasets show the approach outperforms state-of-the-art methods, demonstrating diffusion models can achieve higher realism for virtual try-ons than GANs.

Uploaded by

laure9239
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

LaDI-VTON:

Latent Diffusion Textual-Inversion Enhanced Virtual Try-On


Davide Morelli∗ Alberto Baldrati∗ Giuseppe Cartella
University of Modena and Reggio University of Florence University of Modena and Reggio
Emilia Florence, Italy Emilia
Modena, Italy [email protected] Modena, Italy
[email protected] [email protected]

Marcella Cornia Marco Bertini Rita Cucchiara


arXiv:2305.13501v3 [cs.CV] 3 Aug 2023

University of Modena and Reggio University of Florence University of Modena and Reggio
Emilia Florence, Italy Emilia
Modena, Italy [email protected] Modena, Italy
[email protected] [email protected]

Figure 1: Images generated by the proposed LaDI-VTON model, given an input target model and a try-on clothing item from
both Dress Code [45] (1st row) and VITON-HD [9] (2nd row) datasets.
ABSTRACT these powerful generative solutions. This work introduces LaDI-
The rapidly evolving fields of e-commerce and metaverse continue VTON, the first Latent Diffusion textual Inversion-enhanced model
to seek innovative approaches to enhance the consumer experience. for the Virtual Try-ON task. The proposed architecture relies on
At the same time, recent advancements in the development of diffu- a latent diffusion model extended with a novel additional autoen-
sion models have enabled generative networks to create remarkably coder module that exploits learnable skip connections to enhance
realistic images. In this context, image-based virtual try-on, which the generation process preserving the model’s characteristics. To
consists in generating a novel image of a target model wearing effectively maintain the texture and details of the in-shop garment,
a given in-shop garment, has yet to capitalize on the potential of we propose a textual inversion component that can map the vi-
sual features of the garment to the CLIP token embedding space
∗ Both and thus generate a set of pseudo-word token embeddings capable
authors contributed equally to this research.
of conditioning the generation process. Experimental results on
Dress Code and VITON-HD datasets demonstrate that our approach
MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada
outperforms the competitors by a consistent margin, achieving a
© 2023 Association for Computing Machinery.
This is the author’s version of the work. It is posted here for your personal use. Not significant milestone for the task. Source code and trained models
for redistribution. The definitive Version of Record was published in Proceedings of the are publicly available at: https://github.com/miccunifi/ladi-vton.
31st ACM International Conference on Multimedia (MM ’23), October 29-November 3,
2023, Ottawa, ON, Canada, https://doi.org/10.1145/3581783.3612137.
MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, and Rita Cucchiara

KEYWORDS the encoding phase to the corresponding decoding one, improving


Virtual Try-On, Latent Diffusion Models, Generative Architectures. the autoencoder reconstruction capabilities.
We extensively validate our architecture on two widely-used
ACM Reference Format: virtual try-on benchmarks (i.e., Dress Code [45] and VITON-HD [9]),
Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco demonstrating superior quantitative and qualitative performance
Bertini, and Rita Cucchiara. 2023. LaDI-VTON: Latent Diffusion Textual-
than state-of-the-art methods and showing that diffusion models
Inversion Enhanced Virtual Try-On. In Proceedings of the 31st ACM In-
applied to the virtual try-on field can achieve higher realism than
ternational Conference on Multimedia (MM ’23), October 29-November 3,
2023, Ottawa, ON, Canada. ACM, New York, NY, USA, 15 pages. https: GAN-based counterparts (Figure 1).
//doi.org/10.1145/3581783.3612137 Contributions. To sum up, our contributions are as follows:
• We employ LDMs to solve the task of image-based virtual
try-on, an approach that, to the best of our knowledge, has
1 INTRODUCTION never been previously explored in this field.
The disruptive success of e-commerce and online shopping is • To reduce the reconstruction error of LDMs, we enhance the
steadily demanding a more streamlined and enjoyable customer autoencoder with learnable skip connections, enabling the
shopping experience, from personalized garment recommenda- preservation of details outside the inpainting region.
tion [10, 12, 31, 57] to visual product search [5, 24, 39, 44, 66]. Given • Additionally, to increase detail retention of the generation
the large availability of online images as accessories, garments, and process, we define a forward-only textual inversion module
other related products, Computer Vision and Multimedia research to further condition the model on the input try-on garment
play a crucial role by offering valuable tools for a more personal- without losing texture information.
ized user experience. Among them, image-based virtual try-on has • Extensive experiments validate the effectiveness of each com-
recently attracted significant interest in the research community ponent of our architecture, which achieves state-of-the-art
with the introduction of several architectures [26, 45, 62] that, given results on two widely used benchmarks for the task. We
an image of a person and a garment taken from a catalog, allow to believe our results can highlight how virtual try-on can
dress the person with the given try-on garment. strongly benefit from using LDMs and serve as a starting
The generation process carried out by current state-of-the-art point for future research in the field.
methods for the task [3, 21, 37, 45] entirely relies on Generative
Adversarial Networks (GANs) [22]. During the last years, a new
family of generative architectures, namely diffusion models [29,
59], have shown superior image generation quality compared to 2 RELATED WORK
GANs [14], also with a more stable training procedure. However, Image-Based Virtual Try-On. Image-based virtual try-on [3, 9,
considering the high computational demand typical of diffusion 18, 26, 32, 45, 62] aims to transfer a desired garment onto the corre-
models, Rombach et al. [52] have recently tackled the problem by sponding region of a target subject while preserving human pose
introducing a latent-based version that works in the latent space of and identity. One of the pioneering works in this field is VITON [26],
a pre-trained autoencoder, thus finding the best trade-off between a framework composed of an encoder-decoder generator that pro-
computational load and image quality. duces a coarse result further improved by a refinement network that
Motivated by the tremendous success of these generative models, exploits the warped clothing item obtained through a TPS trans-
in this work, we introduce and explore for the first time an image- formation [15]. Some follow-up works have been oriented towards
based virtual try-on method based on Latent Diffusion Models the enhancement of the warping module. Wang et al. [62] proposed
(LDMs) [52], demonstrating their successful possible applications a learnable TPS module to mitigate the problem of clothing de-
in this field. We design a novel diffusion-based architecture con- tails preservation, which has subsequently been improved either by
ditioned on the target in-shop garment and human keypoints to combining TPS with affine transformations [19, 38] or taking into
keep the model’s body pose unchanged. To preserve the target account generated semantic layouts [68] and body information [17].
garment texture in the generation process, we propose to augment Another research line focuses on the generation phase and re-
LDMs with a textual inversion network able to map the visual fea- finement of the result [21, 32, 45]. Issenhuth et al. [32], for example,
tures of the in-shop garment to the CLIP textual token embedding presented a distillation-based teacher-student architecture that does
space [49]. We then condition the LDM generation through the not leverage a predicted semantic layout during the generation. This
cross-attention mechanism using the predicted tokens embeddings. idea has further been explored in [21] with the introduction of an
While LDMs can generate highly realistic images, one of their additional tutor knowledge module to improve the generation qual-
drawbacks is that they struggle when dealing with high-frequency ity. Differently, Morelli et al. [45] focused on the semantics of the
details in the pixel space. This problem stems from the spatial generated results and proposed a semantic-aware discriminator
compression performed by the autoencoder, which gives access working at the pixel level instead of the image or patch level. Lee et
to a lower-dimensional latent space where high-frequency details al. [37] solved the misalignment problem by designing a unified
may not be accurately represented [52]. In our setting, this can pipeline that combines the warping and segmentation stages to
lead to details loss in the final generated images, especially when achieve better high-resolution results.
handling the model’s hands, feet, and face. To address this issue, we A common aspect linking all current methods is that the genera-
introduce the Enhanced Mask-Aware Skip Connection (EMASC) tion phase relies on GANs [22]. Driven by the enormous success
module, a learnable skip connection that transfers the details from of diffusion models [29] in different fields, we are the first, to the
LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada

best of our knowledge, to propose an image-based virtual try-on input clothing item details, we propose to add a novel forward-only
architecture entirely relying on the aforesaid generative models. textual inversion technique during the generation process. Finally,
Diffusion Models. A fundamental line of research in the image we enhance the image reconstruction autoencoder of Stable Dif-
synthesis field is the one marked by diffusion models [14, 29, 30, fusion with masked skip connections, thus improving the quality
46, 59, 60]. Inspired by non-equilibrium statistical physics, Sohl- of generated images and better preserving the fine-grained details
Dickstein et al. [59] defined a tractable generative model of data of the original model image. Figure 2 depicts an overview of the
distribution by iteratively destroying the data structure through a proposed model.
forward diffusion process and then reconstructing with a learned
reverse diffusion process. Some years later, Ho et al. [29] success- 3.1 Preliminaries
fully demonstrated that this process is applicable to generate high- Stable Diffusion. It consists of an autoencoder A with an en-
quality images. Nichol et al. [46] further improved the work pre- coder E and a decoder D, a text time-conditional U-Net denoising
sented in [29] by learning the variance parameter of the reverse model 𝜖𝜃 , and a CLIP text encoder 𝑇𝐸 , which takes text 𝑌 as input.
diffusion process and generating the output with fewer forward The encoder E compresses an image 𝐼 ∈ R3×𝐻 ×𝑊 into a lower-
passes without sacrificing sample quality. While these methods dimensional latent space in R4×ℎ×𝑤 , where ℎ = 𝐻8 and 𝑤 = 𝑊8 ,
work in the pixel space, Rombach et al. [52] proposed a variant while the decoder D performs the inverse operation and decodes a
working in the latent space of a pre-trained autoencoder, enabling latent variable into the pixel space. For clarity, we refer to the 𝜖𝜃 con-
higher computational efficiency. volutional input as the spatial input 𝛾 (e.g., 𝑧𝑡 ) since convolutions
The impact of diffusion models has rapidly become disruptive preserve the spatial structure, and to the attention conditioning
in diverse tasks such as text-to-image synthesis [23, 47, 50, 56], input as 𝜓 (e.g., [𝑡,𝑇𝐸 (𝑌 )]). The training of the denoising network
image-to-image translation [55, 63, 70], image editing [2, 42, 67], 𝜖𝜃 is performed by minimizing the following loss function:
and inpainting [41, 47]. Strictly related to virtual try-on is the task
𝐿 = E E (𝐼 ),𝑌 ,𝜖∼N (0,1),𝑡 ∥𝜖 − 𝜖𝜃 (𝛾,𝜓 )∥ 22 ,
 
of human image generation, where pose preservation is often a (1)
strict constraint. On this line, Jiang et al. [33] focused on synthesiz- where 𝑡 represents the diffusing time step, 𝛾 = 𝑧𝑡 , 𝑧𝑡 is the encoded
ing full-body images given human pose and textual descriptions of image E (𝐼 ) where we stochastically add Gaussian noise 𝜖 ∼ N (0, 1),
shapes and textures of clothes, generating the output via sampling and 𝜓 = [𝑡;𝑇𝐸 (𝑌 )].
from a learned texture-aware codebook. Bhunia et al. [7] tackled the We aim to generate a new image 𝐼˜ that replaces a target garment
task of pose-guided human generation by developing a texture diffu- in the model input image 𝐼 with an in-shop garment 𝐶 provided
sion block based on cross attention and conditioned on multi-scale by the user while retaining the model’s physical characteristics,
texture patterns from the encoded source image. Baldrati et al. [6], pose, and identity. This task can be seen as a particular type of
instead, proposed to guide the generation process constraining a inpainting, specialized in replacing garment information in human-
latent diffusion model with the model pose, the garment sketch, based images according to a target garment image provided by the
and a textual description of the garment itself. user. For this reason, we use the Stable Diffusion inpainting pipeline
Textual Inversion. Textual inversion is a recent technique pro- as the starting point of our approach. It takes as spatial input 𝛾 the
posed in [20] to learn a pseudo word in the embedding space of the channel-wise concatenation of an encoded masked image E (𝐼𝑀 ), a
text encoder starting from visual concepts. Following [20], several resized binary inpainting mask 𝑚 ∈ {0, 1}1×ℎ×𝑤 , and the denoising
promising methods [11, 25, 43, 54] have been designed to enable network input 𝑧𝑡 . Specifically, 𝐼𝑀 is the model image 𝐼 masked
personalized image generation and editing. Ruiz et al. [54] pre- according to the inpainting mask 𝑀 ∈ {0, 1}1×𝐻 ×𝑊 , and the binary
sented a fine-tuning technique to bind an identifier with a subject inpainting mask 𝑚 is the resized version according to the latent
represented by a few images and adopted a class-specific prior space spatial dimension of the original inpainting mask 𝑀. To
preservation loss to mitigate language drift. Similarly, Kumari et summarize, the spatial input of the inpainting denoising network
al. [36] proposed a different fine-tuning method to enable multi- is 𝛾 = [𝑧𝑡 ; 𝑚; E (𝐼𝑀 )] ∈ R (4+1+4) ×ℎ×𝑤 .
concept composition and showed that updating only a small subset CLIP. It is a vision-language model [49] which aligns visual and
of model weights is sufficient to integrate new concepts. On a differ- textual inputs in a shared embedding space. In particular, CLIP
ent line, Han et al. [25] decomposed the CLIP embedding space [49] consists of a visual encoder 𝑉𝐸 and a text encoder 𝑇𝐸 that extract
based on semantics and enabled image manipulation without re- feature representations 𝑉𝐸 (𝐼 ) ∈ R𝑑 and 𝑇𝐸 (𝐸𝐿 (𝑌 )) ∈ R𝑑 for an
quiring any additional fine-tuning. input image 𝐼 and its corresponding text caption 𝑌 , respectively.
Here, 𝑑 is the size of the CLIP embedding space, and 𝐸𝐿 is the
3 PROPOSED METHOD embedding lookup layer which maps each 𝑌 tokenized word to the
While most of the existing virtual try-on approaches leverage gener- token embedding space W.
ative adversarial networks [26, 32, 45, 62], we propose a novel solu- The proposed approach introduces a novel textual inversion
tion based, for the first time, on Latent Diffusion Models (LDMs). In technique to generate a representation of the in-shop garment 𝐶.
particular, our work employs the Stable Diffusion architecture [52] We feed this representation to the CLIP text encoder and use it to
as a starting point to perform the virtual try-on task. To augment condition the diffusion process. It consists in mapping the visual
the text-to-image model with try-on capabilities, we modify the features of 𝐶 into a set of 𝑁 new token embeddings 𝑉𝑛∗ ∈ W, 𝑛 =
architecture to take as input both the try-on garment and the pose {1, . . . , 𝑁 }. Following the terminology introduced in [4], we refer to
information of the target model. In addition, to better preserve the these embeddings as Pseudo-word Tokens Embeddings (PTEs) since
MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, and Rita Cucchiara

Embedding lookup
a photo of a
Va Vphoto Vof Va Vmodel

Tokenizer
V� model

ViT Layer
MLP
Vdress V �
*
VN*
wearing a
C dress q
Textual inversion Module

T� C Concatena�on V* Predicted pseudo-word

z
p
m

C
CW

EMASC

EMASC

EMASC
Denoising UNet
IM ~
I
Figure 2: Overview of the proposed LaDI-VTON model. On the top, the textual inversion module generates a representation
of the in-shop garment. This information conditions the Stable Diffusion model along with other convolutional inputs. The
decoder D is enriched with the Enhanced Mask-Aware Skip Connection (EMASC) modules to reduce the reconstruction error,
improving the high-frequency details in the final image.

they do not correspond to any linguistically meaningful entity but Textual Inversion. Given the in-shop image 𝐶, the aim of the
rather are a representation of the in-shop garment visual features textual inversion adapter 𝐹𝜃 is to predict a set of pseudo-word
in the token embedding space W. token embeddings {𝑉1∗, . . . , 𝑉𝑁∗ } able to well represent the image 𝐶
in the CLIP token embedding space W. We then use the predicted
3.2 Textual-Inversion Enhanced Virtual Try-On PTEs to condition the Stable Diffusion denoising network 𝜖𝜃 and
To tackle the virtual try-on task, we propose injecting in the Stable obtain the final image 𝐼˜ where the model in 𝐼 is wearing the garment
Diffusion textual conditioning branch additional information from in 𝐶. For clarity, we intend that a set of PTEs represent well a target
the target garment 𝐶 extracted through textual inversion. In partic- image if a Stable Diffusion model conditioned on the concatenation
ular, starting from the features of the in-shop garment 𝐶 extracted of a generic prompt and the predicted pseudo-words can reconstruct
from the CLIP visual encoder, we learn a textual inversion adapter the target image itself.
𝐹𝜃 to predict a set of fine-grained PTEs describing the in-shop gar- We first build a textual prompt 𝑞 that guides the diffusion process
ment 𝐶 itself. These PTEs lie in the CLIP token embedding space to perform the virtual try-on task, tokenize it and map each token
W and thus can be used as an additional conditioning signal. into the token embedding space using the CLIP embedding lookup
We also propose to extend the Stable Diffusion inpainting module, obtaining 𝑉𝑞 . Then, we encode the image 𝐶 using the CLIP
pipeline to accept the model pose map 𝑃 ∈ R18×𝐻 ×𝑊 , where each visual encoder 𝑉𝐸 and feed the features extracted from the last
channel is associated with a human keypoint, and the warped in- hidden layer to the textual inversion adapter 𝐹𝜃 , which maps the
shop garment 𝐶𝑊 ∈ R3×𝐻 ×𝑊 , representing the target garment 𝐶 input visual features to the CLIP token embedding space W. We
warped according to the model body pose. While the pose map then concatenate the prompt embedding vectors with the predicted
𝑃 enables the method to preserve the original human pose of the pseudo-word token embeddings as follows:
model 𝐼 , the warped garment 𝐶𝑊 helps the generation process to
properly fit the garment onto the model. 𝑌ˆ = Concat(𝑉𝑞 , 𝐹𝜃 (𝑉𝐸 (𝐶))). (2)
Data Preparation. The warped garment 𝐶𝑊 is obtained by training
a module that warps the in-shop garment 𝐶 fitting the model body We feed the embedded concatenation 𝑌ˆ to the CLIP text encoder 𝑇𝐸
shape in 𝐼 . We employ the geometric matching module proposed and use the output to condition the denoising network 𝜖𝜃 leveraging
in [62] and refine the results with a U-Net-based component [53]. the existing Stable Diffusion textual cross-attention.
The virtual try-on task involves replacing one or more garments the To train the textual inversion adapter 𝐹𝜃 , we use the inpainting
target model is wearing. With this aim, we define the inpainting area pipeline of the out-of-the-box Stable Diffusion model as 𝜖𝜃 . Specifi-
determined by the mask 𝑀 to fully encompass the target garment. cally, it takes as input the encoded masked target model E (𝐼𝑀 ), the
We adopt the method proposed in previous works such as [32, 45] inpainting mask 𝑀, and the latent variable 𝑧. When training the
to ensure the mask completely covers the target garment. adapter 𝐹𝜃 , we freeze all the other model parameters.
LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada

3.3 Enhanced Mask-Aware Skip Connections


The autoencoder A of LDMs enables the denoising network 𝜖𝜃 to
work within a latent space smaller than the pixel space. Compared to
standard diffusion networks, this behavior is essential to reduce the
IM M

EMASC

EMASC

EMASC
parameters 𝜖𝜃 of the latent diffusion denoising network allowing it
to reach the best trade-off between image quality and computational
Shared M M M ~
Param load [52]. We remind that given an image 𝐼 ∈ R3×𝐻 ×𝑊 , the Stable
𝐻 𝑊
Diffusion encoder E compresses it in a latent space 𝑍 ∈ R4× 8 × 8 ,
resulting in a total compression of 48×. However, this trade-off
comes at a cost especially when dealing with human images and
I Î small high-frequency details such as hands, feet, and faces. We
argue that the autoencoder reconstruction error partially depends
Not M Masking Train Freezed on the data loss deriving from the latent space compression.
~
To address the problem, we propose to extend the autoen-
Figure 3: Overview of the proposed autoencoder with En- coder architecture with an Enhanced Mask-Aware Skip Connection
hanced Mask-Aware Skip Connection (EMASC) modules. (EMASC) module whose aim is to learn to propagate relevant in-
formation from different layers of the encoder E to corresponding
ones of the decoder D. In particular, instead of skipping the in-
To the best of our knowledge, this study marks the first instance formation of the encoded image 𝐼 to reconstruct, we pass to the
in which a textual inversion approach has been employed in the EMASC modules the intermediate features of the masked image
domain of virtual try-on. As shown in the experimental section, this 𝐼𝑀 encoding process, using the encoder E. This procedure allows
innovative conditioning methodology can significantly strengthen only the features not modified in the inpainting task to percolate,
the final results and contribute to preserving the details and texture keeping the process cloth agnostic. We implement EMASC employ-
of the original in-shop garment. Note that our proposed approach ing additive non-linear learned skip connections in which we mask
differs from traditional textual inversion techniques [20, 36, 54]. the output according to the inverted inpainting mask. Since the
Rather than directly optimizing the pseudo-word token embeddings EMASC inputs are the intermediate features of the masked model
through iterative methods, in our solution, the adapter 𝐹𝜃 is trained 𝐼𝑀 encoding process, masking the EMASC output features helps
to generate these embeddings in a single forward pass. avoid propagating the masked regions through the skip connections.
Formally, the EMASC module is defined as follows:
Diffusion Virtual Try-On Model. To perform the complete vir- 𝐸𝑀𝐴𝑆𝐶𝑖 = 𝑓 (𝐸𝑖 ) ∗ 𝑁𝑂𝑇 (𝑚𝑖 )
tual try-on task, we employ the additional inputs described above (4)
(i.e., textual-inverted information 𝑌ˆ of the in-shop garment, the 𝐷𝑖 = 𝐷𝑖 −1 + 𝐸𝑀𝐴𝑆𝐶𝑖
pose map 𝑃, and the garment fitted to the model body shape 𝐶𝑊 ) where 𝑓 is a learned non-linear function, 𝐸𝑖 is the 𝑖-th feature map
to condition the Stable Diffusion inpainting pipeline. In particular, coming from the encoder E, 𝐷𝑖 is the corresponding 𝑖-th decoder
we extend the spatial input 𝛾 ∈ R9×ℎ×𝑤 of the denoising network feature map, and 𝑚𝑖 is obtained by resizing the mask 𝑀 according to
𝜖𝜃 concatenating it with the resized pose map 𝑝 ∈ R18×ℎ×𝑤 and the 𝐸𝑖 spatial dimension. An overview of the proposed autoencoder
the encoded warped garment E (𝐶𝑊 ) ∈ R4×ℎ×𝑤 . The final spatial enhanced with EMASC modules is reported in Figure 3.
input results in 𝛾 = [𝑧𝑡 ; 𝑚; E (𝐼𝑀 ); 𝑝; E (𝐶𝑊 )] ∈ R (9+18+4) ×ℎ×𝑤 . Notice that the EMASC modules only depend on the Stable Dif-
To enrich the input capacity of the denoising network 𝜖𝜃 with- fusion denoising autoencoder, and once trained, they can be easily
out needing to retrain it from scratch [6, 52], we propose to extend added to the standard Stable Diffusion pipeline in a plug-and-play
the kernel channels of the first convolutional layer by adding zero- manner without requiring additional training. We show that this
initialized weights to match the new input channel dimension. In simple proposed modification can reduce the compression informa-
such a way, we can retain the knowledge embedded in the original tion loss in the inpainting task, resulting in better high-frequency
denoising network while allowing the model to deal with the newly human-related reconstructed details.
proposed inputs. Since the warped garment 𝐶𝑊 is not always able
to properly represent the contextualization of the in-shop garment 4 EXPERIMENTAL EVALUATION
with the target model information, we also modify the Stable Dif- 4.1 Datasets and Evaluation Metrics
fusion textual input by using 𝑌ˆ obtained from the output of the
We perform experiments on two virtual try-on datasets, namely
trained textual inversion adapter 𝐹𝜃 as described in Eq. 2.
Dress Code [45] and VITON-HD [9], that feature high-resolution
As in standard LDMs, we train the proposed denoising network
image pairs of in-shop garments and model images in both paired
to predict the noise stochastically added to an encoded input 𝑧𝑡 =
and unpaired settings. While in the paired setting the in-shop gar-
E (𝐼 ). We specify the corresponding objective function as:
ment is the same as the model is wearing, in the unpaired one, a
different garment is selected for the virtual try-on task.
𝐿 = E E (𝐼 ),𝑌ˆ ,𝜖∼N (0,1),𝑡,E (𝐼 ),𝑀,𝑝,E (𝐶 ) ∥𝜖 − 𝜖𝜃 (𝛾,𝜓 )∥ 22 ,
 
(3) The Dress Code dataset [45] features over 53,000 image pairs
𝑀 𝑊
of clothes and human models wearing them. The dataset includes
where 𝜓 = 𝑡;𝑇𝐸 (𝑌ˆ ) .
 
high-resolution images (i.e., 1024 × 768) and garments belonging to
MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, and Rita Cucchiara

Table 1: Quantitative results on the Dress Code dataset [45]. The * marker indicates results reported in previous works, which
may differ in terms of metric implementation. Best results are reported in bold.

Upper-body Lower-body Dresses All


Model FIDu ↓ KIDu ↓ FIDu ↓ KIDu ↓ FIDu ↓ KIDu ↓ LPIPS ↓ SSIM ↑ FIDp ↓ KIDp ↓ FIDu ↓ KIDu ↓
PF-AFN* [21] 14.32 - 18.32 - 13.59 - - - - - - -
HR-VITON* [37] 16.86 - 22.81 - 16.12 - - - - - - -
CP-VTON [62] 48.31 35.25 51.29 38.48 25.94 15.81 0.186 0.842 28.44 21.96 31.19 25.17
CP-VTON† [62] 22.18 12.09 18.85 10.24 21.83 12.31 0.095 0.898 12.90 9.81 13.77 10.12
PSAD [45] 17.51 7.15 19.68 8.90 17.07 6.66 0.058 0.918 8.01 4.90 10.61 6.17
LaDI-VTON 13.26 2.67 14.80 3.13 13.40 2.50 0.064 0.906 4.14 1.21 6.48 2.20

different macro-categories, such as upper-body clothes, lower-body Table 2: Quantitative results on the VITON-HD dataset [9].
clothes, and dresses. In our experiments, we employ the original The * marker indicates results reported in previous works.
splits of the dataset where 5,400 image pairs (1,800 for each category)
Model LPIPS ↓ SSIM ↑ FIDp ↓ KIDp ↓ FIDu ↓ KIDu ↓
compose the test set and the rest the training one. The VITON-HD
dataset [9] instead comprises 13,679 image pairs, each composed CP-VTON* [62] - 0.791 - - 30.25 40.12
of a frontal-view woman and an upper-body clothing item with a ACGPN* [68] - 0.858 - - 14.43 5.87
resolution equal to 1024 × 768. The dataset is divided into training VITON-HD [9] 0.116 0.863 11.01 3.71 12.96 4.09
and test sets of 11,647 and 2,032 pairs, respectively. HR-VITON [37] 0.097 0.878 10.88 4.48 13.06 4.72
To quantitatively evaluate our model, we employ evaluation LaDI-VTON 0.091 0.876 6.66 1.08 9.41 1.60
metrics to estimate the coherence and realism of the generation. In
particular, we use the Learned Perceptual Image Patch Similarity for each condition. This allows the later use of the classifier-free
(LPIPS) [69] and the Structural Similarity (SSIM) [64] to evaluate the guidance technique [30] at inference time. Following [1], we use the
coherence of the generated image compared to the ground-truth. fast variant of the multi-conditional classifier-free guidance, which
We compute these metrics on the paired setting of both datasets. allows computing the final result with a computational complexity
To measure the realism, we instead employ the Fréchet Inception independent from the amount of the input constraints.
Distance [28] and the Kernel Inception Distance [8] in both paired
Autoencoder with EMASC. We apply the proposed EMASC mod-
(i.e., FIDp and KIDp ) and unpaired (i.e., FIDu and KIDu ) settings.
ules to the variational autoencoder of the Stable Diffusion model.
For the LPIPS and SSIM implementation, we use the torch-metrics
In particular, each EMASC module consists of two convolutional
Python package [13], while for the FID and KID scores, we employ
layers, where a SiLU non-linearity [16] activates the first one. We
the implementation in [48].
apply the EMASC modules to the conv_in layer output and the
feature before the down_block connecting each encoder layer to its
4.2 Implementation Details corresponding decoder one. The convolutional layers have a kernel
We first train the EMASC modules, the textual-inversion adapter, size of 3, padding of 1, and stride of 1. The first convolutional layer
and the warping component. Then, we freeze the weights of all maintains the number of channels constant, while the second one
modules except for the textual inversion adapter and train the pro- adapts the channel axis dimension to the decoder features. Finally,
posed enhanced Stable Diffusion pipeline∗ . In all our experiments, we sum the EMASC output to the corresponding decoder features.
we generate images at 512 × 384 resolution. We train the EMASC modules for 40k steps with batch size 16,
Textual Inversion. The textual inversion network 𝐹𝜃 consists of a learning rate equal to 1e-5, AdamW as optimizer with 𝛽 1 = 0.9,
single ViT layer followed by a multi-layer perception composed of 𝛽 2 = 0.999, and weight decay 1e-2. Also, in this case, we perform 500
three fully-connected layers separated by a GELU non-linearity [27] warm-up steps with a linear schedule. We employ a combination
and a dropout layer [61]. We set the number of PTEs generated by of the L1 and VGG [34] loss functions, scaling the perceptual VGG
𝐹𝜃 to 16. We train 𝐹𝜃 for 200k steps, with batch size 16, learning rate loss term by a factor of 0.5. In our setting, we found the VGG loss
1e-5 with 500 warm-up steps using a linear schedule, AdamW [40] essential to avoid blurriness in the reconstructed images. During
as optimizer with 𝛽 1 = 0.9, 𝛽 2 = 0.999, and weight decay equal to training, the encoder E and decoder D are frozen (see Figure 3),
1e-2. As the visual encoder 𝑉𝐸 , we leverage the OpenCLIP ViT-H/14 and only the EMASC modules are learned.
model [65] pre-trained on LAION-2B [58].
Diffusion Virtual Try-On Model. We train the proposed virtual 4.3 Experimental Results
try-on pipeline for 200k iterations, with batch size 16 and the same Comparison with State-of-the-Art Models. We compare our
optimizer and scheduling strategy used to train the textual inver- method with several state-of-the-art competitors. For the Dress
sion network. At training time, we randomly mask the text, the Code dataset, we compare our method with CP-VTON [62] and
warped garment, and the pose map input with a probability of 0.2 PSAD [45], retrained from scratch using the same image resolu-
tion of our model (i.e., 512 × 384) using the source codes when
∗ https://huggingface.co/stabilityai/stable-diffusion-2-inpainting available or otherwise implementing them. Following [45], we also
LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada

CP-VTON† PSAD LaDI-VTON VITON-HD HR-VITON LaDI-VTON

Figure 4: Qualitative results generated by LaDI-VTON and competitors on Dress Code [45] (left) and VITON-HD [9] (right).

include an improved version of CP-VTON (i.e., CP-VTON† ) where Table 3: User study results on the unpaired test set of both
we add as additional input the masked image 𝐼𝑀 . For the VITON- datasets. We report the percentage of times an image from
HD dataset, instead, we compare our model with VITON-HD [9] LaDI-VTON is preferred against a competitor.
and HR-VITON [37] using source codes and checkpoints released
by the authors to extract the results. Given that some evaluation Dataset Model Realism Coherence
scores (e.g., LPIPS and FID) are very sensitive to different implemen- CP-VTON [62] 93.10 89.68
tations, to ensure a fair comparison, we compute the quantitative Dress Code CP-VTON† [62] 80.21 75.69
results of these methods using the same metric implementation of PSAD [45] 74.14 70.83
our model. For completeness, we also include in the comparison VITON-HD [9] 79.19 71.48
VITON-HD
some additional virtual try-on methods for which the results are HR-VITON [37] 77.95 60.98
from previous works and, therefore, may have been obtained using
different evaluation source codes. Table 4: Quantitative results on the entire Dress Code test
Table 1 reports the quantitative results on the Dress Code set [45] using different model configurations.
dataset. As can be seen, LaDI-VTON achieves comparable results
to PSAD [45] in terms of coherence with the inputs (i.e., LPIPS and Model LPIPS ↓ SSIM ↑ FIDp ↓ KIDp ↓ FIDu ↓ KIDu ↓
SSIM), while significantly outperforming all competitors in terms w/o text 0.071 0.902 4.99 1.61 8.50 3.70
of realism in both paired and unpaired settings. In particular, on w/ retrieved text 0.070 0.903 4.85 1.61 7.49 2.93
the Dress Code test set, our model reaches a FID score of 4.14 and w/ 𝐹𝜃 and standard SD 0.105 0.876 5.42 1.87 7.50 2.83
w/o warped garment 0.068 0.904 4.50 1.44 6.30 1.99
6.48 for the paired and unpaired settings, respectively. These results
are considerably lower than the best-performing competitor (i.e., LaDI-VTON 0.064 0.906 4.14 1.21 6.48 2.20
PSAD). In Table 2, we instead show the quantitative analysis of
the VITON-HD dataset. Also, in this case, LaDI-VTON surpasses is always selected more than 60% of the time, further confirming
all other competitors by a large margin in terms of FID and KID, the progress over previous methods.
demonstrating its effectiveness in this setting.
Configuration Analysis. In Table 4, we study the model perfor-
To qualitatively evaluate our results, we report in Figure 4 sample
mance by varying its configuration. We conduct this analysis on
images generated by our model and by the competitors. Notably,
the Dress Code test set. In particular, the experiment in the first row
our solution can generate high-realistic images and preserve the
replaces the Stable Diffusion textual input 𝑌ˆ with an empty string.
texture and details of the original in-shop garments, as well as the
The one in the second row replaces the Stable Diffusion textual
physical characteristics of target models.
input 𝑌ˆ with textual elements retrieved using the in-shop garment
Human Evaluation. To further evaluate the generation quality of image 𝐶 as the query for a CLIP-based model [6]. The results show
our model, we conduct a user study to measure both the realism of that the proposed textual inversion adapter outperforms the other
generated images and their coherence with the inputs given to the textual input alternatives. The third experiment regards the tex-
virtual try-on model. Overall, we collect around 2,000 evaluations tual inversion adapter condition abilities, in particular, we can see
for each test, involving more than 50 unique users. In Table 3, we that it is possible to obtain excellent results by using the proposed
report the percentage of times in which an image generated by our textual inversion adapter to condition an out-of-the-box Stable Dif-
model is preferred against a competitor. As can be seen, LaDI-VTON fusion model. Finally, we test the warped garment 𝐶𝑊 input in the
MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, and Rita Cucchiara

Table 5: Quantitative analysis changing the number of pre- ORIGINAL w/o EMASC w/ EMASC
dicted 𝑉 ∗ . Results are reported on the Dress Code test set [45]
using the out-of-the-box Stable Diffusion as backbone.

#𝑉∗ LPIPS ↓ SSIM ↑ FIDp ↓ KIDp ↓ FIDu ↓ KIDu ↓


1 0.115 0.867 6.14 2.24 8.19 3.14
4 0.108 0.873 5.87 2.15 8.17 3.10
16 0.105 0.876 5.42 1.87 7.50 2.83
32 0.103 0.878 5.37 1.80 7.66 2.92

Table 6: Analysis on the effectiveness of the proposed En-


hanced Mask Aware Skip Connection modules. Results are
reported on Dress Code [45] and VITON-HD [9].

Model EMASC Masked LPIPS ↓ SSIM ↑


SD VAE None - 0.0214 0.9538
Dress Code

SD VAE Linear ✓ 0.0196 0.9636


SD VAE Non-Linear ✗ 0.0183 0.9646
SD VAE Non-Linear ✓ 0.0181 0.9652
LaDI-VTON None - 0.0642 0.8985
LaDI-VTON Non-Linear ✓ 0.0640 0.9060
Figure 5: Image reconstruction results from the Stable Diffu-
SD VAE None - 0.0260 0.9336 sion autoencoder with and without the EMASC modules.
VITON-HD

SD VAE Linear ✓ 0.0220 0.9545


SD VAE Non-Linear ✗ 0.0203 0.9560 To better assess the contribution of the EMASC modules in the
SD VAE Non-Linear ✓ 0.0200 0.9561 autoencoder analysis, we compare the proposed EMASC method
LaDI-VTON None - 0.0960 0.8491 with two of its variants. The first variant involves removing the fea-
LaDI-VTON Non-Linear ✓ 0.0907 0.8758 ture masking after the final convolutional layer, while in the second
variant, we use only one convolutional layer without any non-linear
overall pipeline by removing it. In this case, we can see that this activation. We can notice that the masked non-linear EMASC mod-
additional input helps in the paired setting, but interestingly does ules achieve better results in all metrics on both datasets. In Figure 5,
not appreciably contribute to the unpaired one. we also show sample qualitative results of the Stable Diffusion au-
Analysis on V∗ . In Table 5, we show the results of the out-of-the- toencoder with and without EMASC modules. As it is possible to
box Stable Diffusion inpainting model conditioned using the textual see, the proposed learnable mask-aware skip-connections reduce
inversion module when varying the number of PTEs generated by the reconstruction loss resulting in better faces, hands, and feet.
𝐹𝜃 . Overall, we obtain the best scores in terms of FID and KID on Note that we achieve such results without retraining or fine-tuning
the unpaired setting using 16 pseudo-word tokens embeddings, the autoencoder.
while for the metrics on the paired setting employing 32 PTEs leads
to slightly better results. Since increasing the number of PTEs can 5 CONCLUSION
increase memory usage, the best trade-off between computational In this work, we propose the first latent diffusion-based approach for
load and performance is reached when using 16 PTEs. virtual try-on. To increase the detail retention of the input in-shop
Effectiveness of EMASC modules. We test the proposed EMASC garment, we exploit the textual inversion technique for the first
modules on the paired settings of Dress Code and VITON-HD. In time in this task, demonstrating its capability in conditioning the
particular, we test the EMASC performance on both the autoencoder generation process. Moreover, we introduce the EMASC modules
A for image reconstruction and the final model (i.e., LaDI-VTON) that can enhance the inpainting output image quality reducing the
for the complete virtual try-on task. In the first experiment, we autoencoder compression loss of LDMs. This advancement notably
simply encode and then decode the model image 𝐼 obtaining the improves the human perceived quality of high-frequency human
reconstructed image 𝐼ˆ (i.e., 𝐼ˆ = D (E (𝐼 ))). In the second experiment, body details such as hands, faces, and feet. Results show that the
we compare the performance of our complete model with and proposed LaDI-VTON model outperforms by a large margin the
without the EMASC modules. While for the first experiment, LPIPS competitors in terms of realism on both Dress Code and VITON-HD
and SSIM are computed by comparing the model image 𝐼 with its datasets, two widely used benchmarks for the task.
reconstruction 𝐼ˆ, in the second experiment, we evaluate the metrics
by comparing the model image 𝐼 with its reconstruction 𝐼˜, where we ACKNOWLEDGMENTS
define 𝐼˜ as the output of the virtual try-on pipeline. Results reported This work has partially been supported by the European Horizon
in Table 6 show that the proposed method can enhance both the 2020 Programme (grant number 101004545 - ReInHerit) and by
reconstruction capabilities of the Stable Diffusion autoencoder and the PRIN project “CREATIVE: CRoss-modal understanding and
the output performance of the final virtual try-on pipeline leading gEnerATIon of Visual and tExtual content” (CUP B87G22000460001),
to better evaluation scores. co-funded by the Italian Ministry of University.
LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada

A CLOTHES WARPING PROCEDURE Table 7: Quantitative results per category on the Dress Code
dataset [45].
To warp the in-shop garment 𝐶 to fit the model’s body shape shown
in 𝐼 and obtain the warped in-shop garment 𝐶𝑊 , we exploit the Upper-body
geometric matching module proposed in [62] and a U-Net based [53]
Model LPIPS ↓ SSIM ↑ FIDp ↓ KIDp ↓ FIDu ↓ KIDu ↓
refinement component.
Specifically, the geometric matching module computes a corre- CP-VTON [62] 0.176 0.851 46.47 33.82 48.31 35.25
lation map between the encoded representations of the in-shop CP-VTON† [62] 0.078 0.918 19.70 11.69 22.18 12.09
PSAD [45] 0.049 0.938 13.87 6.40 17.51 7.15
garment 𝐶 and a cloth-agnostic person representation composed
LaDI-VTON 0.049 0.928 9.53 1.98 13.26 2.67
of the pose map 𝑃 and the masked model image 𝐼𝑀 . We obtain
these encoded representations using two separate convolutional Lower-body
networks. Based on the computed correlation map, we predict the Model LPIPS ↓ SSIM ↑ FIDp ↓ KIDp ↓ FIDu ↓ KIDu ↓
spatial transformation parameters 𝜃 of a thin-plate spline geometric CP-VTON [62] 0.220 0.828 47.29 32.40 51.29 38.48
transformation [15, 51] represented by TPS𝜃 . We use the 𝜃 param- CP-VTON† [62] 0.083 0.913 18.85 10.33 18.85 10.24
eters to compute the coarse warped garment 𝐶ˆ starting from the PSAD [45] 0.051 0.932 13.14 5.59 19.68 8.90
in-shop garment 𝐶 (i.e., 𝐶ˆ = TPS𝜃 (𝐶)). LaDI-VTON 0.051 0.922 8.52 1.04 14.80 3.13
To further refine the result, we use a U-Net model that takes as Dresses
input the concatenation of the coarse warped garment 𝐶, ˆ the pose Model LPIPS ↓ SSIM ↑ FIDp ↓ KIDp ↓ FIDu ↓ KIDu ↓
map 𝑃, and the masked model image 𝐼𝑀 and predicts the refined CP-VTON [62] 0.162 0.847 22.54 13.21 25.94 15.81
warped garment 𝐶𝑊 as follows: CP-VTON† [62] 0.123 0.863 18.75 11.07 21.83 12.31
PSAD [45] 0.074 0.885 12.38 4.68 17.07 6.66
ˆ 𝑃, 𝐼𝑀 ).
𝐶𝑊 = U-Net(𝐶, (5) LaDI-VTON 0.089 0.868 9.07 1.12 13.40 2.50

Training details. We first train the geometric matching module for


50 epochs with batch size 32 using the L1 loss function. Then, the findings, as illustrated in Figure 9 and Figure 10, validate the quan-
U-Net refinement module is trained for another 50 epochs using a titative results shown in the main paper, demonstrating that the
combination of the L1 and VGG [34] loss functions, where we scale LaDI-VTON method produces more realistic outcomes. As it can
the perceptual loss by a factor of 0.25. For both training phases, we be seen, our approach surpasses other methods in terms of visual
set the learning rate to 1e-4 and use Adam [35] as optimizer with quality, generating highly realistic images that showcase the effec-
𝛽 1 = 0.5 and 𝛽 2 = 0.99. tiveness and robustness of our pipeline. These additional qualitative
results further support the conclusion drawn from the quantitative
B ADDITIONAL RESULTS analysis and have significant implications for the future develop-
ment of virtual try-on technology.
As a complement of Table 1 of the main paper, Table 7 presents
the complete quantitative results for each category of the Dress
Code dataset. Our method, denoted as LaDI-VTON, demonstrates C LIMITATIONS
superior performance compared to all competitors across all three Beyond the difficulties in reproducing high-frequency details, cur-
Dress Code categories in terms of realism metrics such as FID rent Stable Diffusion-based architectures still present some defi-
and KID in both the paired and unpaired settings. When assessing ciencies in reproducing readable and coherent textual details. On
input adherence metrics such as LPIPS and SSIM, our approach this line, one of the key limitations of LaDI-VTON is that it can not
achieves better results than CP-VTON [62] and CP-VTON† while always synthesize logos and words depicted on the try-on garment
still obtaining comparable results to PSAD [45]. faithfully. Some failure cases of the proposed approach are depicted
in Figure 6. Our model is able to well reproduce the general shape of
Qualitative results. To provide further evidence of the effective- logos and texts preserving the overall structure of the pattern (see
ness of the proposed EMASC modules in improving the quality of second row), but however, struggles to define precise and highly
image reconstruction, we present additional qualitative results on comprehensible letters or numbers. We argue that this flaw results
the Dress Code and VITON-HD datasets. Specifically, Figures 7 and from our model’s reliance on Stable Diffusion, and it could be ad-
8 depict examples of reconstructed images with and without the dressed by using a non-latent diffusion approach, however, with
EMASC modules, showing the original image, the reconstructed im- higher computational load and resource demand.
age without EMASC, the reconstructed image with linear EMASC,
and the reconstructed image with the proposed non-linear EMASC.
REFERENCES
The results show that the EMASC modules improve the quality of
[1] Omri Avrahami, Thomas Hayes, Oran Gafni, Sonal Gupta, Yaniv Taigman, Devi
high-frequency human body details, such as hands, feet, and faces. Parikh, Dani Lischinski, Ohad Fried, and Xi Yin. 2023. SpaText: Spatio-Textual
In the Dress Code dataset (Figure 7), the EMASC module helps Representation for Controllable Image Generation. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition.
preserve the shapes of toes in the third and fourth rows. Similarly, [2] Omri Avrahami, Dani Lischinski, and Ohad Fried. 2022. Blended diffusion for
in the VITON-HD dataset (Figure 8), the EMASC module helps text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference
preserve the color and shape of eyes and avoid artifacts. on Computer Vision and Pattern Recognition.
[3] Shuai Bai, Huiling Zhou, Zhikang Li, Chang Zhou, and Hongxia Yang. 2022.
Finally, we report additional qualitative results of our proposed Single stage virtual try-on via deformable attention flows. In Proceedings of the
virtual try-on pipeline and its competitors on two datasets. Our European Conference on Computer Vision.
MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, and Rita Cucchiara

[19] Matteo Fincato, Federico Landi, Marcella Cornia, Fabio Cesari, and Rita Cucchiara.
2021. VITON-GT: An Image-based Virtual Try-On Model with Geometric Trans-
formations. In Proceedings of the International Conference on Pattern Recognition.
[20] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal
Chechik, and Daniel Cohen-Or. 2023. An Image is Worth One Word: Personal-
izing Text-to-Image Generation using Textual Inversion. In Proceedings of the
International Conference on Learning Representations.
[21] Yuying Ge, Yibing Song, Ruimao Zhang, Chongjian Ge, Wei Liu, and Ping Luo.
2021. Parser-free virtual try-on via distilling appearance flows. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[22] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial
nets. In Advances in Neural Information Processing Systems.
[23] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu
Yuan, and Baining Guo. 2022. Vector quantized diffusion model for text-to-image
synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition.
[24] M Hadi Kiapour, Xufeng Han, Svetlana Lazebnik, Alexander C Berg, and Tamara L
Berg. 2015. Where to buy it: Matching street clothing photos in online shops. In
Proceedings of the IEEE/CVF International Conference on Computer Vision.
[25] Inhwa Han, Serin Yang, Taesung Kwon, and Jong Chul Ye. 2023. Highly Personal-
ized Text Embedding for Image Manipulation by Stable Diffusion. arXiv preprint
arXiv:2303.08767 (2023).
[26] Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S Davis. 2018. Viton: An
image-based virtual try-on network. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition.
[27] Dan Hendrycks and Kevin Gimpel. 2016. Gaussian Error Linear Units (GELUs).
arXiv preprint arXiv:1606.08415 (2016).
Figure 6: Failure cases of LaDI-VTON on Dress Code [45] (1st [28] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Günter
row) and VITON-HD [9] (2nd row) Klambauer, and Sepp Hochreiter. 2017. GANs trained by a two time-scale update
rule converge to a Nash equilibrium. In Advances in Neural Information Processing
[4] Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, and Alberto Del Bimbo. 2023. Systems.
Zero-Shot Composed Image Retrieval with Textual Inversion. In Proceedings of [29] Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic
the IEEE/CVF International Conference on Computer Vision. Models. In Advances in Neural Information Processing Systems.
[5] Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2022. [30] Jonathan Ho and Tim Salimans. 2021. Classifier-Free Diffusion Guidance. In
Conditioned and Composed Image Retrieval Combining and Partially Fine-Tuning Advances in Neural Information Processing Systems Workshops.
CLIP-Based Features. In Proceedings of the IEEE/CVF Conference on Computer [31] Wei-Lin Hsiao and Kristen Grauman. 2018. Creating capsule wardrobes from
Vision and Pattern Recognition Workshops. fashion images. In Proceedings of the IEEE/CVF Conference on Computer Vision
[6] Alberto Baldrati, Davide Morelli, Giuseppe Cartella, Marcella Cornia, Marco and Pattern Recognition.
Bertini, and Rita Cucchiara. 2023. Multimodal Garment Designer: Human-Centric [32] Thibaut Issenhuth, Jérémie Mary, and Clément Calauzenes. 2020. Do not mask
Latent Diffusion Models for Fashion Image Editing. In Proceedings of the IEEE/CVF what you do not need to mask: a parser-free virtual try-on. In Proceedings of the
International Conference on Computer Vision. European Conference on Computer Vision.
[7] Ankan Kumar Bhunia, Salman Khan, Hisham Cholakkal, Rao Muhammad An- [33] Yuming Jiang, Shuai Yang, Haonan Qju, Wayne Wu, Chen Change Loy, and Ziwei
wer, Jorma Laaksonen, Mubarak Shah, and Fahad Shahbaz Khan. 2023. Person Liu. 2022. Text2human: Text-driven controllable human image generation. ACM
Image Synthesis via Denoising Diffusion Model. In Proceedings of the IEEE/CVF Transactions on Graphics 41, 4 (2022), 1–11.
Conference on Computer Vision and Pattern Recognition. [34] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-
[8] Mikołaj Bińkowski, Dougal J Sutherland, Michael Arbel, and Arthur Gretton. time style transfer and super-resolution. In Proceedings of the European Conference
2018. Demystifying MMD GANs. In Proceedings of the International Conference on Computer Vision.
on Learning Representations. [35] Diederik P Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimiza-
[9] Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. 2021. VITON- tion. In Proceedings of the International Conference on Learning Representations.
HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization. In [36] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- 2023. Multi-Concept Customization of Text-to-Image Diffusion. In Proceedings of
tion. the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[10] Guillem Cucurull, Perouz Taslakian, and David Vazquez. 2019. Context-aware [37] Sangyun Lee, Gyojung Gu, Sunghyun Park, Seunghwan Choi, and Jaegul Choo.
visual compatibility prediction. In Proceedings of the IEEE/CVF Conference on 2022. High-Resolution Virtual Try-On with Misalignment and Occlusion-Handled
Computer Vision and Pattern Recognition. Conditions. In Proceedings of the European Conference on Computer Vision.
[11] Giannis Daras and Alexandros G Dimakis. 2022. Multiresolution Textual Inver- [38] Kedan Li, Min Jin Chong, Jeffrey Zhang, and Jingen Liu. 2021. Toward accurate
sion. In Advances in Neural Information Processing Systems Workshops. and realistic outfits visualization with attention to details. In Proceedings of the
[12] Lavinia De Divitiis, Federico Becattini, Claudio Baecchi, and Alberto Del Bimbo. IEEE/CVF Conference on Computer Vision and Pattern Recognition.
2023. Disentangling features for fashion recommendation. ACM Transactions on [39] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. 2016. DeepFash-
Multimedia Computing, Communications and Applications 19, 1s (2023), 1–21. ion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations.
[13] Nicki Skafte Detlefsen, Jiri Borovec, Justus Schock, Ananya Harsh Jha, Teddy In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
Koker, Luca Di Liello, Daniel Stancl, Changsheng Quan, Maxim Grechkin, and nition.
William Falcon. 2022. TorchMetrics-Measuring Reproducibility in PyTorch. Jour- [40] Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization.
nal of Open Source Software 7, 70 (2022), 4101. In Proceedings of the International Conference on Learning Representations.
[14] Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion Models Beat GANs on [41] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte,
Image Synthesis. In Advances in Neural Information Processing Systems. and Luc Van Gool. 2022. RePaint: Inpainting Using Denoising Diffusion Prob-
[15] Jean Duchon. 1977. Splines minimizing rotation-invariant semi-norms in Sobolev abilistic Models. In Proceedings of the IEEE/CVF Conference on Computer Vision
spaces. In Constructive Theory of Functions of Several Variables. and Pattern Recognition.
[16] Stefan Elfwing, Eiji Uchibe, and Kenji Doya. 2018. Sigmoid-weighted linear units [42] Chenlin Meng, Yutong He adnd Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan
for neural network function approximation in reinforcement learning. Neural Zhu, and Stefano Ermon. 2022. SDEdit: Guided Image Synthesis and Editing with
Networks 107 (2018), 3–11. Stochastic Differential Equations. In Proceedings of the International Conference
[17] Benjamin Fele, Ajda Lampe, Peter Peer, and Vitomir Struc. 2022. C-VTON: on Learning Representations.
Context-Driven Image-Based Virtual Try-On Network. In Proceedings of the IEEE [43] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2023.
Winter Conference on Applications of Computer Vision. Null-text Inversion for Editing Real Images using Guided Diffusion Models. In
[18] Emanuele Fenocchi, Davide Morelli, Marcella Cornia, Lorenzo Baraldi, Fabio Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
Cesari, and Rita Cucchiara. 2022. Dual-Branch Collaborative Transformer for tion.
Virtual Try-On. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition Workshops.
LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada

[44] Davide Morelli, Marcella Cornia, Rita Cucchiara, et al. 2021. FashionSearch++: Representations for Fashion Recommendation. In Proceedings of the IEEE Winter
Improving consumer-to-shop clothes retrieval with hard negatives. In CEUR Conference on Applications of Computer Vision.
Workshop Proceedings. [58] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross
[45] Davide Morelli, Matteo Fincato, Marcella Cornia, Federico Landi, Fabio Cesari, Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell
and Rita Cucchiara. 2022. Dress Code: High-Resolution Multi-Category Virtual Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson,
Try-On. In Proceedings of the European Conference on Computer Vision. Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. 2022. LAION-5B: An open
[46] Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising dif- large-scale dataset for training next generation image-text models. In Advances
fusion probabilistic models. In Proceedings of the International Conference on in Neural Information Processing Systems.
Machine Learning. [59] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli.
[47] Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In
Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. 2022. GLIDE: Proceedings of the International Conference on Machine Learning.
Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion [60] Jiaming Song, Chenlin Meng, and Stefano Ermon. 2021. Denoising Diffusion
Models. In Proceedings of the International Conference on Machine Learning. Implicit Models. In Proceedings of the International Conference on Learning Repre-
[48] Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. 2022. On Aliased Resizing sentations.
and Surprising Subtleties in GAN Evaluation. In Proceedings of the IEEE/CVF [61] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Conference on Computer Vision and Pattern Recognition. Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from
[49] Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Gabriel Goh, Sandhini Overfitting. Journal of Machine Learning Research 15, 56 (2014), 1929–1958.
Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen [62] Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin Chen, Liang Lin, and Meng
Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Yang. 2018. Toward characteristic-preserving image-based virtual try-on network.
Natural Language Supervision. In Proceedings of the International Conference on In Proceedings of the European Conference on Computer Vision.
Machine Learning. [63] Tengfei Wang, Ting Zhang, Bo Zhang, Hao Ouyang, Dong Chen, Qifeng Chen,
[50] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. and Fang Wen. 2022. Pretraining is All You Need for Image-to-Image Translation.
2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv arXiv preprint arXiv:2205.12952 (2022).
preprint arXiv:2204.06125 (2022). [64] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image
[51] Ignacio Rocco, Relja Arandjelovic, and Josef Sivic. 2017. Convolutional neural quality assessment: from error visibility to structural similarity. IEEE Transactions
network architecture for geometric matching. In Proceedings of the IEEE/CVF on Image Processing 13, 4 (2004), 600–612.
Conference on Computer Vision and Pattern Recognition. [65] Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Korn-
[52] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn blith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi,
Ommer. 2022. High-Resolution Image Synthesis With Latent Diffusion Mod- Hongseok Namkoong, et al. 2022. Robust fine-tuning of zero-shot models. In Pro-
els. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Recognition. [66] Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen
[53] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Grauman, and Rogerio Feris. 2021. Fashion IQ: A New Dataset Towards Retrieving
Networks for Biomedical Image Segmentation. In Proceedings of the International Images by Natural Language Feedback. In Proceedings of the IEEE/CVF Conference
Conference on Medical Image Computing and Computer Assisted Intervention. on Computer Vision and Pattern Recognition.
[54] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and [67] Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun,
Kfir Aberman. 2023. DreamBooth: Fine Tuning Text-to-image Diffusion Models Dong Chen, and Fang Wen. 2023. Paint by Example: Exemplar-based Image
for Subject-Driven Generation. In Proceedings of the IEEE/CVF Conference on Editing with Diffusion Models. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. Computer Vision and Pattern Recognition.
[55] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim [68] Han Yang, Ruimao Zhang, Xiaobao Guo, Wei Liu, Wangmeng Zuo, and Ping Luo.
Salimans, David Fleet, and Mohammad Norouzi. 2022. Palette: Image-to-image 2020. Towards photo-realistic virtual try-on by adaptively generating-preserving
diffusion models. In Proceedings the ACM SIGGRAPH Conference. image content. In Proceedings of the IEEE/CVF Conference on Computer Vision and
[56] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Den- Pattern Recognition.
ton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo-Lopes, Burcu Karagol [69] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang.
Ayan, Tim Salimans, et al. 2022. Photorealistic Text-to-Image Diffusion Models 2018. The unreasonable effectiveness of deep features as a perceptual metric. In
with Deep Language Understanding. In Advances in Neural Information Processing Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
Systems. tion.
[57] Rohan Sarkar, Navaneeth Bodla, Mariya I Vasileva, Yen-Liang Lin, Anurag Beni- [70] Min Zhao, Fan Bao, Chongxuan Li, and Jun Zhu. 2022. EGSDE: Unpaired Image-
wal, Alan Lu, and Gerard Medioni. 2023. OutfitTransformer: Learning Outfit to-Image Translation via Energy-Guided Stochastic Differential Equations. In
Advances in Neural Information Processing Systems.
MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, and Rita Cucchiara

ORIGINAL w/o EMASC w/ LINEAR EMASC w/ NON-LINEAR EMASC

Figure 7: Image reconstruction qualitative results from the autoencoder of Stable Diffusion on sample images from the Dress
Code dataset. From left to right: the original image, the image from the out-of-the-box Stable Diffusion autoencoder without
EMASC modules, the image from the autoencoder with linear EMASC connections, and the image from the autoencoder with
non-linear EMASC connections.
LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada

ORIGINAL w/o EMASC w/ LINEAR EMASC w/ NON-LINEAR EMASC

Figure 8: Image reconstruction qualitative results from the autoencoder of Stable Diffusion on sample images from the VITON-
HD dataset. From left to right: the original image, the image from the out-of-the-box Stable Diffusion autoencoder without
EMASC modules, the image from the autoencoder with linear EMASC connections, and the image from the autoencoder with
non-linear EMASC connections.
MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, and Rita Cucchiara

Source Image In-shop Garment CP-VTON PSAD LaDI-VTON

Figure 9: Qualitative results generated by LaDI-VTON and competitors on the Dress Code dataset. From left to right: the original
image, the in-shop garment, and images generated by CP-VTON† [62], PSAD [45], LaDI-VTON (ours).
LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada

Source Image In-shop Garment VITON-HD HR-VITON LaDI-VTON

Figure 10: Qualitative results generated by LaDI-VTON and competitors on the Dress Code dataset. From left to right: the
original image, the in-shop garment, and images generated by VITON-HD [9], HR-VITON [37], LaDI-VTON (ours).

You might also like