LaDI VTON
LaDI VTON
University of Modena and Reggio University of Florence University of Modena and Reggio
Emilia Florence, Italy Emilia
Modena, Italy [email protected] Modena, Italy
[email protected] [email protected]
Figure 1: Images generated by the proposed LaDI-VTON model, given an input target model and a try-on clothing item from
both Dress Code [45] (1st row) and VITON-HD [9] (2nd row) datasets.
ABSTRACT these powerful generative solutions. This work introduces LaDI-
The rapidly evolving fields of e-commerce and metaverse continue VTON, the first Latent Diffusion textual Inversion-enhanced model
to seek innovative approaches to enhance the consumer experience. for the Virtual Try-ON task. The proposed architecture relies on
At the same time, recent advancements in the development of diffu- a latent diffusion model extended with a novel additional autoen-
sion models have enabled generative networks to create remarkably coder module that exploits learnable skip connections to enhance
realistic images. In this context, image-based virtual try-on, which the generation process preserving the model’s characteristics. To
consists in generating a novel image of a target model wearing effectively maintain the texture and details of the in-shop garment,
a given in-shop garment, has yet to capitalize on the potential of we propose a textual inversion component that can map the vi-
sual features of the garment to the CLIP token embedding space
∗ Both and thus generate a set of pseudo-word token embeddings capable
authors contributed equally to this research.
of conditioning the generation process. Experimental results on
Dress Code and VITON-HD datasets demonstrate that our approach
MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada
outperforms the competitors by a consistent margin, achieving a
© 2023 Association for Computing Machinery.
This is the author’s version of the work. It is posted here for your personal use. Not significant milestone for the task. Source code and trained models
for redistribution. The definitive Version of Record was published in Proceedings of the are publicly available at: https://github.com/miccunifi/ladi-vton.
31st ACM International Conference on Multimedia (MM ’23), October 29-November 3,
2023, Ottawa, ON, Canada, https://doi.org/10.1145/3581783.3612137.
MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, and Rita Cucchiara
best of our knowledge, to propose an image-based virtual try-on input clothing item details, we propose to add a novel forward-only
architecture entirely relying on the aforesaid generative models. textual inversion technique during the generation process. Finally,
Diffusion Models. A fundamental line of research in the image we enhance the image reconstruction autoencoder of Stable Dif-
synthesis field is the one marked by diffusion models [14, 29, 30, fusion with masked skip connections, thus improving the quality
46, 59, 60]. Inspired by non-equilibrium statistical physics, Sohl- of generated images and better preserving the fine-grained details
Dickstein et al. [59] defined a tractable generative model of data of the original model image. Figure 2 depicts an overview of the
distribution by iteratively destroying the data structure through a proposed model.
forward diffusion process and then reconstructing with a learned
reverse diffusion process. Some years later, Ho et al. [29] success- 3.1 Preliminaries
fully demonstrated that this process is applicable to generate high- Stable Diffusion. It consists of an autoencoder A with an en-
quality images. Nichol et al. [46] further improved the work pre- coder E and a decoder D, a text time-conditional U-Net denoising
sented in [29] by learning the variance parameter of the reverse model 𝜖𝜃 , and a CLIP text encoder 𝑇𝐸 , which takes text 𝑌 as input.
diffusion process and generating the output with fewer forward The encoder E compresses an image 𝐼 ∈ R3×𝐻 ×𝑊 into a lower-
passes without sacrificing sample quality. While these methods dimensional latent space in R4×ℎ×𝑤 , where ℎ = 𝐻8 and 𝑤 = 𝑊8 ,
work in the pixel space, Rombach et al. [52] proposed a variant while the decoder D performs the inverse operation and decodes a
working in the latent space of a pre-trained autoencoder, enabling latent variable into the pixel space. For clarity, we refer to the 𝜖𝜃 con-
higher computational efficiency. volutional input as the spatial input 𝛾 (e.g., 𝑧𝑡 ) since convolutions
The impact of diffusion models has rapidly become disruptive preserve the spatial structure, and to the attention conditioning
in diverse tasks such as text-to-image synthesis [23, 47, 50, 56], input as 𝜓 (e.g., [𝑡,𝑇𝐸 (𝑌 )]). The training of the denoising network
image-to-image translation [55, 63, 70], image editing [2, 42, 67], 𝜖𝜃 is performed by minimizing the following loss function:
and inpainting [41, 47]. Strictly related to virtual try-on is the task
𝐿 = E E (𝐼 ),𝑌 ,𝜖∼N (0,1),𝑡 ∥𝜖 − 𝜖𝜃 (𝛾,𝜓 )∥ 22 ,
of human image generation, where pose preservation is often a (1)
strict constraint. On this line, Jiang et al. [33] focused on synthesiz- where 𝑡 represents the diffusing time step, 𝛾 = 𝑧𝑡 , 𝑧𝑡 is the encoded
ing full-body images given human pose and textual descriptions of image E (𝐼 ) where we stochastically add Gaussian noise 𝜖 ∼ N (0, 1),
shapes and textures of clothes, generating the output via sampling and 𝜓 = [𝑡;𝑇𝐸 (𝑌 )].
from a learned texture-aware codebook. Bhunia et al. [7] tackled the We aim to generate a new image 𝐼˜ that replaces a target garment
task of pose-guided human generation by developing a texture diffu- in the model input image 𝐼 with an in-shop garment 𝐶 provided
sion block based on cross attention and conditioned on multi-scale by the user while retaining the model’s physical characteristics,
texture patterns from the encoded source image. Baldrati et al. [6], pose, and identity. This task can be seen as a particular type of
instead, proposed to guide the generation process constraining a inpainting, specialized in replacing garment information in human-
latent diffusion model with the model pose, the garment sketch, based images according to a target garment image provided by the
and a textual description of the garment itself. user. For this reason, we use the Stable Diffusion inpainting pipeline
Textual Inversion. Textual inversion is a recent technique pro- as the starting point of our approach. It takes as spatial input 𝛾 the
posed in [20] to learn a pseudo word in the embedding space of the channel-wise concatenation of an encoded masked image E (𝐼𝑀 ), a
text encoder starting from visual concepts. Following [20], several resized binary inpainting mask 𝑚 ∈ {0, 1}1×ℎ×𝑤 , and the denoising
promising methods [11, 25, 43, 54] have been designed to enable network input 𝑧𝑡 . Specifically, 𝐼𝑀 is the model image 𝐼 masked
personalized image generation and editing. Ruiz et al. [54] pre- according to the inpainting mask 𝑀 ∈ {0, 1}1×𝐻 ×𝑊 , and the binary
sented a fine-tuning technique to bind an identifier with a subject inpainting mask 𝑚 is the resized version according to the latent
represented by a few images and adopted a class-specific prior space spatial dimension of the original inpainting mask 𝑀. To
preservation loss to mitigate language drift. Similarly, Kumari et summarize, the spatial input of the inpainting denoising network
al. [36] proposed a different fine-tuning method to enable multi- is 𝛾 = [𝑧𝑡 ; 𝑚; E (𝐼𝑀 )] ∈ R (4+1+4) ×ℎ×𝑤 .
concept composition and showed that updating only a small subset CLIP. It is a vision-language model [49] which aligns visual and
of model weights is sufficient to integrate new concepts. On a differ- textual inputs in a shared embedding space. In particular, CLIP
ent line, Han et al. [25] decomposed the CLIP embedding space [49] consists of a visual encoder 𝑉𝐸 and a text encoder 𝑇𝐸 that extract
based on semantics and enabled image manipulation without re- feature representations 𝑉𝐸 (𝐼 ) ∈ R𝑑 and 𝑇𝐸 (𝐸𝐿 (𝑌 )) ∈ R𝑑 for an
quiring any additional fine-tuning. input image 𝐼 and its corresponding text caption 𝑌 , respectively.
Here, 𝑑 is the size of the CLIP embedding space, and 𝐸𝐿 is the
3 PROPOSED METHOD embedding lookup layer which maps each 𝑌 tokenized word to the
While most of the existing virtual try-on approaches leverage gener- token embedding space W.
ative adversarial networks [26, 32, 45, 62], we propose a novel solu- The proposed approach introduces a novel textual inversion
tion based, for the first time, on Latent Diffusion Models (LDMs). In technique to generate a representation of the in-shop garment 𝐶.
particular, our work employs the Stable Diffusion architecture [52] We feed this representation to the CLIP text encoder and use it to
as a starting point to perform the virtual try-on task. To augment condition the diffusion process. It consists in mapping the visual
the text-to-image model with try-on capabilities, we modify the features of 𝐶 into a set of 𝑁 new token embeddings 𝑉𝑛∗ ∈ W, 𝑛 =
architecture to take as input both the try-on garment and the pose {1, . . . , 𝑁 }. Following the terminology introduced in [4], we refer to
information of the target model. In addition, to better preserve the these embeddings as Pseudo-word Tokens Embeddings (PTEs) since
MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, and Rita Cucchiara
Embedding lookup
a photo of a
Va Vphoto Vof Va Vmodel
Tokenizer
V� model
ViT Layer
MLP
Vdress V �
*
VN*
wearing a
C dress q
Textual inversion Module
z
p
m
C
CW
EMASC
EMASC
EMASC
Denoising UNet
IM ~
I
Figure 2: Overview of the proposed LaDI-VTON model. On the top, the textual inversion module generates a representation
of the in-shop garment. This information conditions the Stable Diffusion model along with other convolutional inputs. The
decoder D is enriched with the Enhanced Mask-Aware Skip Connection (EMASC) modules to reduce the reconstruction error,
improving the high-frequency details in the final image.
they do not correspond to any linguistically meaningful entity but Textual Inversion. Given the in-shop image 𝐶, the aim of the
rather are a representation of the in-shop garment visual features textual inversion adapter 𝐹𝜃 is to predict a set of pseudo-word
in the token embedding space W. token embeddings {𝑉1∗, . . . , 𝑉𝑁∗ } able to well represent the image 𝐶
in the CLIP token embedding space W. We then use the predicted
3.2 Textual-Inversion Enhanced Virtual Try-On PTEs to condition the Stable Diffusion denoising network 𝜖𝜃 and
To tackle the virtual try-on task, we propose injecting in the Stable obtain the final image 𝐼˜ where the model in 𝐼 is wearing the garment
Diffusion textual conditioning branch additional information from in 𝐶. For clarity, we intend that a set of PTEs represent well a target
the target garment 𝐶 extracted through textual inversion. In partic- image if a Stable Diffusion model conditioned on the concatenation
ular, starting from the features of the in-shop garment 𝐶 extracted of a generic prompt and the predicted pseudo-words can reconstruct
from the CLIP visual encoder, we learn a textual inversion adapter the target image itself.
𝐹𝜃 to predict a set of fine-grained PTEs describing the in-shop gar- We first build a textual prompt 𝑞 that guides the diffusion process
ment 𝐶 itself. These PTEs lie in the CLIP token embedding space to perform the virtual try-on task, tokenize it and map each token
W and thus can be used as an additional conditioning signal. into the token embedding space using the CLIP embedding lookup
We also propose to extend the Stable Diffusion inpainting module, obtaining 𝑉𝑞 . Then, we encode the image 𝐶 using the CLIP
pipeline to accept the model pose map 𝑃 ∈ R18×𝐻 ×𝑊 , where each visual encoder 𝑉𝐸 and feed the features extracted from the last
channel is associated with a human keypoint, and the warped in- hidden layer to the textual inversion adapter 𝐹𝜃 , which maps the
shop garment 𝐶𝑊 ∈ R3×𝐻 ×𝑊 , representing the target garment 𝐶 input visual features to the CLIP token embedding space W. We
warped according to the model body pose. While the pose map then concatenate the prompt embedding vectors with the predicted
𝑃 enables the method to preserve the original human pose of the pseudo-word token embeddings as follows:
model 𝐼 , the warped garment 𝐶𝑊 helps the generation process to
properly fit the garment onto the model. 𝑌ˆ = Concat(𝑉𝑞 , 𝐹𝜃 (𝑉𝐸 (𝐶))). (2)
Data Preparation. The warped garment 𝐶𝑊 is obtained by training
a module that warps the in-shop garment 𝐶 fitting the model body We feed the embedded concatenation 𝑌ˆ to the CLIP text encoder 𝑇𝐸
shape in 𝐼 . We employ the geometric matching module proposed and use the output to condition the denoising network 𝜖𝜃 leveraging
in [62] and refine the results with a U-Net-based component [53]. the existing Stable Diffusion textual cross-attention.
The virtual try-on task involves replacing one or more garments the To train the textual inversion adapter 𝐹𝜃 , we use the inpainting
target model is wearing. With this aim, we define the inpainting area pipeline of the out-of-the-box Stable Diffusion model as 𝜖𝜃 . Specifi-
determined by the mask 𝑀 to fully encompass the target garment. cally, it takes as input the encoded masked target model E (𝐼𝑀 ), the
We adopt the method proposed in previous works such as [32, 45] inpainting mask 𝑀, and the latent variable 𝑧. When training the
to ensure the mask completely covers the target garment. adapter 𝐹𝜃 , we freeze all the other model parameters.
LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada
EMASC
EMASC
EMASC
parameters 𝜖𝜃 of the latent diffusion denoising network allowing it
to reach the best trade-off between image quality and computational
Shared M M M ~
Param load [52]. We remind that given an image 𝐼 ∈ R3×𝐻 ×𝑊 , the Stable
𝐻 𝑊
Diffusion encoder E compresses it in a latent space 𝑍 ∈ R4× 8 × 8 ,
resulting in a total compression of 48×. However, this trade-off
comes at a cost especially when dealing with human images and
I Î small high-frequency details such as hands, feet, and faces. We
argue that the autoencoder reconstruction error partially depends
Not M Masking Train Freezed on the data loss deriving from the latent space compression.
~
To address the problem, we propose to extend the autoen-
Figure 3: Overview of the proposed autoencoder with En- coder architecture with an Enhanced Mask-Aware Skip Connection
hanced Mask-Aware Skip Connection (EMASC) modules. (EMASC) module whose aim is to learn to propagate relevant in-
formation from different layers of the encoder E to corresponding
ones of the decoder D. In particular, instead of skipping the in-
To the best of our knowledge, this study marks the first instance formation of the encoded image 𝐼 to reconstruct, we pass to the
in which a textual inversion approach has been employed in the EMASC modules the intermediate features of the masked image
domain of virtual try-on. As shown in the experimental section, this 𝐼𝑀 encoding process, using the encoder E. This procedure allows
innovative conditioning methodology can significantly strengthen only the features not modified in the inpainting task to percolate,
the final results and contribute to preserving the details and texture keeping the process cloth agnostic. We implement EMASC employ-
of the original in-shop garment. Note that our proposed approach ing additive non-linear learned skip connections in which we mask
differs from traditional textual inversion techniques [20, 36, 54]. the output according to the inverted inpainting mask. Since the
Rather than directly optimizing the pseudo-word token embeddings EMASC inputs are the intermediate features of the masked model
through iterative methods, in our solution, the adapter 𝐹𝜃 is trained 𝐼𝑀 encoding process, masking the EMASC output features helps
to generate these embeddings in a single forward pass. avoid propagating the masked regions through the skip connections.
Formally, the EMASC module is defined as follows:
Diffusion Virtual Try-On Model. To perform the complete vir- 𝐸𝑀𝐴𝑆𝐶𝑖 = 𝑓 (𝐸𝑖 ) ∗ 𝑁𝑂𝑇 (𝑚𝑖 )
tual try-on task, we employ the additional inputs described above (4)
(i.e., textual-inverted information 𝑌ˆ of the in-shop garment, the 𝐷𝑖 = 𝐷𝑖 −1 + 𝐸𝑀𝐴𝑆𝐶𝑖
pose map 𝑃, and the garment fitted to the model body shape 𝐶𝑊 ) where 𝑓 is a learned non-linear function, 𝐸𝑖 is the 𝑖-th feature map
to condition the Stable Diffusion inpainting pipeline. In particular, coming from the encoder E, 𝐷𝑖 is the corresponding 𝑖-th decoder
we extend the spatial input 𝛾 ∈ R9×ℎ×𝑤 of the denoising network feature map, and 𝑚𝑖 is obtained by resizing the mask 𝑀 according to
𝜖𝜃 concatenating it with the resized pose map 𝑝 ∈ R18×ℎ×𝑤 and the 𝐸𝑖 spatial dimension. An overview of the proposed autoencoder
the encoded warped garment E (𝐶𝑊 ) ∈ R4×ℎ×𝑤 . The final spatial enhanced with EMASC modules is reported in Figure 3.
input results in 𝛾 = [𝑧𝑡 ; 𝑚; E (𝐼𝑀 ); 𝑝; E (𝐶𝑊 )] ∈ R (9+18+4) ×ℎ×𝑤 . Notice that the EMASC modules only depend on the Stable Dif-
To enrich the input capacity of the denoising network 𝜖𝜃 with- fusion denoising autoencoder, and once trained, they can be easily
out needing to retrain it from scratch [6, 52], we propose to extend added to the standard Stable Diffusion pipeline in a plug-and-play
the kernel channels of the first convolutional layer by adding zero- manner without requiring additional training. We show that this
initialized weights to match the new input channel dimension. In simple proposed modification can reduce the compression informa-
such a way, we can retain the knowledge embedded in the original tion loss in the inpainting task, resulting in better high-frequency
denoising network while allowing the model to deal with the newly human-related reconstructed details.
proposed inputs. Since the warped garment 𝐶𝑊 is not always able
to properly represent the contextualization of the in-shop garment 4 EXPERIMENTAL EVALUATION
with the target model information, we also modify the Stable Dif- 4.1 Datasets and Evaluation Metrics
fusion textual input by using 𝑌ˆ obtained from the output of the
We perform experiments on two virtual try-on datasets, namely
trained textual inversion adapter 𝐹𝜃 as described in Eq. 2.
Dress Code [45] and VITON-HD [9], that feature high-resolution
As in standard LDMs, we train the proposed denoising network
image pairs of in-shop garments and model images in both paired
to predict the noise stochastically added to an encoded input 𝑧𝑡 =
and unpaired settings. While in the paired setting the in-shop gar-
E (𝐼 ). We specify the corresponding objective function as:
ment is the same as the model is wearing, in the unpaired one, a
different garment is selected for the virtual try-on task.
𝐿 = E E (𝐼 ),𝑌ˆ ,𝜖∼N (0,1),𝑡,E (𝐼 ),𝑀,𝑝,E (𝐶 ) ∥𝜖 − 𝜖𝜃 (𝛾,𝜓 )∥ 22 ,
(3) The Dress Code dataset [45] features over 53,000 image pairs
𝑀 𝑊
of clothes and human models wearing them. The dataset includes
where 𝜓 = 𝑡;𝑇𝐸 (𝑌ˆ ) .
high-resolution images (i.e., 1024 × 768) and garments belonging to
MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, and Rita Cucchiara
Table 1: Quantitative results on the Dress Code dataset [45]. The * marker indicates results reported in previous works, which
may differ in terms of metric implementation. Best results are reported in bold.
different macro-categories, such as upper-body clothes, lower-body Table 2: Quantitative results on the VITON-HD dataset [9].
clothes, and dresses. In our experiments, we employ the original The * marker indicates results reported in previous works.
splits of the dataset where 5,400 image pairs (1,800 for each category)
Model LPIPS ↓ SSIM ↑ FIDp ↓ KIDp ↓ FIDu ↓ KIDu ↓
compose the test set and the rest the training one. The VITON-HD
dataset [9] instead comprises 13,679 image pairs, each composed CP-VTON* [62] - 0.791 - - 30.25 40.12
of a frontal-view woman and an upper-body clothing item with a ACGPN* [68] - 0.858 - - 14.43 5.87
resolution equal to 1024 × 768. The dataset is divided into training VITON-HD [9] 0.116 0.863 11.01 3.71 12.96 4.09
and test sets of 11,647 and 2,032 pairs, respectively. HR-VITON [37] 0.097 0.878 10.88 4.48 13.06 4.72
To quantitatively evaluate our model, we employ evaluation LaDI-VTON 0.091 0.876 6.66 1.08 9.41 1.60
metrics to estimate the coherence and realism of the generation. In
particular, we use the Learned Perceptual Image Patch Similarity for each condition. This allows the later use of the classifier-free
(LPIPS) [69] and the Structural Similarity (SSIM) [64] to evaluate the guidance technique [30] at inference time. Following [1], we use the
coherence of the generated image compared to the ground-truth. fast variant of the multi-conditional classifier-free guidance, which
We compute these metrics on the paired setting of both datasets. allows computing the final result with a computational complexity
To measure the realism, we instead employ the Fréchet Inception independent from the amount of the input constraints.
Distance [28] and the Kernel Inception Distance [8] in both paired
Autoencoder with EMASC. We apply the proposed EMASC mod-
(i.e., FIDp and KIDp ) and unpaired (i.e., FIDu and KIDu ) settings.
ules to the variational autoencoder of the Stable Diffusion model.
For the LPIPS and SSIM implementation, we use the torch-metrics
In particular, each EMASC module consists of two convolutional
Python package [13], while for the FID and KID scores, we employ
layers, where a SiLU non-linearity [16] activates the first one. We
the implementation in [48].
apply the EMASC modules to the conv_in layer output and the
feature before the down_block connecting each encoder layer to its
4.2 Implementation Details corresponding decoder one. The convolutional layers have a kernel
We first train the EMASC modules, the textual-inversion adapter, size of 3, padding of 1, and stride of 1. The first convolutional layer
and the warping component. Then, we freeze the weights of all maintains the number of channels constant, while the second one
modules except for the textual inversion adapter and train the pro- adapts the channel axis dimension to the decoder features. Finally,
posed enhanced Stable Diffusion pipeline∗ . In all our experiments, we sum the EMASC output to the corresponding decoder features.
we generate images at 512 × 384 resolution. We train the EMASC modules for 40k steps with batch size 16,
Textual Inversion. The textual inversion network 𝐹𝜃 consists of a learning rate equal to 1e-5, AdamW as optimizer with 𝛽 1 = 0.9,
single ViT layer followed by a multi-layer perception composed of 𝛽 2 = 0.999, and weight decay 1e-2. Also, in this case, we perform 500
three fully-connected layers separated by a GELU non-linearity [27] warm-up steps with a linear schedule. We employ a combination
and a dropout layer [61]. We set the number of PTEs generated by of the L1 and VGG [34] loss functions, scaling the perceptual VGG
𝐹𝜃 to 16. We train 𝐹𝜃 for 200k steps, with batch size 16, learning rate loss term by a factor of 0.5. In our setting, we found the VGG loss
1e-5 with 500 warm-up steps using a linear schedule, AdamW [40] essential to avoid blurriness in the reconstructed images. During
as optimizer with 𝛽 1 = 0.9, 𝛽 2 = 0.999, and weight decay equal to training, the encoder E and decoder D are frozen (see Figure 3),
1e-2. As the visual encoder 𝑉𝐸 , we leverage the OpenCLIP ViT-H/14 and only the EMASC modules are learned.
model [65] pre-trained on LAION-2B [58].
Diffusion Virtual Try-On Model. We train the proposed virtual 4.3 Experimental Results
try-on pipeline for 200k iterations, with batch size 16 and the same Comparison with State-of-the-Art Models. We compare our
optimizer and scheduling strategy used to train the textual inver- method with several state-of-the-art competitors. For the Dress
sion network. At training time, we randomly mask the text, the Code dataset, we compare our method with CP-VTON [62] and
warped garment, and the pose map input with a probability of 0.2 PSAD [45], retrained from scratch using the same image resolu-
tion of our model (i.e., 512 × 384) using the source codes when
∗ https://huggingface.co/stabilityai/stable-diffusion-2-inpainting available or otherwise implementing them. Following [45], we also
LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada
Figure 4: Qualitative results generated by LaDI-VTON and competitors on Dress Code [45] (left) and VITON-HD [9] (right).
include an improved version of CP-VTON (i.e., CP-VTON† ) where Table 3: User study results on the unpaired test set of both
we add as additional input the masked image 𝐼𝑀 . For the VITON- datasets. We report the percentage of times an image from
HD dataset, instead, we compare our model with VITON-HD [9] LaDI-VTON is preferred against a competitor.
and HR-VITON [37] using source codes and checkpoints released
by the authors to extract the results. Given that some evaluation Dataset Model Realism Coherence
scores (e.g., LPIPS and FID) are very sensitive to different implemen- CP-VTON [62] 93.10 89.68
tations, to ensure a fair comparison, we compute the quantitative Dress Code CP-VTON† [62] 80.21 75.69
results of these methods using the same metric implementation of PSAD [45] 74.14 70.83
our model. For completeness, we also include in the comparison VITON-HD [9] 79.19 71.48
VITON-HD
some additional virtual try-on methods for which the results are HR-VITON [37] 77.95 60.98
from previous works and, therefore, may have been obtained using
different evaluation source codes. Table 4: Quantitative results on the entire Dress Code test
Table 1 reports the quantitative results on the Dress Code set [45] using different model configurations.
dataset. As can be seen, LaDI-VTON achieves comparable results
to PSAD [45] in terms of coherence with the inputs (i.e., LPIPS and Model LPIPS ↓ SSIM ↑ FIDp ↓ KIDp ↓ FIDu ↓ KIDu ↓
SSIM), while significantly outperforming all competitors in terms w/o text 0.071 0.902 4.99 1.61 8.50 3.70
of realism in both paired and unpaired settings. In particular, on w/ retrieved text 0.070 0.903 4.85 1.61 7.49 2.93
the Dress Code test set, our model reaches a FID score of 4.14 and w/ 𝐹𝜃 and standard SD 0.105 0.876 5.42 1.87 7.50 2.83
w/o warped garment 0.068 0.904 4.50 1.44 6.30 1.99
6.48 for the paired and unpaired settings, respectively. These results
are considerably lower than the best-performing competitor (i.e., LaDI-VTON 0.064 0.906 4.14 1.21 6.48 2.20
PSAD). In Table 2, we instead show the quantitative analysis of
the VITON-HD dataset. Also, in this case, LaDI-VTON surpasses is always selected more than 60% of the time, further confirming
all other competitors by a large margin in terms of FID and KID, the progress over previous methods.
demonstrating its effectiveness in this setting.
Configuration Analysis. In Table 4, we study the model perfor-
To qualitatively evaluate our results, we report in Figure 4 sample
mance by varying its configuration. We conduct this analysis on
images generated by our model and by the competitors. Notably,
the Dress Code test set. In particular, the experiment in the first row
our solution can generate high-realistic images and preserve the
replaces the Stable Diffusion textual input 𝑌ˆ with an empty string.
texture and details of the original in-shop garments, as well as the
The one in the second row replaces the Stable Diffusion textual
physical characteristics of target models.
input 𝑌ˆ with textual elements retrieved using the in-shop garment
Human Evaluation. To further evaluate the generation quality of image 𝐶 as the query for a CLIP-based model [6]. The results show
our model, we conduct a user study to measure both the realism of that the proposed textual inversion adapter outperforms the other
generated images and their coherence with the inputs given to the textual input alternatives. The third experiment regards the tex-
virtual try-on model. Overall, we collect around 2,000 evaluations tual inversion adapter condition abilities, in particular, we can see
for each test, involving more than 50 unique users. In Table 3, we that it is possible to obtain excellent results by using the proposed
report the percentage of times in which an image generated by our textual inversion adapter to condition an out-of-the-box Stable Dif-
model is preferred against a competitor. As can be seen, LaDI-VTON fusion model. Finally, we test the warped garment 𝐶𝑊 input in the
MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, and Rita Cucchiara
Table 5: Quantitative analysis changing the number of pre- ORIGINAL w/o EMASC w/ EMASC
dicted 𝑉 ∗ . Results are reported on the Dress Code test set [45]
using the out-of-the-box Stable Diffusion as backbone.
A CLOTHES WARPING PROCEDURE Table 7: Quantitative results per category on the Dress Code
dataset [45].
To warp the in-shop garment 𝐶 to fit the model’s body shape shown
in 𝐼 and obtain the warped in-shop garment 𝐶𝑊 , we exploit the Upper-body
geometric matching module proposed in [62] and a U-Net based [53]
Model LPIPS ↓ SSIM ↑ FIDp ↓ KIDp ↓ FIDu ↓ KIDu ↓
refinement component.
Specifically, the geometric matching module computes a corre- CP-VTON [62] 0.176 0.851 46.47 33.82 48.31 35.25
lation map between the encoded representations of the in-shop CP-VTON† [62] 0.078 0.918 19.70 11.69 22.18 12.09
PSAD [45] 0.049 0.938 13.87 6.40 17.51 7.15
garment 𝐶 and a cloth-agnostic person representation composed
LaDI-VTON 0.049 0.928 9.53 1.98 13.26 2.67
of the pose map 𝑃 and the masked model image 𝐼𝑀 . We obtain
these encoded representations using two separate convolutional Lower-body
networks. Based on the computed correlation map, we predict the Model LPIPS ↓ SSIM ↑ FIDp ↓ KIDp ↓ FIDu ↓ KIDu ↓
spatial transformation parameters 𝜃 of a thin-plate spline geometric CP-VTON [62] 0.220 0.828 47.29 32.40 51.29 38.48
transformation [15, 51] represented by TPS𝜃 . We use the 𝜃 param- CP-VTON† [62] 0.083 0.913 18.85 10.33 18.85 10.24
eters to compute the coarse warped garment 𝐶ˆ starting from the PSAD [45] 0.051 0.932 13.14 5.59 19.68 8.90
in-shop garment 𝐶 (i.e., 𝐶ˆ = TPS𝜃 (𝐶)). LaDI-VTON 0.051 0.922 8.52 1.04 14.80 3.13
To further refine the result, we use a U-Net model that takes as Dresses
input the concatenation of the coarse warped garment 𝐶, ˆ the pose Model LPIPS ↓ SSIM ↑ FIDp ↓ KIDp ↓ FIDu ↓ KIDu ↓
map 𝑃, and the masked model image 𝐼𝑀 and predicts the refined CP-VTON [62] 0.162 0.847 22.54 13.21 25.94 15.81
warped garment 𝐶𝑊 as follows: CP-VTON† [62] 0.123 0.863 18.75 11.07 21.83 12.31
PSAD [45] 0.074 0.885 12.38 4.68 17.07 6.66
ˆ 𝑃, 𝐼𝑀 ).
𝐶𝑊 = U-Net(𝐶, (5) LaDI-VTON 0.089 0.868 9.07 1.12 13.40 2.50
[19] Matteo Fincato, Federico Landi, Marcella Cornia, Fabio Cesari, and Rita Cucchiara.
2021. VITON-GT: An Image-based Virtual Try-On Model with Geometric Trans-
formations. In Proceedings of the International Conference on Pattern Recognition.
[20] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal
Chechik, and Daniel Cohen-Or. 2023. An Image is Worth One Word: Personal-
izing Text-to-Image Generation using Textual Inversion. In Proceedings of the
International Conference on Learning Representations.
[21] Yuying Ge, Yibing Song, Ruimao Zhang, Chongjian Ge, Wei Liu, and Ping Luo.
2021. Parser-free virtual try-on via distilling appearance flows. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[22] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial
nets. In Advances in Neural Information Processing Systems.
[23] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu
Yuan, and Baining Guo. 2022. Vector quantized diffusion model for text-to-image
synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition.
[24] M Hadi Kiapour, Xufeng Han, Svetlana Lazebnik, Alexander C Berg, and Tamara L
Berg. 2015. Where to buy it: Matching street clothing photos in online shops. In
Proceedings of the IEEE/CVF International Conference on Computer Vision.
[25] Inhwa Han, Serin Yang, Taesung Kwon, and Jong Chul Ye. 2023. Highly Personal-
ized Text Embedding for Image Manipulation by Stable Diffusion. arXiv preprint
arXiv:2303.08767 (2023).
[26] Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S Davis. 2018. Viton: An
image-based virtual try-on network. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition.
[27] Dan Hendrycks and Kevin Gimpel. 2016. Gaussian Error Linear Units (GELUs).
arXiv preprint arXiv:1606.08415 (2016).
Figure 6: Failure cases of LaDI-VTON on Dress Code [45] (1st [28] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Günter
row) and VITON-HD [9] (2nd row) Klambauer, and Sepp Hochreiter. 2017. GANs trained by a two time-scale update
rule converge to a Nash equilibrium. In Advances in Neural Information Processing
[4] Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, and Alberto Del Bimbo. 2023. Systems.
Zero-Shot Composed Image Retrieval with Textual Inversion. In Proceedings of [29] Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic
the IEEE/CVF International Conference on Computer Vision. Models. In Advances in Neural Information Processing Systems.
[5] Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2022. [30] Jonathan Ho and Tim Salimans. 2021. Classifier-Free Diffusion Guidance. In
Conditioned and Composed Image Retrieval Combining and Partially Fine-Tuning Advances in Neural Information Processing Systems Workshops.
CLIP-Based Features. In Proceedings of the IEEE/CVF Conference on Computer [31] Wei-Lin Hsiao and Kristen Grauman. 2018. Creating capsule wardrobes from
Vision and Pattern Recognition Workshops. fashion images. In Proceedings of the IEEE/CVF Conference on Computer Vision
[6] Alberto Baldrati, Davide Morelli, Giuseppe Cartella, Marcella Cornia, Marco and Pattern Recognition.
Bertini, and Rita Cucchiara. 2023. Multimodal Garment Designer: Human-Centric [32] Thibaut Issenhuth, Jérémie Mary, and Clément Calauzenes. 2020. Do not mask
Latent Diffusion Models for Fashion Image Editing. In Proceedings of the IEEE/CVF what you do not need to mask: a parser-free virtual try-on. In Proceedings of the
International Conference on Computer Vision. European Conference on Computer Vision.
[7] Ankan Kumar Bhunia, Salman Khan, Hisham Cholakkal, Rao Muhammad An- [33] Yuming Jiang, Shuai Yang, Haonan Qju, Wayne Wu, Chen Change Loy, and Ziwei
wer, Jorma Laaksonen, Mubarak Shah, and Fahad Shahbaz Khan. 2023. Person Liu. 2022. Text2human: Text-driven controllable human image generation. ACM
Image Synthesis via Denoising Diffusion Model. In Proceedings of the IEEE/CVF Transactions on Graphics 41, 4 (2022), 1–11.
Conference on Computer Vision and Pattern Recognition. [34] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-
[8] Mikołaj Bińkowski, Dougal J Sutherland, Michael Arbel, and Arthur Gretton. time style transfer and super-resolution. In Proceedings of the European Conference
2018. Demystifying MMD GANs. In Proceedings of the International Conference on Computer Vision.
on Learning Representations. [35] Diederik P Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimiza-
[9] Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. 2021. VITON- tion. In Proceedings of the International Conference on Learning Representations.
HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization. In [36] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- 2023. Multi-Concept Customization of Text-to-Image Diffusion. In Proceedings of
tion. the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[10] Guillem Cucurull, Perouz Taslakian, and David Vazquez. 2019. Context-aware [37] Sangyun Lee, Gyojung Gu, Sunghyun Park, Seunghwan Choi, and Jaegul Choo.
visual compatibility prediction. In Proceedings of the IEEE/CVF Conference on 2022. High-Resolution Virtual Try-On with Misalignment and Occlusion-Handled
Computer Vision and Pattern Recognition. Conditions. In Proceedings of the European Conference on Computer Vision.
[11] Giannis Daras and Alexandros G Dimakis. 2022. Multiresolution Textual Inver- [38] Kedan Li, Min Jin Chong, Jeffrey Zhang, and Jingen Liu. 2021. Toward accurate
sion. In Advances in Neural Information Processing Systems Workshops. and realistic outfits visualization with attention to details. In Proceedings of the
[12] Lavinia De Divitiis, Federico Becattini, Claudio Baecchi, and Alberto Del Bimbo. IEEE/CVF Conference on Computer Vision and Pattern Recognition.
2023. Disentangling features for fashion recommendation. ACM Transactions on [39] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. 2016. DeepFash-
Multimedia Computing, Communications and Applications 19, 1s (2023), 1–21. ion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations.
[13] Nicki Skafte Detlefsen, Jiri Borovec, Justus Schock, Ananya Harsh Jha, Teddy In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
Koker, Luca Di Liello, Daniel Stancl, Changsheng Quan, Maxim Grechkin, and nition.
William Falcon. 2022. TorchMetrics-Measuring Reproducibility in PyTorch. Jour- [40] Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization.
nal of Open Source Software 7, 70 (2022), 4101. In Proceedings of the International Conference on Learning Representations.
[14] Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion Models Beat GANs on [41] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte,
Image Synthesis. In Advances in Neural Information Processing Systems. and Luc Van Gool. 2022. RePaint: Inpainting Using Denoising Diffusion Prob-
[15] Jean Duchon. 1977. Splines minimizing rotation-invariant semi-norms in Sobolev abilistic Models. In Proceedings of the IEEE/CVF Conference on Computer Vision
spaces. In Constructive Theory of Functions of Several Variables. and Pattern Recognition.
[16] Stefan Elfwing, Eiji Uchibe, and Kenji Doya. 2018. Sigmoid-weighted linear units [42] Chenlin Meng, Yutong He adnd Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan
for neural network function approximation in reinforcement learning. Neural Zhu, and Stefano Ermon. 2022. SDEdit: Guided Image Synthesis and Editing with
Networks 107 (2018), 3–11. Stochastic Differential Equations. In Proceedings of the International Conference
[17] Benjamin Fele, Ajda Lampe, Peter Peer, and Vitomir Struc. 2022. C-VTON: on Learning Representations.
Context-Driven Image-Based Virtual Try-On Network. In Proceedings of the IEEE [43] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2023.
Winter Conference on Applications of Computer Vision. Null-text Inversion for Editing Real Images using Guided Diffusion Models. In
[18] Emanuele Fenocchi, Davide Morelli, Marcella Cornia, Lorenzo Baraldi, Fabio Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
Cesari, and Rita Cucchiara. 2022. Dual-Branch Collaborative Transformer for tion.
Virtual Try-On. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition Workshops.
LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada
[44] Davide Morelli, Marcella Cornia, Rita Cucchiara, et al. 2021. FashionSearch++: Representations for Fashion Recommendation. In Proceedings of the IEEE Winter
Improving consumer-to-shop clothes retrieval with hard negatives. In CEUR Conference on Applications of Computer Vision.
Workshop Proceedings. [58] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross
[45] Davide Morelli, Matteo Fincato, Marcella Cornia, Federico Landi, Fabio Cesari, Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell
and Rita Cucchiara. 2022. Dress Code: High-Resolution Multi-Category Virtual Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson,
Try-On. In Proceedings of the European Conference on Computer Vision. Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. 2022. LAION-5B: An open
[46] Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising dif- large-scale dataset for training next generation image-text models. In Advances
fusion probabilistic models. In Proceedings of the International Conference on in Neural Information Processing Systems.
Machine Learning. [59] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli.
[47] Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In
Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. 2022. GLIDE: Proceedings of the International Conference on Machine Learning.
Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion [60] Jiaming Song, Chenlin Meng, and Stefano Ermon. 2021. Denoising Diffusion
Models. In Proceedings of the International Conference on Machine Learning. Implicit Models. In Proceedings of the International Conference on Learning Repre-
[48] Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. 2022. On Aliased Resizing sentations.
and Surprising Subtleties in GAN Evaluation. In Proceedings of the IEEE/CVF [61] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Conference on Computer Vision and Pattern Recognition. Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from
[49] Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Gabriel Goh, Sandhini Overfitting. Journal of Machine Learning Research 15, 56 (2014), 1929–1958.
Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen [62] Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin Chen, Liang Lin, and Meng
Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Yang. 2018. Toward characteristic-preserving image-based virtual try-on network.
Natural Language Supervision. In Proceedings of the International Conference on In Proceedings of the European Conference on Computer Vision.
Machine Learning. [63] Tengfei Wang, Ting Zhang, Bo Zhang, Hao Ouyang, Dong Chen, Qifeng Chen,
[50] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. and Fang Wen. 2022. Pretraining is All You Need for Image-to-Image Translation.
2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv arXiv preprint arXiv:2205.12952 (2022).
preprint arXiv:2204.06125 (2022). [64] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image
[51] Ignacio Rocco, Relja Arandjelovic, and Josef Sivic. 2017. Convolutional neural quality assessment: from error visibility to structural similarity. IEEE Transactions
network architecture for geometric matching. In Proceedings of the IEEE/CVF on Image Processing 13, 4 (2004), 600–612.
Conference on Computer Vision and Pattern Recognition. [65] Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Korn-
[52] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn blith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi,
Ommer. 2022. High-Resolution Image Synthesis With Latent Diffusion Mod- Hongseok Namkoong, et al. 2022. Robust fine-tuning of zero-shot models. In Pro-
els. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Recognition. [66] Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen
[53] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Grauman, and Rogerio Feris. 2021. Fashion IQ: A New Dataset Towards Retrieving
Networks for Biomedical Image Segmentation. In Proceedings of the International Images by Natural Language Feedback. In Proceedings of the IEEE/CVF Conference
Conference on Medical Image Computing and Computer Assisted Intervention. on Computer Vision and Pattern Recognition.
[54] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and [67] Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun,
Kfir Aberman. 2023. DreamBooth: Fine Tuning Text-to-image Diffusion Models Dong Chen, and Fang Wen. 2023. Paint by Example: Exemplar-based Image
for Subject-Driven Generation. In Proceedings of the IEEE/CVF Conference on Editing with Diffusion Models. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. Computer Vision and Pattern Recognition.
[55] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim [68] Han Yang, Ruimao Zhang, Xiaobao Guo, Wei Liu, Wangmeng Zuo, and Ping Luo.
Salimans, David Fleet, and Mohammad Norouzi. 2022. Palette: Image-to-image 2020. Towards photo-realistic virtual try-on by adaptively generating-preserving
diffusion models. In Proceedings the ACM SIGGRAPH Conference. image content. In Proceedings of the IEEE/CVF Conference on Computer Vision and
[56] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Den- Pattern Recognition.
ton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo-Lopes, Burcu Karagol [69] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang.
Ayan, Tim Salimans, et al. 2022. Photorealistic Text-to-Image Diffusion Models 2018. The unreasonable effectiveness of deep features as a perceptual metric. In
with Deep Language Understanding. In Advances in Neural Information Processing Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
Systems. tion.
[57] Rohan Sarkar, Navaneeth Bodla, Mariya I Vasileva, Yen-Liang Lin, Anurag Beni- [70] Min Zhao, Fan Bao, Chongxuan Li, and Jun Zhu. 2022. EGSDE: Unpaired Image-
wal, Alan Lu, and Gerard Medioni. 2023. OutfitTransformer: Learning Outfit to-Image Translation via Energy-Guided Stochastic Differential Equations. In
Advances in Neural Information Processing Systems.
MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, and Rita Cucchiara
Figure 7: Image reconstruction qualitative results from the autoencoder of Stable Diffusion on sample images from the Dress
Code dataset. From left to right: the original image, the image from the out-of-the-box Stable Diffusion autoencoder without
EMASC modules, the image from the autoencoder with linear EMASC connections, and the image from the autoencoder with
non-linear EMASC connections.
LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada
Figure 8: Image reconstruction qualitative results from the autoencoder of Stable Diffusion on sample images from the VITON-
HD dataset. From left to right: the original image, the image from the out-of-the-box Stable Diffusion autoencoder without
EMASC modules, the image from the autoencoder with linear EMASC connections, and the image from the autoencoder with
non-linear EMASC connections.
MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, and Rita Cucchiara
Figure 9: Qualitative results generated by LaDI-VTON and competitors on the Dress Code dataset. From left to right: the original
image, the in-shop garment, and images generated by CP-VTON† [62], PSAD [45], LaDI-VTON (ours).
LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada
Figure 10: Qualitative results generated by LaDI-VTON and competitors on the Dress Code dataset. From left to right: the
original image, the in-shop garment, and images generated by VITON-HD [9], HR-VITON [37], LaDI-VTON (ours).