0% found this document useful (0 votes)

61 views25 pages

Fashion Diffusion Control

Uploaded by

Achraf Louiza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views25 pages

Fashion Diffusion Control

Uploaded by

Achraf Louiza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Multimodal Garment Designer:

Human-Centric Latent Diffusion Models for Fashion Image Editing

Alberto Baldrati1,3,∗ , Davide Morelli2,3,∗ , Giuseppe Cartella2 , Marcella Cornia2 ,

Marco Bertini1 , Rita Cucchiara2,4
1
University of Florence, Italy 2 University of Modena and Reggio Emilia, Italy
3
University of Pisa, Italy 4 IIT-CNR, Italy
arXiv:2304.02051v2 [cs.CV] 23 Aug 2023

1 2
{name.surname}@unifi.it {name.surname}@unimore.it

long red sleeveless dress cream dress black belted light pink long dress form-ﬁ�ng evening dress gold sheath dress black polka-dot a-line dress
red ﬂoor-length dress natural sleeveless v-neck dress dress black faux long striped dress red maxi dress light gold short dress black dot print shirt dress
solid red long dress sleevless beige dress wrap dress faux leather pink women's long dress red solid halterneck gown short gold shiny dress black polka dot dress

Figure 1: In this work, we propose a novel multimodal garment designer framework based on latent diffusion models that
can generate a novel fashion image conditioned on text, human keypoints, and a garment sketch.

Abstract sults on these new datasets demonstrate the effectiveness

of our proposal, both in terms of realism and coherence
Fashion illustration is used by designers to communi-
with the given multimodal inputs. Source code and col-
cate their vision and to bring the design idea from con-
lected multimodal annotations are publicly available at:
ceptualization to realization, showing how clothes inter-
https://github.com/aimagelab/multimodal-garment-designer.
act with the human body. In this context, computer vi-
sion can thus be used to improve the fashion design pro-
cess. Differently from previous works that mainly focused 1. Introduction
on the virtual try-on of garments, we propose the task of
Computer Vision research has always paid much atten-
multimodal-conditioned fashion image editing, guiding the
tion both to the human person and to fashion-related prob-
generation of human-centric fashion images by following
lems, especially working on the recognition and retrieval
multimodal prompts, such as text, human body poses, and
of clothing items [11, 24], the recommendation of similar
garment sketches. We tackle this problem by proposing a
garments [8, 18, 41], and the virtual try-on of clothes and
new architecture based on latent diffusion models, an ap-
accessories [7, 13, 29, 30, 50, 55]. In the last years, some re-
proach that has not been used before in the fashion do-
search efforts have been dedicated to the text-conditioned
main. Given the lack of existing datasets suitable for the
image editing task where, given a model image and a tex-
task, we also extend two existing fashion datasets, namely
tual description of a garment, the goal is to generate the
Dress Code and VITON-HD, with multimodal annotations
collected in a semi-automatic manner. Experimental re- ∗ Equal contribution.

1
input model wearing a new clothing item corresponding to model’s characteristics. (3) To tackle the new task, we ex-
the given textual description. In this context, only a few tend two existing fashion datasets with textual sentences
works [19, 35, 59] have been proposed, exclusively employ- and garment sketches devising a semi-automatic annotation
ing GAN-based approaches for the generative step. framework. (4) Extensive experiments demonstrate that the
Recently, diffusion models [10, 17, 32, 44] have attracted proposed approach outperforms other competitors in terms
more and more attention due to their outstanding genera- of realism and coherence with multimodal inputs.
tion capabilities, allowing the improvement of a variety of
downstream tasks in several domains, while their applica- 2. Related Work
bility to the fashion domain is still unexplored. Many dif- Text-Guided Image Generation. Creating an image that
ferent solutions have been introduced and can roughly be faithfully reflects the provided textual prompt is the goal of
identified based on the denoising conditions used to guide text-to-image synthesis. In this context, early approaches
the diffusion process, which can enable greater control of were based on GANs [48, 54, 56, 58], while most re-
the synthesized output. A particular type of diffusion model cent solutions exploit the effectiveness of diffusion mod-
has been proposed in [39] that, instead of applying the dif- els [33, 37, 39]. In the fashion domain, only a few attempts
fusion process in the pixel space, defines the forward and of text-to-image synthesis have been proposed [19, 35, 59].
the reverse processes in the latent space of a pre-trained au- Specifically, Zhu et al. [59] presented a GAN-based solu-
toencoder, becoming one of the leading choices thanks to tion that generates the final image conditioned on both tex-
its reduced computational cost. Although this solution can tual descriptions and semantic layouts. A different approach
generate highly realistic images, it does not perform well in is the one introduced in [35], where a latent code regular-
human-centric generation tasks and can not deal with mul- ization technique is employed to augment the GAN inver-
tiple conditioning signals to guide the generation phase. sion process by exploiting CLIP textual embeddings [36] to
In this work, we address an extended and more gen- guide the image editing process. Instead, Jiang et al. [19]
eral framework and define the new task of multimodal- proposed an architecture that synthesizes full-body images
conditioned fashion image editing, which allows guiding by mapping the textual descriptions of clothing items into
the generative process via multimodal prompts while pre- one-hot vectors, limiting however the expressiveness capa-
serving the identity and body shape of a given person bility of the conditioning signal.
(Fig. 1). To tackle this task, we introduce a new archi- Multimodal Image Generation with Diffusion Models.
tecture, called Multimodal Garment Designer (MGD), that A related line of works aims to condition existing diffu-
emulates the process of a designer conceiving a new gar- sion models on different modalities thus enabling greater
ment on a model shape, based on preliminary indications control over the generation process [5, 6, 27, 31, 51]. For
provided through a textual sentence or a garment sketch. example, Choi et al. [6] proposed to refine the generative
In particular, starting from Stable Diffusion [39], we pro- process of an unconditional denoising diffusion probabilis-
pose a denoising network that can be conditioned by multi- tic model [32] by matching each latent variable with the
ple modalities and also takes into account the pose consis- given reference image. On a different line, the approach in-
tency between input and generated images, thus improving troduced in [27] adds noise to a stroke-based input and ap-
the effectiveness of human-centric diffusion models. plies the reverse stochastic differential equation to synthe-
To address the newly proposed task, we present a semi- size images, without additional training. Wang et al. [51],
automatic framework to extend existing datasets with mul- instead, proposed to learn a highly semantic latent space and
timodal data. Specifically, we start from two famous virtual perform conditional finetuning for each downstream task to
try-on datasets (i.e. Dress Code [30] and VITON-HD [7]) map the guidance signals to the pre-trained space. Other
and extend them with textual descriptions and garment recent works proposed to add sketches as additional condi-
sketches. Experimental results on the two proposed mul- tioning signals, either concatenating them with the model
timodal fashion benchmarks show both quantitatively and input [5] or training an MLP-based edge predictor to map
qualitatively that our proposed architecture generates high- latent features to spatial maps [49].
quality images based on the given multimodal inputs and Among contemporary works that aim to condition pre-
outperforms all considered competitors and baselines, also trained latent diffusion models, ControlNet [57] proposes
according to human evaluations. to extend the Stable Diffusion model [39] with an additional
To sum up, our contributions are as follows: (1) We pro- conditioning input. This process involves creating two ver-
pose a novel task of multimodal-conditioned fashion image sions of the original model’s weights: one that remains fixed
editing, which entails the use of multimodal data to guide and unchanged (locked copy) and another that can be up-
the generation. (2) We introduce a new human-centric gen- dated during training (trainable copy). The purpose of this
erative architecture based on latent diffusion models, capa- is to allow the trainable version to learn the newly intro-
ble of following multimodal prompts while preserving the duced condition while the locked version retains the origi-

2
Text Y Pre-trained Channels Additional Channels
Cold shoulder CLIP Text
Black midi dress
Encoder

+ + +

4 5 1 18 9+19 Denoising U-Net

Figure 2: Overview of the proposed Multimodal Garment Designer (MGD), a human-centric latent diffusion model condi-
tioned on multiple modalities (i.e. text, human pose, and garment sketch).

nal model knowledge. On the other hand, T2I-Adapter [31] ing for human-centric fashion image editing. Stable Dif-
learns modality-specific adapter modules that enable Stable fusion is composed of an autoencoder with an encoder E
Diffusion conditioning on new modalities. and a decoder D, a text-time-conditional U-Net denoising
In contrast, we focus on the fashion domain and pro- model ϵθ , and a CLIP-based text encoder TE taking as in-
pose a human-centric architecture based on latent diffusion put a text Y . The encoder E compresses an image I into a
models that directly exploits the conditioning of textual sen- lower-dimensional latent space defined in Rh×w×4 , where
tences and other modalities such as human body poses and h = H/8 and w = W/8. The decoder D performs the
garment sketches. opposite operation, decoding a latent variable into the pixel
space. For the sake of clarity, we define the ϵθ convolu-
3. Proposed Method tional input (i.e. zt in this case) as spatial input γ because of
the property of convolutions to preserve the spatial structure
In this section, we propose a novel task to automatically
and the attention conditioning input as ψ. The denoising
edit a human-centric fashion image conditioned on multi-
network ϵθ is trained according to the following loss:
ple modalities. Specifically, given the model image I ∈
RH×W ×3 , its pose map P ∈ RH×W ×18 where the chan- L = \mathbb {E}_{\mathcal {E}(I), Y, \epsilon \sim \mathcal {N}(0,1),t} \left [ \lVert \epsilon - \epsilon _{\theta }(\gamma ,\psi ) \rVert _2^2 \right ], \label {eq:diffusion_loss} (1)
nels represent the human keypoints, a textual description Y
of a garment, and a sketch of the same S ∈ RH×W ×1 , we where t is the diffusing time step, γ = zt , ψ = [t; TE (Y )],
want to generate a new image I˜ ∈ RH×W ×3 that retains the and ϵ ∼ N (0, 1) is the Gaussian noise added to E(I).
information of the input model while substituting the target
garment according to the multimodal inputs. To tackle the 3.2. Human-Centric Image Editing
task, we propose a novel latent diffusion approach, called Our task aims to generate a new image I, ˜ by replac-
Multimodal Garment Designer (MGD), that can effectively ing in the input image I the target garment using multi-
combine multimodal information when generating the new modal inputs, while preserving the model’s identity and
image I.˜ Our proposed architecture is a general framework physical characteristics. As a natural consequence, this task
that can be easily extended to other modalities such as tex- can be identified as a particular type of inpainting tailored
ture and 3d information. We strongly believe this task can for human body data. Instead of using a standard text-to-
foster research in the field and enhance the design process of image model, we perform inpainting concatenating along
new fashion items with greater customization. An overview the channel dimension of the denoising network input zt an
of our model is shown in Fig. 2. encoded masked image E(IM ) and the relative resized bi-
nary inpainting mask m ∈ {0, 1}h×w×1 , which is derived
3.1. Preliminaries
from the original inpainting mask M ∈ {0, 1}H×W ×1 .
While diffusion models [44] are latent variable architec- Since here, the spatial input of the denoising network is
tures that work in the same dimensionality of the data (i.e. in γ = [zt ; m; E(IM )], γ ∈ Rh×w×9 . Thanks to the fully
the pixel space), latent diffusion models (LDMs) [39] oper- convolutional nature of the encoder E and the decoder D,
ate in the latent space of a pre-trained autoencoder achiev- this LDMs-based architecture can preserve the spatial in-
ing higher computational efficiency while preserving the formation in the latent space. Exploiting this feature, our
generation quality. In our work, we leverage the Stable method can thus optionally add conditioning constraints to
Diffusion model [39], a text-to-image implementation of the generation. In particular, we propose to add two gener-
LDMs as a starting point to perform multimodal condition- ation constraints in addition to the textual information: the

3
model pose map P to preserve the original human pose of Defining Iˆ = D(z0 ) ∈ RH×W ×3 as the output of the de-
the input model and the garment sketch S to allow the final coder D and Mhead ∈ {0, 1}H×W ×1 as the model face bi-
users to better condition the garment generation process. nary mask of the image I, the final output image I˜ is ob-
Pose Map Conditioning. In most cases [23, 26, 47], in- tained as follows: I˜ = Mhead ⊙ I + (1 − Mhead ) ⊙ I,
ˆ where
painting is performed with the objective of either remov- ⊙ denotes the element-wise multiplication operator.
ing or entirely replacing the content of the masked region.
3.3. Training and Inference
However, in our task, we aim to remove all information re-
garding the garment worn by the model while preserving the As in standard latent diffusion models, given an encoded
model’s body information and identity. Thus, we propose to input z = E(I), the proposed denoising network is trained
improve the garment inpainting process by using the bound- to predict the noise stochastically added to z. The corre-
ing box of the segmentation mask along with pose map in- sponding objective function can be specified as
formation representing body keypoints. This approach en-
ables the preservation of the model’s physical characteris-
tics in the masked region while allowing the inpainting of \label {eq:our_loss} L = \mathbb {E}_{\mathcal {E}(I), Y, \epsilon \sim \mathcal {N}(0,1), t, \mathcal {E}(I_M),m,p,s} \left [ \lVert \epsilon - \epsilon _{\theta }(\gamma ,\psi ) \rVert _2^2 \right ],
garments with different shapes. Differently from conven- (2)
tional inpainting techniques, we focus on selectively retain- where γ = [zt ; m; E(IM ); p; s] and ψ = [t; TE (Y )].
ing and discarding specific information within the masked Classifier-Free Guidance. Classifier-free guidance is an
region to achieve the desired outcome. To enhance the per- inference technique that requires the denoising network to
formance of the denoising network with human body key- work both conditioned and unconditioned. This method
points, we modify the first convolution layer of the net- modifies the unconditional model predicted noise moving
work by adding 18 additional channels, one for each key- it toward the conditioned one. Specifically, the predicted
point. Adding new inputs usually would require retraining diffusion process at time t, given the generic condition c, is
the model from scratch, thus consuming time, data, and re- computed as follows:
sources, especially in the case of data-hungry models like
the diffusion ones. Therefore, we propose to extend the ker- \hat {\epsilon }_{\theta }(z_t | c) = \epsilon _{\theta }(z_t | \emptyset ) + \alpha \cdot (\epsilon _{\theta }(z_t | c) - \epsilon _{\theta }(z_t | \emptyset )), \label {eq:diffusion_classifier_free} (3)
nels of the pre-trained input layer of the denoising network
where ϵθ (zt |c) is the predicted noise at time t given the con-
with randomly initialized weights sampled from a uniform
dition c, ϵθ (zt |∅) is the predicted noise at time t given the
distribution [14] and retrain the whole network. This con-
null condition, and the guidance scale α controls the degree
sistently reduces the number of training steps and enables
of extrapolation towards the condition.
training with less data. Our experiments show that such im-
Since our model deals with three conditions (i.e. text,
provement enhances the consistency of the body informa-
pose map, and sketch), we use the fast variant multi-
tion between the generated image and the original one.
condition classifier-free guidance proposed in [1]. Instead
Incorporating Sketches. Fully describing a garment us- of performing the classifier-free guidance according to each
ing only textual descriptions is a challenging task due to condition probability, it computes the direction of the joint
the complexity and ambiguity of natural language. While probability of all the conditions ∆tjoint = ϵθ (zt |{ci }i=N
i=1 ) −
text can convey specific attributes like style, color, and pat- ϵθ (zt |∅):
terns of a garment, it may not provide sufficient informa-
tion about its spatial characteristics, such as shape and size. \hat {\epsilon }_{\theta }(z_t | \{ c_i \}_{i=1}^{i=N}) = \epsilon _{\theta }(z_t | \emptyset ) + \alpha \cdot \Delta _{\text {joint}}^t. \label {eq:diffusion_classifier_free2} (4)
This limitation can hinder the customization of the gener-
ated clothing item other than the ability to accurately match This reduces the number of feed-forward executions from
the user’s intended style. Therefore, we propose to lever- N + 1 to 2.
age garment sketches to enrich the textual input with ad- Unconditional Training. Ensuring the ability of the de-
ditional spatial fine-grained details. We achieve this fol- noising model to work both with and without conditions is
lowing the same approach described for pose map condi- achieved by replacing at training time the condition with a
tioning. The final spatial input of our denoising network is null one according to a fixed probability. This approach
γ = [zt ; m; E(IM ); p; s], [p; s] ∈ Rh×w×(18+1) , p and s allows the model to learn from both conditional and un-
are obtained by resizing P and S to match the latent space conditional samples, resulting in improved mode coverage
dimensions. In the case of sketches, we only condition the and sample fidelity. Moreover, this technique also allows
early steps of the denoising process as the final steps have the model to optionally use the control signals at prediction
little influence on the shapes [2]. time. Since our approach considers multiple conditions, we
Mask Composition. To preserve the model identity when propose to extend the input masking to each condition in-
performing human-centric inpainting, we perform mask dependently. Experiments show that tuning this parameter
composition as the final step of the proposed approach. can effectively affect the quality of the final result.

4
4. Collecting Multimodal Fashion Datasets
red ﬁ�ed crop top
Currently available datasets for fashion image genera- red body crop top

tion often contain low-resolution images and lack all the re- long-sleeve top

quired multimodal information needed to perform the task

previously described. For this reason, the collection of new
multimodal datasets for the fashion domain plays a cru- mul�color eva stripe
cial role to advance research in the field. To this aim, we crop pants
mul�color drawstring
start from two recent high-resolution fashion datasets intro- crop pants
mul�color gaucho pants
duced for the virtual try-on task, namely Dress Code [30]
and VITON-HD [7], and extend them with textual sen-
tences and garment sketches. Both datasets include image
Figure 3: Sample images and multimodal data from our
pairs with a resolution of 1024 × 768, each composed of
newly collected datasets.
a garment image and a reference model wearing the given
fashion item. In this section, we introduce a framework # Unique # Unique
to semi-automatically annotate fashion images with multi- Dataset Text Pose Sketch # Images # Products Texts Words

modal information and provide a complete description of VITON-HD [7] ✗ ✓ ✗ 27,358 13,679 - -
Dress Code [30] ✗ ✓ ✗ 107,584 53,792 - -
how to enrich Dress Code and VITON-HD with garment- Be Your Own Prada [59] ✓ ✓ ✗ 78,979 N/A 3,972 445
related text and sketches. We call our extended versions DF-Multimodal [19] ✓ ✓ ✗ 44,096 N/A 10,253 77

of these datasets Dress Code Multimodal and VITON-HD VITON-HD Multimodal ✓ ✓ ✓ 27,358 13,679 5,143 1,613
Dress Code Multimodal ✓ ✓ ✓ 107,584 53,792 25,596 2,995
Multimodal, respectively. Sample images and multimodal
data of the collected datasets can be found in Fig. 3. Table 1: Comparison of Dress Code and VITON-HD Mul-
timodal with other fashion datasets with multimodal anno-
4.1. Dataset Collection and Annotation
tations.
Data Preparation. We start the annotation from the Dress
Code dataset, which contains more than 53k model-garment To determine the most relevant noun chunks for each
pairs of multiple categories. As a first step, we need to garment, we employ the CLIP model [36] and its open-
associate each garment with a textual description contain- source adaptation (i.e. OpenCLIP [52]). We select the VIT-
ing fashion-specific and non-generic terms which are suffi- L14@336 and RN50×64 models for CLIP, and the VIT-
ciently detailed but not extremely lengthy to be exploited for L14, ViT-H14, and ViT-g14 models for OpenCLIP. Prompt
constraining the generation. Motivated by recent findings ensembling is performed to improve the results and, for
in the field showing that humans tend to describe fashion each image, we select 25 noun chunks based on the top-5
items using only a few words [3], we propose to use noun noun chunks per model rated by cosine similarity between
chunks (i.e. short textual sentences composed of a noun image and text embeddings, avoiding repetitions.
along with its modifiers) that can effectively capture impor-
Fine-Grained Textual Annotation. To ensure the accuracy
tant information while reducing unnecessary words or de-
and representativeness of our annotations, we manually an-
tails. Given that manually annotating all the images would
notate a significant portion of Dress Code images. In par-
be time-consuming and resource-intensive1 , we propose a
ticular, we select the three most representative noun chunks,
novel framework to semi-automatically annotate the dataset
among the 25 automatically associated, with each garment
using noun chunks. Firstly, domain-specific captions are
image. To minimize the annotation time, we develop a cus-
collected from two available fashion datasets, namely Fash-
tom annotation tool that constrains the annotation time to
ionIQ [53] and Fashion200k [12], standardizing them with
an average time of 60 seconds per item and allows the an-
word lemmatization and eventually reducing each word to
notator to manually insert noun chunks in the case that none
its root form with the NLTK library2 . Then, we extract noun
of the automatically extracted ones are suitable for the im-
chunks from the captions, filtering the results by removing
age. Overall, we manually annotate 26,400 different gar-
all textual items that start with or contain special characters.
ments (8,800 for each category) out of the 53,792 products
After this pre-processing stage, we obtain more than 60k
included in the dataset, ensuring to include all fashion items
unique noun chunks, divided into three different categories
of the original test set [30].
(i.e. upper-body clothes, lower-body clothes, and dresses).
Coarse-Grained Textual Annotation. To complete the an-
1 Since the Dress Code dataset consists of over 53k fashion items and notation, we first finetune the OpenCLIP ViT-B32 model,
assuming that each annotation requires approximately 5 minutes, a single
annotator working 8 hours per day, 5 days a week, and 260 working days
pre-trained on the English portion of the LAION5B
per year would take more than 2 years to complete the annotation task. dataset [42], using the newly annotated image-text pairs.
2 https://www.nltk.org/ We then use this model and the collected set of noun chunks

5
to automatically tag all the remaining elements of the Dress HD Multimodal datasets on a single NVIDIA A100 GPU
Code dataset with the three most similar noun chunks, al- for 150k steps, using a batch size of 16, a learning rate
ways determined via cosine similarity between multimodal of 10−5 with a linear warmup for the first 500 iterations,
embeddings. We employ the same strategy also to auto- and AdamW [25] as optimizer with weight decay 10−2 .
matically annotate all garment images of the VITON-HD To speed up training and save memory, we use mixed
dataset. In this case, since this dataset only contains upper- precision [28]. We set both the fraction of steps condi-
body clothes, we limit the table noun chunks to the ones tioned by the sketch and the portion of masked conditions
describing upper-body garments. during training to 0.2. During inference, we employ the
Extracting Sketches. The introduction of garment sketches DDIM [45] with 50 steps as noise scheduler and set the
can provide valuable design details that are not easily dis- classifier-free guidance parameter α to 7.5.
cernible from text alone. In this way, the dataset can pro- Baselines and Competitors. As first competitor, we use
vide a more accurate and comprehensive representation of the out-of-the-box implementation of the inpainting Sta-
the garments, leading to improved quality and better con- ble Diffusion pipeline3 provided by Huggingface. More-
trol of the generated design details. To extract sketches over, we adapt two existing models, namely FICE [35] and
for both Dress Code and VITON-HD datasets, we employ SDEdit [27], to work on our setting. In particular, we re-
PiDiNet [46], a pre-trained edge detection network. train all main components of the FICE model on the newly
Given that the selected datasets have originally been in- collected datasets. We employ the same resolution used by
troduced for virtual try-on, they consist of both paired and the authors (i.e. 256 × 256), downsampling each image to
unpaired test sets. While for the paired set we can directly 256 × 192 and applying padding to match the desired size
use the human parsing mask to extract the garment of inter- (which is then removed during evaluation). To compare
est worn by the model and then feed it to the edge detec- our model with a different conditioning strategy, we em-
tion network, for the unpaired set we need to first create a ploy the approach proposed in [27] using our model trained
warped version of the in-shop garment matching the body only with text and human poses as input modalities and per-
pose and shape of the target model. Following virtual try-on form the sketch guidance using as starting latent variable the
methods [50,55], we train a geometric transformation mod- sketch image with added random noise. Following the orig-
ule that performs a thin-plate spline transformation [38] of inal paper instructions, we use 0.8 as the strength parameter.
the input garment and then refines the warped result using a
U-Net model [40]. From each warped garment, we extract 5.2. Evaluation Metrics
the sketch image enabling the use of the proposed solution To assess the realism of generated images, we employ
even in unpaired settings. the Fréchet Inception Distance (FID) [16] and the Kernel
Inception Distance (KID) [4]. For both metrics, we adopt
4.2. Comparison with Other Datasets
the implementation proposed in [34]. Instead, to evaluate
The only two text-to-image generation datasets avail- the adherence of the image to the textual conditioning in-
able in the fashion domain [19, 59] are both based on put, we employ the CLIP Score (CLIP-S) [15] provided in
images from the DeepFashion dataset [24]. While the the TorchMetrics library [9], using the OpenCLIP ViT-H/14
dataset introduced in [59] contains short textual descrip- model as cross-modal architecture. We compute the score
tions, DeepFashion-Multimodal [19] is annotated with at- on the inpainted region of the generated output pasted on a
tributes (e.g. category, color, fabric, etc.) that can be com- 224 × 224 white background.
posed in longer captions. In Table 1, we summarize the Pose Distance (PD). We propose a novel pose distance met-
main statistics of the publicly available datasets textual ric that measures the coherence of human body poses be-
annotations compared with those of our newly extended tween the generated image and the original one estimating
datasets. As can be seen, our datasets contain more variety the distance between the human keypoints extracted from
in terms of textual items and words, confirming the appro- the original and generated images. Specifically, we employ
priateness of our annotation procedure and enabling a more the OpenPifPaf [22] human pose estimation network and
personalized control of the generation process. Also, it is compute the ℓ2 distance between each pair of real-generated
worth noting that the other datasets have no in-shop gar- corresponding estimated keypoints. We only consider the
ment images making them difficult to employ in our case. keypoints involved in the generation (i.e. that falls in the
mask M ) and weigh each keypoint distance with the detec-
5. Experimental Evaluation tor confidence to take into account any estimation errors.
5.1. Implementation Details and Competitors Sketch Distance (SD). To quantify the adherence of the
generated image to the sketch constraint, we propose a
Training and Inference. All models are trained on the
original splits of the Dress Code Multimodal and VITON- 3 https://huggingface.co/runwayml/stable-diffusion-inpainting

6
Modalities Dress Code Multimodal VITON-HD Multimodal
Model Resolution Text Pose Sketch FID ↓ KID ↓ CLIP-S ↑ PD ↓ SD ↓ FID ↓ KID ↓ CLIP-S ↑ PD ↓ SD ↓
Paired setting
Stable Diffusion [39] 256×192 ✓ 17.05 9.28 28.71 4.62 - 15.18 6.38 30.40 5.04 -
FICE [35] 256×192 ✓ ✓ 30.63 23.54 28.72 6.87 - 49.44 44.74 29.26 6.37 -
MGD (ours) 256×192 ✓ ✓ 5.57 1.67 31.33 2.37 - 10.11 3.14 31.85 2.90 -
Paired setting
Stable Diffusion [39] 512×384 ✓ 17.43 9.48 29.18 9.24 0.467 16.28 6.56 30.70 10.78 0.410
SDEdit [27] 512×384 ✓ ✓ ✓ 10.19 5.03 29.21 5.41 0.398 13.07 4.66 30.58 6.76 0.306
MGD (ours) 512×384 ✓ ✓ ✓ 5.74 2.11 31.68 4.72 0.374 10.60 3.26 32.39 5.94 0.253
Unpaired setting
Stable Diffusion [39] 256×192 ✓ 19.11 10.69 27.53 5.07 - 17.37 7.55 28.40 5.50 -
FICE [35] 256×192 ✓ ✓ 34.14 26.86 26.03 7.15 - 52.74 48.58 25.94 6.58 -
MGD (ours) 256×192 ✓ ✓ 7.01 2.19 29.58 2.96 - 11.54 3.18 29.95 3.30 -
Unpaired setting
Stable Diffusion [39] 512×384 ✓ 19.55 10.80 28.02 9.89 0.582 18.45 7.87 28.74 11.60 0.561
SDEdit [27] 512×384 ✓ ✓ ✓ 11.38 5.69 27.10 6.16 0.509 15.12 5.67 28.61 7.35 0.406
MGD (ours) 512×384 ✓ ✓ ✓ 7.73 2.82 30.04 6.79 0.458 12.81 3.86 30.75 7.22 0.317

Table 2: Quantitative results on the Dress Code Multimodal and VITON-HD Multimodal datasets for both paired and un-
paired settings .

Modalities Dress Code Multimodal Modalities Realism Multimodal Coherence

Model Text Pose Sketch FID ↓ KID ↓ CLIP-S ↑ PD ↓ SD ↓ Text Pose Sketch Stable Diff. FICE SDEdit Stable Diff. FICE SDEdit
HD M. Code M.

✓ 70.82 - - 65.32 - -
Dress

✓ 6.19 2.15 31.79 6.16 0.411 ✓ ✓ 70.73 96.26 - 65.15 84.48 -

✓ ✓ 6.31 2.33 31.67 5.31 0.405 ✓ ✓ ✓ 70.29 - 52.54 65.38 - 66.23
MGD (ours) ✓ ✓ ✓ 5.74 2.11 31.68 4.72 0.374
✓ 67.03 - - 57.76 - -
VITON

Modalities VITON-HD Multimodal ✓ ✓ 66.17 93.84 - 73.73 83.46 -

✓ ✓ ✓ 60.71 - 53.44 69.47 - 59.34
Model Text Pose Sketch FID ↓ KID ↓ CLIP-S ↑ PD ↓ SD ↓
✓ 11.39 3.52 32.16 7.83 0.339 Table 4: User study results on the unpaired setting of both
✓ ✓ 11.07 3.36 32.27 6.77 0.318
MGD (ours) ✓ ✓ ✓ 10.60 3.26 32.39 5.94 0.253
datasets. We report the percentage of times an image from
MGD is preferred against a competitor. Comparisons with
Table 3: Performance analysis on the paired setting of both FICE [35] are performed at 256 × 192 resolution.
datasets as input modalities vary.
the aforementioned competitors. As can be seen, the pro-
novel sketch distance metric. To compute the score, we ex- posed MGD model consistently outperforms competitors,
tract the segmentation map of the original and generated in terms of realism (i.e. FID and KID) and coherency with
garments using an off-the-shelf clothing segmentation net- input modalities (i.e. CLIP-S, PD, and SD). In particu-
work4 . We then use the segmented garment area to extract lar, when considering low-resolution results, we notice that
garment sketches using the PIDInet [46] edge detector net- FICE [35] can produce images fairly consistent with the
work. The final score is the mean squared error between text conditioning, albeit less realistic than other methods.
these sketches, weighting the per-pixel results on the inverse While Stable Diffusion [39] enhances image realism, it fails
pixel frequency of the activated pixels. More details about to preserve the input model’s pose due to the lack of pose
these proposed metrics can be found in the supplementary. information in the inputs. It is noteworthy that in this case,
we compare the results of our model only using text and
5.3. Experimental Results pose map as conditioning since both considered competi-
Comparison with Other Methods. We test our pro- tors are not conditioned on sketches. For this reason, we
posal for the paired and unpaired settings of the considered do not report the results in terms of sketch distance for low-
datasets. In the former, the conditions (e.g. text, sketch) resolution images.
refers to the garment the model is wearing, while in the lat- In the high-resolution setting, we evaluate instead our
ter, the in-shop garment differs from the worn one. In Ta- MGD method using all multimodal conditions (i.e. text,
ble 2, we report the quantitative results on Dress Code Mul- pose map, and sketch) as input. In this case, we com-
timodal and VITON-HD Multimodal in comparison with pare MGD with Stable Diffusion [39] plus SDEdit [27],
where we use our text-pose conditioned denoising network
4 https://github.com/levindabhi/cloth-segmentation as SDEdit backbone. Our findings indicate that Stable Dif-

7
SDedit MGD (ours) SDedit MGD (ours)

hawaiian black blue navy side

short sleeve shirt stripe trousers
hawaiian print navy blue track
tropical looking pants
shirt track pants

mul�color black
green pussy-bow floral embroi-
de chine blouse dered gown
green pe�te bow red and black floral
blouse print
blue �e-front shirt slim straps floral
dress

Color ﬁdelity Shape ﬁdelity

Figure 4: Sample generated images on Dress Code Multimodal and VITON-HD Multimodal (bottom left) using all multi-
modal inputs.

fusion performs worse in terms of the pose distance than Dress Code Multimodal
both SDEdit and MGD, owing to the lack of pose informa- Uncond. Portion Sketch Cond. FID ↓ KID ↓ CLIP-S ↑ PD ↓ SD ↓
tion in the inputs. It is noteworthy that SDEdit performs 0.1 1.0 9.64 3.76 30.24 7.66 0.459
worse than our model in all metrics. We attribute this be- 0.2 1.0 8.62 3.24 29.06 7.51 0.430
0.3 1.0 10.93 4.78 28.47 7.69 0.432
havior to the way sketch conditioning happens. In SDEdit,
it occurs only at the beginning by initializing zt using the 0.2 0.8 8.56 3.28 29.31 7.32 0.433
0.2 0.6 8.43 3.21 29.51 7.32 0.436
sketch image with added noise according to the condition- 0.2 0.4 8.11 3.00 29.79 7.13 0.440
ing strength, while our model conditions the denoising pro- 0.2 0.2 7.73 2.82 30.04 6.79 0.458
0.2 0.0 7.82 2.85 29.93 6.26 0.519
cess in multiple steps, depending on the sketch conditioning
parameter. Qualitative results reported in Fig. 4 highlight
Table 5: Ablation analysis of our complete model varying
how our model better follows the given conditions and gen-
the unconditioning portion during training and the sketch
erate high-realistic images.
conditioning steps. Results refer to the unpaired setting.
To validate our results based on human judgment, we
conduct a user study that evaluates both the realism of the of the sketch distance in Table 3 confirms, this input actu-
generation and the adherence to multimodal inputs. Over- ally influences the generation process of our model in both
all we collect about 7k evaluations involving more than 150 the considered datasets. Also, this modality slightly affects
users. Additional details are reported in the supplementary. the pose distance as the sketch implicitly contains informa-
Table 4 shows the user study results. Also in this case our tion about the model’s body pose. We further mask the pose
model outperforms the competitors, thus confirming the ef- map input and compare the output with previous results. In
fectiveness of our proposal. this case, we can also notice a consistent difference with
the text-only conditioned model, according to all metrics
Varying Input Modalities. In Table 3, we study the be-
except CLIP-S as expected. These results confirm that our
havior of our MGD model when the input modalities are
MGD model can effectively deal with the conditions in a
masked (i.e. where we feed the model with a zero tensor in-
disentangled way, making them optional.
stead of the considered modality). In particular, we focus
on the CLIP-S for text adherence and on the newly pro- Unconditional Training and Sketch Conditioning. In Ta-
posed pose and sketch distances for the pose and sketch co- ble 5, we inquire about the fully conditioned network per-
herency, respectively. Notice that the text input anchors the formance according to the variance of the portion of uncon-
CLIP-S metrics of all experiments and makes them compa- ditional training. Additionally, we evaluate the results by
rable in all cases. Starting from the fully conditioned model varying the fraction of sketch conditioning steps. As can be
(i.e. text, pose, sketch), we mask the sketch. As the decrease seen, the best results are achieved by using 0.2 for both pa-

8
rameters. In particular, for unconditional training, we train [7] Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul
three different models (i.e. with 0.1, 0.2, and 0.3). When Choo. VITON-HD: High-Resolution Virtual Try-On via
evaluating the sketch conditioning parameter, we test our Misalignment-Aware Normalization. In CVPR, 2021. 1, 2, 5
model with values between 0 and 1 with a stride of 0.2. It is [8] Guillem Cucurull, Perouz Taslakian, and David Vazquez.
worth noting that the sketch distance consistently decreases Context-aware visual compatibility prediction. In CVPR,
2019. 1
as the number of sketch conditioning steps increases, show-
[9] Nicki Skafte Detlefsen, Jiri Borovec, Justus Schock,
ing the robustness of the approach.
Ananya Harsh Jha, Teddy Koker, Luca Di Liello, Daniel
Stancl, Changsheng Quan, Maxim Grechkin, and William
6. Conclusion Falcon. TorchMetrics-Measuring Reproducibility in Py-
The Multimodal Garment Designer proposed in this pa- Torch. Journal of Open Source Software, 7(70):4101, 2022.
per is the first latent diffusion model defined for human- 6, 13
[10] Prafulla Dhariwal and Alexander Nichol. Diffusion Models
centric fashion image editing, conditioned by multimodal
Beat GANs on Image Synthesis. In NeurIPS, 2021. 2
inputs such as text, body pose, and sketches. The novel
[11] M Hadi Kiapour, Xufeng Han, Svetlana Lazebnik, Alexan-
architecture, trained on two new semi-automatically anno- der C Berg, and Tamara L Berg. Where to buy it: Matching
tated datasets and evaluated with standard and newly pro- street clothing photos in online shops. In ICCV, 2015. 1
posed metrics, as well as by user studies, is very promising. [12] Xintong Han, Zuxuan Wu, Phoenix X Huang, Xiao Zhang,
The result is one of the first successful attempts to mimic the Menglong Zhu, Yuan Li, Yang Zhao, and Larry S Davis. Au-
designers’ job in the creative process of fashion design and tomatic spatially-aware fashion concept discovery. In ICCV,
could be a starting point for a capillary adoption of diffusion 2017. 5, 11
models in creative industries, oversight by human input. [13] Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S
Davis. Viton: An image-based virtual try-on network. In
Acknowledgments CVPR, 2018. 1
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
This work has partially been supported by the Euro- Delving deep into rectifiers: Surpassing human-level perfor-
pean Commission under the PNRR-M4C2 project “FAIR mance on imagenet classification. In ICCV, 2015. 4
- Future Artificial Intelligence Research” and the Euro- [15] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras,
pean Horizon 2020 Programme (grant number 101004545 - and Yejin Choi. CLIPScore: A Reference-free Evaluation
ReInHerit), and by the PRIN project “CREATIVE: CRoss- Metric for Image Captioning. In EMNLP, 2021. 6, 13
modal understanding and gEnerATIon of Visual and tExtual [16] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
content” (CUP B87G22000460001), co-funded by the Ital- Bernhard Nessler, Günter Klambauer, and Sepp Hochreiter.
GANs trained by a two time-scale update rule converge to a
ian Ministry of University.
Nash equilibrium. In NeurIPS, 2017. 6
[17] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Dif-
References
fusion Probabilistic Models. In NeurIPS, 2020. 2
[1] Omri Avrahami, Thomas Hayes, Oran Gafni, Sonal Gupta, [18] Wei-Lin Hsiao and Kristen Grauman. Creating capsule
Yaniv Taigman, Devi Parikh, Dani Lischinski, Ohad Fried, wardrobes from fashion images. In CVPR, 2018. 1
and Xi Yin. SpaText: Spatio-Textual Representation for [19] Yuming Jiang, Shuai Yang, Haonan Qju, Wayne Wu,
Controllable Image Generation. In CVPR, 2023. 4 Chen Change Loy, and Ziwei Liu. Text2human: Text-driven
[2] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, controllable human image generation. ACM Transactions on
Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Graphics, 41(4):1–11, 2022. 2, 5, 6
Samuli Laine, Bryan Catanzaro, et al. eDiff-I: Text-to-Image [20] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual
Diffusion Models with an Ensemble of Expert Denoisers. losses for real-time style transfer and super-resolution. In
arXiv preprint arXiv:2211.01324, 2022. 4 ECCV, 2016. 12
[3] Federico Bianchi, Jacopo Tagliabue, and Bingqing Yu. [21] Diederik P Kingma and Jimmy Ba. Adam: A Method for
Query2Prod2Vec: Grounded word embeddings for eCom- Stochastic Optimization. In ICLR, 2015. 12
merce. In NAACL, 2021. 5 [22] Sven Kreiss, Lorenzo Bertoni, and Alexandre Alahi. Open-
[4] Mikołaj Bińkowski, Dougal J Sutherland, Michael Arbel, PifPaf: Composite Fields for Semantic Keypoint Detection
and Arthur Gretton. Demystifying MMD GANs. In ICLR, and Spatio-Temporal Association. IEEE Transactions on In-
2018. 6, 15 telligent Transportation Systems, 23(8):13498–13511, 2021.
[5] Shin-I Cheng, Yu-Jie Chen, Wei-Chen Chiu, Hung-Yu 6, 14
Tseng, and Hsin-Ying Lee. Adaptively-Realistic Image Gen- [23] Wenbo Li, Zhe Lin, Kun Zhou, Lu Qi, Yi Wang, and Jiaya
eration from Stroke and Sketch with Diffusion Model. In Jia. MAT: Mask-Aware Transformer for Large Hole Image
WACV, 2023. 2 Inpainting. In CVPR, 2022. 4
[6] Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune [24] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou
Gwon, and Sungroh Yoon. Ilvr: Conditioning method for Tang. DeepFashion: Powering Robust Clothes Recognition
denoising diffusion probabilistic models. In ICCV, 2021. 2 and Retrieval with Rich Annotations. In CVPR, 2016. 1, 6

9
[25] Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay [40] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
Regularization. In ICLR, 2019. 6, 12 Net: Convolutional Networks for Biomedical Image Seg-
[26] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher mentation. In MICCAI, 2015. 6, 12
Yu, Radu Timofte, and Luc Van Gool. RePaint: Inpainting [41] Rohan Sarkar, Navaneeth Bodla, Mariya I Vasileva, Yen-
Using Denoising Diffusion Probabilistic Models. In CVPR, Liang Lin, Anurag Beniwal, Alan Lu, and Gerard Medioni.
2022. 4 OutfitTransformer: Learning Outfit Representations for
[27] Chenlin Meng, Yutong He adnd Yang Song, Jiaming Song, Fashion Recommendation. In WACV, 2023. 1
Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: [42] Christoph Schuhmann, Romain Beaumont, Richard Vencu,
Guided image synthesis and editing with stochastic differ- Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo
ential equations. In ICLR, 2022. 2, 6, 7, 15, 20, 21 Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts-
[28] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory man, Patrick Schramowski, Srivatsa R Kundurthy, Katherine
Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia
Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Jitsev. LAION-5b: An open large-scale dataset for training
Mixed Precision Training. In ICLR, 2018. 6, 12 next generation image-text models. In NeurIPS, 2022. 5
[29] Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Mar- [43] Karen Simonyan and Andrew Zisserman. Very deep convo-
cella Cornia, Marco Bertini, and Rita Cucchiara. LaDI- lutional networks for large-scale image recognition. In ICLR,
VTON: Latent Diffusion Textual-Inversion Enhanced Virtual 2015. 12
Try-On. In ACM Multimedia, 2023. 1 [44] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
[30] Davide Morelli, Matteo Fincato, Marcella Cornia, Federico and Surya Ganguli. Deep unsupervised learning using
Landi, Fabio Cesari, and Rita Cucchiara. Dress Code: High- nonequilibrium thermodynamics. In ICML, 2015. 2, 3
Resolution Multi-Category Virtual Try-On. In ECCV, 2022. [45] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois-
1, 2, 5 ing Diffusion Implicit Models. In ICLR, 2021. 6
[31] Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhon- [46] Zhuo Su, Wenzhe Liu, Zitong Yu, Dewen Hu, Qing Liao,
gang Qi, Ying Shan, and Xiaohu Qie. T2I-Adapter: Learning Qi Tian, Matti Pietikäinen, and Li Liu. Pixel difference net-
Adapters to Dig out More Controllable Ability for Text-to- works for efficient edge detection. In ICCV, 2021. 6, 7, 14
Image Diffusion Models. arXiv preprint arXiv:2302.08453, [47] Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin,
2023. 2, 3 Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov,
[32] Alexander Quinn Nichol and Prafulla Dhariwal. Improved Naejin Kong, Harshith Goka, Kiwoong Park, and Victor
denoising diffusion probabilistic models. In ICML, 2021. 2 Lempitsky. Resolution-robust large mask inpainting with
[33] Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, fourier convolutions. In WACV, 2022. 4
Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya [48] Ming Tao, Hao Tang, Fei Wu, Xiao-Yuan Jing, Bing-Kun
Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Bao, and Changsheng Xu. DF-GAN: A Simple and Effective
Image Generation and Editing with Text-Guided Diffusion Baseline for Text-to-Image Synthesis. In CVPR, 2022. 2
Models. In ICML, 2022. 2 [49] Andrey Voynov, Kfir Aberman, and Daniel Cohen-Or.
[34] Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On Sketch-Guided Text-to-Image Diffusion Models. arXiv
Aliased Resizing and Surprising Subtleties in GAN Evalu- preprint arXiv:2211.13752, 2022. 2
ation. In CVPR, 2022. 6 [50] Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin
[35] Martin Pernuš, Clinton Fookes, Vitomir Štruc, and Si- Chen, Liang Lin, and Meng Yang. Toward characteristic-
mon Dobrišek. FICE: Text-Conditioned Fashion Image preserving image-based virtual try-on network. In ECCV,
Editing With Guided GAN Inversion. arXiv preprint 2018. 1, 6, 12
arXiv:2301.02110, 2023. 2, 6, 7, 15, 22, 23 [51] Tengfei Wang, Ting Zhang, Bo Zhang, Hao Ouyang, Dong
[36] Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Chen, Qifeng Chen, and Fang Wen. Pretraining is All
Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda You Need for Image-to-Image Translation. arXiv preprint
Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and arXiv:2205.12952, 2022. 2
Ilya Sutskever. Learning Transferable Visual Models From [52] Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim,
Natural Language Supervision. In ICML, 2021. 2, 5, 11 Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gon-
[37] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, tijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok
and Mark Chen. Hierarchical Text-Conditional Image Gen- Namkoong, et al. Robust fine-tuning of zero-shot models.
eration with CLIP Latents. arXiv preprint arXiv:2204.06125, In CVPR, 2022. 5, 11
2022. 2 [53] Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven
[38] Ignacio Rocco, Relja Arandjelovic, and Josef Sivic. Convo- Rennie, Kristen Grauman, and Rogerio Feris. Fashion IQ:
lutional neural network architecture for geometric matching. A New Dataset Towards Retrieving Images by Natural Lan-
In CVPR, 2017. 6, 12 guage Feedback. In CVPR, 2021. 5, 11
[39] Robin Rombach, Andreas Blattmann, Dominik Lorenz, [54] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe
Patrick Esser, and Björn Ommer. High-Resolution Image Gan, Xiaolei Huang, and Xiaodong He. AttnGAN: Fine-
Synthesis With Latent Diffusion Models. In CVPR, 2022. 2, Grained Text to Image Generation with Attentional Genera-
3, 7, 15, 16, 20, 21, 22, 23 tive Adversarial Networks. In CVPR, 2018. 2

10
[55] Han Yang, Ruimao Zhang, Xiaobao Guo, Wei Liu, Wang-
meng Zuo, and Ping Luo. Towards photo-realistic virtual
try-on by adaptively generating-preserving image content. In
CVPR, 2020. 1, 6
[56] Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and Figure 5: Examples of FashionIQ data type.
Yinfei Yang. Cross-modal contrastive learning for text-to-
image generation. In CVPR, 2021. 2 Unique Captions Unique Noun Chunks
[57] Lvmin Zhang and Maneesh Agrawala. Adding conditional
Dataset Upper Lower Dresses Upper Lower Dresses
control to text-to-image diffusion models. arXiv preprint
arXiv:2302.05543, 2023. 2, 15, 16 FashionIQ [53] 27,339 0 15,101 7,801 0 3,592
Fashion200k [12] 25,959 11,022 16,694 22,898 13,420 15,890
[58] Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. DM-
GAN: Dynamic Memory Generative Adversarial Networks
Table 6: Number of unique captions and noun chunks for
for Text-to-Image Synthesis. In CVPR, 2019. 2
each category of the FashionIQ and Fashion200k datasets.
[59] Shizhan Zhu, Raquel Urtasun, Sanja Fidler, Dahua Lin, and
Chen Change Loy. Be Your Own Prada: Fashion Synthesis
• a bright photo of a [noun chunk],
with Structural Coherence. In ICCV, 2017. 2, 5, 6
• a fashion studio shot of a [noun chunk],
• a fashion magazine photo of a [noun chunk],
A. Dress Code Multimodal and VITON-HD • a fashion brochure photo of a [noun chunk],
Multimodal Datasets • a fashion catalog photo of a [noun chunk],
In this section, we give additional details about the • a fashion press photo of a [noun chunk],
dataset collection and annotation process and provide statis- • a yoox photo of a [noun chunk],
tics and further examples of the collected datasets. • a yoox web image of a [noun chunk],
• a high-resolution photo of a [noun chunk],
A.1. Data Preparation • a cropped photo of a [noun chunk],
• a close-up photo of a [noun chunk],
Before extracting noun chunks from the textual sen-
• a photo of one [noun chunk].
tences of FashionIQ [53] and Fashion200k [12], we per-
form word lemmatization to reduce each word to its root
form. Such pre-processing stage is crucial for the Fash- A.2. Annotation Tool for Fine-Grained Annotation
ionIQ dataset, as the captions do not describe a single gar-
ment but instead express the properties to modify in a given We develop a custom annotation tool using the Django
image to match its target. Fig. 5 shows two examples of and Angular web frameworks to ease and speed up the fine-
FashionIQ annotations. grained annotation process. Fig. 6 depicts the user inter-
We use the spaCy NLP toolkit5 to extract noun chunks face. In the annotation phase, users are provided with both
from textual sentences. To facilitate prompt engineering model’s image and the corresponding in-shop garment and
at a later stage, we remove the articles at the beginning should select the three most representative noun chunks per
of each noun chunk. Subsequently, we filter out all noun item (Fig. 6a). If the automatic selection process fails to
chunks starting with or containing special characters and suggest three correct noun chunks, the user can manually
keep unique elements. Table 6 reports detailed statistics insert them (Fig. 6b).
about the number of unique captions and extracted noun
A.3. Coarse-Grained Annotation
chunks from which we start the annotation.
Textual Prompts. As described in the main paper, we rely After completing the manual annotation process on
on the cosine similarity between CLIP-based image and text Dress Code, we obtain 26,400 different model-garment
embeddings to associate each garment with the 25 most rep- pairs (with 8,800 items per category), each associated with
resentative noun chunks. We exploit prompt ensembling to three different noun chunks. To annotate the remaining
perform such zero-shot association as it is shown in [36] 27,392 items of Dress Code Multimodal and the 13,679
that this technique improves performance. items of VITON-HD Multimodal, we leverage the manu-
The employed textual prompts are: ally annotated image-text pairs and finetune the OpenCLIP
• a photo of a [noun chunk], ViT-B/32 [52] model pre-trained on the English portion of
• a photo of a nice [noun chunk], the LAION-5B dataset.
• a photo of a cool [noun chunk], CLIP Finetuning. We finetune both encoders of the Open-
• a photo of an expensive [noun chunk], CLIP model using a single NVIDIA A100 GPU for 400
• a good photo of a [noun chunk], steps, with a batch size of 2048 and a learning rate of 10−6 .
5 https://spacy.io/ ∗ Equal contribution.

11
(a) (b)
Figure 6: User interface of the custom annotation tool. In (a) the user can select the noun chunks among the proposed ones,
while in (b) the user can manually annotate the garment.

As optimizer, we use AdamW [25] with a weight decay Images Unique Noun Chunks
of 0.2. We use mixed precision [28] to speed up training Dataset Ann. Split Upper Lower Dresses Upper Lower Dresses
and save memory. During the training process, we mon- Train 7,000 7,000 7,000 4,751 5,914 4,410
Test 1,800 1,800 1,800 2,337 2,861 2,144
itor the model performance using the top-3 accuracy met- Dress Code M. F
∪ 8,800 8,800 8,800 5,284 6,509 4,915
ric on the test split of the Dress Code Multimodal dataset. ∩ - - - 1,804 2,266 1,639
We choose this metric intending to associate each image Train 6,563 151 20,666 7,198 320 8,650
Test 0 0 0 0 0 0
with three distinct noun chunks. The out-of-the-box model Dress Code M. C
∪ 6,563 151 20,666 7,198 320 8,650
achieves a top-3 accuracy of 12.95%, which improves to ∩ - - - 0 0 0
16.60% after finetuning. The OpenCLIP ViT-g/14 model Train 13,563 7,151 27,666 9,163 6,037 9,465
Test 1,800 1,800 1,800 2,337 2,861 2,144
instead achieves a top-3 accuracy of 16.21%, while being Dress Code M. F+C
∪ 15,363 8,951 29,466 9431 6,597 9,568
computationally heavier than the ViT-B/32 version. Since ∩ - - - 2,069 2,301 2,041
the ViT-g/14 model predicts the set of noun chunks from Train 11,647 - - 4,823 - -
Test 2,032 - - 2,149 - -
which we extract the ground-truth, the actual difference in VITON-HD M. C
∪ 13,679 - - 5,143 - -
performance between the finetuned ViT-B/32 model and the ∩ - - - 1,829 - -
out-of-the-box ViT-g/14 model could be even higher.
Table 7: Number of images and unique noun chunks per
category for both Dress Code Multimodal and VITON-HD
A.4. Extracting Sketches Multimodal. (F) indicates the fine-grained annotation while
(C) indicates the coarse-grained annotation.
As mentioned in the main paper, we train a warping
module to generate input sketches for the unpaired setting in-shop garment C as follows:
(i.e. when we give as input the multimodal information cor-
responding to a garment different from the one originally \hat {C} = \text {TPS}_\theta (C). \label {eq:tps_warping} (5)
worn by the model). In particular, our method involves the
transformation of a given in-shop garment C ∈ RH×W ×3 To refine the result, we employ a U-Net model that takes as
into a warped image of the same garment that fits the model input the concatenation of the coarse warped garment Ĉ, the
of a target image I. We employ the warping module propose map P , and the masked model image IM , and predicts
posed in [50], refining the results with a U-Net based com- the refined warped garment C̃.
ponent [40]. We train this model on the training set of both Dress
The warping module computes a correlation map be- Code Multimodal and VITON-HD Multimodal using a
tween the encoded representations of the in-shop garment combination of an L1 loss between generated and target in-
C and a cloth-agnostic person representation composed of shop garments and a perceptual loss (also known as VGG
the pose map P ∈ RH×W ×18 and the masked model image loss [20]) to compute the difference between the feature
IM ∈ RH×W ×3 . We use two separate convolutional net- maps of generated and target garments extracted with a
works to obtain these encoded representations. Based on the VGG-19 [43]. We train with a resolution of 256 × 192,
computed correlation map, we predict the spatial transfor- Adam [21] as optimizer with β1 = 0.5, β2 = 0.99, and a
mation parameters θ of a thin-plate spline geometric trans- learning rate equal to 10−4 . We train the network on the
formation [38] (i.e. TPSθ ). We then use the θ parameters VITON-HD dataset for 30 epochs, while the training on the
to compute the coarse warped garment Ĉ starting from the Dress Code dataset converges after 80 epochs.

12
Fine-Grained Training Set Fine-Grained Test Set Coarse-Grained Training Set
35000

30000

25000

20000

15000

10000

5000

0
Upper-body Clothes Lower-body Clothes Dresses

Figure 7: Annotated images per category on Dress Code (a) (b)

Multimodal.

A.5. Additional Statistics and Annotated Samples

Table 7 summarizes the number of images and unique
noun chunks for each category of Dress Code Multimodal
and VITON-HD Multimodal. The table shows that the
datasets share noun chunks between the train and test set
(∩). This behavior is likely due to the limited capacity of (c) (d)
the textual modality to represent the whole semantic infor-
Figure 8: Vocabulary of the frequent words scaled by fre-
mation of the image. Fig. 7 instead shows the number of
quency for dresses (a), lower-body clothes (b), upper-body
samples for each category highlighting the different anno-
clothes (c) of Dress Code Multimodal and clothing items of
tation strategies on Dress Code Multimodal.
VITON-HD Multimodal (d).
In Fig. 8, we report the word clouds extracted from
the textual annotations, representing the most frequently
process. As mentioned in the main paper, our implemen-
used words in the collected noun chunks for each category
tation relies on the CLIP-S of the TorchMetrics library [9]
of Dress Code Multimodal and VITON-HD Multimodal.
and adopts the ViT-H/14 trained on LAION-2B as the CLIP
From this visualization, we can notice that the frequency
model. Specifically, we crop the generated image using the
of the terms varies according to the garment category, and
bounding box used to mask it and paste the resulting crop
the semantic richness of our annotations is consistent across
on a white background, obtaining a final resolution equal to
different garment types.
224 × 224. The adopted metric is defined as follows:
In Fig. 11 and Fig. 12, we report samples from the fine-
grained and coarse-grained subsets of Dress Code Multi- \text {CLIP-S}(I,Y) = \max (100*cos(E_{\Tilde {I}},E_Y),0), \label {eq:CLIP_score} (6)
modal, respectively. Instead, in Fig. 13, we show additional
examples extracted from VITON-HD Multimodal. where EI˜ represents the CLIP embedding of the generated
portion of the image I˜ pasted on white background, EY
B. Evaluation Metrics represents the CLIP embedding for the caption Y , and cos
is the cosine similarity. We calculate the cosine similarity
This section provides additional details about the evalu- between the image and caption embeddings and scale the
ation metrics used in our experiments. We first give some result by a factor of 100. If the cosine similarity is negative,
clarifications about the CLIP-S metric and then present an then CLIP-S is zero.
accurate formulation of the proposed sketch distance and Pose Distance (PD). To measure the coherence of human-
pose distance metrics. body poses between the generated image and the original
CLIP-S. The CLIP score [15] is a well-known metric to one, we propose a novel pose distance metric that estimates
evaluate the similarity between images and textual sen- the distance between human keypoints extracted from the
tences. In our setting, we employ this metric to assess the original and the generated images. Given a ground-truth im-
coherence of the generated images with respect to the cor- age I and a generated image I, ˜ we extract human keypoints
responding textual inputs used to condition the generation from each of them using the keypoint extraction network K

13
(i.e. in our case, we use OpenPifPaf [22]) and identify the
set of keypoints falling in the mask M as K(·)M . We com-
pute the final score with an ℓ2 distance between each pair
of real-generated corresponding keypoints (i.e. k ∈ K(I)M
and k̃ ∈ K(I) ˜ M , respectively), weighting each keypoint
distance with the detector confidence to consider possible
estimation errors. Formally, our pose distance metric is de-
fined as follows:

\text {PD}(I, \Tilde {I}) = \frac {\sum \limits _{\substack {k \in \mathcal {K}(I)_M \\ \Tilde {k} \in \mathcal {K}(\Tilde {I})_M}} \sqrt {(k_x-\Tilde {k}_x)^2 + (k_y-\Tilde {k}_y)^2} \cdot \text {CF}_{k\Tilde {k}}}{\sum _{k\Tilde {k}} \text {CF}_{k\Tilde {k}}}, (a)
(7)
where, for each pair of real-generated keypoints, CFkk̃ is
1 if the confidence of the detector K on both keypoints is
greater or equal to 0.5, and 0 otherwise.
Sketch Distance (SD). To evaluate the adherence of the
generated images to the constraints imposed by the input
sketch, we propose a new sketch distance metric. To com-
pute the metric, we first extract the ground-truth and the
generated garments label maps using an off-the-shelf se-
mantic segmentation model6 . We segment the garment ac-
cording to its category and paste it on a white background
of shape 512 × 384. We refer to these new images with IS
and I˜S , respectively. Then, we extract the garment sketches
of both the ground-truth and the generated images using an
edge detector network Edge (i.e. PIDInet [46]). Finally,
we compute the mean squared error between the extracted
sketches, weighting the per-pixel results on the inverse fre-
quency of the activated pixels. Formally, the introduced
sketch distance metric is defined as follows: (b)

\text {SD}(I_S, \Tilde {I}_S) = \text {MSE}\left (Edge(I_S), Edge(\Tilde {I}_S)\right ) * p, \label {eq:sketch_metric} (8) Figure 9: User study interface, where (a) corresponds to the
realism evaluation and (b) refers to the coherence analysis
where p is the inverse pixel frequency. It is noteworthy that between generated images and the given multimodal inputs.
sketch thresholding could be applied before distance com-
putation. Nevertheless, we argue that avoiding threshold- generated output asking the user to select for each compar-
ing enables an effective comparison of hand-drawn ground- ison the image that seems more realistic. In the latter (Fig-
truth grayscale sketches. This approach can facilitate the ure 9b), given the model’s image, the set of noun chunks
evaluation of methods that generate images conditioned us- describing the garment, and the sketch, the user is asked to
ing the sketch. Therefore, we think the proposed metric can select which of the two proposed outputs looks more coher-
be a valuable tool for comparing sketch-guided generative ent with the multimodal inputs also taking into account the
architectures. model’s body pose. Overall, we collect around 7k evalua-
tions, 3.5k for each test, and involving more than 150 users.
C. User Study
As mentioned in the main paper, we conduct a user study D. Additional Results
to evaluate the realism of generated images and their adher-
In this section, we provide additional experimental re-
ence to the given multimodal inputs, comparing our results
sults to understand the strengths and limitations of our ap-
with those from the considered competitors. To this aim,
proach. Table 8 extends Table 2 of the main paper show-
we develop a custom web interface presenting two different
ing quantitative results on each garment category of Dress
surveys. The former (Fig. 9a) assesses the realism of the
Code Multimodal. Since each category contains only 1,800
6 https://github.com/levindabhi/cloth-segmentation images, the FID score presents a high variance in the re-

14
Modalities Upper-body Lower-body Dresses
Model Resolution Text Keypoints Sketch FID ↓ KID ↓ CLIP-S ↑ PD ↓ SD ↓ FID ↓ KID ↓ CLIP-S ↑ PD ↓ SD ↓ FID ↓ KID ↓ CLIP-S ↑ PD ↓ SD ↓
Paired setting
Stable Diff. [39] 256×192 ✓ 22.86 9.73 28.31 4.29 - 28.78 13.93 26.41 4.97 - 36.31 20.74 27.84 5.67 -
FICE [35] 256×192 ✓ ✓ 46.41 32.26 28.58 7.46 - 41.68 27.22 28.14 7.54 - 34.06 20.58 29.47 6.06 -
MGD (ours) 256×192 ✓ ✓ 11.88 2.82 31.48 1.91 - 10.24 1.55 30.50 2.58 - 11.87 2.03 32.05 2.57 -
Paired setting
Stable Diff. [39] 512×384 ✓ 21.00 8.59 30.17 7.95 0.310 28.40 14.48 28.02 9.96 0.345 33.12 17.39 29.36 9.86 0.450
SDEdit [27] 512×384 ✓ ✓ ✓ 15.78 5.52 29.73 4.21 0.222 16.64 6.07 29.00 6.51 0.256 21.53 9.02 28.89 5.67 0.270
MGD (ours) 512×384 ✓ ✓ ✓ 12.42 3.71 31.90 3.72 0.190 10.70 2.01 31.10 5.70 0.210 11.38 1.89 32.02 4.93 0.194
Unpaired setting
Stable Diff. [39] 256×192 ✓ 22.86 9.73 28.31 4.29 - 28.78 13.93 26.41 4.97 - 36.31 20.74 27.84 5.67 -
FICE [35] 256×192 ✓ ✓ 49.77 35.37 26.48 7.64 - 44.94 30.39 25.42 7.84 - 39.04 25.27 26.14 6.39 -
MGD (ours) 256×192 ✓ ✓ 14.50 3.48 29.24 2.39 - 13.70 2.48 29.09 3.32 - 13.72 2.50 30.37 3.17 -
Unpaired setting
Stable Diff. [39] 512×384 ✓ 24.23 10.39 28.64 8.59 0.413 30.90 15.38 27.03 10.43 0.453 35.96 19.94 28.37 10.60 0.609
SDEdit [27] 512×384 ✓ ✓ ✓ 17.86 6.50 27.36 4.78 0.357 19.16 6.85 27.08 7.53 0.399 22.97 9.98 26.85 6.42 0.411
MGD (ours) 512×384 ✓ ✓ ✓ 15.99 4.50 29.76 5.41 0.291 14.82 2.81 29.96 7.96 0.289 14.71 3.63 30.41 7.15 0.252

Table 8: Category-wise quantitative results on the Dress Code Multimodal dataset.

Dress Code Multimodal Modalities Dress Code Multimodal

Sketch Cond. FID ↓ KID ↓ CLIP-S ↑ PD ↓ SD ↓ Model Text Pose Sketch FID ↓ KID ↓ CLIP-S ↑ PD ↓ SD ↓
1.0 5.44 1.82 31.03 4.43 0.363 ✓ 7.61 2.54 30.17 7.22 0.527
0.8 5.65 1.96 31.17 4.42 0.364 ✓ ✓ 7.82 2.85 29.93 6.26 0.519
0.6 5.73 2.11 31.31 4.50 0.365 MGD (ours) ✓ ✓ ✓ 7.73 2.82 30.04 6.79 0.458
0.4 5.80 2.17 31.44 4.51 0.368 Modalities VITON-HD Multimodal
0.2 5.74 2.11 31.68 4.72 0.374
0.0 6.31 2.33 31.67 5.31 0.405 Model Text Pose Sketch FID ↓ KID ↓ CLIP-S ↑ PD ↓ SD ↓
✓ 12.73 3.59 30.24 8.64 0.643
✓ ✓ 12.40 3.36 30.34 7.53 0.435
Table 9: Ablation study by varying the sketch conditioning MGD (ours) ✓ ✓ ✓ 12.81 3.86 30.75 7.22 0.317
steps on the paired setting of Dress Code Multimodal.
Table 11: Performance analysis on the unpaired setting of
VITON-HD Multimodal
both datasets as input modalities vary.
Sketch Cond. FID ↓ KID ↓ CLIP-S ↑ PD ↓ SD ↓
1.0 13.01 4.00 30.32 7.05 0.225 Code contains full-body target models.
0.8 12.75 3.73 30.46 7.11 0.250
0.6 12.76 3.75 30.53 7.13 0.263 In Table 11, we show the performance of our MGD
0.4 12.71 3.67 30.56 7.12 0.280 model when masking different input modalities. In this
0.2 12.81 3.86 30.75 7.22 0.317 case, we report the results on the unpaired setting of both
0.0 12.40 3.36 30.34 7.53 0.435
datasets. As it can be seen, evaluation metrics measuring
Table 10: Ablation study by varying the sketch conditioning the realism of the generation (i.e. FID and KID) are com-
steps on the unpaired setting of VITON-HD Multimodal. parable among different cases, while the pose distance and
sketch distance metrics correlate in general with the given
sults [4], while the KID metric presents more accurate re- input (i.e. with the pose map and the garment sketch, respec-
sults. Nevertheless, our method outperforms all competitors tively). Moreover, in this case, the warped in-shop garment
in all metrics except for the pose metrics in the unpaired not fitting the model’s body shape affects the pose distance
setting. This behavior is due to the imperfect match of the metric for the Dress Code Multimodal dataset.
predicted warped unpaired sketches and the model’s body Finally, in Table 12 we report a comparison with the
shape and pose. In fact, from the analysis of the sketch con- concurrent work ControlNet [57] adapted to work with the
ditioning steps in the unpaired setting (Table 5 of the main Stable Diffusion inpaint denoising network. Following the
paper), we can see that the pose distance directly correlates original paper, we only condition ControlNet on text plus an
with the sketch conditioning parameter, while in the paired additional modality (i.e. pose or sketch). It is worth noting
one (Table 9) the pose distance metric decreases as the num- that across all configurations, MGD outperforms Control-
ber of sketch conditioning steps increases. Instead, when Net by a significant margin.
evaluating the results on VITON-HD Multimodal, the pose Qualitative results. We also show additional qualitative re-
distance metric in the unpaired setting decreases (Table 10). sults for both datasets. Specifically, in Fig. 14 and Fig. 15,
We believe this behavior relates to the size of the worn gar- we compare images generated by our approach and com-
ment in this last dataset, which facilitates garment warping. petitors using a resolution of 512 × 384, for Dress Code
In fact, VITON-HD features half-body images, while Dress Multimodal and VITON-HD Multimodal, respectively. In-

15
Modalities Dress Code Multimodal VITON-HD Multimodal
Model Resolution Text Pose Sketch FID ↓ KID ↓ CLIP-S ↑ PD ↓ SD ↓ FID ↓ KID ↓ CLIP-S ↑ PD ↓ SD ↓
Paired setting
ControlNet [57] 512×384 ✓ ✓ 18.36 9.82 29.00 7.46 0.462 19.08 9.35 30.03 7.72 0.392
MGD (ours) 512×384 ✓ ✓ 6.31 2.33 31.67 5.31 0.405 11.07 3.36 32.27 6.77 0.318
ControlNet [57] 512×384 ✓ ✓ 27.23 19.01 27.07 7.54 0.436 25.44 17.05 28.31 8.16 0.298
MGD (ours) 512×384 ✓ ✓ 5.72 2.15 31.69 4.94 0.373 10.64 3.26 32.31 6.18 0.255
Unpaired setting
ControlNet [57] 512×384 ✓ ✓ 20.66 11.58 27.57 8.15 0.577 21.03 10.34 28.11 8.38 0.534
MGD (ours) 512×384 ✓ ✓ 7.82 2.85 29.93 6.26 0.519 12.40 3.36 30.34 7.53 0.435
ControlNet [57] 512×384 ✓ ✓ 29.61 20.83 25.75 9.74 0.544 27.41 18.66 26.63 9.53 0.416
MGD (ours) 512×384 ✓ ✓ 7.65 2.70 30.21 7.50 0.456 12.65 3.59 30.69 7.49 0.320

Table 12: Performance comparison with ControlNet on the Dress Code Multimodal and VITON-HD Multimodal datasets for
both paired and unpaired settings.

family [39] and derives from the high spatial compression

SKETCH COND. (t=16)

SKETCH COND. (t=34)

SKETCH COND. (t=0)

nature of the latent space (8× for each spatial dimension).

ORIGINAL

SKETCH

Instead, the third example of the first row and the first two
samples of the second row highlight the dependence of our
model performance from the given sketch. When the geo-
metric warping module fails to generate a sketch able to fit
SKETCH COND. (t=16)

SKETCH COND. (t=34)

SKETCH COND. (t=0)

the model’s shape, the generation task fails as well, creating

ORIGINAL

SKETCH

unwanted artifacts (e.g. a sketch may be smaller than the

model’s body shape as in the third example of the first row,
resulting in an artifact near the model’s left hand).

Figure 10: Time window conditioned examples on Dress

Code Multimodal. We report qualitative results fixing the
sketch conditioning steps to around a third of diffusion steps
and varying the starting conditioning timestep (i.e. tstart =
0, 16, 34).

stead, in Fig. 16 and Fig. 17, we report low-resolution qual-

itative comparisons. Fig. 19 shows some qualitative re-
sults varying the sketch conditioning parameter. Increas-
ing the number of sketch conditioning steps leads to images
that better follow the given sketch while slightly reducing
the realism of the generated garments. Finally, we inves-
tigate the conditioning contribution in various time win-
dows in Fig. 10. We perform this experiment by fixing
the sketch conditioning steps to around a third of diffu-
sion steps and varying the starting conditioning timestep
(i.e. tstart = 0, 16, 34). Qualitative results show that start-
ing the sketch conditioning in the central (i.e. tstart = 16,
tend = 34) or final denoising steps (i.e. tstart = 34,
tend = 50) leads the model to generate images that do not
follow the input sketch and present artifacts.
Limitations and failure cases. Fig. 20 shows some fail-
ure cases of the proposed approach. In the first row, the
first two examples show that our model sometimes fails to
generate hands accurately when they occupy a limited area
within the source image. This behavior is intrinsic in LDMs

16
long ﬂoral dress
long sleeveless ﬂoral
print dress
printed long
summer dress

green logo t-shirt

green and logo
green mans tee

long sleeve black t

shirt
black logo-print top
black colored and
le�er print

mid-rise trousers
beige genesis
slim-ﬁt trousers
beige straight-leg
trousers

blue deneuve but-

ton-detail midi
skirt
cloqué pencil skirt
blue
blue rib tube skirt

Figure 11: Sample images and multimodal data from our newly collected Dress Code Multimodal dataset (fine-grained
textual annotations).

17
asymmetrical
floral gown
mul�color floral
print evening gown
conver�ble strapless
floral gown

mul�color bear
t-shirt
mul�color red
t-shirt
red a�la peasant
tee

pink racerback tank

pe�te
pink racerback tank
pink graham spen-
cer tank

mul�color print
shorts
mul�color pants
street print
mul�color animal
printed pants

blue asymmetric
tartan skirt
blue prisca modern
plaid miniskirt
blue wrap-eﬀect
embroidered mini
skirt

Figure 12: Sample images and multimodal data from our newly collected Dress Code Multimodal dataset (coarse-grained
textual annotations).

18
pink printed blouse
pink wisteria ﬂo-
ral-print blouse'
women's oversized
palm jacquard top

red printed
three-quarter sleeve
red ﬂoral print shirt
red ﬂoral-print top

mul�color flo-
ral-print t-shirt only
macy
black t-shirt floral
print
black flower-print
tee

orange tee
boxy ﬁt t-shirt'
bright orange t-shirt

black saint tropez

blouse
black lace-up peas-
ant blouse
black plus size
cold-shoulder peas-
ant

Figure 13: Sample images and multimodal data from our newly collected VITON-HD Multimodal dataset (coarse-grained
textual annotations).

19
long sleeved
short red dress
red drape dress
red stretch
jersey knot
front dress

black sleeveless
gingham combo
gown
long polka dot
dress
long ruﬄed
gown

blue polo image

blue polo style
shirt
short sleeve
blue polo

short sleeves
logo
tee shirt
white short
sleeve t-shirt

black high-rise
trousers
patch pocket
peg trousers
black wide-leg
dress pants

blue knee
length
blue navy
blue tailored
knee-length
shorts

Figure 14: Qualitative comparison on Dress Code Multimodal. From left to right: model’s image, input sketch, pose map,
image generated by Stable Diffusion [39], image generated by SDedit [27], image generated by MGD (ours), and noun
chunks.

20
mul�color ﬂo-
ral-print t-shirt

blue ﬂoral-print
shirt shirt
blue printed
v-neck top

black �e-front
tank
black twisted
tank
black kno�ed
twist cami

crop tee
high-neck
ribbed-jersey
t-shirt'
black
short-sleeve
graphic tee

black
long-sleeved
wrap top
black abigail
twist front crop
top

red wow t-shirt

double v tee
women's teev

black plunge
bodysuit
black
long-sleeved
wrap top
black long
sleeve wrap top

Figure 15: Qualitative comparison on VITON-HD Multimodal. From left to right: model’s image, input sketch, pose map,
image generated by Stable Diffusion [39], image generated by SDedit [27], image generated by MGD (ours), and noun
chunks.

21
blue ﬂoral print
palazzo pants
blue wide-leg
trousers
blue printed
palazzo pants

131.224 mm
natural geomet-
ric print trou-
sers
orange sammy
trousers
red printed
cropped pants

ﬂared sleeves
white blouse
bu�on
white mandarin
collar

beige 3/4
sleeves
puﬀy sleeve
beige
cold-shoulder
top

black v-necked
fi�ed dress
black tailored
short sleeved
dress
knee length
black dress
black dress lace
embroidery
mul�color
floral sleeveless
dress
sheer floral pat-
terned dress

Figure 16: Qualitative comparison with low-resolution images on Dress Code Multimodal. From left to right: model’s image,
input sketch, pose map, image generated by Stable Diffusion [39], image generated by FICE [35], image generated by MGD
(ours), and noun chunks.

22
white striped
kno�ed tee
black yazzmin
�e front t shirt
'black ada stripe
crop tee

white �e-front
top
white indra
�e-front top
top lace-up
front white

high-neck top
blue pleated
high-neck
long-sleeved top
blue shayna
mock-turtleneck
lace top
white cropped
logo t-shirt
white logo-em-
broidered
t-shirt
logo print jersey
t-shirt
bright orange
and long
sleeves
casual and long
sleeves
long-sleeve
t-shirt

classic tee
graphic tee
mid t-shirt

Figure 17: Qualitative comparison with low-resolution images on VITON-HD Multimodal. From left to right: model’s
image, input sketch, pose map, image generated by Stable Diffusion [39], image generated by FICE [35], image generated
by MGD (ours), and noun chunks.

23
floral and no
sleeves
floral sleeveless
white colored
and floral print

crewneck
long-sleeve tee
purple design
purple
long-sleeve top

ue pleated skinny ankle

high-neck blue tapered-leg
g-sleeved top denim trousers
white pe�te
skinny pants

pleated panel
sleeveless dress
blue asymmetri-
cal hem
�e-waist dress
high low blue
dress
white embroi-
dered asym-
metrical dress
one covered
shoulder
sleeveless white
dress
black long ciga-
re�e pants
black straight
tailored trousers
black pure ta-
pered-leg trou-
sers

Figure 18: Qualitative comparison of images generated by our model on Dress Code Multimodal using different conditioning
modalities. From left to right: model’s image, input sketch, pose map, image generated using only text, image generated using
text and pose map, image generated with all input modalities (i.e. text, pose map, and sketch).

24
animal print top

brown leop-
ard-print sleeve-
less blouse

cropped top

floral-print de
chine maxi skirt
mul�color print-
ed fil coupe maxi
skirt
white floral print
maxi skirt

long pink maxi

dress
mul�color
lace-trimmed
embellished
gown
pink striped long
tank dress

0.0 0.2 0.4 0.6 0.8 1.0

Sketch Condi�oning

Figure 19: Qualitative results generated by MGD increasing the sketch conditioning steps.

knee length dress black belted dress

black long dress
purple embel- black bow cock-
front slit lished dress tail dress

long solid black black sleeveless

sleeveless light pink sleeved lace jacquard
dress dress

feminist slogan
loose ﬁt t-shirt regular t-shirt
short-sleeve top solid pink tee only white t-shirt
army green t-shirt slim ﬁt t-shirt
only white slogan
t-shirt

Figure 20: Failure cases on Dress Code Multimodal (first row) and VITON-HD Multimodal (second row).

Property of The Mountain Man (Montana Mountain Men Book 1) (Gemma Weir (Weir, Gemma) )
100% (2)
Property of The Mountain Man (Montana Mountain Men Book 1) (Gemma Weir (Weir, Gemma) )
189 pages
System Change (System Universe Book 1) (SunriseCV) (Z-Library)
No ratings yet
System Change (System Universe Book 1) (SunriseCV) (Z-Library)
368 pages
Brand Factory Project
100% (1)
Brand Factory Project
44 pages
The of British Fashion History
No ratings yet
The of British Fashion History
8 pages
Fashion-RAG: Multimodal Fashion Image Editing Via Retrieval-Augmented Generation
No ratings yet
Fashion-RAG: Multimodal Fashion Image Editing Via Retrieval-Augmented Generation
8 pages
Ladi-Vton: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On
No ratings yet
Ladi-Vton: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On
15 pages
LaDI VTON
No ratings yet
LaDI VTON
15 pages
Tame VTON
No ratings yet
Tame VTON
16 pages
Fashion AI
No ratings yet
Fashion AI
5 pages
Improving Diffusion Models For Authentic Virtual Try-On in The Wild
No ratings yet
Improving Diffusion Models For Authentic Virtual Try-On in The Wild
30 pages
Komprimiertes PDF 2403.05139v3
No ratings yet
Komprimiertes PDF 2403.05139v3
30 pages
Tryoffanyone: Tiled Cloth Generation From A Dressed Person: Ioannis Xarchakos Theodoros Koukopoulos
No ratings yet
Tryoffanyone: Tiled Cloth Generation From A Dressed Person: Ioannis Xarchakos Theodoros Koukopoulos
11 pages
Fashiongan: Display Your Fashion Design Using Conditional Generative Adversarial Nets
No ratings yet
Fashiongan: Display Your Fashion Design Using Conditional Generative Adversarial Nets
11 pages
Viton, CVPR 2018 PDF
No ratings yet
Viton, CVPR 2018 PDF
19 pages
DLCV ProjectReport
No ratings yet
DLCV ProjectReport
6 pages
Another half-RP
No ratings yet
Another half-RP
3 pages
Feng Weakly Supervised High-Fidelity Clothing Model Generation CVPR 2022 Paper
No ratings yet
Feng Weakly Supervised High-Fidelity Clothing Model Generation CVPR 2022 Paper
10 pages
3-Toward Intelligent Design An AI-Based Fashion Designer Using Generative Adversarial Networks Aided by Sketch and Rendering Generators
No ratings yet
3-Toward Intelligent Design An AI-Based Fashion Designer Using Generative Adversarial Networks Aided by Sketch and Rendering Generators
16 pages
DIFFBLENDER Scalable and Composable
No ratings yet
DIFFBLENDER Scalable and Composable
18 pages
Multimodal Image Synthesis and Editing The Generative AI Era
No ratings yet
Multimodal Image Synthesis and Editing The Generative AI Era
22 pages
Editing Fashion Images With Precision: A Controlled in Painting Method
No ratings yet
Editing Fashion Images With Precision: A Controlled in Painting Method
13 pages
Fashion Synthesis With Structural Coherence
No ratings yet
Fashion Synthesis With Structural Coherence
9 pages
Virtual Try On
No ratings yet
Virtual Try On
10 pages
Fast and Robust Virtual Try On Based On Parser Free Generative Adversarial Network
No ratings yet
Fast and Robust Virtual Try On Based On Parser Free Generative Adversarial Network
10 pages
Intelligent Clothing Interaction Design and Evaluation System Based On DCGAN
No ratings yet
Intelligent Clothing Interaction Design and Evaluation System Based On DCGAN
6 pages
Seminar
No ratings yet
Seminar
23 pages
Morelli Dress Code High-Resolution Multi-Category Virtual Try-On CVPRW 2022 Paper
No ratings yet
Morelli Dress Code High-Resolution Multi-Category Virtual Try-On CVPRW 2022 Paper
5 pages
Zeng 等 - 2020 - TileGAN Category-Oriented Attention-based High-qu
No ratings yet
Zeng 等 - 2020 - TileGAN Category-Oriented Attention-based High-qu
14 pages
Generative AI Summary
No ratings yet
Generative AI Summary
20 pages
32729-Article Text-36797-1-2-20250410
No ratings yet
32729-Article Text-36797-1-2-20250410
10 pages
Esser A Variational U-Net CVPR 2018 Paper
No ratings yet
Esser A Variational U-Net CVPR 2018 Paper
10 pages
A Variational U-Net For Conditional Appearance and Shape Generation
No ratings yet
A Variational U-Net For Conditional Appearance and Shape Generation
21 pages
Image Based Virtual Try On Network
No ratings yet
Image Based Virtual Try On Network
4 pages
Learning Flow Fields in Attention For Controllable Person Image Generation
No ratings yet
Learning Flow Fields in Attention For Controllable Person Image Generation
19 pages
FashionFlow Pix2Pix Approach
No ratings yet
FashionFlow Pix2Pix Approach
9 pages
Imagdressing-V1: Customizable Virtual Dressing
No ratings yet
Imagdressing-V1: Customizable Virtual Dressing
9 pages
Stableviton: Learning Semantic Correspondence With Latent Diffusion Model For Virtual Try-On
No ratings yet
Stableviton: Learning Semantic Correspondence With Latent Diffusion Model For Virtual Try-On
17 pages
Mpvton, Iccv 2019 PDF
No ratings yet
Mpvton, Iccv 2019 PDF
11 pages
Tryondiffusion: A Tale of Two Unets
No ratings yet
Tryondiffusion: A Tale of Two Unets
30 pages
FW-GAN: Flow-Navigated Warping GAN For Video Virtual Try-On
No ratings yet
FW-GAN: Flow-Navigated Warping GAN For Video Virtual Try-On
10 pages
StackGAN and AttnGAN
No ratings yet
StackGAN and AttnGAN
78 pages
2504.17826v1
No ratings yet
2504.17826v1
15 pages
Team Project Report
No ratings yet
Team Project Report
54 pages
mmfp2955 Shi
No ratings yet
mmfp2955 Shi
13 pages
Image Based Virtual Try-On
No ratings yet
Image Based Virtual Try-On
12 pages
Choi VITON-HD High-Resolution Virtual Try-On Via Misalignment-Aware Normalization CVPR 2021 Paper
No ratings yet
Choi VITON-HD High-Resolution Virtual Try-On Via Misalignment-Aware Normalization CVPR 2021 Paper
10 pages
01-08, Tesma0802, IJEAST
No ratings yet
01-08, Tesma0802, IJEAST
8 pages
Fele C-VTON Context-Driven Image-Based Virtual Try-On Network WACV 2022 Paper
No ratings yet
Fele C-VTON Context-Driven Image-Based Virtual Try-On Network WACV 2022 Paper
10 pages
References
No ratings yet
References
2 pages
Cui Dressing in Order
No ratings yet
Cui Dressing in Order
19 pages
Cui Dressing in Order Recurrent Person Image Generation For Pose Transfer ICCV 2021 Paper
No ratings yet
Cui Dressing in Order Recurrent Person Image Generation For Pose Transfer ICCV 2021 Paper
10 pages
Wear Any Way
No ratings yet
Wear Any Way
18 pages
Tryongan
No ratings yet
Tryongan
11 pages
Smart Fashion A Review of AI - IRO-Journals-3 4 2
No ratings yet
Smart Fashion A Review of AI - IRO-Journals-3 4 2
22 pages
LA-VITON: A Network For Looking-Attractive Virtual Try-On
No ratings yet
LA-VITON: A Network For Looking-Attractive Virtual Try-On
4 pages
End Sem Seminar
No ratings yet
End Sem Seminar
21 pages
FashionTex Supplementary Material
No ratings yet
FashionTex Supplementary Material
7 pages
Text2Human: Text-Driven Controllable Human Image Generation
No ratings yet
Text2Human: Text-Driven Controllable Human Image Generation
11 pages
2411.10499v2
No ratings yet
2411.10499v2
18 pages
Paper 8918
No ratings yet
Paper 8918
7 pages
Virtual Trial Room
100% (1)
Virtual Trial Room
4 pages
Group - 45 Research Paper - Final
No ratings yet
Group - 45 Research Paper - Final
8 pages
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
From Everand
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
Fouad Sabry
No ratings yet
Computer Vision: Fundamentals and Applications
From Everand
Computer Vision: Fundamentals and Applications
Fouad Sabry
No ratings yet
1 s2.0 S2949719124000074 Main
No ratings yet
1 s2.0 S2949719124000074 Main
30 pages
Xiao Florence-2 Advancing A Unified Representation For A Variety of Vision CVPR 2024 Paper
No ratings yet
Xiao Florence-2 Advancing A Unified Representation For A Variety of Vision CVPR 2024 Paper
12 pages
Large Language Models For Information Retrieval: A Survey
No ratings yet
Large Language Models For Information Retrieval: A Survey
35 pages
Tian Learning Vision From Models Rivals Learning Vision From Data CVPR 2024 Paper
No ratings yet
Tian Learning Vision From Models Rivals Learning Vision From Data CVPR 2024 Paper
12 pages
The Power of Noise: Redefining Retrieval For RAG Systems: Florin Cuconasu Giovanni Trappolini Federico Siciliano
No ratings yet
The Power of Noise: Redefining Retrieval For RAG Systems: Florin Cuconasu Giovanni Trappolini Federico Siciliano
11 pages
Stock Agent
No ratings yet
Stock Agent
31 pages
Multimodal Recommender Systems: Rakuten Institute of Technology
No ratings yet
Multimodal Recommender Systems: Rakuten Institute of Technology
44 pages
2233 A Transformer Based Framework
No ratings yet
2233 A Transformer Based Framework
19 pages
How To Describe A Picture (Useful Phrases)
No ratings yet
How To Describe A Picture (Useful Phrases)
2 pages
General English Focus On Listening S1 Informatika
No ratings yet
General English Focus On Listening S1 Informatika
7 pages
Kutz Rate Card May2024
No ratings yet
Kutz Rate Card May2024
14 pages
MSS E 90 V1 & V2 - Revision - 1.3
No ratings yet
MSS E 90 V1 & V2 - Revision - 1.3
7 pages
Skechers Shoes Comfort and Style
No ratings yet
Skechers Shoes Comfort and Style
10 pages
Tamo Pattern - February 2021
No ratings yet
Tamo Pattern - February 2021
7 pages
Portrait Prompts Magazine - Volume 3
100% (2)
Portrait Prompts Magazine - Volume 3
68 pages
Chanel
No ratings yet
Chanel
12 pages
Eternelle Mystique Long Lasting Perfume For Women by Oxizun
No ratings yet
Eternelle Mystique Long Lasting Perfume For Women by Oxizun
10 pages
Nike Hit The Ground Running in 1962
No ratings yet
Nike Hit The Ground Running in 1962
2 pages
Zellige
No ratings yet
Zellige
10 pages
Cite Dissertation Apa Owl
100% (2)
Cite Dissertation Apa Owl
4 pages
Modul Housekeeping
No ratings yet
Modul Housekeeping
41 pages
The Knot Wedding Planning Spreadsheet
No ratings yet
The Knot Wedding Planning Spreadsheet
36 pages
1-FON Skill Lab Manual
No ratings yet
1-FON Skill Lab Manual
88 pages
50 Best Medium-Length Hairstyles For 2022 - Hair Adviser
No ratings yet
50 Best Medium-Length Hairstyles For 2022 - Hair Adviser
1 page
HR Manual
No ratings yet
HR Manual
65 pages
Mitchum Interview - Eddie Coyle
No ratings yet
Mitchum Interview - Eddie Coyle
38 pages
Repaso ENG III UGM
No ratings yet
Repaso ENG III UGM
18 pages
Urban Tribes: Emos
No ratings yet
Urban Tribes: Emos
2 pages
Chatzopoulou Sofia
No ratings yet
Chatzopoulou Sofia
78 pages
Cosmopolitan Australia - October-November 2024
No ratings yet
Cosmopolitan Australia - October-November 2024
148 pages
Women's Shoe Store Marketing Plan
100% (14)
Women's Shoe Store Marketing Plan
20 pages
The Companies Act, 2013 Private Company Limited by Shares Memorendum of Association of Despoo Fashion (Opc) Private Limited
No ratings yet
The Companies Act, 2013 Private Company Limited by Shares Memorendum of Association of Despoo Fashion (Opc) Private Limited
3 pages
Lay Out New PL Press Qlty
No ratings yet
Lay Out New PL Press Qlty
68 pages
Interesting Customs and Traditions
No ratings yet
Interesting Customs and Traditions
2 pages

Fashion Diffusion Control

Uploaded by

Fashion Diffusion Control

Uploaded by

Multimodal Garment Designer:

Human-Centric Latent Diffusion Models for Fashion Image Editing

Alberto Baldrati1,3,∗ , Davide Morelli2,3,∗ , Giuseppe Cartella2 , Marcella Cornia2 ,

Abstract sults on these new datasets demonstrate the effectiveness

4 5 1 18 9+19 Denoising U-Net

quired multimodal information needed to perform the task

Modalities Dress Code Multimodal Modalities Realism Multimodal Coherence

✓ 6.19 2.15 31.79 6.16 0.411 ✓ ✓ 70.73 96.26 - 65.15 84.48 -

Modalities VITON-HD Multimodal ✓ ✓ 66.17 93.84 - 73.73 83.46 -

hawaiian black blue navy side

Color ﬁdelity Shape ﬁdelity

Figure 7: Annotated images per category on Dress Code (a) (b)

A.5. Additional Statistics and Annotated Samples

Table 8: Category-wise quantitative results on the Dress Code Multimodal dataset.

Dress Code Multimodal Modalities Dress Code Multimodal

family [39] and derives from the high spatial compression

SKETCH COND. (t=34)

nature of the latent space (8× for each spatial dimension).

SKETCH COND. (t=34)

the model’s shape, the generation task fails as well, creating

unwanted artifacts (e.g. a sketch may be smaller than the

Figure 10: Time window conditioned examples on Dress

stead, in Fig. 16 and Fig. 17, we report low-resolution qual-

green logo t-shirt

long sleeve black t

blue deneuve but-

pink racerback tank

black saint tropez

blue polo image

red wow t-shirt

ue pleated skinny ankle

long pink maxi

0.0 0.2 0.4 0.6 0.8 1.0

knee length dress black belted dress

long solid black black sleeveless

You might also like