Fashion Diffusion Control
Fashion Diffusion Control
1 2
{name.surname}@unifi.it {name.surname}@unimore.it
long red sleeveless dress cream dress black belted light pink long dress form-fi�ng evening dress gold sheath dress black polka-dot a-line dress
red floor-length dress natural sleeveless v-neck dress dress black faux long striped dress red maxi dress light gold short dress black dot print shirt dress
solid red long dress sleevless beige dress wrap dress faux leather pink women's long dress red solid halterneck gown short gold shiny dress black polka dot dress
Figure 1: In this work, we propose a novel multimodal garment designer framework based on latent diffusion models that
can generate a novel fashion image conditioned on text, human keypoints, and a garment sketch.
1
input model wearing a new clothing item corresponding to model’s characteristics. (3) To tackle the new task, we ex-
the given textual description. In this context, only a few tend two existing fashion datasets with textual sentences
works [19, 35, 59] have been proposed, exclusively employ- and garment sketches devising a semi-automatic annotation
ing GAN-based approaches for the generative step. framework. (4) Extensive experiments demonstrate that the
Recently, diffusion models [10, 17, 32, 44] have attracted proposed approach outperforms other competitors in terms
more and more attention due to their outstanding genera- of realism and coherence with multimodal inputs.
tion capabilities, allowing the improvement of a variety of
downstream tasks in several domains, while their applica- 2. Related Work
bility to the fashion domain is still unexplored. Many dif- Text-Guided Image Generation. Creating an image that
ferent solutions have been introduced and can roughly be faithfully reflects the provided textual prompt is the goal of
identified based on the denoising conditions used to guide text-to-image synthesis. In this context, early approaches
the diffusion process, which can enable greater control of were based on GANs [48, 54, 56, 58], while most re-
the synthesized output. A particular type of diffusion model cent solutions exploit the effectiveness of diffusion mod-
has been proposed in [39] that, instead of applying the dif- els [33, 37, 39]. In the fashion domain, only a few attempts
fusion process in the pixel space, defines the forward and of text-to-image synthesis have been proposed [19, 35, 59].
the reverse processes in the latent space of a pre-trained au- Specifically, Zhu et al. [59] presented a GAN-based solu-
toencoder, becoming one of the leading choices thanks to tion that generates the final image conditioned on both tex-
its reduced computational cost. Although this solution can tual descriptions and semantic layouts. A different approach
generate highly realistic images, it does not perform well in is the one introduced in [35], where a latent code regular-
human-centric generation tasks and can not deal with mul- ization technique is employed to augment the GAN inver-
tiple conditioning signals to guide the generation phase. sion process by exploiting CLIP textual embeddings [36] to
In this work, we address an extended and more gen- guide the image editing process. Instead, Jiang et al. [19]
eral framework and define the new task of multimodal- proposed an architecture that synthesizes full-body images
conditioned fashion image editing, which allows guiding by mapping the textual descriptions of clothing items into
the generative process via multimodal prompts while pre- one-hot vectors, limiting however the expressiveness capa-
serving the identity and body shape of a given person bility of the conditioning signal.
(Fig. 1). To tackle this task, we introduce a new archi- Multimodal Image Generation with Diffusion Models.
tecture, called Multimodal Garment Designer (MGD), that A related line of works aims to condition existing diffu-
emulates the process of a designer conceiving a new gar- sion models on different modalities thus enabling greater
ment on a model shape, based on preliminary indications control over the generation process [5, 6, 27, 31, 51]. For
provided through a textual sentence or a garment sketch. example, Choi et al. [6] proposed to refine the generative
In particular, starting from Stable Diffusion [39], we pro- process of an unconditional denoising diffusion probabilis-
pose a denoising network that can be conditioned by multi- tic model [32] by matching each latent variable with the
ple modalities and also takes into account the pose consis- given reference image. On a different line, the approach in-
tency between input and generated images, thus improving troduced in [27] adds noise to a stroke-based input and ap-
the effectiveness of human-centric diffusion models. plies the reverse stochastic differential equation to synthe-
To address the newly proposed task, we present a semi- size images, without additional training. Wang et al. [51],
automatic framework to extend existing datasets with mul- instead, proposed to learn a highly semantic latent space and
timodal data. Specifically, we start from two famous virtual perform conditional finetuning for each downstream task to
try-on datasets (i.e. Dress Code [30] and VITON-HD [7]) map the guidance signals to the pre-trained space. Other
and extend them with textual descriptions and garment recent works proposed to add sketches as additional condi-
sketches. Experimental results on the two proposed mul- tioning signals, either concatenating them with the model
timodal fashion benchmarks show both quantitatively and input [5] or training an MLP-based edge predictor to map
qualitatively that our proposed architecture generates high- latent features to spatial maps [49].
quality images based on the given multimodal inputs and Among contemporary works that aim to condition pre-
outperforms all considered competitors and baselines, also trained latent diffusion models, ControlNet [57] proposes
according to human evaluations. to extend the Stable Diffusion model [39] with an additional
To sum up, our contributions are as follows: (1) We pro- conditioning input. This process involves creating two ver-
pose a novel task of multimodal-conditioned fashion image sions of the original model’s weights: one that remains fixed
editing, which entails the use of multimodal data to guide and unchanged (locked copy) and another that can be up-
the generation. (2) We introduce a new human-centric gen- dated during training (trainable copy). The purpose of this
erative architecture based on latent diffusion models, capa- is to allow the trainable version to learn the newly intro-
ble of following multimodal prompts while preserving the duced condition while the locked version retains the origi-
2
Text Y Pre-trained Channels Additional Channels
Cold shoulder CLIP Text
Black midi dress
Encoder
+ + +
Figure 2: Overview of the proposed Multimodal Garment Designer (MGD), a human-centric latent diffusion model condi-
tioned on multiple modalities (i.e. text, human pose, and garment sketch).
nal model knowledge. On the other hand, T2I-Adapter [31] ing for human-centric fashion image editing. Stable Dif-
learns modality-specific adapter modules that enable Stable fusion is composed of an autoencoder with an encoder E
Diffusion conditioning on new modalities. and a decoder D, a text-time-conditional U-Net denoising
In contrast, we focus on the fashion domain and pro- model ϵθ , and a CLIP-based text encoder TE taking as in-
pose a human-centric architecture based on latent diffusion put a text Y . The encoder E compresses an image I into a
models that directly exploits the conditioning of textual sen- lower-dimensional latent space defined in Rh×w×4 , where
tences and other modalities such as human body poses and h = H/8 and w = W/8. The decoder D performs the
garment sketches. opposite operation, decoding a latent variable into the pixel
space. For the sake of clarity, we define the ϵθ convolu-
3. Proposed Method tional input (i.e. zt in this case) as spatial input γ because of
the property of convolutions to preserve the spatial structure
In this section, we propose a novel task to automatically
and the attention conditioning input as ψ. The denoising
edit a human-centric fashion image conditioned on multi-
network ϵθ is trained according to the following loss:
ple modalities. Specifically, given the model image I ∈
RH×W ×3 , its pose map P ∈ RH×W ×18 where the chan- L = \mathbb {E}_{\mathcal {E}(I), Y, \epsilon \sim \mathcal {N}(0,1),t} \left [ \lVert \epsilon - \epsilon _{\theta }(\gamma ,\psi ) \rVert _2^2 \right ], \label {eq:diffusion_loss} (1)
nels represent the human keypoints, a textual description Y
of a garment, and a sketch of the same S ∈ RH×W ×1 , we where t is the diffusing time step, γ = zt , ψ = [t; TE (Y )],
want to generate a new image I˜ ∈ RH×W ×3 that retains the and ϵ ∼ N (0, 1) is the Gaussian noise added to E(I).
information of the input model while substituting the target
garment according to the multimodal inputs. To tackle the 3.2. Human-Centric Image Editing
task, we propose a novel latent diffusion approach, called Our task aims to generate a new image I, ˜ by replac-
Multimodal Garment Designer (MGD), that can effectively ing in the input image I the target garment using multi-
combine multimodal information when generating the new modal inputs, while preserving the model’s identity and
image I.˜ Our proposed architecture is a general framework physical characteristics. As a natural consequence, this task
that can be easily extended to other modalities such as tex- can be identified as a particular type of inpainting tailored
ture and 3d information. We strongly believe this task can for human body data. Instead of using a standard text-to-
foster research in the field and enhance the design process of image model, we perform inpainting concatenating along
new fashion items with greater customization. An overview the channel dimension of the denoising network input zt an
of our model is shown in Fig. 2. encoded masked image E(IM ) and the relative resized bi-
nary inpainting mask m ∈ {0, 1}h×w×1 , which is derived
3.1. Preliminaries
from the original inpainting mask M ∈ {0, 1}H×W ×1 .
While diffusion models [44] are latent variable architec- Since here, the spatial input of the denoising network is
tures that work in the same dimensionality of the data (i.e. in γ = [zt ; m; E(IM )], γ ∈ Rh×w×9 . Thanks to the fully
the pixel space), latent diffusion models (LDMs) [39] oper- convolutional nature of the encoder E and the decoder D,
ate in the latent space of a pre-trained autoencoder achiev- this LDMs-based architecture can preserve the spatial in-
ing higher computational efficiency while preserving the formation in the latent space. Exploiting this feature, our
generation quality. In our work, we leverage the Stable method can thus optionally add conditioning constraints to
Diffusion model [39], a text-to-image implementation of the generation. In particular, we propose to add two gener-
LDMs as a starting point to perform multimodal condition- ation constraints in addition to the textual information: the
3
model pose map P to preserve the original human pose of Defining Iˆ = D(z0 ) ∈ RH×W ×3 as the output of the de-
the input model and the garment sketch S to allow the final coder D and Mhead ∈ {0, 1}H×W ×1 as the model face bi-
users to better condition the garment generation process. nary mask of the image I, the final output image I˜ is ob-
Pose Map Conditioning. In most cases [23, 26, 47], in- tained as follows: I˜ = Mhead ⊙ I + (1 − Mhead ) ⊙ I,
ˆ where
painting is performed with the objective of either remov- ⊙ denotes the element-wise multiplication operator.
ing or entirely replacing the content of the masked region.
3.3. Training and Inference
However, in our task, we aim to remove all information re-
garding the garment worn by the model while preserving the As in standard latent diffusion models, given an encoded
model’s body information and identity. Thus, we propose to input z = E(I), the proposed denoising network is trained
improve the garment inpainting process by using the bound- to predict the noise stochastically added to z. The corre-
ing box of the segmentation mask along with pose map in- sponding objective function can be specified as
formation representing body keypoints. This approach en-
ables the preservation of the model’s physical characteris-
tics in the masked region while allowing the inpainting of \label {eq:our_loss} L = \mathbb {E}_{\mathcal {E}(I), Y, \epsilon \sim \mathcal {N}(0,1), t, \mathcal {E}(I_M),m,p,s} \left [ \lVert \epsilon - \epsilon _{\theta }(\gamma ,\psi ) \rVert _2^2 \right ],
garments with different shapes. Differently from conven- (2)
tional inpainting techniques, we focus on selectively retain- where γ = [zt ; m; E(IM ); p; s] and ψ = [t; TE (Y )].
ing and discarding specific information within the masked Classifier-Free Guidance. Classifier-free guidance is an
region to achieve the desired outcome. To enhance the per- inference technique that requires the denoising network to
formance of the denoising network with human body key- work both conditioned and unconditioned. This method
points, we modify the first convolution layer of the net- modifies the unconditional model predicted noise moving
work by adding 18 additional channels, one for each key- it toward the conditioned one. Specifically, the predicted
point. Adding new inputs usually would require retraining diffusion process at time t, given the generic condition c, is
the model from scratch, thus consuming time, data, and re- computed as follows:
sources, especially in the case of data-hungry models like
the diffusion ones. Therefore, we propose to extend the ker- \hat {\epsilon }_{\theta }(z_t | c) = \epsilon _{\theta }(z_t | \emptyset ) + \alpha \cdot (\epsilon _{\theta }(z_t | c) - \epsilon _{\theta }(z_t | \emptyset )), \label {eq:diffusion_classifier_free} (3)
nels of the pre-trained input layer of the denoising network
where ϵθ (zt |c) is the predicted noise at time t given the con-
with randomly initialized weights sampled from a uniform
dition c, ϵθ (zt |∅) is the predicted noise at time t given the
distribution [14] and retrain the whole network. This con-
null condition, and the guidance scale α controls the degree
sistently reduces the number of training steps and enables
of extrapolation towards the condition.
training with less data. Our experiments show that such im-
Since our model deals with three conditions (i.e. text,
provement enhances the consistency of the body informa-
pose map, and sketch), we use the fast variant multi-
tion between the generated image and the original one.
condition classifier-free guidance proposed in [1]. Instead
Incorporating Sketches. Fully describing a garment us- of performing the classifier-free guidance according to each
ing only textual descriptions is a challenging task due to condition probability, it computes the direction of the joint
the complexity and ambiguity of natural language. While probability of all the conditions ∆tjoint = ϵθ (zt |{ci }i=N
i=1 ) −
text can convey specific attributes like style, color, and pat- ϵθ (zt |∅):
terns of a garment, it may not provide sufficient informa-
tion about its spatial characteristics, such as shape and size. \hat {\epsilon }_{\theta }(z_t | \{ c_i \}_{i=1}^{i=N}) = \epsilon _{\theta }(z_t | \emptyset ) + \alpha \cdot \Delta _{\text {joint}}^t. \label {eq:diffusion_classifier_free2} (4)
This limitation can hinder the customization of the gener-
ated clothing item other than the ability to accurately match This reduces the number of feed-forward executions from
the user’s intended style. Therefore, we propose to lever- N + 1 to 2.
age garment sketches to enrich the textual input with ad- Unconditional Training. Ensuring the ability of the de-
ditional spatial fine-grained details. We achieve this fol- noising model to work both with and without conditions is
lowing the same approach described for pose map condi- achieved by replacing at training time the condition with a
tioning. The final spatial input of our denoising network is null one according to a fixed probability. This approach
γ = [zt ; m; E(IM ); p; s], [p; s] ∈ Rh×w×(18+1) , p and s allows the model to learn from both conditional and un-
are obtained by resizing P and S to match the latent space conditional samples, resulting in improved mode coverage
dimensions. In the case of sketches, we only condition the and sample fidelity. Moreover, this technique also allows
early steps of the denoising process as the final steps have the model to optionally use the control signals at prediction
little influence on the shapes [2]. time. Since our approach considers multiple conditions, we
Mask Composition. To preserve the model identity when propose to extend the input masking to each condition in-
performing human-centric inpainting, we perform mask dependently. Experiments show that tuning this parameter
composition as the final step of the proposed approach. can effectively affect the quality of the final result.
4
4. Collecting Multimodal Fashion Datasets
red fi�ed crop top
Currently available datasets for fashion image genera- red body crop top
tion often contain low-resolution images and lack all the re- long-sleeve top
modal information and provide a complete description of VITON-HD [7] ✗ ✓ ✗ 27,358 13,679 - -
Dress Code [30] ✗ ✓ ✗ 107,584 53,792 - -
how to enrich Dress Code and VITON-HD with garment- Be Your Own Prada [59] ✓ ✓ ✗ 78,979 N/A 3,972 445
related text and sketches. We call our extended versions DF-Multimodal [19] ✓ ✓ ✗ 44,096 N/A 10,253 77
of these datasets Dress Code Multimodal and VITON-HD VITON-HD Multimodal ✓ ✓ ✓ 27,358 13,679 5,143 1,613
Dress Code Multimodal ✓ ✓ ✓ 107,584 53,792 25,596 2,995
Multimodal, respectively. Sample images and multimodal
data of the collected datasets can be found in Fig. 3. Table 1: Comparison of Dress Code and VITON-HD Mul-
timodal with other fashion datasets with multimodal anno-
4.1. Dataset Collection and Annotation
tations.
Data Preparation. We start the annotation from the Dress
Code dataset, which contains more than 53k model-garment To determine the most relevant noun chunks for each
pairs of multiple categories. As a first step, we need to garment, we employ the CLIP model [36] and its open-
associate each garment with a textual description contain- source adaptation (i.e. OpenCLIP [52]). We select the VIT-
ing fashion-specific and non-generic terms which are suffi- L14@336 and RN50×64 models for CLIP, and the VIT-
ciently detailed but not extremely lengthy to be exploited for L14, ViT-H14, and ViT-g14 models for OpenCLIP. Prompt
constraining the generation. Motivated by recent findings ensembling is performed to improve the results and, for
in the field showing that humans tend to describe fashion each image, we select 25 noun chunks based on the top-5
items using only a few words [3], we propose to use noun noun chunks per model rated by cosine similarity between
chunks (i.e. short textual sentences composed of a noun image and text embeddings, avoiding repetitions.
along with its modifiers) that can effectively capture impor-
Fine-Grained Textual Annotation. To ensure the accuracy
tant information while reducing unnecessary words or de-
and representativeness of our annotations, we manually an-
tails. Given that manually annotating all the images would
notate a significant portion of Dress Code images. In par-
be time-consuming and resource-intensive1 , we propose a
ticular, we select the three most representative noun chunks,
novel framework to semi-automatically annotate the dataset
among the 25 automatically associated, with each garment
using noun chunks. Firstly, domain-specific captions are
image. To minimize the annotation time, we develop a cus-
collected from two available fashion datasets, namely Fash-
tom annotation tool that constrains the annotation time to
ionIQ [53] and Fashion200k [12], standardizing them with
an average time of 60 seconds per item and allows the an-
word lemmatization and eventually reducing each word to
notator to manually insert noun chunks in the case that none
its root form with the NLTK library2 . Then, we extract noun
of the automatically extracted ones are suitable for the im-
chunks from the captions, filtering the results by removing
age. Overall, we manually annotate 26,400 different gar-
all textual items that start with or contain special characters.
ments (8,800 for each category) out of the 53,792 products
After this pre-processing stage, we obtain more than 60k
included in the dataset, ensuring to include all fashion items
unique noun chunks, divided into three different categories
of the original test set [30].
(i.e. upper-body clothes, lower-body clothes, and dresses).
Coarse-Grained Textual Annotation. To complete the an-
1 Since the Dress Code dataset consists of over 53k fashion items and notation, we first finetune the OpenCLIP ViT-B32 model,
assuming that each annotation requires approximately 5 minutes, a single
annotator working 8 hours per day, 5 days a week, and 260 working days
pre-trained on the English portion of the LAION5B
per year would take more than 2 years to complete the annotation task. dataset [42], using the newly annotated image-text pairs.
2 https://www.nltk.org/ We then use this model and the collected set of noun chunks
5
to automatically tag all the remaining elements of the Dress HD Multimodal datasets on a single NVIDIA A100 GPU
Code dataset with the three most similar noun chunks, al- for 150k steps, using a batch size of 16, a learning rate
ways determined via cosine similarity between multimodal of 10−5 with a linear warmup for the first 500 iterations,
embeddings. We employ the same strategy also to auto- and AdamW [25] as optimizer with weight decay 10−2 .
matically annotate all garment images of the VITON-HD To speed up training and save memory, we use mixed
dataset. In this case, since this dataset only contains upper- precision [28]. We set both the fraction of steps condi-
body clothes, we limit the table noun chunks to the ones tioned by the sketch and the portion of masked conditions
describing upper-body garments. during training to 0.2. During inference, we employ the
Extracting Sketches. The introduction of garment sketches DDIM [45] with 50 steps as noise scheduler and set the
can provide valuable design details that are not easily dis- classifier-free guidance parameter α to 7.5.
cernible from text alone. In this way, the dataset can pro- Baselines and Competitors. As first competitor, we use
vide a more accurate and comprehensive representation of the out-of-the-box implementation of the inpainting Sta-
the garments, leading to improved quality and better con- ble Diffusion pipeline3 provided by Huggingface. More-
trol of the generated design details. To extract sketches over, we adapt two existing models, namely FICE [35] and
for both Dress Code and VITON-HD datasets, we employ SDEdit [27], to work on our setting. In particular, we re-
PiDiNet [46], a pre-trained edge detection network. train all main components of the FICE model on the newly
Given that the selected datasets have originally been in- collected datasets. We employ the same resolution used by
troduced for virtual try-on, they consist of both paired and the authors (i.e. 256 × 256), downsampling each image to
unpaired test sets. While for the paired set we can directly 256 × 192 and applying padding to match the desired size
use the human parsing mask to extract the garment of inter- (which is then removed during evaluation). To compare
est worn by the model and then feed it to the edge detec- our model with a different conditioning strategy, we em-
tion network, for the unpaired set we need to first create a ploy the approach proposed in [27] using our model trained
warped version of the in-shop garment matching the body only with text and human poses as input modalities and per-
pose and shape of the target model. Following virtual try-on form the sketch guidance using as starting latent variable the
methods [50,55], we train a geometric transformation mod- sketch image with added random noise. Following the orig-
ule that performs a thin-plate spline transformation [38] of inal paper instructions, we use 0.8 as the strength parameter.
the input garment and then refines the warped result using a
U-Net model [40]. From each warped garment, we extract 5.2. Evaluation Metrics
the sketch image enabling the use of the proposed solution To assess the realism of generated images, we employ
even in unpaired settings. the Fréchet Inception Distance (FID) [16] and the Kernel
Inception Distance (KID) [4]. For both metrics, we adopt
4.2. Comparison with Other Datasets
the implementation proposed in [34]. Instead, to evaluate
The only two text-to-image generation datasets avail- the adherence of the image to the textual conditioning in-
able in the fashion domain [19, 59] are both based on put, we employ the CLIP Score (CLIP-S) [15] provided in
images from the DeepFashion dataset [24]. While the the TorchMetrics library [9], using the OpenCLIP ViT-H/14
dataset introduced in [59] contains short textual descrip- model as cross-modal architecture. We compute the score
tions, DeepFashion-Multimodal [19] is annotated with at- on the inpainted region of the generated output pasted on a
tributes (e.g. category, color, fabric, etc.) that can be com- 224 × 224 white background.
posed in longer captions. In Table 1, we summarize the Pose Distance (PD). We propose a novel pose distance met-
main statistics of the publicly available datasets textual ric that measures the coherence of human body poses be-
annotations compared with those of our newly extended tween the generated image and the original one estimating
datasets. As can be seen, our datasets contain more variety the distance between the human keypoints extracted from
in terms of textual items and words, confirming the appro- the original and generated images. Specifically, we employ
priateness of our annotation procedure and enabling a more the OpenPifPaf [22] human pose estimation network and
personalized control of the generation process. Also, it is compute the ℓ2 distance between each pair of real-generated
worth noting that the other datasets have no in-shop gar- corresponding estimated keypoints. We only consider the
ment images making them difficult to employ in our case. keypoints involved in the generation (i.e. that falls in the
mask M ) and weigh each keypoint distance with the detec-
5. Experimental Evaluation tor confidence to take into account any estimation errors.
5.1. Implementation Details and Competitors Sketch Distance (SD). To quantify the adherence of the
generated image to the sketch constraint, we propose a
Training and Inference. All models are trained on the
original splits of the Dress Code Multimodal and VITON- 3 https://huggingface.co/runwayml/stable-diffusion-inpainting
6
Modalities Dress Code Multimodal VITON-HD Multimodal
Model Resolution Text Pose Sketch FID ↓ KID ↓ CLIP-S ↑ PD ↓ SD ↓ FID ↓ KID ↓ CLIP-S ↑ PD ↓ SD ↓
Paired setting
Stable Diffusion [39] 256×192 ✓ 17.05 9.28 28.71 4.62 - 15.18 6.38 30.40 5.04 -
FICE [35] 256×192 ✓ ✓ 30.63 23.54 28.72 6.87 - 49.44 44.74 29.26 6.37 -
MGD (ours) 256×192 ✓ ✓ 5.57 1.67 31.33 2.37 - 10.11 3.14 31.85 2.90 -
Paired setting
Stable Diffusion [39] 512×384 ✓ 17.43 9.48 29.18 9.24 0.467 16.28 6.56 30.70 10.78 0.410
SDEdit [27] 512×384 ✓ ✓ ✓ 10.19 5.03 29.21 5.41 0.398 13.07 4.66 30.58 6.76 0.306
MGD (ours) 512×384 ✓ ✓ ✓ 5.74 2.11 31.68 4.72 0.374 10.60 3.26 32.39 5.94 0.253
Unpaired setting
Stable Diffusion [39] 256×192 ✓ 19.11 10.69 27.53 5.07 - 17.37 7.55 28.40 5.50 -
FICE [35] 256×192 ✓ ✓ 34.14 26.86 26.03 7.15 - 52.74 48.58 25.94 6.58 -
MGD (ours) 256×192 ✓ ✓ 7.01 2.19 29.58 2.96 - 11.54 3.18 29.95 3.30 -
Unpaired setting
Stable Diffusion [39] 512×384 ✓ 19.55 10.80 28.02 9.89 0.582 18.45 7.87 28.74 11.60 0.561
SDEdit [27] 512×384 ✓ ✓ ✓ 11.38 5.69 27.10 6.16 0.509 15.12 5.67 28.61 7.35 0.406
MGD (ours) 512×384 ✓ ✓ ✓ 7.73 2.82 30.04 6.79 0.458 12.81 3.86 30.75 7.22 0.317
Table 2: Quantitative results on the Dress Code Multimodal and VITON-HD Multimodal datasets for both paired and un-
paired settings .
Model Text Pose Sketch FID ↓ KID ↓ CLIP-S ↑ PD ↓ SD ↓ Text Pose Sketch Stable Diff. FICE SDEdit Stable Diff. FICE SDEdit
HD M. Code M.
✓ 70.82 - - 65.32 - -
Dress
7
SDedit MGD (ours) SDedit MGD (ours)
mul�color black
green pussy-bow floral embroi-
de chine blouse dered gown
green pe�te bow red and black floral
blouse print
blue �e-front shirt slim straps floral
dress
Figure 4: Sample generated images on Dress Code Multimodal and VITON-HD Multimodal (bottom left) using all multi-
modal inputs.
fusion performs worse in terms of the pose distance than Dress Code Multimodal
both SDEdit and MGD, owing to the lack of pose informa- Uncond. Portion Sketch Cond. FID ↓ KID ↓ CLIP-S ↑ PD ↓ SD ↓
tion in the inputs. It is noteworthy that SDEdit performs 0.1 1.0 9.64 3.76 30.24 7.66 0.459
worse than our model in all metrics. We attribute this be- 0.2 1.0 8.62 3.24 29.06 7.51 0.430
0.3 1.0 10.93 4.78 28.47 7.69 0.432
havior to the way sketch conditioning happens. In SDEdit,
it occurs only at the beginning by initializing zt using the 0.2 0.8 8.56 3.28 29.31 7.32 0.433
0.2 0.6 8.43 3.21 29.51 7.32 0.436
sketch image with added noise according to the condition- 0.2 0.4 8.11 3.00 29.79 7.13 0.440
ing strength, while our model conditions the denoising pro- 0.2 0.2 7.73 2.82 30.04 6.79 0.458
0.2 0.0 7.82 2.85 29.93 6.26 0.519
cess in multiple steps, depending on the sketch conditioning
parameter. Qualitative results reported in Fig. 4 highlight
Table 5: Ablation analysis of our complete model varying
how our model better follows the given conditions and gen-
the unconditioning portion during training and the sketch
erate high-realistic images.
conditioning steps. Results refer to the unpaired setting.
To validate our results based on human judgment, we
conduct a user study that evaluates both the realism of the of the sketch distance in Table 3 confirms, this input actu-
generation and the adherence to multimodal inputs. Over- ally influences the generation process of our model in both
all we collect about 7k evaluations involving more than 150 the considered datasets. Also, this modality slightly affects
users. Additional details are reported in the supplementary. the pose distance as the sketch implicitly contains informa-
Table 4 shows the user study results. Also in this case our tion about the model’s body pose. We further mask the pose
model outperforms the competitors, thus confirming the ef- map input and compare the output with previous results. In
fectiveness of our proposal. this case, we can also notice a consistent difference with
the text-only conditioned model, according to all metrics
Varying Input Modalities. In Table 3, we study the be-
except CLIP-S as expected. These results confirm that our
havior of our MGD model when the input modalities are
MGD model can effectively deal with the conditions in a
masked (i.e. where we feed the model with a zero tensor in-
disentangled way, making them optional.
stead of the considered modality). In particular, we focus
on the CLIP-S for text adherence and on the newly pro- Unconditional Training and Sketch Conditioning. In Ta-
posed pose and sketch distances for the pose and sketch co- ble 5, we inquire about the fully conditioned network per-
herency, respectively. Notice that the text input anchors the formance according to the variance of the portion of uncon-
CLIP-S metrics of all experiments and makes them compa- ditional training. Additionally, we evaluate the results by
rable in all cases. Starting from the fully conditioned model varying the fraction of sketch conditioning steps. As can be
(i.e. text, pose, sketch), we mask the sketch. As the decrease seen, the best results are achieved by using 0.2 for both pa-
8
rameters. In particular, for unconditional training, we train [7] Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul
three different models (i.e. with 0.1, 0.2, and 0.3). When Choo. VITON-HD: High-Resolution Virtual Try-On via
evaluating the sketch conditioning parameter, we test our Misalignment-Aware Normalization. In CVPR, 2021. 1, 2, 5
model with values between 0 and 1 with a stride of 0.2. It is [8] Guillem Cucurull, Perouz Taslakian, and David Vazquez.
worth noting that the sketch distance consistently decreases Context-aware visual compatibility prediction. In CVPR,
2019. 1
as the number of sketch conditioning steps increases, show-
[9] Nicki Skafte Detlefsen, Jiri Borovec, Justus Schock,
ing the robustness of the approach.
Ananya Harsh Jha, Teddy Koker, Luca Di Liello, Daniel
Stancl, Changsheng Quan, Maxim Grechkin, and William
6. Conclusion Falcon. TorchMetrics-Measuring Reproducibility in Py-
The Multimodal Garment Designer proposed in this pa- Torch. Journal of Open Source Software, 7(70):4101, 2022.
per is the first latent diffusion model defined for human- 6, 13
[10] Prafulla Dhariwal and Alexander Nichol. Diffusion Models
centric fashion image editing, conditioned by multimodal
Beat GANs on Image Synthesis. In NeurIPS, 2021. 2
inputs such as text, body pose, and sketches. The novel
[11] M Hadi Kiapour, Xufeng Han, Svetlana Lazebnik, Alexan-
architecture, trained on two new semi-automatically anno- der C Berg, and Tamara L Berg. Where to buy it: Matching
tated datasets and evaluated with standard and newly pro- street clothing photos in online shops. In ICCV, 2015. 1
posed metrics, as well as by user studies, is very promising. [12] Xintong Han, Zuxuan Wu, Phoenix X Huang, Xiao Zhang,
The result is one of the first successful attempts to mimic the Menglong Zhu, Yuan Li, Yang Zhao, and Larry S Davis. Au-
designers’ job in the creative process of fashion design and tomatic spatially-aware fashion concept discovery. In ICCV,
could be a starting point for a capillary adoption of diffusion 2017. 5, 11
models in creative industries, oversight by human input. [13] Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S
Davis. Viton: An image-based virtual try-on network. In
Acknowledgments CVPR, 2018. 1
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
This work has partially been supported by the Euro- Delving deep into rectifiers: Surpassing human-level perfor-
pean Commission under the PNRR-M4C2 project “FAIR mance on imagenet classification. In ICCV, 2015. 4
- Future Artificial Intelligence Research” and the Euro- [15] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras,
pean Horizon 2020 Programme (grant number 101004545 - and Yejin Choi. CLIPScore: A Reference-free Evaluation
ReInHerit), and by the PRIN project “CREATIVE: CRoss- Metric for Image Captioning. In EMNLP, 2021. 6, 13
modal understanding and gEnerATIon of Visual and tExtual [16] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
content” (CUP B87G22000460001), co-funded by the Ital- Bernhard Nessler, Günter Klambauer, and Sepp Hochreiter.
GANs trained by a two time-scale update rule converge to a
ian Ministry of University.
Nash equilibrium. In NeurIPS, 2017. 6
[17] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Dif-
References
fusion Probabilistic Models. In NeurIPS, 2020. 2
[1] Omri Avrahami, Thomas Hayes, Oran Gafni, Sonal Gupta, [18] Wei-Lin Hsiao and Kristen Grauman. Creating capsule
Yaniv Taigman, Devi Parikh, Dani Lischinski, Ohad Fried, wardrobes from fashion images. In CVPR, 2018. 1
and Xi Yin. SpaText: Spatio-Textual Representation for [19] Yuming Jiang, Shuai Yang, Haonan Qju, Wayne Wu,
Controllable Image Generation. In CVPR, 2023. 4 Chen Change Loy, and Ziwei Liu. Text2human: Text-driven
[2] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, controllable human image generation. ACM Transactions on
Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Graphics, 41(4):1–11, 2022. 2, 5, 6
Samuli Laine, Bryan Catanzaro, et al. eDiff-I: Text-to-Image [20] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual
Diffusion Models with an Ensemble of Expert Denoisers. losses for real-time style transfer and super-resolution. In
arXiv preprint arXiv:2211.01324, 2022. 4 ECCV, 2016. 12
[3] Federico Bianchi, Jacopo Tagliabue, and Bingqing Yu. [21] Diederik P Kingma and Jimmy Ba. Adam: A Method for
Query2Prod2Vec: Grounded word embeddings for eCom- Stochastic Optimization. In ICLR, 2015. 12
merce. In NAACL, 2021. 5 [22] Sven Kreiss, Lorenzo Bertoni, and Alexandre Alahi. Open-
[4] Mikołaj Bińkowski, Dougal J Sutherland, Michael Arbel, PifPaf: Composite Fields for Semantic Keypoint Detection
and Arthur Gretton. Demystifying MMD GANs. In ICLR, and Spatio-Temporal Association. IEEE Transactions on In-
2018. 6, 15 telligent Transportation Systems, 23(8):13498–13511, 2021.
[5] Shin-I Cheng, Yu-Jie Chen, Wei-Chen Chiu, Hung-Yu 6, 14
Tseng, and Hsin-Ying Lee. Adaptively-Realistic Image Gen- [23] Wenbo Li, Zhe Lin, Kun Zhou, Lu Qi, Yi Wang, and Jiaya
eration from Stroke and Sketch with Diffusion Model. In Jia. MAT: Mask-Aware Transformer for Large Hole Image
WACV, 2023. 2 Inpainting. In CVPR, 2022. 4
[6] Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune [24] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou
Gwon, and Sungroh Yoon. Ilvr: Conditioning method for Tang. DeepFashion: Powering Robust Clothes Recognition
denoising diffusion probabilistic models. In ICCV, 2021. 2 and Retrieval with Rich Annotations. In CVPR, 2016. 1, 6
9
[25] Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay [40] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
Regularization. In ICLR, 2019. 6, 12 Net: Convolutional Networks for Biomedical Image Seg-
[26] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher mentation. In MICCAI, 2015. 6, 12
Yu, Radu Timofte, and Luc Van Gool. RePaint: Inpainting [41] Rohan Sarkar, Navaneeth Bodla, Mariya I Vasileva, Yen-
Using Denoising Diffusion Probabilistic Models. In CVPR, Liang Lin, Anurag Beniwal, Alan Lu, and Gerard Medioni.
2022. 4 OutfitTransformer: Learning Outfit Representations for
[27] Chenlin Meng, Yutong He adnd Yang Song, Jiaming Song, Fashion Recommendation. In WACV, 2023. 1
Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: [42] Christoph Schuhmann, Romain Beaumont, Richard Vencu,
Guided image synthesis and editing with stochastic differ- Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo
ential equations. In ICLR, 2022. 2, 6, 7, 15, 20, 21 Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts-
[28] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory man, Patrick Schramowski, Srivatsa R Kundurthy, Katherine
Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia
Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Jitsev. LAION-5b: An open large-scale dataset for training
Mixed Precision Training. In ICLR, 2018. 6, 12 next generation image-text models. In NeurIPS, 2022. 5
[29] Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Mar- [43] Karen Simonyan and Andrew Zisserman. Very deep convo-
cella Cornia, Marco Bertini, and Rita Cucchiara. LaDI- lutional networks for large-scale image recognition. In ICLR,
VTON: Latent Diffusion Textual-Inversion Enhanced Virtual 2015. 12
Try-On. In ACM Multimedia, 2023. 1 [44] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
[30] Davide Morelli, Matteo Fincato, Marcella Cornia, Federico and Surya Ganguli. Deep unsupervised learning using
Landi, Fabio Cesari, and Rita Cucchiara. Dress Code: High- nonequilibrium thermodynamics. In ICML, 2015. 2, 3
Resolution Multi-Category Virtual Try-On. In ECCV, 2022. [45] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois-
1, 2, 5 ing Diffusion Implicit Models. In ICLR, 2021. 6
[31] Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhon- [46] Zhuo Su, Wenzhe Liu, Zitong Yu, Dewen Hu, Qing Liao,
gang Qi, Ying Shan, and Xiaohu Qie. T2I-Adapter: Learning Qi Tian, Matti Pietikäinen, and Li Liu. Pixel difference net-
Adapters to Dig out More Controllable Ability for Text-to- works for efficient edge detection. In ICCV, 2021. 6, 7, 14
Image Diffusion Models. arXiv preprint arXiv:2302.08453, [47] Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin,
2023. 2, 3 Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov,
[32] Alexander Quinn Nichol and Prafulla Dhariwal. Improved Naejin Kong, Harshith Goka, Kiwoong Park, and Victor
denoising diffusion probabilistic models. In ICML, 2021. 2 Lempitsky. Resolution-robust large mask inpainting with
[33] Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, fourier convolutions. In WACV, 2022. 4
Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya [48] Ming Tao, Hao Tang, Fei Wu, Xiao-Yuan Jing, Bing-Kun
Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Bao, and Changsheng Xu. DF-GAN: A Simple and Effective
Image Generation and Editing with Text-Guided Diffusion Baseline for Text-to-Image Synthesis. In CVPR, 2022. 2
Models. In ICML, 2022. 2 [49] Andrey Voynov, Kfir Aberman, and Daniel Cohen-Or.
[34] Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On Sketch-Guided Text-to-Image Diffusion Models. arXiv
Aliased Resizing and Surprising Subtleties in GAN Evalu- preprint arXiv:2211.13752, 2022. 2
ation. In CVPR, 2022. 6 [50] Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin
[35] Martin Pernuš, Clinton Fookes, Vitomir Štruc, and Si- Chen, Liang Lin, and Meng Yang. Toward characteristic-
mon Dobrišek. FICE: Text-Conditioned Fashion Image preserving image-based virtual try-on network. In ECCV,
Editing With Guided GAN Inversion. arXiv preprint 2018. 1, 6, 12
arXiv:2301.02110, 2023. 2, 6, 7, 15, 22, 23 [51] Tengfei Wang, Ting Zhang, Bo Zhang, Hao Ouyang, Dong
[36] Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Chen, Qifeng Chen, and Fang Wen. Pretraining is All
Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda You Need for Image-to-Image Translation. arXiv preprint
Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and arXiv:2205.12952, 2022. 2
Ilya Sutskever. Learning Transferable Visual Models From [52] Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim,
Natural Language Supervision. In ICML, 2021. 2, 5, 11 Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gon-
[37] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, tijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok
and Mark Chen. Hierarchical Text-Conditional Image Gen- Namkoong, et al. Robust fine-tuning of zero-shot models.
eration with CLIP Latents. arXiv preprint arXiv:2204.06125, In CVPR, 2022. 5, 11
2022. 2 [53] Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven
[38] Ignacio Rocco, Relja Arandjelovic, and Josef Sivic. Convo- Rennie, Kristen Grauman, and Rogerio Feris. Fashion IQ:
lutional neural network architecture for geometric matching. A New Dataset Towards Retrieving Images by Natural Lan-
In CVPR, 2017. 6, 12 guage Feedback. In CVPR, 2021. 5, 11
[39] Robin Rombach, Andreas Blattmann, Dominik Lorenz, [54] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe
Patrick Esser, and Björn Ommer. High-Resolution Image Gan, Xiaolei Huang, and Xiaodong He. AttnGAN: Fine-
Synthesis With Latent Diffusion Models. In CVPR, 2022. 2, Grained Text to Image Generation with Attentional Genera-
3, 7, 15, 16, 20, 21, 22, 23 tive Adversarial Networks. In CVPR, 2018. 2
10
[55] Han Yang, Ruimao Zhang, Xiaobao Guo, Wei Liu, Wang-
meng Zuo, and Ping Luo. Towards photo-realistic virtual
try-on by adaptively generating-preserving image content. In
CVPR, 2020. 1, 6
[56] Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and Figure 5: Examples of FashionIQ data type.
Yinfei Yang. Cross-modal contrastive learning for text-to-
image generation. In CVPR, 2021. 2 Unique Captions Unique Noun Chunks
[57] Lvmin Zhang and Maneesh Agrawala. Adding conditional
Dataset Upper Lower Dresses Upper Lower Dresses
control to text-to-image diffusion models. arXiv preprint
arXiv:2302.05543, 2023. 2, 15, 16 FashionIQ [53] 27,339 0 15,101 7,801 0 3,592
Fashion200k [12] 25,959 11,022 16,694 22,898 13,420 15,890
[58] Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. DM-
GAN: Dynamic Memory Generative Adversarial Networks
Table 6: Number of unique captions and noun chunks for
for Text-to-Image Synthesis. In CVPR, 2019. 2
each category of the FashionIQ and Fashion200k datasets.
[59] Shizhan Zhu, Raquel Urtasun, Sanja Fidler, Dahua Lin, and
Chen Change Loy. Be Your Own Prada: Fashion Synthesis
• a bright photo of a [noun chunk],
with Structural Coherence. In ICCV, 2017. 2, 5, 6
• a fashion studio shot of a [noun chunk],
• a fashion magazine photo of a [noun chunk],
A. Dress Code Multimodal and VITON-HD • a fashion brochure photo of a [noun chunk],
Multimodal Datasets • a fashion catalog photo of a [noun chunk],
In this section, we give additional details about the • a fashion press photo of a [noun chunk],
dataset collection and annotation process and provide statis- • a yoox photo of a [noun chunk],
tics and further examples of the collected datasets. • a yoox web image of a [noun chunk],
• a high-resolution photo of a [noun chunk],
A.1. Data Preparation • a cropped photo of a [noun chunk],
• a close-up photo of a [noun chunk],
Before extracting noun chunks from the textual sen-
• a photo of one [noun chunk].
tences of FashionIQ [53] and Fashion200k [12], we per-
form word lemmatization to reduce each word to its root
form. Such pre-processing stage is crucial for the Fash- A.2. Annotation Tool for Fine-Grained Annotation
ionIQ dataset, as the captions do not describe a single gar-
ment but instead express the properties to modify in a given We develop a custom annotation tool using the Django
image to match its target. Fig. 5 shows two examples of and Angular web frameworks to ease and speed up the fine-
FashionIQ annotations. grained annotation process. Fig. 6 depicts the user inter-
We use the spaCy NLP toolkit5 to extract noun chunks face. In the annotation phase, users are provided with both
from textual sentences. To facilitate prompt engineering model’s image and the corresponding in-shop garment and
at a later stage, we remove the articles at the beginning should select the three most representative noun chunks per
of each noun chunk. Subsequently, we filter out all noun item (Fig. 6a). If the automatic selection process fails to
chunks starting with or containing special characters and suggest three correct noun chunks, the user can manually
keep unique elements. Table 6 reports detailed statistics insert them (Fig. 6b).
about the number of unique captions and extracted noun
A.3. Coarse-Grained Annotation
chunks from which we start the annotation.
Textual Prompts. As described in the main paper, we rely After completing the manual annotation process on
on the cosine similarity between CLIP-based image and text Dress Code, we obtain 26,400 different model-garment
embeddings to associate each garment with the 25 most rep- pairs (with 8,800 items per category), each associated with
resentative noun chunks. We exploit prompt ensembling to three different noun chunks. To annotate the remaining
perform such zero-shot association as it is shown in [36] 27,392 items of Dress Code Multimodal and the 13,679
that this technique improves performance. items of VITON-HD Multimodal, we leverage the manu-
The employed textual prompts are: ally annotated image-text pairs and finetune the OpenCLIP
• a photo of a [noun chunk], ViT-B/32 [52] model pre-trained on the English portion of
• a photo of a nice [noun chunk], the LAION-5B dataset.
• a photo of a cool [noun chunk], CLIP Finetuning. We finetune both encoders of the Open-
• a photo of an expensive [noun chunk], CLIP model using a single NVIDIA A100 GPU for 400
• a good photo of a [noun chunk], steps, with a batch size of 2048 and a learning rate of 10−6 .
5 https://spacy.io/ ∗ Equal contribution.
11
(a) (b)
Figure 6: User interface of the custom annotation tool. In (a) the user can select the noun chunks among the proposed ones,
while in (b) the user can manually annotate the garment.
As optimizer, we use AdamW [25] with a weight decay Images Unique Noun Chunks
of 0.2. We use mixed precision [28] to speed up training Dataset Ann. Split Upper Lower Dresses Upper Lower Dresses
and save memory. During the training process, we mon- Train 7,000 7,000 7,000 4,751 5,914 4,410
Test 1,800 1,800 1,800 2,337 2,861 2,144
itor the model performance using the top-3 accuracy met- Dress Code M. F
∪ 8,800 8,800 8,800 5,284 6,509 4,915
ric on the test split of the Dress Code Multimodal dataset. ∩ - - - 1,804 2,266 1,639
We choose this metric intending to associate each image Train 6,563 151 20,666 7,198 320 8,650
Test 0 0 0 0 0 0
with three distinct noun chunks. The out-of-the-box model Dress Code M. C
∪ 6,563 151 20,666 7,198 320 8,650
achieves a top-3 accuracy of 12.95%, which improves to ∩ - - - 0 0 0
16.60% after finetuning. The OpenCLIP ViT-g/14 model Train 13,563 7,151 27,666 9,163 6,037 9,465
Test 1,800 1,800 1,800 2,337 2,861 2,144
instead achieves a top-3 accuracy of 16.21%, while being Dress Code M. F+C
∪ 15,363 8,951 29,466 9431 6,597 9,568
computationally heavier than the ViT-B/32 version. Since ∩ - - - 2,069 2,301 2,041
the ViT-g/14 model predicts the set of noun chunks from Train 11,647 - - 4,823 - -
Test 2,032 - - 2,149 - -
which we extract the ground-truth, the actual difference in VITON-HD M. C
∪ 13,679 - - 5,143 - -
performance between the finetuned ViT-B/32 model and the ∩ - - - 1,829 - -
out-of-the-box ViT-g/14 model could be even higher.
Table 7: Number of images and unique noun chunks per
category for both Dress Code Multimodal and VITON-HD
A.4. Extracting Sketches Multimodal. (F) indicates the fine-grained annotation while
(C) indicates the coarse-grained annotation.
As mentioned in the main paper, we train a warping
module to generate input sketches for the unpaired setting in-shop garment C as follows:
(i.e. when we give as input the multimodal information cor-
responding to a garment different from the one originally \hat {C} = \text {TPS}_\theta (C). \label {eq:tps_warping} (5)
worn by the model). In particular, our method involves the
transformation of a given in-shop garment C ∈ RH×W ×3 To refine the result, we employ a U-Net model that takes as
into a warped image of the same garment that fits the model input the concatenation of the coarse warped garment Ĉ, the
of a target image I. We employ the warping module pro- pose map P , and the masked model image IM , and predicts
posed in [50], refining the results with a U-Net based com- the refined warped garment C̃.
ponent [40]. We train this model on the training set of both Dress
The warping module computes a correlation map be- Code Multimodal and VITON-HD Multimodal using a
tween the encoded representations of the in-shop garment combination of an L1 loss between generated and target in-
C and a cloth-agnostic person representation composed of shop garments and a perceptual loss (also known as VGG
the pose map P ∈ RH×W ×18 and the masked model image loss [20]) to compute the difference between the feature
IM ∈ RH×W ×3 . We use two separate convolutional net- maps of generated and target garments extracted with a
works to obtain these encoded representations. Based on the VGG-19 [43]. We train with a resolution of 256 × 192,
computed correlation map, we predict the spatial transfor- Adam [21] as optimizer with β1 = 0.5, β2 = 0.99, and a
mation parameters θ of a thin-plate spline geometric trans- learning rate equal to 10−4 . We train the network on the
formation [38] (i.e. TPSθ ). We then use the θ parameters VITON-HD dataset for 30 epochs, while the training on the
to compute the coarse warped garment Ĉ starting from the Dress Code dataset converges after 80 epochs.
12
Fine-Grained Training Set Fine-Grained Test Set Coarse-Grained Training Set
35000
30000
25000
20000
15000
10000
5000
0
Upper-body Clothes Lower-body Clothes Dresses
13
(i.e. in our case, we use OpenPifPaf [22]) and identify the
set of keypoints falling in the mask M as K(·)M . We com-
pute the final score with an ℓ2 distance between each pair
of real-generated corresponding keypoints (i.e. k ∈ K(I)M
and k̃ ∈ K(I) ˜ M , respectively), weighting each keypoint
distance with the detector confidence to consider possible
estimation errors. Formally, our pose distance metric is de-
fined as follows:
\text {PD}(I, \Tilde {I}) = \frac {\sum \limits _{\substack {k \in \mathcal {K}(I)_M \\ \Tilde {k} \in \mathcal {K}(\Tilde {I})_M}} \sqrt {(k_x-\Tilde {k}_x)^2 + (k_y-\Tilde {k}_y)^2} \cdot \text {CF}_{k\Tilde {k}}}{\sum _{k\Tilde {k}} \text {CF}_{k\Tilde {k}}}, (a)
(7)
where, for each pair of real-generated keypoints, CFkk̃ is
1 if the confidence of the detector K on both keypoints is
greater or equal to 0.5, and 0 otherwise.
Sketch Distance (SD). To evaluate the adherence of the
generated images to the constraints imposed by the input
sketch, we propose a new sketch distance metric. To com-
pute the metric, we first extract the ground-truth and the
generated garments label maps using an off-the-shelf se-
mantic segmentation model6 . We segment the garment ac-
cording to its category and paste it on a white background
of shape 512 × 384. We refer to these new images with IS
and I˜S , respectively. Then, we extract the garment sketches
of both the ground-truth and the generated images using an
edge detector network Edge (i.e. PIDInet [46]). Finally,
we compute the mean squared error between the extracted
sketches, weighting the per-pixel results on the inverse fre-
quency of the activated pixels. Formally, the introduced
sketch distance metric is defined as follows: (b)
\text {SD}(I_S, \Tilde {I}_S) = \text {MSE}\left (Edge(I_S), Edge(\Tilde {I}_S)\right ) * p, \label {eq:sketch_metric} (8) Figure 9: User study interface, where (a) corresponds to the
realism evaluation and (b) refers to the coherence analysis
where p is the inverse pixel frequency. It is noteworthy that between generated images and the given multimodal inputs.
sketch thresholding could be applied before distance com-
putation. Nevertheless, we argue that avoiding threshold- generated output asking the user to select for each compar-
ing enables an effective comparison of hand-drawn ground- ison the image that seems more realistic. In the latter (Fig-
truth grayscale sketches. This approach can facilitate the ure 9b), given the model’s image, the set of noun chunks
evaluation of methods that generate images conditioned us- describing the garment, and the sketch, the user is asked to
ing the sketch. Therefore, we think the proposed metric can select which of the two proposed outputs looks more coher-
be a valuable tool for comparing sketch-guided generative ent with the multimodal inputs also taking into account the
architectures. model’s body pose. Overall, we collect around 7k evalua-
tions, 3.5k for each test, and involving more than 150 users.
C. User Study
As mentioned in the main paper, we conduct a user study D. Additional Results
to evaluate the realism of generated images and their adher-
In this section, we provide additional experimental re-
ence to the given multimodal inputs, comparing our results
sults to understand the strengths and limitations of our ap-
with those from the considered competitors. To this aim,
proach. Table 8 extends Table 2 of the main paper show-
we develop a custom web interface presenting two different
ing quantitative results on each garment category of Dress
surveys. The former (Fig. 9a) assesses the realism of the
Code Multimodal. Since each category contains only 1,800
6 https://github.com/levindabhi/cloth-segmentation images, the FID score presents a high variance in the re-
14
Modalities Upper-body Lower-body Dresses
Model Resolution Text Keypoints Sketch FID ↓ KID ↓ CLIP-S ↑ PD ↓ SD ↓ FID ↓ KID ↓ CLIP-S ↑ PD ↓ SD ↓ FID ↓ KID ↓ CLIP-S ↑ PD ↓ SD ↓
Paired setting
Stable Diff. [39] 256×192 ✓ 22.86 9.73 28.31 4.29 - 28.78 13.93 26.41 4.97 - 36.31 20.74 27.84 5.67 -
FICE [35] 256×192 ✓ ✓ 46.41 32.26 28.58 7.46 - 41.68 27.22 28.14 7.54 - 34.06 20.58 29.47 6.06 -
MGD (ours) 256×192 ✓ ✓ 11.88 2.82 31.48 1.91 - 10.24 1.55 30.50 2.58 - 11.87 2.03 32.05 2.57 -
Paired setting
Stable Diff. [39] 512×384 ✓ 21.00 8.59 30.17 7.95 0.310 28.40 14.48 28.02 9.96 0.345 33.12 17.39 29.36 9.86 0.450
SDEdit [27] 512×384 ✓ ✓ ✓ 15.78 5.52 29.73 4.21 0.222 16.64 6.07 29.00 6.51 0.256 21.53 9.02 28.89 5.67 0.270
MGD (ours) 512×384 ✓ ✓ ✓ 12.42 3.71 31.90 3.72 0.190 10.70 2.01 31.10 5.70 0.210 11.38 1.89 32.02 4.93 0.194
Unpaired setting
Stable Diff. [39] 256×192 ✓ 22.86 9.73 28.31 4.29 - 28.78 13.93 26.41 4.97 - 36.31 20.74 27.84 5.67 -
FICE [35] 256×192 ✓ ✓ 49.77 35.37 26.48 7.64 - 44.94 30.39 25.42 7.84 - 39.04 25.27 26.14 6.39 -
MGD (ours) 256×192 ✓ ✓ 14.50 3.48 29.24 2.39 - 13.70 2.48 29.09 3.32 - 13.72 2.50 30.37 3.17 -
Unpaired setting
Stable Diff. [39] 512×384 ✓ 24.23 10.39 28.64 8.59 0.413 30.90 15.38 27.03 10.43 0.453 35.96 19.94 28.37 10.60 0.609
SDEdit [27] 512×384 ✓ ✓ ✓ 17.86 6.50 27.36 4.78 0.357 19.16 6.85 27.08 7.53 0.399 22.97 9.98 26.85 6.42 0.411
MGD (ours) 512×384 ✓ ✓ ✓ 15.99 4.50 29.76 5.41 0.291 14.82 2.81 29.96 7.96 0.289 14.71 3.63 30.41 7.15 0.252
15
Modalities Dress Code Multimodal VITON-HD Multimodal
Model Resolution Text Pose Sketch FID ↓ KID ↓ CLIP-S ↑ PD ↓ SD ↓ FID ↓ KID ↓ CLIP-S ↑ PD ↓ SD ↓
Paired setting
ControlNet [57] 512×384 ✓ ✓ 18.36 9.82 29.00 7.46 0.462 19.08 9.35 30.03 7.72 0.392
MGD (ours) 512×384 ✓ ✓ 6.31 2.33 31.67 5.31 0.405 11.07 3.36 32.27 6.77 0.318
ControlNet [57] 512×384 ✓ ✓ 27.23 19.01 27.07 7.54 0.436 25.44 17.05 28.31 8.16 0.298
MGD (ours) 512×384 ✓ ✓ 5.72 2.15 31.69 4.94 0.373 10.64 3.26 32.31 6.18 0.255
Unpaired setting
ControlNet [57] 512×384 ✓ ✓ 20.66 11.58 27.57 8.15 0.577 21.03 10.34 28.11 8.38 0.534
MGD (ours) 512×384 ✓ ✓ 7.82 2.85 29.93 6.26 0.519 12.40 3.36 30.34 7.53 0.435
ControlNet [57] 512×384 ✓ ✓ 29.61 20.83 25.75 9.74 0.544 27.41 18.66 26.63 9.53 0.416
MGD (ours) 512×384 ✓ ✓ 7.65 2.70 30.21 7.50 0.456 12.65 3.59 30.69 7.49 0.320
Table 12: Performance comparison with ControlNet on the Dress Code Multimodal and VITON-HD Multimodal datasets for
both paired and unpaired settings.
SKETCH
Instead, the third example of the first row and the first two
samples of the second row highlight the dependence of our
model performance from the given sketch. When the geo-
metric warping module fails to generate a sketch able to fit
SKETCH COND. (t=16)
SKETCH
16
long floral dress
long sleeveless floral
print dress
printed long
summer dress
mid-rise trousers
beige genesis
slim-fit trousers
beige straight-leg
trousers
Figure 11: Sample images and multimodal data from our newly collected Dress Code Multimodal dataset (fine-grained
textual annotations).
17
asymmetrical
floral gown
mul�color floral
print evening gown
conver�ble strapless
floral gown
mul�color bear
t-shirt
mul�color red
t-shirt
red a�la peasant
tee
mul�color print
shorts
mul�color pants
street print
mul�color animal
printed pants
blue asymmetric
tartan skirt
blue prisca modern
plaid miniskirt
blue wrap-effect
embroidered mini
skirt
Figure 12: Sample images and multimodal data from our newly collected Dress Code Multimodal dataset (coarse-grained
textual annotations).
18
pink printed blouse
pink wisteria flo-
ral-print blouse'
women's oversized
palm jacquard top
red printed
three-quarter sleeve
red floral print shirt
red floral-print top
mul�color flo-
ral-print t-shirt only
macy
black t-shirt floral
print
black flower-print
tee
orange tee
boxy fit t-shirt'
bright orange t-shirt
Figure 13: Sample images and multimodal data from our newly collected VITON-HD Multimodal dataset (coarse-grained
textual annotations).
19
long sleeved
short red dress
red drape dress
red stretch
jersey knot
front dress
black sleeveless
gingham combo
gown
long polka dot
dress
long ruffled
gown
short sleeves
logo
tee shirt
white short
sleeve t-shirt
black high-rise
trousers
patch pocket
peg trousers
black wide-leg
dress pants
blue knee
length
blue navy
blue tailored
knee-length
shorts
Figure 14: Qualitative comparison on Dress Code Multimodal. From left to right: model’s image, input sketch, pose map,
image generated by Stable Diffusion [39], image generated by SDedit [27], image generated by MGD (ours), and noun
chunks.
20
mul�color flo-
ral-print t-shirt
blue floral-print
shirt shirt
blue printed
v-neck top
black �e-front
tank
black twisted
tank
black kno�ed
twist cami
crop tee
high-neck
ribbed-jersey
t-shirt'
black
short-sleeve
graphic tee
black
long-sleeved
wrap top
black abigail
twist front crop
top
black plunge
bodysuit
black
long-sleeved
wrap top
black long
sleeve wrap top
Figure 15: Qualitative comparison on VITON-HD Multimodal. From left to right: model’s image, input sketch, pose map,
image generated by Stable Diffusion [39], image generated by SDedit [27], image generated by MGD (ours), and noun
chunks.
21
blue floral print
palazzo pants
blue wide-leg
trousers
blue printed
palazzo pants
131.224 mm
natural geomet-
ric print trou-
sers
orange sammy
trousers
red printed
cropped pants
flared sleeves
white blouse
bu�on
white mandarin
collar
beige 3/4
sleeves
puffy sleeve
beige
cold-shoulder
top
black v-necked
fi�ed dress
black tailored
short sleeved
dress
knee length
black dress
black dress lace
embroidery
mul�color
floral sleeveless
dress
sheer floral pat-
terned dress
Figure 16: Qualitative comparison with low-resolution images on Dress Code Multimodal. From left to right: model’s image,
input sketch, pose map, image generated by Stable Diffusion [39], image generated by FICE [35], image generated by MGD
(ours), and noun chunks.
22
white striped
kno�ed tee
black yazzmin
�e front t shirt
'black ada stripe
crop tee
white �e-front
top
white indra
�e-front top
top lace-up
front white
high-neck top
blue pleated
high-neck
long-sleeved top
blue shayna
mock-turtleneck
lace top
white cropped
logo t-shirt
white logo-em-
broidered
t-shirt
logo print jersey
t-shirt
bright orange
and long
sleeves
casual and long
sleeves
long-sleeve
t-shirt
classic tee
graphic tee
mid t-shirt
Figure 17: Qualitative comparison with low-resolution images on VITON-HD Multimodal. From left to right: model’s
image, input sketch, pose map, image generated by Stable Diffusion [39], image generated by FICE [35], image generated
by MGD (ours), and noun chunks.
23
floral and no
sleeves
floral sleeveless
white colored
and floral print
crewneck
long-sleeve tee
purple design
purple
long-sleeve top
pleated panel
sleeveless dress
blue asymmetri-
cal hem
�e-waist dress
high low blue
dress
white embroi-
dered asym-
metrical dress
one covered
shoulder
sleeveless white
dress
black long ciga-
re�e pants
black straight
tailored trousers
black pure ta-
pered-leg trou-
sers
Figure 18: Qualitative comparison of images generated by our model on Dress Code Multimodal using different conditioning
modalities. From left to right: model’s image, input sketch, pose map, image generated using only text, image generated using
text and pose map, image generated with all input modalities (i.e. text, pose map, and sketch).
24
animal print top
brown leop-
ard-print sleeve-
less blouse
cropped top
floral-print de
chine maxi skirt
mul�color print-
ed fil coupe maxi
skirt
white floral print
maxi skirt
Sketch Condi�oning
Figure 19: Qualitative results generated by MGD increasing the sketch conditioning steps.
feminist slogan
loose fit t-shirt regular t-shirt
short-sleeve top solid pink tee only white t-shirt
army green t-shirt slim fit t-shirt
only white slogan
t-shirt
Figure 20: Failure cases on Dress Code Multimodal (first row) and VITON-HD Multimodal (second row).
25