0% found this document useful (0 votes)
156 views22 pages

Virtual-Try-Off Via High-Fidelity Garment Reconstruction Using Diffusion Models.18350v1

Uploaded by

neturiue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
156 views22 pages

Virtual-Try-Off Via High-Fidelity Garment Reconstruction Using Diffusion Models.18350v1

Uploaded by

neturiue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction

using Diffusion Models

Riza Velioglu , Petra Bevandic, Robin Chan, Barbara Hammer


Machine Learning Group, CITEC, Bielefeld University, Germany
{rvelioglu, pbevandic, rchan, bhammer}@techfak.de
arXiv:2411.18350v1 [cs.CV] 27 Nov 2024

Figure 1. Virtual try-off results generated by our method. The first row shows the input reference image, the second row our model’s
prediction, and the third row the ground truth. Our approach naturally renders the garment against a clean background, preserving the
standard pose and capturing complex details of the target garment, such as patterns and logos, from a single reference image.

Abstract 1. Introduction
Image-based virtual try-on (VTON) [23] is a key computer
This paper introduces Virtual Try-Off (VTOFF), a novel
vision task aimed at generating images of a person wear-
task focused on generating standardized garment images
ing a specified garment. Typically, two input images are
from single photos of clothed individuals. Unlike traditional
required: one showing the garment in a standardized form
Virtual Try-On (VTON), which digitally dresses models,
(often from an e-commerce catalog) and another of the per-
VTOFF aims to extract a canonical garment image, posing
son that needs to be ‘dressed’. Recent methods focus on a
unique challenges in capturing garment shape, texture, and
modified formulation where the catalog image is replaced
intricate patterns. This well-defined target makes VTOFF
with a photo of another person wearing the target garment.
particularly effective for evaluating reconstruction fidelity
This introduces additional processing complexity [55] as
in generative models. We present TryOffDiff, a model that
the model does not have access to full garment information.
adapts Stable Diffusion with SigLIP-based visual condition-
ing to ensure high fidelity and detail retention. Experiments From an application perspective, VTON offers an inter-
on a modified VITON-HD dataset show that our approach active shopping experience that helps users make better-
outperforms baseline methods based on pose transfer and informed purchasing decisions. On the research side, it
virtual try-on with fewer pre- and post-processing steps. raises intriguing research questions, particularly around hu-
Our analysis reveals that traditional image generation met- man pose detection as well as clothing shape, pattern, and
rics inadequately assess reconstruction quality, prompting texture analysis [17]. Best-performing models are usu-
us to rely on DISTS for more accurate evaluation. Our re- ally guided generative models focused on creating specific,
sults highlight the potential of VTOFF to enhance product physically accurate outputs. Unlike general generative tasks
imagery in e-commerce applications, advance generative that produce diverse outputs, reconstruction requires mod-
model evaluation, and inspire future work on high-fidelity els to generate images that align with the correct appearance
reconstruction. Demo, code, and models are available at: of the garment on a person.
https://rizavelioglu.github.io/tryoffdiff/ However, one drawback of VTON is the lack of a clearly
defined target output, often resulting in stylistic variations

1
that complicate evaluation. Generated images may show
garments tucked, untucked, or altered in fit, introducing
plausible yet inconsistent visual variations and making it
difficult to assess the true quality of garment representa-
tion [47]. This is why current evaluation methods generally
rely on a broad assessment of generative quality [20], with-
out considering the similarity between individual garment-
person ground truth pairs. Common image quality met-
rics often exhibit sensitivity to differences in non-salient re-
gions, such as the background, which complicates pinpoint-
ing the precise sources of performance variability [11, 45].
We therefore introduce Virtual Try-OFF (VTOFF), a
novel task focused on generating standardized product im-
ages from real-world photos of clothed individuals as illus- Figure 2. Illustration of the differences between Virtual Try-
trated in Figure 1 and Figure 2. Even though the goal is re- On and Virtual Try-Off. Top: Basic inference pipeline of a Vir-
versed when compared to VTON, the two tasks address sim- tual Try-On model, which takes an image of a clothed person as
ilar challenges such as pose analysis, geometric and appear- reference and an image of a garment to generate an image of the
ance transformations, potential occlusions and preservation same person but wearing the specified garment. Bottom: Virtual
Try-Off setup, where the objective is to predict the canonical form
of fine-grained details such as textures, patterns, and logos.
of the garment from a single input reference image.
Additionally, the acquisition diversity of real-world pho-
tos — varying in background, lighting, and camera qual-
ity — introduces unique challenges in domain adaptation • We introduce VTOFF, a novel task to generate standard-
and robust feature extraction. Still, this switch in the target ized product images from real-world photos of clothed
presents a crucial advantage of VTOFF over VTON: the re- individuals, unlocking promising real-world applications
duced stylistic variability on the output side simplifies the while raising important new research questions.
assessment of reconstruction quality. • We present TryOffDiff, a novel framework that adapts
The potential impact of VTOFF extends well beyond pretrained diffusion models for VTOFF by aligning im-
research. It could enhance the flexibility of various e- age features with text-based diffusion priors, ensuring
commerce applications that rely on consistent product im- high visual fidelity and consistent product details.
ages. For instance, generated images can be integrated • Extensive experiments on the VITON-HD dataset demon-
seamlessly into existing virtual try-on solutions, enabling strate that TryOffDiff generates high-quality, detail-rich
the more complex person-to-person try-on by substituting product images of garments, outperforming state-of-the-
the ground truth with the generated garment image. Rec- art view synthesis and virtual try-on methods.
ommendation and other customer-to-product retrieval sys-
tems [14] could also benefit from access to standardized 2. Related Work
garment representation. Moreover, it could support the cre-
ation of large-scale, high-quality fashion datasets, thereby Virtual Try-Off seeks to reconstruct a canonical image of
accelerating the development of fashion-oriented AI. From clothing, typically resembling garments worn by a person
an environmental standpoint, these applications should help in a neutral pose. While virtual try-on and pose-transfer
customers with purchasing decisions, thus reducing product methods could be adapted to produce these standardized
returns and the environmental footprint of the fashion indus- outputs, our experiments indicate that such adaptations un-
try. Finally, generating standardized garment images from derperform. Instead, we base our solution on conditional
everyday photos is an interesting task in itself, as it could diffusion models, which have demonstrated robust perfor-
simplify the maintenance of e-commerce catalogs by re- mance across diverse generative tasks.
ducing the need for expensive photography equipment and
time-consuming editing, benefiting smaller vendors who Image-based Virtual Try-On. The objective of image-
lack the resources for professional-quality product photog- based virtual try-on is to produce composite images that re-
raphy. alistically depict a specific garment on a target person, pre-
Our work highlights that reconstructing e-commerce im- serving the person’s identity, pose, and body shape, while
ages is a challenging task that requires significant modifi- capturing fine garment details. CAGAN [23] introduced
cations to existing VTON models. Moreover, we show that this task with a cycle-GAN approach, while VITON [17]
traditional image generation metrics fall short in capturing formalized it as a two-step, supervised framework: warp-
reconstruction quality. Our primary contributions are: ing the garment through non-parametric geometric transfor-

2
mations [3], followed by blending it onto the person. CP- DiOr [10] proposed a generation framework for pose
VTON [52] refined this process by implementing a learn- transfer, using a recurrent architecture that sequentially
able thin-plate spline (TPS) transformation using a geo- dresses a person in garments to create different looks from
metric matcher, later improved with dense flow [18] and the same input. [36] introduced a GAN-based pose trans-
appearance flow [15] to enhance the pixel-level alignment fer model that uses a multi-scale attention-guided approach,
of garment details. Despite progress in warping-based ap- significantly improving on existing methods and show-
proaches, limitations remain, especially with complex gar- ing potential for VTON applications. DreamPose [24]
ment textures, folds, and logos. synthesizes try-on videos from an image and a sequence
To address these drawbacks, recent works adopted GAN- of human body poses using a pretrained latent diffusion
based and diffusion-based methods. FW-GAN [12] syn- model. PoCoLD [19] trained a latent diffusion model con-
thesized try-on videos, while PASTA-GAN [55] modified ditioned on dense pose maps for person image synthesis.
StyleGAN2 for person-to-person try-on. However, GANs ViscoNet[7] integrates adapter layers into a pretrained la-
suffer from issues like unstable training and mode collapse, tent diffusion model and extends ControlNet to incorpo-
leading VTON research to favor diffusion models, which rate multiple image conditions, enhancing control over vi-
have proven to be more reliable. M&M-VTO [63] intro- sual inputs. PCDM [39] proposed a three-stage pipeline
duced a single-stage diffusion model capable of synthesiz- for pose-guided person image synthesis, achieving texture
ing multi-garment try-on results from an input person image restoration, and enhancing fine-detail consistency.
and multiple garment images. IDM-VTON [8] proposed It should be mentioned that pose transfer focuses on pre-
two modules to encode the semantics of garment image, serving the original scene attributes, such as lighting, back-
extracting high- and low-level features with cross-attention ground, and subject appearance, In contrast, the virtual try-
and self-attention layers. OOTDiffusion [57] leveraged pre- off task should adhere to strict e-commerce presentation
trained latent diffusion models to learn garment features, standards including consistent front/back views, uniform
which are incorporated into a denoising UNet using outfit- sizing, and catalog-specific styling.
ting fusion. In a more lightweight approach, CatVTON [9]
eliminated the need for heavy feature extraction, propos-
ing a compact model based on a pretrained latent diffusion Conditional Diffusion Models. Latent Diffusion Mod-
model that achieved promising results with fewer param- els [35] (LDMs) achieved great success in recent years, of-
eters. Modifying existing VTON models for VTOFF is fering control over generative process through the introduc-
not necessarily straightforward, as VTON models often de- tion of the cross-attention mechanism [49]. The condition-
pend on additional inputs like text prompts, keypoints, or ing works with diverse input modalities such as text [2, 4,
segmentation masks, which must be carefully selected and 13] and image [32, 37, 38]. In text-guided image synthesis,
manually tailored for effective adaptation. models like ControlNet [60] and T2I-Adapter [30] extend
It is important to note that, while both VTON and pretrained models with additional blocks that offer more
VTOFF tasks involve garment manipulation, they are fun- precise spatial control. IP-Adapter [58] advances this flexi-
damentally different. VTON models have access to com- bility by decoupling the cross-attention mechanism for text
plete garment details, allowing them to primarily focus on and image features, allowing image-guided generation with
warping the item to fit a target pose. In contrast, VTOFF optional structural conditions. Prompt-Free Diffusion [56]
models must work with only partial garment information discards text prompts altogether, generating images solely
from a reference image, where occlusions and deformations from a reference image and optional structural inputs.
are common, requiring them to reconstruct missing details Despite the advancements, these models cannot be
from limited visual cues. applied for garment reconstruction out-of-the-box: text-
guided approaches require impractically detailed prompts
Image-based View Synthesis & Pose Transfer. Novel for each sample to specify product attributes, while existing
View Synthesis (NVS) aims to generate realistic images image-guided models lack mechanisms to enforce the strict
from unseen viewpoints. While early methods required requirements of standardized product photography.
hundreds of training images per instance [26, 43, 46, While these techniques have advanced image manipu-
61, 62], recent approaches enable synthesis from sparse lation capabilities, they fall short of addressing the spe-
views [22, 48]. However, NVS alone cannot fully address cific challenges associated with generating standardized e-
garment reconstruction, as the pose of the observed person commerce product images. Recently, Wang et al. [53] in-
cannot be changed. Pose transfer, a related task, can be corporated a VTOFF-like objective in their models, but only
seen as type of view synthesis that also allows for object as an auxiliary loss term. To the best of our knowledge, we
deformation. It requires additional capabilities for inferring are the first to formally define Virtual Try-Off (VTOFF) as
potentially occluded body parts. a standalone task and to propose a tailored approach for it.

3
3. Methodology
This section provides the formal definition of the virtual try-
off task. We propose a suitable evaluation setup and perfor-
mance metrics. We further provide details of our TryOffDiff
model which relies on StableDiffusion and SigLIP features (a) 82.4 / 20.6 (b) 96.8 / 17.9 (c) 88.3 / 20.3
for image-based conditioning.

3.1. Virtual Try-Off


Problem Formulation. Let I ∈ RH×W ×3 be an RGB
image with height H ∈ N and width W ∈ N, respectively.
In the task of virtual try-off, I represents a reference im- (d) 86.0 / 70.3 (e) 75.0 / 8.2 (f) 86.4 / 24.7
age displaying a clothed person. Given the reference im-
Figure 3. Examples demonstrating the un/suitability of perfor-
age, VTOFF aims to generate a standardized product image mance metrics (SSIM↑ / DISTS↓) to VTON and VTOFF. In the
G ∈ {0, . . . , 255}H×W ×3 , displaying the garment accord- top row, a reference image is compared against: (a) an image with
ing to commercial catalog standards. a masked-out garment; (b) an image with changed colors of the
Formally, the goal is to train a generative model that model; (c) and an image after applying color jittering. In the bot-
learns the conditional distribution P (G|C), where G and tom row, a garment image is compared against: (d) a plain white
C represent the variables corresponding to garment im- image; (e) a slightly rotated image; (f) and a randomly poster-
ages and reference images (serving as condition), respec- ized image (reducing the number of bits for each color channel).
tively. Suppose the model approximates this target distri- While the SSIM score achieves consistently high across all exam-
bution with Q(G|C). Then, given a specific reference im- ples, in particular including failure cases, the DISTS score more
accurately reflects variations aligned with human judgment.
age I as conditioning input, the objective is for a sample
Ĝ ∼ Q(G|C = I) to resemble a true sample of a garment
image G ∼ P (G|C = I) as closely as possible. A metric that addresses these shortcomings is the Deep
Image Structure and Texture Similarity (DISTS) [11] met-
Performance Measures. To evaluate VTOFF perfor- ric, designed to measure perceptual similarity between im-
mance effectively, evaluation metrics must capture both re- ages by capturing both structural and textural informa-
construction and perceptual quality. Reconstruction quality tion. DISTS leverages the VGG model [40], where lower-
quantifies how accurately the model’s prediction Ĝ matches level features are used to capture structural elements, while
the ground truth G, focusing on pixel-level fidelity. In con- higher-level features focus on finer textural details. The fi-
trast, perceptual quality assesses how natural and visually nal DISTS score is computed through a weighted combina-
appealing the generated image appears to human observers, tion of these two components, with weighting parameters
aligning with common visual standards. optimized based on human ratings, resulting in a perceptual
To estimate reconstruction, we may use full-reference similarity score that aligns more closely with human judg-
metrics such as Structural Similarity Index Measure ment. For these reasons, DISTS represents our main metric
(SSIM) [54]. However, neither SSIM, nor its multi- for VTOFF.
scale (MS-SSIM) and complex-wavelet (CW-SSIM) vari-
3.2. TryOffDiff
ants align well with human perception, as noted in prior
studies [11, 45]. We observe similar behavior in our exper- We base our TryOffDiff model on Stable Diffusion [35]
iments as well, and illustrate our findings in Figure 3. (v1.4), a latent diffusion model originally designed for text-
Perceptual quality may be captured with no-reference conditioned image generation using CLIP’s [34] text en-
metrics like Fréchet Inception Distance (FID) [20] and Ker- coder. We replace text prompts for direct image-guided im-
nel Inception Distance (KID) [5]. These metrics usually age generation.
compare distributions of image feature representations be-
tween generated and real images. They are however unsuit- Image Conditioning. A core challenge in image-guided
able for single image pair comparison since they are sen- generation is effectively incorporating visual features into
sitive to sample size and potential outliers. Additionally, the conditioning mechanism of the generative model.
both FID and KID rely on features from the classical In- CLIP’s ViT [34] has become a popular choice for image fea-
ception [44] model, which does not necessarily align with ture extraction due to its general-purpose capabilities. Re-
human judgment in assessing perceptual quality, especially cently, SigLIP [59] introduced modifications that improve
in the context of modern generative models such as diffu- performance, particularly for tasks requiring more detailed
sion models [42]. and domain-specific visual representations. Therefore, we

4
Figure 4. Overview of TryOffDiff. The SigLIP image encoder [59] extracts features from the reference image, which are subsequently
processed by adapter modules. These extracted image features are embedded into a pre-trained text-to-image Stable Diffusion-v1.4 [35]
by replacing the original text features in the cross-attention layers. By conditioning on image features in place of text features, TryOffDiff
directly targets the VTOFF task. Simultaneous training of the adapter layers and the diffusion model enables effective garment transfor-
mation.

use the SigLIP model as image feature extractor and retain 4. Experiments
the entire sequence of token representations in its final layer
to preserve spatial information, which we find essential for We establish several baseline approaches for the virtual try-
the capture of fine-grained visual details and accurate gar- off task, adapting virtual try-on and pose transfer models as
ment reconstruction. discussed in Section 2, and compare them against our pro-
Given input image I, our proposed adapter module pro- posed TryOffDiff method described in Section 3. To ensure
cesses these representations as follows: reproducibility, we detail our experimental setup. We use
DISTS as the primary evaluation metric, while also report-
C(I) = (LN ◦ Linear ◦ ψ ◦ SigLIP)(I) ∈ Rn×m (1) ing other standard generative metrics for comparison. Addi-
tionally, we provide extensive qualitative results to illustrate
where ψ is a standard transformer encoder [49] processing
how our model manages various challenging inputs.
SigLIP embeddings, followed by a linear projection layer
and layer normalization (LN) [1], cf . Figure 4.
4.1. Experimental Setup
The adapted image features are integrated into the
denoising U-Net of Stable Diffusion via cross-attention. Dataset. Our experiments are conducted on the pub-
Specifically, the key K and value V of the attention mech- licly available VITON-HD [27] dataset, which consists of
anism at each layer are derived from the image features 13, 679 high-resolution (1024 × 768) image pairs of frontal
through linear transformations: half-body models and corresponding upper-body garments.
While the VITON-HD dataset was originally curated for the
K = C(I) · Wk ∈ Rn×dk , V = C(I) · WV ∈ Rn×dv (2)
VTON task, it is also well-suited to our purposes as it pro-
where Wk ∈ Rm×dk and Wv ∈ Rm×dv . This formula- vides the required (I, G) image pairs, where I represents
tion enables the cross-attention mechanism to condition the the reference image of a clothed person and G the corre-
denoising process on the features of the external reference sponding garment image.
image I, enhancing alignment in the generated output. Upon closer inspection of VITON-HD, we identified 95
We only train the adapter modules and fine-tune the de- duplicate image pairs (0.8%) in the training set and 6 du-
noising U-Net of the Stable Diffusion model, while keeping plicate pairs (0.3%) in the test set. Additionally, we found
the SigLIP image encoder, VAE encoder and VAE decoder 36 pairs (1.8%) in the training set that had been included in
frozen. This training strategy preserves the robust image the original test split. To ensure the integrity of our exper-
processing capabilities of the pretrained components while iments, we cleaned the dataset by removing all duplicates
adjusting the generative components to the specific require- in both subsets as well as all leaked examples from the test
ments of garment reconstruction. set. The resulting cleaned dataset, contains 11,552 unique

5
(a) Left to right: reference image, fixed pose heatmap derived from target (b) Left to right: masked conditioning image, mask image, pose image, ini-
image, initial model output, SAM prompts, and final processed output. tial model output with SAM prompts, and final processed output.

(c) Left to right: masked garment image, model image, masked model image, (d) Left to right: conditioning garment image, blank model image, mask
initial model output with SAM prompts, and final processed output. image, initial model output with SAM prompts, final processed output.

Figure 5. Adapting existing state-of-the-art methods to VTOFF. (a) GAN-Pose [36] and (b) ViscoNet [7] are approaches based on pose
transfer and view synthesis, respectively, (c) OOTDiffusion [57] and (d) CatVTON [9] are based on recent virtual try-on methods.

image pairs for training and 1,990 unique image pairs for 4.2. Baseline Approaches
testing. We provide the script for cleaning the dataset in our
To establish the baselines, we adapted state-of-the-art pose
code repository.
transfer and virtual try-on methods, modifying each to ap-
proximate garment reconstruction functionality as closely
Implementation Details. We train TryOffDiff by build- as possible. We illustrate these approaches in Figure 5.
ing on the pretrained Stable Diffusion v1.4 [35], focusing on
fine-tuning the denoising U-Net and training adapter layers GAN-Pose [36] is a GAN-based pose transfer method
from scratch, cf . Section 3.2. As a preprocessing step, we that expects three inputs: a reference image, and pose
pad the input reference image along the width for a square heatmaps of the reference and target subject. Garment im-
aspect ratio, then resize them to a resolution of 512 × 512 ages from VITON-HD are used to estimate the heatmap for
to match the expected input format of the pretrained SigLIP a fixed, neutral pose. This setup enables the transfer of hu-
and VAE encoder. For training, we preprocess the garment man poses from diverse reference images to a standardized
images in the same way. We use SigLIP-B/16-512 as image pose, aligning the output to the typical view of product im-
feature extractor, which outputs 1024 token embeddings of ages.
dimension 768. Our adapter, consisting of a single trans-
former encoder layer with 8 attention heads, followed by
linear and normalization layers, reduces these to n = 77 ViscoNet [7] requires a text prompt, a pose, a mask, and
conditioning embeddings of dimension m = 768. multiple masked conditioning images as inputs. For the
Training occurs over 220k iterations on a single node text prompt, we use a description such as “a photo of an
with 4 NVIDIA A40 GPUs, requiring approximately 9 days e-commerce clothing product”. We choose a garment im-
with a batch size of 16. We employ the AdamW opti- age from VITON-HD to estimate a neutral pose as well as
mizer [29], with an initial learning rate of 1e-4 that in- a generic target mask. Since ViscoNet is originally trained
creases linearly from 0 during the first 1,000 warmup steps, with masked conditioning images, we apply an off-the-shelf
then follows a cosine decay to 0 with a hard restart at 90k fashion parser [50] to mask the upper-body garment, which
steps. As proposed in [28], we use the PNDM scheduler is then provided as input.
with 1,000 steps. We optimize using the standard Mean
Squared Error (MSE) loss, which measures the difference OOTDiffusion [57] takes a garment image and a refer-
between the added and the predicted noise at each step. This ence image to generate a VTON output. To adapt this model
loss function is commonly employed in diffusion models for VTOFF, we again apply the fashion parser [50] to mask
to guide the model in learning to reverse the noising pro- the upper-body garment to create the garment image. We
cess effectively. During inference, we run TryOffDiff with select a reference image with a mannequin in a neutral pose
a PNDM scheduler over 50 timesteps with a guidance scale as further input. An intermediate step involves masking the
of 2.0. On a single NVIDIA A6000 GPU, this process takes upper-body within the reference image, for which we use a
12 seconds per image and requires 4.6GB of memory. hand-crafted masked version of the reference image.

6
MS- CW- L- CLIP- DI- Section 3.1. GAN-Pose generates outputs that manage to
Method SSIM↑ SSIM↑ SSIM↑ PIPS↓ FID↓ FID↓ KID↓ STS↓
approximate the main color and shape of the target gar-
GAN-Pose [36] 77.4 63.8 32.5 44.2 73.2 30.9 55.8 30.4
ment. However, the predicted images often contain small
ViscoNet [7] 58.5 50.7 28.9 54.0 42.3 12.1 25.5 31.2
regions where parts of the garment are missing. Although
OOTDiff. [57] 65.1 50.6 26.1 49.5 54.0 17.5 33.2 32.4
CatVTON [9] 72.8 56.9 32.0 45.9 31.4 9.7 17.8 28.2 these gaps do not significantly affect full-reference metrics
Ours: TryOffDiff 79.5 70.4 46.2 32.4 25.1 9.4 8.9 23.0
since the overall garment structure is still largely intact, they
noticeably reduce visual fidelity, giving the images an un-
Table 1. Quantitative comparison. Evaluation metrics for vari- natural appearance. This degradation is reflected in the no-
ous methods on VITON-HD-test dataset in the VTOFF task. reference metrics, which are more sensitive to such visual
artifacts.
ViscoNet generally produces more realistic outputs than
CatVTON [9] is a model that generates a VTON image GAN-Pose but struggles to accurately capture the garment’s
using a reference image and a conditioning garment im- shape, often resulting in deformed representations. Addi-
age as inputs. An intermediate step incorporates upper- tionally, ViscoNet displays a bias towards generating long
body masks to guide the try-on process. For adaptation sleeves, regardless of the target garment’s actual design.
to VTOFF, we replace the reference image with a plain Most outputs also lack textural details, further highlighting
white image and use a handcrafted mask in a neutral pose, ViscoNet’s limitations for the garment reconstruction task.
enabling CatVTON to perform garment transfer indepen-
dently of any specific person. OOTDiffusion, originally designed as a virtual try-on
In all of our baselines, we post-process the outputs with method, encounters similar difficulties as GAN-Pose in
Segment Anything (SAM) [25] and point prompts to isolate generating realistic images. While it generally struggles
the garment mask. We cut out the identified garment sec- to retain detailed textures, it performs better in preserv-
tions and paste them onto a white background for the final ing fine elements like logos compared to previous methods.
garment image output. Nonetheless, its inability to consistently capture overall tex-
tural details underscores its limitations in virtual try-off.
4.3. Quantitative Results CatVTON also demonstrates the ability to preserve logo
The numerical results of our experiments on the VITON- elements. Furthermore, it generally manages to produce
HD dataset are reported in Table 1. Our tailored TryOffD- texture details that closely resemble those of the target gar-
iff approach outperforms all baseline methods across all ment. The garment shapes this method generates appear
generative performance metrics. However, baseline rank- natural, making CatVTON’s outputs visually appealing and
ings vary significantly depending on the chosen metric. For the strongest baseline methods in terms of visual fidelity.
example, GAN-Pose has the second best results when us- Although CatVTON produces garments with a natural ap-
ing full-reference metrics like SSIM, MS-SSIM, and CW- pearance, the shapes do not consistently match the target
SSIM. In contrast, for no-reference metrics such as FID, garment’s actual shape, undermining its full-reference met-
CLIP-FID, and KID, CatVTON emerges as the strongest ric performance and limiting its overall effectiveness for
baseline, while GAN-Pose has the lowest performance. VTOFF.
The DISTS metric is our main metric as it balances struc- Our TryOffDiff model consistently captures the shape of
tural and textural information, offering a more nuanced as- target garments, even reconstructing portions of the gar-
sessment of generated image quality. When examining the ment that are occluded in the reference image. For in-
ranking of the baseline methods, CatVTON slightly out- stance, TryOffDiff can correctly infer the shape of high-
performs GAN-Pose, which in turn shows marginally bet- cut bodysuits, even when models in the reference images
ter performance than ViscoNet and OOTDiff. This rank- are wearing pants. Subtle indicators, such as garment tight-
ing aligns well with our own subjective visual perception, ness or features like shoulder straps, enable this reconstruc-
which will be further discussed in the following Section 4.4. tion. Additionally, TryOffDiff reliably recovers detailed
We emphasize that TryOffDiff shows a significant improve- textures, including colors, patterns, buttons, ribbons, and
ment of 5.2 percentage points over the next best performing logos, making it superior over all baseline methods and the
baseline method. top-performing model for VTOFF in our experiments.
While we note that TryOffDiff is the only method specif-
4.4. Qualitative Analysis
ically designed for VTOFF, it stands out as the only ap-
The qualitative results are shown in Figure 6. We find that proach capable of accurately reconstructing textural details.
they align with the quantitative results and illustrate how This underscores the effectiveness of our proposed image
each metric emphasizes different aspects of garment recon- conditioning mechanism, which enables precise texture re-
struction leading to inconsistent rankings, as discussed in covery and overall high-quality garment reconstruction.

7
(a) Reference (b) Gan-Pose (c) ViscoNet (d) OOTDiffusion (e) CatVTON (f) TryOffDiff (g) Target

Figure 6. Qualitative comparison. In comparison to the baseline approaches, TryOffDiff is capable of generating garment images with
accurate structural details as well as fine textural details.

5. Conclusion and post-processing steps. In particular, we find that we are


better at preserving fine details like patterns and logos. We
In this paper, we introduced VTOFF, a novel task focused also observe that this advantage is not reflected when using
on reconstructing a standardized garment image based on conventional metrics for generative model reconstruction
one reference image of a person wearing it. While VTOFF quality. To better capture visual fidelity, we adopt DISTS
shares similarities to VTON, we demonstrate it is better as our primary evaluation metric.
suited for evaluating the garment reconstruction accuracy of VTOFF highlights the potential for advancing our under-
generative models since it targets a clearly defined output. standing of guided generative model performance. Our re-
We further propose TryOffDiff, a first tailored VTOFF sults show promise, but there is still room for improvement
model which adapts Stable Diffusion. We substitute Stable in preserving complex structures, such as logos and printed
Diffusion text conditioning with adapted SigLIP features to designs. Future work could benefit from exploring newer
guide the generative process. In our experiments, we repur- generative models, alternative visual conditioning methods
pose the existing VITON-HD dataset, enabling direct com- and additional losses to enhance detail preservation. Fi-
parisons of our method against several baselines based on nally, our findings underscore the need for improved qual-
existing VTON approaches. TryOffDiff significantly out- ity metrics, potentially combined with user studies, to better
performs these baselines, with fewer requirements for pre- align qualitative impressions with quantitative evaluations.

8
Acknowledgment [15] Yuying Ge, Yibing Song, Ruimao Zhang, Chongjian Ge, Wei
Liu, and Ping Luo. Parser-free virtual try-on via distilling
This work has been funded by the German federal appearance flows. In CVPR, 2021. 3
state of North Rhine-Westphalia as part of the re-
[16] Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp
search training group “DataNinja” (Trustworthy AI for
Schmid, Zachary Mueller, Sourab Mangrulkar, et al. Ac-
Seamless Problem Solving: Next Generation Intelli-
celerate: Training and inference at scale made simple,
gence Joins Robust Data Analysis) and the research
efficient and adaptable. https : / / github . com /
funding program KI-Starter. We would like to thank
huggingface/accelerate, 2022. 4
UniZG-FER for providing access to their hardware.
[17] Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S
Davis. Viton: An image-based virtual try-on network. In
CVPR, 2018. 1, 2
References [18] Xintong Han, Xiaojun Hu, Weilin Huang, and Matthew R
[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Scott. Clothflow: A flow-based model for clothed person
Layer normalization. stat, 1050:21, 2016. 5 generation. In CVPR, 2019. 3
[2] Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole Brich- [19] Xiao Han, Xiatian Zhu, Jiankang Deng, Yi-Zhe Song, and
tova, Andrew Bunner, Kelvin Chan, et al. Imagen 3. arXiv, Tao Xiang. Controllable person image synthesis with pose-
2024. https://doi.org/nqr4. 3 constrained latent diffusion. In ICCV, 2023. 3
[3] Serge Belongie, Jitendra Malik, and Jan Puzicha. Shape [20] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
matching and object recognition using shape contexts. IEEE Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
TPAMI, 2002. 3 two time-scale update rule converge to a local nash equilib-
[4] James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng rium. In NeurIPS, 2017. 2, 4
Wang, Linjie Li, et al. Improving image generation with [21] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
better captions. preprint, 2023. 3 sion probabilistic models. In NeurIPS, 2020. 1
[5] Mikołaj Bińkowski, Danica J Sutherland, Michael Arbel, and [22] Wonbong Jang and Lourdes Agapito. Nvist: In the wild new
Arthur Gretton. Demystifying mmd gans. In ICLR, 2018. 4 view synthesis from a single image with transformers. In
[6] Chaofeng Chen and Jiadi Mo. IQA-PyTorch: Pytorch tool- CVPR, 2024. 3
box for image quality assessment. https://github. [23] Nikolay Jetchev and Urs Bergmann. The conditional analogy
com/chaofengc/IQA-PyTorch, 2022. 4 gan: Swapping fashion articles on people images. In ICCVW,
[7] Soon Yau Cheong, Armin Mustafa, and Andrew Gilbert. Vis- 2017. 1, 2
conet: Bridging and harmonizing visual and textual condi- [24] Johanna Karras, Aleksander Holynski, Ting-Chun Wang,
tioning for controlnet. In ECCVW, 2024. 3, 6, 7 and Ira Kemelmacher-Shlizerman. Dreampose: Fashion
[8] Yisol Choi, Sangkyung Kwak, Kyungmin Lee, Hyungwon image-to-video synthesis via stable diffusion. In ICCV,
Choi, and Jinwoo Shin. Improving diffusion models for vir- 2023. 3
tual try-on. arXiv, 2024. https://doi.org/np47. 3 [25] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao,
[9] Zheng Chong, Xiao Dong, Haoxiang Li, Shiyue Zhang, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White-
Wenqing Zhang, Xujie Zhang, Hanqing Zhao, and Xiao- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any-
dan Liang. Catvton: Concatenation is all you need for vir- thing. In ICCV, 2023. 7
tual try-on with diffusion models. arXiv, 2024. https: [26] Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and
//doi.org/npf6. 3, 6, 7, 2 Josh Tenenbaum. Deep convolutional inverse graphics net-
[10] Aiyu Cui, Daniel McKee, and Svetlana Lazebnik. Dressing work. In NeurIPS, 2015. 3
in order: Recurrent person image generation for pose trans- [27] Sangyun Lee, Gyojung Gu, Sunghyun Park, Seunghwan
fer, virtual try-on and outfit editing. In ICCV, 2021. 3 Choi, and Jaegul Choo. High-resolution virtual try-on with
[11] Keyan Ding, Kede Ma, Shiqi Wang, and Eero P Simoncelli. misalignment and occlusion-handled conditions. In ECCV,
Image quality assessment: Unifying structure and texture 2022. 5
similarity. IEEE TPAMI, 2020. 2, 4 [28] Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo
[12] Haoye Dong, Xiaodan Liang, Xiaohui Shen, Bowen Wu, numerical methods for diffusion models on manifolds. In
Bing-Cheng Chen, and Jian Yin. Fw-gan: Flow-navigated ICLR, 2022. 6
warping gan for video virtual try-on. In ICCV, 2019. 3 [29] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
[13] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim En- regularization. In ICLR, 2019. 6
tezari, Jonas Müller, Harry Saini, et al. Scaling rectified flow [30] Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian
transformers for high-resolution image synthesis. In ICML, Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learning
2024. 3, 1 adapters to dig out more controllable ability for text-to-image
[14] Yuying Ge, Ruimao Zhang, Lingyun Wu, Xiaogang Wang, diffusion models. In AAAI, 2024. 3
Xiaoou Tang, and Ping Luo. A versatile benchmark for de- [31] Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On
tection, pose estimation, segmentation and re-identification aliased resizing and surprising subtleties in gan evaluation.
of clothing images. In CVPR, 2019. 2 In CVPR, 2022. 4

9
[32] Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Laforte, Varun Jampani, and Yan-Pei Cao. Triposr: Fast
Li, Jingwan Lu, and Jun-Yan Zhu. Zero-shot image-to-image 3d object reconstruction from a single image. arXiv, 2024.
translation. In SIGGRAPH, 2023. 3 https://doi.org/nq56. 3
[33] Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood De- [49] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
hghan, Osmar R Zaiane, and Martin Jagersand. U2-net: Go- reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia
ing deeper with nested u-structure for salient object detec- Polosukhin. Attention is all you need. In NeurIPS, 2017. 3,
tion. Pattern Recognit., 2020. 1 5
[34] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya [50] Riza Velioglu, Robin Chan, and Barbara Hammer. Fashion-
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, fail: Addressing failure cases in fashion object detection and
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- segmentation. In IJCNN, 2024. 6
ing transferable visual models from natural language super- [51] Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro
vision. In ICML, 2021. 4 Cuenca, Nathan Lambert, Kashif Rasul, et al. Diffusers:
[35] Robin Rombach, Andreas Blattmann, Dominik Lorenz, State-of-the-art diffusion models. https://github.
Patrick Esser, and Björn Ommer. High-resolution image syn- com/huggingface/diffusers, 2022. 4
thesis with latent diffusion models. In CVPR, 2022. 3, 4, 5, [52] Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin
6, 1 Chen, Liang Lin, and Meng Yang. Toward characteristic-
[36] Prasun Roy, Saumik Bhattacharya, Subhankar Ghosh, and preserving image-based virtual try-on network. In ECCV,
Umapada Pal. Multi-scale attention guided pose transfer. 2018. 3
Pattern Recognit., 2023. 3, 6, 7 [53] Chenhui Wang, Tao Chen, Zhihao Chen, Zhizhong Huang,
[37] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Taoran Jiang, Qi Wang, and Hongming Shan. Fldm-vton:
Jonathan Ho, Tim Salimans, et al. Palette: Image-to-image Faithful latent diffusion model for virtual try-on. In IJCAI,
diffusion models. In SIGGRAPH, 2022. 3 2024. 3
[38] Chitwan Saharia, Jonathan Ho, William Chan, Tim Sali- [54] Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli.
mans, David J Fleet, and Mohammad Norouzi. Image super- Image quality assessment: from error visibility to structural
resolution via iterative refinement. IEEE TPAMI, 2022. 3 similarity. IEEE Trans. Image Process., 2004. 4
[39] Fei Shen, Hu Ye, Jun Zhang, Cong Wang, Xiao Han, and Wei [55] Zhenyu Xie, Zaiyu Huang, Fuwei Zhao, Haoye Dong,
Yang. Advancing pose-guided image synthesis with progres- Michael Kampffmeyer, and Xiaodan Liang. Towards
sive conditional diffusion models. In ICLR, 2024. 3 scalable unpaired virtual try-on via patch-routed spatially-
[40] Karen Simonyan and Andrew Zisserman. Very deep convo- adaptive gan. In NeurIPS, 2021. 1, 3
lutional networks for large-scale image recognition. In ICLR, [56] Xingqian Xu, Jiayi Guo, Zhangyang Wang, Gao Huang, Ir-
2015. 4 fan Essa, and Humphrey Shi. Prompt-free diffusion: Taking”
[41] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- text” out of text-to-image diffusion models. In CVPR, 2024.
ing diffusion implicit models. In ICLR, 2021. 4 3
[42] George Stein, Jesse Cresswell, Rasa Hosseinzadeh, Yi Sui, [57] Yuhao Xu, Tao Gu, Weifeng Chen, and Chengcai Chen. Oot-
Brendan Ross, Valentin Villecroze, Zhaoyan Liu, Anthony L diffusion: Outfitting fusion based latent diffusion for control-
Caterini, Eric Taylor, and Gabriel Loaiza-Ganem. Exposing lable virtual try-on. arXiv, 2024. https://doi.org/
flaws of generative model evaluation metrics and their unfair npf9. 3, 6, 7, 2
treatment of diffusion models. In NeurIPS, 2024. 4 [58] Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-
[43] Shao-Hua Sun, Minyoung Huh, Yuan-Hong Liao, Ning adapter: Text compatible image prompt adapter for text-to-
Zhang, and Joseph J Lim. Multi-view to novel view: Synthe- image diffusion models. arXiv, 2023. https://doi.
sizing novel views with self-learned confidence. In ECCV, org/np3v. 3
2018. 3 [59] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and
[44] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Lucas Beyer. Sigmoid loss for language image pre-training.
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent In ICCV, 2023. 4, 5
Vanhoucke, and Andrew Rabinovich. Going deeper with [60] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding
convolutions. In CVPR, 2015. 4 conditional control to text-to-image diffusion models. In
[45] Huixuan Tang, Neel Joshi, and Ashish Kapoor. Learning a ICCV, 2023. 3
blind measure of perceptual image quality. In CVPR, 2011. [61] Bo Zhao, Xiao Wu, Zhi-Qi Cheng, Hao Liu, Zequn Jie, and
2, 4 Jiashi Feng. Multi-view image generation from a single-
[46] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. view. In ACM MM, 2018. 3
Multi-view 3d models from single images with a convolu- [62] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Ma-
tional network. In ECCV, 2016. 3 lik, and Alexei A Efros. View synthesis by appearance flow.
[47] Lucas Theis, Aäron van den Oord, and Matthias Bethge. A In ECCV, 2016. 3
note on the evaluation of generative models. In ICLR, 2016. [63] Luyang Zhu, Yingwei Li, Nan Liu, Hao Peng, Dawei
2 Yang, and Ira Kemelmacher-Shlizerman. M&m vto: Multi-
[48] Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan garment virtual try-on and editing. In CVPR, 2024. 3
Huang, Adam Letts, Yangguang Li, Ding Liang, Christian

10
TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction
using Diffusion Models
Supplementary Material
The supplementary material provides additional experi-
mental results. The first section presents an ablation study
that examines the contributions of individual components of
TryOffDiff and evaluates the effect of various inference hy-
perparameters. We then material demonstrate how the pro-
posed approach can be integrated with virtual try-on models (a) 81.9 / 36.2 (b) 81.5 / 40.4 (c) 81.7 / 39.7
for person-to-person try-on, achieving competitive perfor-
mance with specialized models. Further qualitative com-
parisons with baseline methods are included, alongside vi-
sualizations of TryOffDiff predictions on 10% of the test
dataset. Finally, we give further details regarding imple-
(d) 80.3 / 24.2 (e) 75.3 / 25.0 (f) 80.3 / 19.4
mentation.
Figure 7. Examples demonstrating the un-/suitability of per-
6. Ablation Studies formance metrics (SSIM↑ / DISTS↓) and an Autoencoer model
applied to VTOFF. In each figure, left image is the ground truth
Our ablation experiments investigate the impact of vari-
image and the right image is the model prediction of Autoen-
ous TryOffDiff configurations. We analyze the differences
coder (top, a-c) and TryOffDiff (bottom, d-f). Notice the higher
between operating in the pixel and latent space, evaluate SSIM scores for the Autoencoder compared to TryOffDiff despite
adapter design choices, and assess the influence of differ- poor visual quality of reconstructed garment images.
ent image encoders and conditioning features. Additionally,
we compare the effectiveness of fine-tuning versus training
from scratch. Finally, we further look into the role of de- lier experiments, here we evaluate the raw outputs of the
noising hyperparameters during the inference phase of our generative model without applying background removal.
method. Previously, background removal was necessary to ensure
comparability with baseline methods designed for VTON
6.1. Impact of TryOffDiff configurations models adapted to the VTOFF task. Unnecessary ele-
Our first set of experiments explores different TryOffDiff ments (e.g. anything except the upper-body garment) were
setups, focusing only on methods that achieved comparable removed through segmentation-based post-processing with
results in our evaluations. All models were trained from SAM. However, since all models in this comparison are
scratch, except for TryOffDiff. specifically trained for the VTOFF task, they are expected
The Autoencoder is based on a nested U-Net [33] , orig- to handle background removal directly. TryOffDiff achieves
inally proposed for salient object detection. We trained the slightly better performance metrics when evaluated without
model from scratch using MSE. This approach is able to SAM post-processing.
reconstruct the general shape of the garment, but it lacks Figure 8 shows the qualitative results for different con-
detailed features such as logos, text, and patterns. figurations of our approach. These results further high-
The PixelModel, a diffusion model operating in pixel- light the shortcomings of existing image generation metrics,
space based on the original diffusion architecture [21], which often fail to align with human perception of image
shows improved pixel-level details but suffers from slow in- quality. For instance, the autoencoder in column 1 achieves
ference, rendering it impractical for real-world applications. high scores despite its lack of fine details, a limitation also
For the Latent Diffusion Models (LDMs), we leverage illustrated in Figure 7.
the recent VAE encoder from StableDiffusion-3 [13] , con-
ditioning it with images via cross-attention layers in the 6.2. Hyper-parameter choice in the denoising pro-
U-Net. The overall architecture mirrors StableDiffusion- cess
1.4 [35], with variations through different image encoders, Figure 9 shows how various guidance scale and inference
adapter layers, and mixed precision settings. steps impact FID and DISTS. We find that the performance
Precise model details are listed in Table 2, and the cor- of our approach remains relatively stable with respect to the
responding quantitative results for the VTOFF task on the number of denoising steps. Still, it is affected by the value
VITON-HD dataset are summarized in Table 3. Unlike ear- of the guidance scale, which we further demonstrate with

1
Method VAE Img. Encoder Emb.shape Adapter Cond.shape Sched. Prec. Steps
Autoencoder - - - - - - fp32 290k
PixelModel - SigLIP-B/16 (1024,768) Linear+LN (64,768) DDPM fp16 300k
LDM-1 SD3 CLIP ViT-B/32 (50,768) - (50,768) DDPM fp16 180k
LDM-2 SD3 SigLIP-B/16 (1024,768) Linear+LN (64,768) DDPM fp16 320k
LDM-3 SD3 SigLIP-B/16 (1024,768) Linear+LN (64,768) DDPM fp32 120k
TryOffDiff SD1.4 SigLIP-B/16 (1024,768) Trans.+Linear+LN (77,768) PNDM fp32 220k

Table 2. Training configurations of ablations.

Method Sched. s n SSIM↑ MS-SSIM↑ CW-SSIM↑ LPIPS↓ FID ↓ CLIP-FID↓ KID↓ DISTS↓
Autoencoder - - - 81.4 72.0 37.3 39.5 108.7 31.7 66.8 32.5
PixelModel DDPM - 50 76.0 66.3 37.0 52.1 75.4 20.7 56.4 32.6
LDM-1 DDPM - 50 79.6 70.5 42.0 33.0 26.6 9.14 11.5 24.3
LDM-2 DDPM - 50 80.2 72.3 48.3 31.8 18.9 7.5 5.4 21.8
LDM-3 DDPM - 50 79.5 71.3 46.9 32.6 18.6 7.5 6.7 22.7
TryOffDiff PNDM 2.0 50 79.4 71.5 47.2 33.2 20.2 8.3 6.8 22.5

Table 3. Quantitative comparison. Evaluation metrics for different methods on VITON-HD-test dataset for VTOFF task. Results are
reported on raw predictions, with no background removal. Note that while LDM-2 may achieve better performance metrics, we still choose
TryOffDiff over LDM-2 due to its better subjective visual quality in garment image generation, see also Figure 8.

qualitative results in Figure 10. Lower guidance values repository. The quantitative results are summarized in Ta-
result in a loss of detail, whereas higher values compromise ble 4. Since VITON-HD dataset lacks person-to-person try-
realism, introducing artifacts such as excessive contrast and on ground truth data, we report only metrics that assess per-
color saturation. ceptual quality.
Figure 11 and Figure 12 demonstrate the effect of vary- Replacing the ground truth garment with TryOffDiff’s
ing noising seed on reconstruction quality. Overall, the gen- predictions leads to a slight drop in quality, as the recon-
erated garment images show strong consistency across in- structions are not perfect. Our approach also slightly out-
ference runs. However, for certain examples, slight vari- performs CatVTON. This may be partly attributed to CatV-
ations in the shape of the garment can occur. This is no- TON’s difficulties with person reconstruction, despite its
ticeable in upper-body apparel with challenging features, strength in preserving clothing details. This observation fur-
such as ribbons or short tops. Similarly, complex pat- ther highlights the limitations of the VTON task and com-
terns, such as printed designs or text on shirts, may ex- monly used VTON metrics, which fail to adequately distin-
hibit slight differences in reconstruction. In contrast, sim- guish between person and garment reconstruction quality.
pler garments–those with solid colors or basic patterns like Qualitative results are shown in Figure 13 and Figure 14.
stripes–show high consistency across all runs and closely Overall, there is no definitive winner between CatVTON
match the ground truth.

7. Person-to-person Try-On Method FID↓ CLIP-FID↓ KID↓


CatVTON 12.0 3.5 3.9
TryOffDiff can be used to adapt existing Virtual Try-On OOTDiffusion + GT 10.8 2.8 2.0
models for person-to-person try-on. In this setup, our OOTDiffusion + TryOffDiff 12.0 3.5 2.5
method generates the target garment from the target model,
which is then used as input to classical VTON models in- Table 4. Quantitative comparison of Virtual Try-On models.
stead of the ground truth garment image. We conduct exper- We compare the results of OOTDiffusion when ground truth (GT)
iments using OOTDiffusion [57] and compare the quality garment is used and when the garment predicted by TryOffDiff
of virtual try-on using the ground truth garment versus our is used. We further show the results of CatVTON, a specialized
predicted garment. Additionally, we evaluate against CatV- person-to-person try-on model. Our TryOffDiff model in com-
TON [9], a state-of-the-art person-to-person try-on model, bination with VTON model achieves competitive performance in
using its default inference settings from the official GitHub person-to-person VTON.

2
(a) Autoencoder (b) PixelModel (c) LDM-1 (d) LDM-2 (e) LDM-3 (f) TryOffDiff (g) Target

Figure 8. Qualitative comparison between different configurations explored in our ablation study. See also Table 2 for more details.

3
(a) Guidance Scale (b) Inference steps

Figure 9. Ablation study on the impact of guidance scale (s) and inference steps (n) on DISTS and FID scores. Experiments are
conducted on VITON-HD-test with TryOffDiff using the DDIM [41] noise scheduler.

and OOTDiffusion combined with TryOffDiff. CatVTON SSIM, MS-SSIM, CW-SSIM, and LPIPS, and the ‘clean-
excels in preserving texture and pattern details but occasion- fid’ [31] library for FID, CLIP-FID, and KID. Finally, we
ally suffers from diffusion artifacts (Figure 13, row 3; Fig- employ the original implementation of DISTS [11] for eval-
ure 14, row 2). Additionally, CatVTON sometimes trans- uating perceptual image quality. For readability purposes,
fers attributes of the target model to the source model (Fig- the values of SSIM, MS-SSIM, CW-SSIM, LPIPS, and
ure 13, rows 3 and 4; Figure 14, row 4), a limitation not DISTS presented in this paper are multiplied by 100, and
observed in classical try-on models. KID is multiplied by 1000.
Finally, complex clothing items remain challenging,
even when using ground truth images for virtual try-on (Fig-
ure 13, row 1; Figure 14, rows 1 and 4).
Nonetheless, these results highlight the potential of the
Virtual Try-Off task and the TryOffDiff model. Although
TryOffDiff was not specifically trained for person-to-person
virtual try-on, its integration with VTON models presents
a promising approach, already demonstrating competitive
performance compared to state-of-the-art person-to-person
virtual try-on methods.

8. Additional Qualitative Results


This section offers additional qualitative results. We present
further comparisons with our baseline models, as intro-
duced in Section 4.2, in Figure 15.
We also visualize TryOffDiff’s output on 10% of the test
set, which is obtained by sorting the test images alphabet-
ically and selecting every 10th image. These results are
shown in Figure 16 and Figure 17.

9. Implementation Details
The implementation relies on PyTorch as the core frame-
work, with HuggingFace’s Diffusers library [51] for diffu-
sion model components and the Accelerate library [16] for
efficient multi-GPU training.
For evaluation, we use ‘IQA-PyTorch’ [6] to compute

4
s=0 s = 1.2 s = 1.5 s = 1.8 s = 2.0 s = 2.5 s = 3.0 s = 3.5 Ground Truth
Figure 10. Qualitative results for different guidance. Left: no guidance applied (s = 0). Middle: varying guidance scale (s ∈
[1.2, 1.5, 1.8, 2.0, 2.5, 3.0, 3.5]). Right: ground-truth.

5
Examples generated from multiple inference runs using our TryOffDiff model Target
Figure 11. Sample Variations. While minor variations in shape and pattern may occur with complex garments, the overall output of
TryOffDiff demonstrates consistent garment reconstructions across multiple inference runs with different random seeds.

6
Examples generated from multiple inference runs using our TryOffDiff model Target
Figure 12. Sample Variations. While minor variations in shape and pattern may occur with complex garments, the overall output of
TryOffDiff demonstrates consistent garment reconstructions across multiple inference runs with different random seeds.

7
Figure 13. Qualitative comparison on (person-to-person) VTON task. Columns show: (a) person to be dressed which all of the models
use as one of the reference inputs, (b) output of the CatVTON model which uses an image of a person wearing the target garment as
condition for direct person-to-person VTON, (c) output of the OOTDiffusion model which takes in an image of the target garment and (d)
output of the OODDiffusion model which takes in the output of our TryOffDiff model for indirect person-to-person VTON.

8
Figure 14. Qualitative comparison on (person-to-person) VTON task. Columns show: (a) person to be dressed which all of the models
use as one of the reference inputs , (b) output of the CatVTON model which uses an image of a person wearing the target garment as
condition for direct person-to-person VTON, (c) output of the OOTDiffusion model which takes in an image of the target garment and (d)
output of the OODDiffusion model which takes in the output of our TryOffDiff model for indirect person-to-person VTON.

9
(a) Gan-Pose (b) ViscoNet (c) OOTDiffusion (d) CatVTON (e) TryOffDiff (f) Target

Figure 15. Qualitative comparison between baselines and TryOffDiff. In comparison to the baseline approaches, TryOffDiff is more
capable of generating garment images with accurate structural details as well as fine textural details.

10
Figure 16. TryOffDiff predictions on the VITON-HD-test dataset (samples 1–100). Visualized are the first 100 predictions, sampled by
selecting every 10th sample from the test set after sorting filenames alphabetically.

11
Figure 17. TryOffDiff predictions on the VITON-HD-test dataset (samples 101–200). Visualized are the next 100 predictions, sampled
by selecting every 10th sample from the test set after sorting filenames alphabetically.

12

You might also like