0% found this document useful (0 votes)
126 views16 pages

MV-VTON: Multi-View Virtual Try-On With Diffusion Models

The document introduces Multi-View Virtual Try-On (MV-VTON), a novel approach that generates realistic dressing results for a person from multiple views using both frontal and back clothing images. It employs diffusion models and a view-adaptive selection method to ensure accurate feature extraction and alignment, addressing limitations of existing frontal-only virtual try-on methods. The authors also present a new dataset, MVG, which contains diverse view images, and demonstrate that their method outperforms previous approaches in both multi-view and frontal-view virtual try-on tasks.

Uploaded by

Mr.Hacker 07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
126 views16 pages

MV-VTON: Multi-View Virtual Try-On With Diffusion Models

The document introduces Multi-View Virtual Try-On (MV-VTON), a novel approach that generates realistic dressing results for a person from multiple views using both frontal and back clothing images. It employs diffusion models and a view-adaptive selection method to ensure accurate feature extraction and alignment, addressing limitations of existing frontal-only virtual try-on methods. The authors also present a new dataset, MVG, which contains diverse view images, and demonstrate that their method outperforms previous approaches in both multi-view and frontal-view virtual try-on tasks.

Uploaded by

Mr.Hacker 07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

MV-VTON: Multi-View Virtual Try-On with Diffusion Models

Haoyu Wang1,2, * , Zhilu Zhang2,† Donglin Di3 , Shiliang Zhang1 , Wangmeng Zuo2
1
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
2
Harbin Institute of Technology
3
Space AI, Li Auto
[email protected], [email protected], [email protected], [email protected], [email protected]

View 1 View 2 View 3 View 4 View 5


Abstract
arXiv:2404.17364v4 [cs.CV] 5 Jan 2025

Previous work Clothing & Person


The goal of image-based virtual try-on is to generate an im-
age of the target person naturally wearing the given clothing.
However, existing methods solely focus on the frontal try-on
using the frontal clothing. When the views of the clothing and
person are significantly inconsistent, particularly when the
person’s view is non-frontal, the results are unsatisfactory. To
address this challenge, we introduce Multi-View Virtual Try-
ON (MV-VTON), which aims to reconstruct the dressing re-
sults from multiple views using the given clothes. Given that
single-view clothes provide insufficient information for MV-
VTON, we instead employ two images, i.e., the frontal and
back views of the clothing, to encompass the complete view
as much as possible. Moreover, we adopt diffusion models
that have demonstrated superior abilities to perform our MV-
Ours

VTON. In particular, we propose a view-adaptive selection


method where hard-selection and soft-selection are applied
to the global and local clothing feature extraction, respec-
tively. This ensures that the clothing features are roughly fit
to the person’s view. Subsequently, we suggest joint attention
blocks to align and fuse clothing features with person fea- Figure 1: Motivation of this work. Previous VTON methods,
tures. Additionally, we collect a MV-VTON dataset MVG, in e.g., StableVITON (Kim et al. 2023) can only be used for the
which each person has multiple photos with diverse views and frontal-view person, and fail when facing the person with
poses. Experiments show that the proposed method not only multiple views. Our MV-VTON can faithfully present the
achieves state-of-the-art results on MV-VTON task using our try-on results for a person with various views.
MVG dataset, but also has superiority on frontal-view virtual
try-on task using VITON-HD and DressCode datasets.

Code — https://github.com/hywang2002/MV-VTON generative adversarial networks (Goodfellow et al. 2020)


(GANs). They generally align the clothing to the person’s
pose, and then employ a generator to fuse the warped cloth-
Introduction ing with the person. However, it poses a challenge to en-
Virtual Try-On (VTON) is a classic yet intriguing technol- sure that the warped clothing fits the target person’s pose,
ogy. It can be applied in the field of fashion and clothes on- and inaccurate clothes features will easily lead to distortion
line shopping to improve user experience. VTON aims to results. Recently, diffusion models (Rombach et al. 2022)
render the visual effect of a person wearing a specified gar- have made remarkable strides in the field of image genera-
ment. The emphasis of this technology lies in reconstructing tion (Ruiz et al. 2023). Leveraging its potent generative ca-
a realistic image that faithfully preserves personal attributes pabilities, some researchers (Morelli et al. 2023; Kim et al.
and accurately represents clothing shape and details. 2023) have integrated it into virtual try-on fields, building
Early VTON methods (Lee et al. 2022; Xie et al. 2023; upon previous work and achieving commendable results.
Bai et al. 2022; He, Song, and Xiang 2022) are based on Although VTON has made great progress, most existing
* This work was done while the first author was an undergradu- methods focus on performing the frontal try-on. In practical
ate at Harbin Institute of Technology. applications, such as online shopping for clothes, customers

Corresponding Author. may expect to obtain the dressing effect on multiple views
Copyright © 2025, Association for the Advancement of Artificial (e.g., side or back). In this case, the pose of the garment may
Intelligence (www.aaai.org). All rights reserved. be seriously inconsistent with the person’s posture, and the
single-view clothing may not be enough to provide complete • We collect a multi-view virtual try-on dataset. Exten-
try-on information. Thus, these methods will easily generate sive experiments demonstrate that our method outper-
results with poorly deformed clothing, and lead to the loss forms previous approaches quantitatively and qualitatively
of high-frequency details such as texts, patterns, and other in both frontal-view and multi-view virtual try-on tasks.
textures on clothing, as shown in Figure 1.
To address these issues, we introduce Multi-View Virtual Related Work
Try-ON (MV-VTON), which aims to reconstruct the appear- GAN-Based Virtual Try-On
ance and attire of a person from multiple views. For ex-
ample, for clothing in Figure 1, which may exhibit signifi- Existing methods are aimed at the frontal-view VTON task.
cant differences between frontal and back styles, MV-VTON To reconstruct realistic results, these methods based on
should be able to display try-on results in various views, in- generative adversarial networks (GAN) (Goodfellow et al.
cluding front, back, and side ones. Thus, providing single 2020) are typically divided into two steps. Firstly, the
clothing can’t meet the needs of dressing up, as the cloth- frontal-view clothing is deformed to align with the target
ing only has partial information. Instead, we utilize both the person’s pose. Afterward, the warped clothing and target
frontal and back views of the clothing, which covers approx- person are fused through a GAN-based generator. In the
imately complete view with as few images as possible. warping step, some methods (Yang et al. 2020; Ge et al.
Given the frontal and back clothing, we utilize the pop- 2021a; Wang et al. 2018) use TPS transformation to de-
ular diffusion method to achieve MV-VTON. It is natural form the frontal-view clothing, and others (Lee et al. 2022;
but doesn’t work well to simply concatenate two pieces of Ge et al. 2021b; Xie et al. 2023) predict the global and lo-
clothing together as conditions of diffusion models, as it cal optical flow required for clothing deformation. However,
is difficult for the model to learn how to assign two-view when the clothing possesses intricate high-frequency details
clothes to a person, especially when the person is side- and the person’s pose is complex, the effectiveness of cloth-
ways. Instead, we propose a view-adaptive selection mecha- ing deformation is often diminished. Moreover, GAN-based
nism, which picks appropriate features of two-view clothes generators generally encounter challenges in convergence
based on the posture information of the person and clothes. and are highly susceptible to mode collapse (Miyato et al.
Therein, the hard-selection module chooses one of the two 2018), leading to noticeable artifacts at the junction between
clothes for global feature extraction, and the soft-selection warped clothing and the target person in the final results. In
module modulates the local features of two clothes. We uti- addition, previous multi-pose virtual try-on methods (Dong
lize CLIP (Radford et al. 2021) and a multi-scale encoder to et al. 2019; Wang et al. 2020; Yu et al. 2023) can change the
extract the global and local clothing features, respectively. person’s pose, but are also limited by GAN-based generator
Moreover, to enhance the preservation of high-frequency de- and insufficient clothing information.
tails in clothing, we present joint attention blocks. They in-
dependently align global and local features with the person Diffusion-Based Virtual Try-On
features, and selectively fuse them to refine the local cloth- Thanks to the rapid advancement of diffusion models, recent
ing details while preserving global semantic information. works have sought to utilize the generative prior of large-
Furthermore, we collect a multi-view virtual try-on scale pre-trained diffusion models (Ho, Jain, and Abbeel
dataset, named Multi-View Garment (MVG). It contains 2020; Song, Meng, and Ermon 2020; Rombach et al. 2022;
thousands of samples, and each sample contains 5 im- Yang et al. 2023) to tackle frontal-view virtual try-on tasks.
ages under different views and poses. We conduct exten- TryOnDiffusion (Zhu et al. 2023) introduces two U-Nets
sive experiments not only on MV-VTON task using the to encode target person and frontal-view clothing images
MVG dataset, but also on the frontal-view VTON task using respectively, and interacts with the features of the two
VITON-HD (Lee et al. 2022) and DressCode (Morelli et al. branches through the cross-attention mechanism.
2022) datasets. The results demonstrate that our method out- LaDI-VTON (Morelli et al. 2023) encodes the frontal-
performs existing methods on both tasks. view clothing image through textual inversion (Gal et al.
In summary, our contributions are outlined below: 2022; Wei et al. 2023) and serves as the conditional input
of backbone. DCI-VTON (Gou et al. 2023) first conducts an
• We introduce a novel Multi-View Virtual Try-ON (MV- initial deformation of frontal-view clothing by incorporating
VTON) task, which aims at generating realistic dressing- a pre-trained wrapping network (Ge et al. 2021b). Subse-
up results of the multi-view person by using the given quently, it attaches the deformed clothing to the target per-
frontal and back clothing. son image and feeds it into the diffusion model. While their
frontal-view virtual try-on results seem more natural com-
• We propose a view-adaptive selection method, where
pared to GAN-based methods, they face difficulties in pre-
hard-selection and soft-selection are applied to global and
serving high-frequency details due to the loss of details from
local clothing feature extraction, respectively. It ensures
the CLIP image encoder (Radford et al. 2021). To address
that the clothing features are roughly fit to the person’s
this problem, StableVITON (Kim et al. 2023) attempts to
view.
introduce an additional encoder (Zhang, Rao, and Agrawala
• We propose joint attention blocks to align the global and 2023) to encode the features of frontal-view clothing, and
local features of selected clothing with the person ones, align the obtained clothing features through the zero cross-
and fuse them. attention block. However, due to the absence of adequate
clothing priors, the generated results often struggle to re-
main faithful to the original clothing. Therefore, we intro-
duce joint attention blocks to extract the global and local
features of clothing, and employ the view-adaptive selection
to choose the clothing features from the two views.

Method
Preliminaries for Diffusion Models
Diffusion Models (Ho, Jain, and Abbeel 2020; Rombach
(a) Frontal-view (b) Multi-view
et al. 2022) have demonstrated strong capabilities in visual
generation, which transforms a Gaussian distribution into a Figure 2: Comparison between previous datasets and our
target distribution by iterative denoising. In particular, Stable proposed MVG dataset. (a) is the dataset used by the pre-
Diffusion (Rombach et al. 2022) is a widely used generative vious work, which only have clothing and person in the
diffusion model, which consists of a CLIP text encoder ET , a frontal-view. In contrast, our dataset (b) offers images from
VAE encoder E as well as decoder D, and a time-conditional five different views.
denoising model ϵθ . The text encoder ET encodes the in-
put text prompt y as conditional input. The VAE encoder E
compresses the input image I into latent space to get the la- the inpainting mask, and denote by a the masked per-
tent variable z0 = E(I). In contrast, the VAE decoder D son image x. The model concatenates zt (z0 = E(x)),
decodes the output of backbone from latent space to pixel the encoded clothing-agnostic image E(a), and the resized
space. Through the VAE encoder E, at an arbitrary time step clothing-agnostic mask m in the channel dimension, and
t, the forward process is performed: feeds them into the backbone as spatial input. Besides, we
Qt √ √ use an existing method to pre-warp the clothing and paste it
α := s=1 (1 − βs ) , zt = αt z0 + 1 − αt ϵ, (1)
on a. While utilizing CLIP image encoder to encode cloth-
where ϵ ∼ N (0, 1) is the random Gaussian noise and β is ing as the global condition of the diffusion model, we also
a predefined variance schedule. The training objective is to introduce an additional encoder (Zhang, Rao, and Agrawala
acquire a noise prediction network that minimizes the dis- 2023) to encode clothing to provide more refined local con-
parity between the predicted noise and the noise added to ditions. Since both the frontal and back view clothing need
ground truth. The loss function can be defined as, to be encoded, directly sending both into the backbone as
2 conditions may result in confusion of clothing features. To
LLDM = EE(I),y,ϵ∼N (0,1),t [∥ϵ − ϵθ (zt , t, ET (y))∥2 ], (2) alleviate this problem, we propose a view-adaptive selection
where zt represents the encoded image E(I) with random mechanism. Based on the similarity between the poses of
Gaussian noise ϵ ∼ N (0, 1) added. the person and two clothes, it conducts hard-selection when
In our work, we use an exemplar-based inpainting extracting global features and soft-selection when extracting
model (Yang et al. 2023) as a backbone, which employs an local features. To preserve semantic information in clothing
image c rather than texts as the prompt and then encode c and enhance high-frequency details in global features using
by the image encoder EI of CLIP. Thus, the loss function in local ones, we introduce joint attention blocks. They first
Eq. (2) can be modified as, independently align global and local features to the person
ones and then selectively fuse them. Figure 3(a) depicts an
2
LLDM = EE(I),c,ϵ∼N (0,1),t [∥ϵ − ϵθ (zt , t, EI (c))∥2 ]. (3) overview of our proposed method.

Method Overview View-Adaptive Selection


While existing virtual try-on methods are designed solely For multi-view virtual try-on task, given the substantial dif-
for frontal-view scenarios, we present a novel approach to ferences between the frontal and back views, as illustrated
handle both frontal-view and multi-view virtual try-on tasks, in Figure 2(b), it’s imperative to extract and assign the fea-
along with a multi-view virtual try-on dataset MVG com- tures of frontal and back view clothing for the person tenden-
prising try-on images captured from five different views. tiously. Actually, based on the pose of the target person, we
Examples of it are shown in Figure 2(b). Formally, given can determine which view of clothing should be given more
a person image x in an arbitrary view, along with a frontal attention during the try-on process. For example, if the target
view clothing cf and a back view clothing cb , our goal is pose resembles the pose in the fourth column of Figure 2(b),
to generate the result of the person wearing the clothing in it’s evident that we should rely more on the characteristics of
its view. Considering the substantial differences between the the back view clothing to generate the try-on result. Specif-
front and back of most clothing, another challenge is to make ically, we propose a view-adaptive selection mechanism to
informed decisions regarding the two provided clothing im- achieve this purpose, including hard- and soft-selection.
ages based on the target person’s pose, ensuring a natural Hard-Selection for Global Clothing Features. We deploy
try-on result across multiple views. a CLIP image encoder to extract global features of cloth-
In this work, we use an image inpainting diffusion ing. During this process, we perform hard-selection on the
model (Yang et al. 2023) as our backbone. Denote by M frontal and back view clothing based on the similarity be-
Pose: 𝑝𝑓 : Pose of ℰ𝑝 (𝑝ℎ ) ℰ𝑝 (𝑝𝑓 ) 𝑐𝑓i 𝑐𝑏i ℰ𝑝 (𝑝b ) ℰ𝑝 (𝑝ℎ )
ℰ𝑝 Frontal
𝑝𝑓 /𝑝𝑏 / 𝑝ℎ
Clothing
𝑝𝑏 : Pose of
ℰ𝑙 Back 𝑊ℎ𝑖 𝑊𝑓𝑖 𝑊𝑐𝑖 𝑊𝑏𝑖 𝑊ℎ𝑖
Clothing
𝑐𝑓 𝑐𝑏 𝑐𝑙1 𝑝ℎ : Pose of 𝑃ℎ𝑖 𝑃f𝑖 𝐶𝑓𝑖 𝐶𝑏𝑖 𝑃𝑏𝑖 𝑃ℎ𝑖
𝑐𝑙2
𝑐𝑙3 Person  
Softmax Softmax

𝑧𝑡 ℰ(a) U-Net 𝑧𝑡−1 


Multi-View Person C
Frozen
c𝑔 Trainable
Pose: 𝑝𝑓 /𝑝𝑏 / 𝑝ℎ
Transformer Block 𝑐𝑙i
Soft-Selection Block C Concatenate Frontal View Flow
CLIP Joint Attention Block
 Multiplication Back View Flow
𝑐𝑓 𝑐𝑏 Hard-Selection Pose Encoder

(a) Overview of MV-VTON (b) Soft-Selection Block

Figure 3: (a) Overview of MV-VTON. It encodes frontal and back view clothing into global features using the CLIP image
encoder and extracts multi-scale local features through an additional encoder El . Both features act as conditional inputs for
the decoder of backbone. Besides, both features are selectively extracted through view-adaptive selection mechanism. (b) Soft-
selection modulates the clothing features on frontal and back views, respectively, based on the similarity between the clothing’s
pose and the person’s pose. Then the features from both views are concatenated in the channel dimension.

tween the garments’ pose and the person’s pose. It means Whi and Wfi , respectively. We also map cif to Cfi through a
that we only select one piece of clothing that is closest to linear layer with weights Wci . Then, we calculate the simi-
the person’s pose as the input of the image encoder, since it larity between the person’s pose and frontal-view clothing’s
is enough to cover global semantic information. When gen- pose to get the selection weights of frontal-view clothing,
erating pre-warped clothing for E(a), the selection is also i.e.,
performed. Implementation details of hard-selection can be Phi (Pfi )T
found in the supplementary material. weights = sof tmax( √ ), (4)
d
Soft-Selection for Local Clothing Features. We utilize an where weights represents the selection weights of frontal-
additional encoder El to extract the multi-scale local features view clothing, and d represents the dimension of these ma-
of frontal and back view clothing, which in the i-th scale are trices. Assuming that the person’s pose is biased towards
denoted as cif and cib , respectively. When reconstructing the the front, as depicted in the second column of Figure 2(b),
try-on results, it may be insufficient to rely solely on the the similarity between the person’s pose and the front view
clothing from either frontal or back view under certain spe- clothing’s pose will be higher. Consequently, the corre-
cific scenes, such as the third column shown in Figure 2(b). sponding clothing features will be enhanced by weights,
In these cases, it may be necessary to incorporate clothing and vice versa. The features of back view clothing cib un-
features from both views. However, simply combining the dergo similar processing. Finally, the two selected clothing
two may lead to confusion of features. Instead, we introduce features are concatenated along the channel dimension as the
soft-selection block to modulate their features, respectively, local condition cil of backbone.
as shown in Figure 3(b).
First, the person’s pose ph , frontal-view clothing’s pose Joint Attention Blocks
pf , and back view clothing’s pose pb are encoded by the Global clothing features cg provide identical conditions for
pose encoder Ep to obtain their respective features Ep (ph ), blocks at each scale of U-Net, and multi-scale local clothing
Ep (pf ), and Ep (pb ). Details of the pose encoder can be found features cl allow for reconstructing more accurate details.
in the supplementary material. When processing frontal- We present joint attention blocks to align cg and cl with the
view clothing, in i-th soft-selection block, we map Ep (ph ) current person features, as shown in Figure 4. To retain most
and Ep (pf ) to Phi and Pfi through a linear layer with weights of the semantic information in global features cg , we use
𝑐𝑙i Experiments
K V
Q Self Cross Feed Experiments Settings
Attention Attention Forward 
i
𝑓𝑖𝑛 Sharing Sharing
⊕ i
𝑓𝑜𝑢𝑡 Datasets: For the proposed multi-view virtual try-on task,
Weights Weights
Self Cross Feed
we collect MVG dataset containing 1,009 samples. Each
Q Attention Attention Forward sample contains five images of the same person wearing the
K V same garment from five different views, for a total of 5,045
𝑐g
images, as shown in Figure 2(b). The image resolution is
Learnable Fusion Vector about 1K. We’ll explain how the datasets are collected and
⊕ Addition  Channel-Wise Multiplication how they’re used for MV-VTON in the supplementary ma-
terial. The proposed method can also be applied to frontal-
Figure 4: Overview of the proposed joint attention blocks. view virtual try-on task. Our frontal-view experiments are
carried out on VITON-HD (Lee et al. 2022) and Dress-
Code (Morelli et al. 2022) datasets. They contain more than
local features cl to refine some lost and erroneous detailed
10,000 frontal-view person and upper-body clothing image
texture information in cg by selective fusion.
pairs. We follow previous work for the use of them.
Specifically, in the i-th joint attention block, we first cal-
i Evaluation Metrics. Following previous works (Kim et al.
culate self-attention for the current features fin . Then, we
2023; Morelli et al. 2023), we use four metrics to eval-
deploy a double cross-attention. The queries (Q) come from
i uate the performance of our method: Structural Similarity
fin and global features cg serve as one set of keys (K) and
(SSIM) (Wang et al. 2004), Learned Perceptual Image Patch
values (V), while local features cil serve as another set of Similarity (LPIPS) (Zhang et al. 2018), Frechet Inception
keys (K) and values (V). After aligning to the person’s pose Distance (FID) (Heusel et al. 2017) and Kernel Inception
through cross-attention, the clothing features cg and cil are Distance (KID) (Bińkowski et al. 2018). Specifically, for
selectively fused in channel-wise dimension, i.e., paired test setting, which means directly using the paired
i
Qig (Kgi )T i data in the dataset, we utilize the above four metrics for eval-
fout = sof tmax( √ )Vg + uation. For unpaired test setting, which means that the given
d
(5) garment is different from the garment originally worn by tar-
Qil (Kli )T i get person, we use FID and KID for evaluation, and in order
λ ⊙ sof tmax( √ )Vl ,
d to distinguish them from the paired setting, we named them
where Qig , Kgi , Vgi represent the Q, K, V of global branch, FIDu and KIDu respectively.
Implementation Details. We use Paint by Example (Yang
Qil , Kli , Vli represent the Q, K, V of local branch, λ is the
et al. 2023) as the backbone of our method and copy the
learnable fusion vector, ⊙ represents channel-wise multi-
i weights of its encoder to initialize El . The hyper-parameter
plication, and fout represents the clothing features after se-
λ1 is set to 1e-1, and λperc is set to 1e-4. We train our
lective fusion. By engaging and fusing the global and lo-
model on 2 NVIDIA Tesla A100 GPUs for 40 epochs
cal clothing features, we can enhance the retention of high-
with a batch size of 4 and a learning rate of 1e-5. We
frequency garment details, e.g., texts and patterns.
use AdamW (Loshchilov and Hutter 2017) optimizer with
Training Objectives β1 = 0.9, β2 = 0.999.
Comparison Settings. We compare our method with Paint
As stated in preliminaries, diffusion models learn to generate
By Example (Yang et al. 2023), PF-AFN (Ge et al. 2021b),
images from random Gaussian noise. However, the training
GP-VTON (Xie et al. 2023), LaDI-VTON (Morelli et al.
objective in Eq. (3) is performed in latent space, and does
2023), DCI-VTON (Gou et al. 2023), StableVITON (Kim
not explicitly constrain the generated results in visible im-
et al. 2023) and IDM-VTON (Choi et al. 2024) on both
age space, resulting in slight differences in color from the
frontal-view and multi-view virtual try-on tasks. For multi-
ground truth. To alleviate the problem, we additionally em-
view virtual try-on, we compare these methods on the pro-
ploy ℓ1 loss L1 and perceptual loss (Johnson, Alahi, and Fei-
posed MVG dataset. For the sake of fairness, we fine-tune
Fei 2016) Lperc . The L1 loss is calculated by
the previous methods on the MVG dataset according to its
L1 = ∥x̂ − x∥1 , (6) original training settings. Since previous methods can only
where x̂ is the reconstructed image using Eq. (1). The per- input a single clothing image, we input frontal and back view
ceptual loss is calculated as, clothing respectively and select the best result. For frontal-
5 view virtual try-on, we compare these methods on VITON-
Lperc =
X
∥ϕk (x̂) − ϕk (x)∥1 , (7) HD and DressCode datasets. Following previous works’ set-
tings, the proposed MV-VTON only inputs one frontal-view
k=1
garment during training and inference.
where ϕk represents the k-th layer of VGG (Simonyan and
Zisserman 2014). Totally, the overall training objective can Quantitative Evaluation
be written as,
Table 1 reports the quantitative results on the paired set-
L = LLDM + λ1 L1 + λperc Lperc , (8) ting, and Table 2 shows the unpaired setting’s results. On
where λ1 and λperc are the balancing weights. the multi-view virtual try-on task, as can be seen, thanks
MVG VITON-HD DressCode - Upper Body
Methods Reference
LPIPS↓ SSIM↑ FID↓ KID↓ LPIPS↓ SSIM↑ FID↓ KID↓ LPIPS↓ SSIM↑ FID↓ KID↓
Paint by Example CVPR23 0.120 0.880 54.38 14.95 0.150 0.843 13.78 4.48 0.078 0.899 15.21 4.51
PF-AFN CVPR21 0.139 0.873 49.47 12.81 0.141 0.855 7.76 4.19 0.091 0.902 13.11 6.29
GP-VTON CVPR23 - - - - 0.085 0.889 6.25 0.77 0.236 0.781 19.37 8.07
LaDI-VTON MM23 0.069 0.921 29.14 4.39 0.094 0.872 7.08 1.49 0.063 0.922 11.85 3.20
DCI-VTON MM23 0.062 0.929 25.71 0.95 0.074 0.893 5.52 0.57 0.043 0.937 11.87 1.91
StableVITON CVPR24 0.063 0.929 23.52 0.46 0.073 0.888 6.15 1.34 0.040 0.937 10.18 1.70
IDM-VTON ECCV24 0.095 0.896 34.66 5.33 0.135 0.826 14.36 8.63 0.066 0.912 13.88 5.39
Ours - 0.050 0.936 22.18 0.35 0.069 0.897 5.43 0.49 0.040 0.941 8.26 1.39

Table 1: Quantitative comparison with previous work on paired setting. For multi-view virtual try-on task, we show results on
our proposed MVG dataset. For frontal-view virtual try-on task, we show results on VITON-HD dataset (Lee et al. 2022) and
DressCode dataset (Morelli et al. 2022). The best results have been bolded. Note that all previous works have been finetuned
on our proposed MVG dataset when comparing on multi-view virtual try-on task.

Frontal cloth Back cloth Person Paint By Example PF-AFN LaDI-VTON DCI-VTON StableVITON IDM-VTON Ours

Figure 5: Qualitative comparisons on multi-view virtual try-on task with MVG dataset.
Clothing Person Paint By Example PF-AFN GP-VTON LaDI-VTON DCI-VTON StableVITON IDM-VTON Ours

Figure 6: Qualitative comparisons on frontal-view virtual try-on task with VITON-HD and DressCode datasets.

to the view-adaptive selection mechanism, our method can due to the lack of adaptive selection of clothes, previous
reasonably select clothing features according to the person’s methods have difficulty in generating hoods of the origi-
pose, so it is better than existing methods in various met- nal cloth. Moreover, in the second row, previous methods
rics, especially on LPIPS and SSIM. Furthermore, owing often struggle to maintain fidelity to the original garments.
to joint attention blocks, our approach excels in preserving In contrast, our method effectively addresses the aforemen-
high-frequency details of the original garments across both tioned problems and generates high-fidelity results. We pro-
frontal-view and multi-view virtual try-on scenarios, thus vide more results of multi-view virtual try-on in the supple-
achieving superior performance in these metrics. mentary materials.
Frontal-View Virtual Try-On. As shown in Figure 6, our
Qualitative Evaluation method also demonstrates superior performance over exist-
Multi-View Virtual Try-On. As shown in Figure 5, MV- ing methods on frontal-view virtual try-on task, particularly
VTON generates more realistic multi-view results compared in retaining clothing details. Specifically, our method not
to the previous five methods. Specifically, in the first row, only faithfully generates complex patterns (in the first row),
(w/o) hard-selection Ours (w/o) soft-selection Ours
MVG VITON-HD

Back Frontal
Method
FIDu ↓ KIDu ↓ FIDu ↓ KIDu ↓
Paint by Example 43.79 5.92 17.27 4.56
PF-AFN 47.38 7.04 21.18 6.57

Person
GP-VTON - - 9.11 1.21
LaDI-VTON 36.61 3.39 9.55 1.83
DCI-VTON 36.03 3.79 8.93 1.07
StableVITON 35.85 4.22 9.86 1.09 Figure 7: Visualization of view-adaptive selection’s effect.
IDM-VTON 40.73 5.74 18.27 10.43
Ours 33.44 2.69 8.67 0.78 (w/o) local features (w/o) global features Ours

Clothing
Table 2: Unpaired setting’s quantitative results on our MVG
dataset and VITON-HD dataset. The best results have been
bolded.

Agnostic person
Hard Soft LPIPS↓ SSIM↑ FID↓ KID↓ FIDu ↓ KIDu ↓
× ×
√ 0.068 0.925 25.13 0.77 35.28 3.24
×
√ 0.064 0.928 24.58 0.62 34.67 3.05
√ ×
√ 0.052 0.934 22.18 0.43 33.47 2.74

Clothing
0.050 0.936 22.18 0.35 33.44 2.69

Table 3: Ablation study of our proposed view-adaptive se-

Agnostic person
lection mechanism on MVG dataset.

Global Local LPIPS↓ SSIM↑ FID↓ KID↓ FIDu ↓ KIDu ↓



× 0.062 0.929 25.71 0.95 36.01 3.78

MVG

× 0.058 0.931 26.16 1.21 36.29 3.91


√ √ Figure 8: Visualization of joint attention blocks’ effect.
0.050 0.936 22.18 0.35 33.44 2.69

× 0.074 0.893 5.52 0.57 8.93 1.07
VITON-HD


× 0.070 0.896 5.76 0.81 9.15 1.09
√ √
0.069 0.897 5.43 0.49 8.67 0.78 joint attention blocks, we discard the global feature extrac-
tion branch and the local feature extraction branch respec-
Table 4: Ablation study of joint attention blocks on MVG tively. Results are shown in Table 4 and Figure 8. As can
and VITON-HD datasets. be seen, relying solely on global features may lead to loss
of details, such as the distorted text ’VANS’ in the first row
and the missing letter ’C’ in the second row. Moreover, if
but also better preserves the literal ’Wrangler’ in the cloth- only local features are provided, the results may also have
ing (in the second row). We provide more qualitative com- unfaithful textures, such as artifacts on the person’s chest.
parisons in the supplementary materials, as well as dressing Compared to them, we fuse global and local features through
results under complex human pose conditions. joint attention blocks, which can refine details in garments
while preserving semantic information.
Ablation Studies
Effect of View-Adaptive Selection. We investigate the ef- Conclusion
fect of view-adaptive selection on the multi-view virtual try-
on task. Specifically, no hard-selection represents that we di- We introduce a novel and practical Multi-View Virtual Try-
rectly concatenate two garments’ features encoded by CLIP, ON (MV-VTON) task, which aims at using the frontal and
and no soft-selection means that two clothing features are back clothing to reconstruct the dressing results of a per-
concatenated without passing soft-selection blocks. Com- son from multiple views. To achieve the task, we propose a
parison results are shown in Table 3 and Figure 7. As can diffusion-based method. Specifically, the view-adaptive se-
be seen, the performance is greatly reduced without hard- lection mechanism exacts more reasonable clothing features
selection and soft-selection. No hard-selection will confuse based on the similarity between the poses of a person and
two view’s cloth features, as shown by the blurriness of the two clothes. The joint attention block aligns the global and
’POP’ text in Figure 7. In addition, no soft-selection causes local features of the selected clothing to the target person,
the model to lose some cloth information when processing and fuse them. In addition, we collect a multi-view garment
the side view situation, such as the missing white hood and dataset for this task. Extensive experiments demonstrate that
cuffs in Figure 7. the proposed method achieves state-of-the-art performance
Effect of Joint Attention Blocks. In order to demonstrate both on frontal-view and multi-view virtual try-on tasks,
the effectiveness of fusing global and local features through compared with existing methods.
Acknowledgments

Target Person
This work was supported by the National Key R&D Program
of China (2022YFA1004100).

Appendix
IMPLEMENTATION DETAILS

Pose
Hard-Selection. In this section, we present more details
about the proposed hard-selection for global clothing fea-
tures. Specifically, in multi-view virtual try-on task, we use
OpenPose (Cao et al. 2017; Simon et al. 2017; Wei et al.

Selected Garment
2016) to extract the skeleton images of target person, frontal
clothing and back clothing as pose information ph , pf , and
pb , respectively. After that, we decide whether to use frontal-
view clothing or back-view clothing based on the relative
positions of the target person’s left arm and right arm in the
skeleton images. As shown in Figure A, if the right arm ap- Figure A: Visualization of the person and corresponding
pears positioned to the left of the left arm in the skeleton im- poses. We select one of garments based on the relative po-
age (columns one to three in Figure A), frontal-view cloth- sitions of left and right arms in the skeleton image when
ing is chosen; otherwise, back-view clothing is preferred performing hard-selection on the multi-view virtual try-on
(columns four to five in Figure A). In addition, following task.
previous works (Gou et al. 2023; Xie et al. 2023; Morelli
et al. 2023), we adopt an additional warping network (Ge
et al. 2021b; Kim et al. 2023) to obtain the pre-warped cloth. Target Person

Pose Encoder. The pose encoder is used to extract features


of skeleton images. It is a tiny network that contains three
blocks, followed by the layer normalization (Ba, Kiros, and
Hinton 2016). Each block comprises one convolution layer,
one GELU (Hendrycks and Gimpel 2016) activation layer,
Human Parsing Map

and a down-sampling operation. We utilize the acquired pose


embeddings as input for the proposed soft-selection block.
MVG Dataset. To construct dataset for multi-view vir-
tual try-on (MV-VTON), we first collect a large number of
videos from YOOX NET-A-PORTER1 , Taobao2 and Tik-
Tok3 websites, then filter out 1009 videos where the per-
son in the video turns at least 180 degrees while wearing
Dense Pose

clothes. Afterwards, we divide each video into frames and


handpick 5 frames to constitute a sample within our MVG
dataset. Across these 5 frames, the person is captured from
various angles, approximately spanning 0 (i.e., frontal view),
45, 90, 135, and 180 (i.e., back view) degrees, as shown in Figure B: Examples of human parsing maps and dense pose
the first row of Figure B. In addition, following the pre- in our dataset. The parsing maps can be used to synthesize
vious DressCode dataset (Morelli et al. 2022), we employ cloth-agnostic person images.
SCHP model (Li et al. 2020) to extract the corresponding hu-
man parsing maps and utilize DensePose (Güler, Neverova,
and Kokkinos 2018) to obtain the person’s dense labels, as dataset. Specifically, in the first row of Figure C, due to the
shown in the second and third row of Figure B. Human pars- lack of adaptive selection of clothes, previous methods have
ing maps can be utilized to generate cloth-agnostic person difficulty in generating hoods of the original cloth. More-
images, which are necessary for training and inference pro- over, in the second and third rows, previous methods often
cesses. struggle to maintain fidelity to the original garments. In con-
trast, our method effectively addresses the aforementioned
MORE QUALITATIVE RESULTS problems and generates high-fidelity results.
Comparison Results on Multi-View VTON. In this sec- Multi-View Results on Multi-View VTON. In this section,
tion, we present more comparison results on the MVG we present more multi-view results on the MVG dataset.
Specifically, as shown in Figure E, we show multiple groups
1
https://net-a-porter.com of try-on results for the same person under different views,
2
https://taobao.com using the proposed method. In Figure E, the first column
3
https://douyin.com displays frontal-view and back-view garments, the second
Frontal view cloth Back view cloth Person Paint By Example PF-AFN LaDI-VTON DCI-VTON StableVITON IDM-VTON Ours

Figure C: Comparison results on multi-view virtual try-on task.

to fourth columns depict persons from different views, while Clothing Person Result
the fifth to seventh columns showcase the corresponding try-
on results. As can be seen, our method can generate realistic
dressing-up results of the multi-view person from the given
two views of clothing. Furthermore, our method can retain
the details well on the original clothing (e.g., the buttons in
the fifth row) and generate high-fidelity try-on images even
under occlusion (e.g., hair occlusion in the second row). In
conclusion, the proposed method exhibits outstanding per-
formance on multi-view virtual try-on task.
Complex Human Pose Results on Frontal-View VTON.
In this section, we provide more VTON results under com-
plex human pose conditions in Figure F. It can be seen
that our method can also generate high-fidelity try-on results
even when the target person has a more complex pose.
Comparison Results on Frontal-View VTON. In this sec-
tion, we show more visual comparison results on VITON-
HD (Choi et al. 2021) dataset and DressCode (Morelli et al.
2022) dataset. The previous works include Paint By Ex-
ample (Yang et al. 2023), PF-AFN (Ge et al. 2021b), GP-
VTON (Xie et al. 2023), LaDI-VTON (Morelli et al. 2023), Figure D: Visualization of bad cases on VITON-HD dataset.
DCI-VTON (Gou et al. 2023) and StableVITON (Kim et al.
2023). The results are shown in Figure G. In the first and
second row of Figure G, it can be seen that our method bet-
ter preserves the shape of the original clothing (e.g., the cuff
in the second row), compared to the previous methods. In
addition, our method outperforms previous methods in pre- section, we present more results at 1024×768 resolution on
serving high-frequency details, such as patterns on clothing VITON-HD (Choi et al. 2021) and DressCode (Morelli et al.
in the fourth and sixth rows. Moreover, in contrast to pre- 2022) datasets, as shown in the Figure H. Specifically, we
vious methods, MV-VTON is not constrained by specific utilize the model trained at 512×384 resolution to directly
types of clothing and can achieve highly realistic effects test at 1024×768 resolution. Despite the difference in res-
across a wide range of garment styles (e.g., the garment in olutions between training and testing, our method can also
the third row and the collar in the eighth row). In summary, produce high-fidelity try-on results. For instance, the gener-
our method also has superiority on frontal-view virtual try- ated images can preserve both the intricate patterns and text
on task. adorning the clothing (in the first row) while also effectively
High Resolution Results on Frontal-View VTON. In this maintaining their original shapes (in the last row).
LIMITATIONS
Despite outperforming previous methods on both frontal-
view and multi-view virtual try-on tasks, our method does
not perform well in all cases. Figure D displays some unsat-
isfactory try-on results. As can be seen, although our method
can preserve the shape and texture of original clothing (e.g.,
the ’DIESEL’ text in the first row), it is difficult for it to
fully preserve some smaller or more complex details (e.g.,
the parts circled in red). The reason for this phenomenon
may be that these details are easily lost when inpainting in
latent space. We will try to solve this issue in future work.
Clothing View 1 View 2 View 3 Result 1 Result 2 Result 3

Figure E: Multi-view results on multi-view virtual try-on task.


Person

Person
Clothing Clothing

Figure F: Complex human pose results on frontal-view virtual try-on task.


Clothing Person Paint By Example PF-AFN GP-VTON LaDI-VTON DCI-VTON StableVITON IDM-VTON Ours

Figure G: Qualitative comparisons on frontal-view virtual try-on task.


Clothing Person Result Clothing Person Result

Figure H: Qualitative results of 1024×768 resolution on frontal-view virtual try-on task.


References Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; and
Ba, J. L.; Kiros, J. R.; and Hinton, G. E. 2016. Layer nor- Hochreiter, S. 2017. Gans trained by a two time-scale up-
malization. arXiv preprint arXiv:1607.06450. date rule converge to a local nash equilibrium. Advances in
neural information processing systems, 30.
Bai, S.; Zhou, H.; Li, Z.; Zhou, C.; and Yang, H. 2022.
Single stage virtual try-on via deformable attention flows. Ho, J.; Jain, A.; and Abbeel, P. 2020. Denoising diffusion
In European Conference on Computer Vision, 409–425. probabilistic models. Advances in neural information pro-
Springer. cessing systems, 33: 6840–6851.
Bińkowski, M.; Sutherland, D. J.; Arbel, M.; and Gret- Johnson, J.; Alahi, A.; and Fei-Fei, L. 2016. Perceptual
ton, A. 2018. Demystifying mmd gans. arXiv preprint losses for real-time style transfer and super-resolution. In
arXiv:1801.01401. Computer Vision–ECCV 2016: 14th European Conference,
Amsterdam, The Netherlands, October 11-14, 2016, Pro-
Cao, Z.; Simon, T.; Wei, S.-E.; and Sheikh, Y. 2017. Re- ceedings, Part II 14, 694–711. Springer.
altime Multi-Person 2D Pose Estimation using Part Affinity
Fields. In CVPR. Kim, J.; Gu, G.; Park, M.; Park, S.; and Choo, J. 2023.
StableVITON: Learning Semantic Correspondence with La-
Choi, S.; Park, S.; Lee, M.; and Choo, J. 2021. Viton-hd: tent Diffusion Model for Virtual Try-On. arXiv preprint
High-resolution virtual try-on via misalignment-aware nor- arXiv:2312.01725.
malization. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, 14131–14140. Lee, S.; Gu, G.; Park, S.; Choi, S.; and Choo, J. 2022. High-
resolution virtual try-on with misalignment and occlusion-
Choi, Y.; Kwak, S.; Lee, K.; Choi, H.; and Shin, J. 2024. Im- handled conditions. In European Conference on Computer
proving diffusion models for virtual try-on. arXiv preprint Vision, 204–219. Springer.
arXiv:2403.05139.
Li, P.; Xu, Y.; Wei, Y.; and Yang, Y. 2020. Self-correction
Dong, H.; Liang, X.; Shen, X.; Wang, B.; Lai, H.; Zhu, J.; for human parsing. IEEE Transactions on Pattern Analysis
Hu, Z.; and Yin, J. 2019. Towards multi-pose guided virtual and Machine Intelligence, 44(6): 3260–3271.
try-on network. In Proceedings of the IEEE/CVF interna-
Loshchilov, I.; and Hutter, F. 2017. Decoupled weight decay
tional conference on computer vision, 9026–9035.
regularization. arXiv preprint arXiv:1711.05101.
Gal, R.; Alaluf, Y.; Atzmon, Y.; Patashnik, O.; Bermano,
Miyato, T.; Kataoka, T.; Koyama, M.; and Yoshida, Y. 2018.
A. H.; Chechik, G.; and Cohen-Or, D. 2022. An image is
Spectral normalization for generative adversarial networks.
worth one word: Personalizing text-to-image generation us-
arXiv preprint arXiv:1802.05957.
ing textual inversion. arXiv preprint arXiv:2208.01618.
Morelli, D.; Baldrati, A.; Cartella, G.; Cornia, M.; Bertini,
Ge, C.; Song, Y.; Ge, Y.; Yang, H.; Liu, W.; and Luo, P. M.; and Cucchiara, R. 2023. LaDI-VTON: latent diffusion
2021a. Disentangled cycle consistency for highly-realistic textual-inversion enhanced virtual try-on. In Proceedings
virtual try-on. In Proceedings of the IEEE/CVF conference of the 31st ACM International Conference on Multimedia,
on computer vision and pattern recognition, 16928–16937. 8580–8589.
Ge, Y.; Song, Y.; Zhang, R.; Ge, C.; Liu, W.; and Luo, P. Morelli, D.; Fincato, M.; Cornia, M.; Landi, F.; Cesari, F.;
2021b. Parser-free virtual try-on via distilling appearance and Cucchiara, R. 2022. Dress code: high-resolution multi-
flows. In Proceedings of the IEEE/CVF conference on com- category virtual try-on. In Proceedings of the IEEE/CVF
puter vision and pattern recognition, 8485–8493. Conference on Computer Vision and Pattern Recognition,
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; 2231–2235.
Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.;
2020. Generative adversarial networks. Communications of Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.;
the ACM, 63(11): 139–144. et al. 2021. Learning transferable visual models from nat-
Gou, J.; Sun, S.; Zhang, J.; Si, J.; Qian, C.; and Zhang, L. ural language supervision. In International conference on
2023. Taming the Power of Diffusion Models for High- machine learning, 8748–8763. PMLR.
Quality Virtual Try-On with Appearance Flow. In Proceed- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; and Om-
ings of the 31st ACM International Conference on Multime- mer, B. 2022. High-resolution image synthesis with latent
dia, 7599–7607. diffusion models. In Proceedings of the IEEE/CVF confer-
Güler, R. A.; Neverova, N.; and Kokkinos, I. 2018. Dense- ence on computer vision and pattern recognition, 10684–
pose: Dense human pose estimation in the wild. In Proceed- 10695.
ings of the IEEE conference on computer vision and pattern Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; and
recognition, 7297–7306. Aberman, K. 2023. Dreambooth: Fine tuning text-to-image
He, S.; Song, Y.-Z.; and Xiang, T. 2022. Style-based global diffusion models for subject-driven generation. In Proceed-
appearance flow for virtual try-on. In Proceedings of the ings of the IEEE/CVF Conference on Computer Vision and
IEEE/CVF Conference on Computer Vision and Pattern Pattern Recognition, 22500–22510.
Recognition, 3470–3479. Simon, T.; Joo, H.; Matthews, I.; and Sheikh, Y. 2017. Hand
Hendrycks, D.; and Gimpel, K. 2016. Gaussian error linear Keypoint Detection in Single Images using Multiview Boot-
units (gelus). arXiv preprint arXiv:1606.08415. strapping. In CVPR.
Simonyan, K.; and Zisserman, A. 2014. Very deep convo-
lutional networks for large-scale image recognition. arXiv
preprint arXiv:1409.1556.
Song, J.; Meng, C.; and Ermon, S. 2020. Denoising diffusion
implicit models. arXiv preprint arXiv:2010.02502.
Wang, B.; Zheng, H.; Liang, X.; Chen, Y.; Lin, L.; and Yang,
M. 2018. Toward characteristic-preserving image-based vir-
tual try-on network. In Proceedings of the European confer-
ence on computer vision (ECCV), 589–604.
Wang, J.; Sha, T.; Zhang, W.; Li, Z.; and Mei, T. 2020. Down
to the last detail: Virtual try-on with fine-grained details. In
Proceedings of the 28th ACM International Conference on
Multimedia, 466–474.
Wang, Z.; Bovik, A. C.; Sheikh, H. R.; and Simoncelli, E. P.
2004. Image quality assessment: from error visibility to
structural similarity. IEEE transactions on image process-
ing, 13(4): 600–612.
Wei, S.-E.; Ramakrishna, V.; Kanade, T.; and Sheikh, Y.
2016. Convolutional pose machines. In CVPR.
Wei, Y.; Zhang, Y.; Ji, Z.; Bai, J.; Zhang, L.; and Zuo, W.
2023. Elite: Encoding visual concepts into textual embed-
dings for customized text-to-image generation. In Proceed-
ings of the IEEE/CVF International Conference on Com-
puter Vision, 15943–15953.
Xie, Z.; Huang, Z.; Dong, X.; Zhao, F.; Dong, H.; Zhang,
X.; Zhu, F.; and Liang, X. 2023. Gp-vton: Towards general
purpose virtual try-on via collaborative local-flow global-
parsing learning. In Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition, 23550–
23559.
Yang, B.; Gu, S.; Zhang, B.; Zhang, T.; Chen, X.; Sun, X.;
Chen, D.; and Wen, F. 2023. Paint by example: Exemplar-
based image editing with diffusion models. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition, 18381–18391.
Yang, H.; Zhang, R.; Guo, X.; Liu, W.; Zuo, W.; and Luo,
P. 2020. Towards photo-realistic virtual try-on by adap-
tively generating-preserving image content. In Proceedings
of the IEEE/CVF conference on computer vision and pattern
recognition, 7850–7859.
Yu, F.; Hua, A.; Du, C.; Jiang, M.; Wei, X.; Peng, T.; Xu, L.;
and Hu, X. 2023. VTON-MP: Multi-Pose Virtual Try-On via
Appearance Flow and Feature Filtering. IEEE Transactions
on Consumer Electronics.
Zhang, L.; Rao, A.; and Agrawala, M. 2023. Adding Con-
ditional Control to Text-to-Image Diffusion Models. In
Proceedings of the IEEE/CVF International Conference on
Computer Vision (ICCV), 3836–3847.
Zhang, R.; Isola, P.; Efros, A. A.; Shechtman, E.; and Wang,
O. 2018. The unreasonable effectiveness of deep features as
a perceptual metric. In Proceedings of the IEEE conference
on computer vision and pattern recognition, 586–595.
Zhu, L.; Yang, D.; Zhu, T.; Reda, F.; Chan, W.; Saharia,
C.; Norouzi, M.; and Kemelmacher-Shlizerman, I. 2023.
Tryondiffusion: A tale of two unets. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 4606–4615.

You might also like