0% found this document useful (0 votes)
25 views15 pages

Cloth 2 Tex

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views15 pages

Cloth 2 Tex

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Cloth2Tex: A Customized Cloth Texture Generation Pipeline for 3D Virtual

Try-On

Daiheng Gao1∗ Xu Chen2,3∗ Xindi Zhang1 Qi Wang1


Ke Sun1 Bang Zhang1 Liefeng Bo1 Qixing Huang4
1
Alibaba XR Lab 2 ETH Zurich, Department of Computer Science
3
Max Planck Institute for Intelligent Systems 4 The University of Texas at Austin
arXiv:2308.04288v1 [cs.CV] 8 Aug 2023

Figure 1. We propose Cloth2Tex, a novel pipeline for converting 2D images of clothing to high-quality 3D textured meshes that can be
draped onto 3D humans. In contrast to previous methods, Cloth2Tex supports a variety of clothing types. Results of 3D textured meshes
produced by our method as well as the corresponding input images are shown above.

Abstract model. We evaluate our approach both qualitatively


and quantitatively and demonstrate that Cloth2Tex can
Fabricating and designing 3D garments has be- generate high-quality texture maps and achieve the best
come extremely demanding with the increasing need for visual effects in comparison to other methods. Project page:
synthesizing realistic dressed persons for a variety of tomguluson92.github.io/projects/cloth2tex/
applications, e.g. 3D virtual try-on, digitalization of 2D
clothes into 3D apparel, and cloth animation. It thus
necessitates a simple and straightforward pipeline to
obtain high-quality texture from simple input, such as 2D 1. Introduction
reference images. Since traditional warping-based texture
generation methods require a significant number of control The advancement of AR/VR and 3D graphics has opened
points to be manually selected for each type of garment, up new possibilities for the fashion e-commerce industry.
which can be a time-consuming and tedious process. Customers can now virtually try on clothes on their avatars
We propose a novel method, called Cloth2Tex, which in 3D, which can help them make more informed purchase
eliminates the human burden in this process. Cloth2Tex decisions. However, most clothing assets are currently pre-
is a self-supervised method that generates texture maps sented in 2D catalog images, which are incompatible with
with reasonable layout and structural consistency. An- 3D graphics pipelines. Thus it is critical to produce 3D
other key feature of Cloth2Tex is that it can be used to clothing assets automatically from these existing 2D im-
support high-fidelity texture inpainting. This is done by ages, aiming at making 3D virtual try-on accessible to ev-
combining Cloth2Tex with a prevailing latent diffusion eryone.
method as previous methods, we incorporate neural mesh
rendering [17] to directly establish dense correspondences
between 2D catalog images and the UV textures of the
meshes. This results in higher-quality initial texture maps
for all clothing types. We achieve this by optimizing the 3D
clothing mesh models and textures to align with the catalog
images’ color, silhouette, and key points.
Although the texture maps from neural rendering are of
higher quality, they still need refinement due to missing re-
gions. Learning to refine these texture maps across differ-
ent clothing types requires a large dataset of high-quality
3D textures, which is infeasible to acquire. We tackle this
problem by leveraging the recently emerging latent diffu-
sion model (LDM) [24] as a data simulator. Specifically,
we use ControlNet [39] to generate large-scale, high-quality
texture maps with various patterns and colors based on its
Figure 2. Problem of warping-based texture generation algorithm: canny edge version. In addition to the high-quality ground-
partially filled UV texture maps with large missing holes as high- truth textures, the refinement network requires the corre-
lighted in yellow. sponding initial defective texture maps obtained from neu-
ral rendering. To get such data, we render the high-quality
Towards this goal, the research community has been de-
texture maps into catalog images and then run our neural
veloping algorithms [19, 20, 37] that can transfer 2D im-
rendering pipeline to re-obtain the texture map from the cat-
ages into 3D textures of clothing mesh models. The key
alog images, which now contain defects as desired. With
to producing 3D textures from 2D images is to determine
these pairs of high-quality complete texture maps and the
the correspondences between the catalog images and the
defective texture maps from the neural renderer, we train a
UV textures. Conventionally, this is achieved via the Thin-
high-resolution image translation model that refines the de-
Plate-Spline (TPS) method [3], which approximates the
fective texture maps.
dense correspondences from a small set of corresponding
Our method can produce high-quality 3D textured cloth-
key points. In industrial applications, these key points are
ing from 2D catalog images of various clothing types. In
annotated manually and densely for each clothing instance
our experiments, we compare our approach with state-of-
to achieve good quality. With deep learning models, au-
the-art techniques of inferring 3D clothing textures and find
tomatic key point detectors [19, 35] have been proposed
that our method supports more clothing types and demon-
to detect key points automatically for clothing. However,
strates superior texture quality. In addition, we carefully
as seen in Fig. 2, the inherent self-occlusions (e.g. sleeves
verify the effectiveness of individual components via a thor-
occluded by the main fabric) of TPS warping-based ap-
ough ablation study.
proaches are intractable, leading to erroneous and incom-
In summary, we contribute Cloth2Tex, a pipeline that
plete texture maps. Several works have attempted to use
can produce high-quality 3D textured clothing in various
generative models to refine texture maps. However, such
types based on 2D catalog images, which is achieved via
a refinement strategy has demonstrated success only in a
• a) 3D parametric clothing mesh models of 10+ different
small set of clothing types, i.e. T-shirts, pants, and shorts.
categories that will be publicly available,
This is because TPS cannot produce satisfactory initial tex-
• b) an approach based on neural mesh rendering to trans-
ture maps on all clothing types, and a large training dataset
ferring 2D catalog images into texture maps of clothing
covering high-quality texture maps of diverse clothing types
meshes,
is missing. Pix2Surf [20], a SMPL [18]-based virtual try-
• c) data simulation approach for training a texture refine-
on algorithm, has automated the process of texture gener-
ment network built on top of blendshape-driven mesh and
ation with no apparent cavity or void. However, due to its
LDM-based texture.
clothing-specific model, Pix2Surf is limited in its ability to
generalize to clothes with arbitrary shapes.
2. Related Works
This paper aims to automatically convert 2D reference
clothing images into 3D textured clothing meshes for a Learning 3D Textures. Our method is related to learning
larger diversity of clothing types. To this end, we first texture maps for 3D meshes. Texturify [27] learns to gen-
contribute template mesh models for 10+ different clothing erate high-fidelity texture maps by rendering multiple 2D
types (well beyond current SOTAs: Pix2Surf (4) and [19] images from different viewpoints and aligning the distribu-
(2)). Next, instead of using the Thin-Plate-Spline (TPS) tion of rendered images and real image observations. Yu
et al. [38] adopt a similar method, rendering images from this by registering our parametric garment meshes onto cat-
different viewpoints and then discriminating the images by alog images using a neural mesh renderer. The pipeline’s
separate discriminators. With the emergence of diffusion second stage (Phase II) is to recover fine textures from the
models [7, 31], recent work Text2Tex [5] exploits 2D dif- coarse estimate. We use image translation networks trained
fusion models for 3D texture synthesis. Due to the mighty on large-scale data synthesized by pre-trained latent diffu-
generalization ability of the diffusion model [11, 24] trained sion models. The mesh templates for individual clothing
on the largest corpus LAION-5B [26], i.e. stable diffu- categories are a pre-requirement for our pipeline. We ob-
sion [24], the textured meshes generated by Text2Tex are tain these templates by manual artist design and will make
of superior quality and contain rich details. Our method is them publicly available.
related to these approaches in that we also utilize diffusion Implementation details are placed in the supp. material
models for 3D texture learning. However, different from due to the page limit.
previous approaches, we use latent diffusion models only to
generate synthetic texture maps to train our texture inpaint- 3.1. Pre-requirement: Template Meshes
ing model, and our focus lies in learning 3D texture corre-
sponding to a specific pair of 2D reference images instead For the sake of both practicality and convenience, we design
of random or text-guided generation. cloth template mesh (with fixed topology) M for common
Texture-based 3D Virtual Try-On. Wang et al. [34] pro- garment types (e.g., T-shirts, sweatshirts, baseball jackets,
vide a sketch-based network that infers both 2D garment trousers, shorts, skirts, and etc.). We then build a defor-
sewing patterns and the draped 3D garment mesh from 2D mation graph D [29] to optimize the template mesh ver-
sketches. In real applications, however, many applications tices. This is because per-vertex image-based optimization
require inferring 3D garments and the texture from 2D cat- is subject to errors and artifacts due to the high degrees of
alog images. To achieve this goal, Pix2Surf [20] is the first freedom. Specifically, we construct D with k nodes, which
work that creates textured 3D garments automatically from are parameterized with axis angles A ∈ R3 and translations
front/back view images of a garment. This is achieved by T ∈ R3 . The vertex displacements are then derived from
predicting dense correspondences between the 2D images the deformation nodes (the number of nodes k is dependent
and the 3D mesh template using a trained network. How- on the garment type since different templates have differ-
ever, due to the erroneous correspondence prediction, par- ent numbers of vertices and faces). We also manually select
ticularly on unseen test samples, Pix2Surf has difficulty in several vertices on the mesh templates as landmarks K. The
preserving high-frequency details and tends to blur out fine- specific requirements of the template mesh are as follows:
grained details such as thin lines and logos. vertices V less than 10,000, uniform mesh topology, and
To avoid such a problem, Sahib et al. [19] propose integrity of UV. The vertex number of all templates ranges
to use a warping-based method (TPS) [3] instead and to between skirt (6,116) to windbreaker (9,881). For unifor-
use further a deep texture inpainting network built upon mity, we set the downsampling factor of D for all templates
MADFNet [40]. However, as mentioned in the introduc- to 20 (details of template meshes are placed in the supp.
tion, warping-based methods generally require dense and material). The integrity of UV means that the UV should
accurate corresponding key points in images and UV maps be placed as a whole in terms of front and back, without
and have only demonstrated successful results on two sim- further subdivision, as used in traditional computer graph-
ple clothing categories, T-shirts and trousers. In contrast to ics. Fabricating integral UV is not complicated and can be a
previous work, Cloth2Tex aims to achieve automatic high- super-duper candidate for later diffusion-based texture gen-
quality texture learning for a broader range of garment cat- eration. See Sec. 3.3.1 for more details.
egories. To this end, we use neural rendering instead of
warping, which yields better texture quality on more com- 3.2. Phase I: Shape and Coarse Texture Generation
plex garment categories. We further utilize latent diffusion
models (LDMs) to synthesize high-quality texture maps of The goal of Phase I is to determine the garment shape and
various clothing categories to train the inpainting network. a coarse estimate of the UV textures T from the input cat-
alog (Front & Back view). We adopt a differentiable ren-
3. Method dering approach [17] to determine the UV textures in a self-
supervised way without involving trained neural networks.
We propose Cloth2Tex, a two-stage approach that converts Precisely, we fit our template model to the catalog images
2D images into textured 3D garments. The garments are by minimizing the difference between the 2D rendering of
represented as polygon meshes, which can be draped and our mesh model and the target images. The fitting proce-
simulated on 3D human bodies. The overall pipeline is il- dure consists of two stages, namely Silhouette Matching and
lustrated in Fig. 3. The pipeline’s first stage (Phase I) is to Image-based Optimization. We will now elaborate on these
determine the 3D garment shape and coarse texture. We do stages below.
Figure 3. Method overview: Cloth2Tex consists of two stages. In Phase I, we determine the 3D garment shape and coarse texture by
registering our parametric garment meshes onto catalog images using a neural mesh renderer. Next, in Phase II, we refine the coarse
estimate of the texture to obtain high-quality fine textures using image translation networks trained on large-scale data synthesized by
pre-trained latent diffusion models. Note that the only component that requires training is the inpainting network. Please watch our video
on the project page for an animated explanation of Cloth2Tex.

3.2.1 Silhouette Matching projection of 3D template mesh keypoints:


We first align the corresponding template mesh to the 2D Y
images based on the 2D landmarks and silhouette. Here, Elmk = ∥ K − L2d ∥2 (1)
we use BCRNN [35] to detect landmarks L2d and Dense-
Q
CLIP [22] to extract the silhouette M . To fit our various where denotes the 2D projection of 3D keypoints.
types of garments, we finetune BCRNN with 2,000+ manu- 2D Silhouette Alignment Esil measures the overlap be-
ally annotated clothing images per type. tween the silhouette of M and the predicted M from Dense-
After the mask and landmarks of the input images are CLIP:
obtained, we first perform a global rigid alignment by an
automatic cloth scaling method to adjust the scaling fac- Esil = MaskIoU(Sproj (M), M ) (2)
tor of mesh vertices according to the overlap of the ini-
where Sproj (M) is the silhouette rendered by the differen-
tial silhouette between mesh and input images, which en-
tiable mesh renderer SoftRas [17] and MaskIoU loss is de-
sures a rough agreement of the yielded texture map (See
rived from Kaolin [9].
Fig. 8). Specifically, we implement this mechanism by
Merely minimizing Elmk and Esil does not lead to sat-
checking the silhouette between the rendered and reference
isfactory results, and optimization procedure can easily get
images, and then enlarging or shrinking the scale of mesh
trapped into local minimums. To alleviate this issue, we
vertices accordingly. After an optimum Intersection over
introduce a couple of regularization terms. We first regular-
Union(IoU) has been achieved, we fix the coefficient and
ize the deformation using the as-rigid-as-possible loss Earap
send the scaled template to the next step.
[28] which penalizes the deviation of estimated local sur-
We then fit the silhouette and the landmarks of the tem-
face deformations from rigid transformations. Moreover,
plate mesh (the landmarks on the template mesh are pre-
we further enforce the normal consistency Enorm , which
defined as described in Sec. 3.1) to those detected from the
measures normal consistency for each pair of neighboring
2D catalog images. To this end, we optimize the deforma-
faces). The overall optimization objective is given as:
tions of the nodes in the deformation graph by minimizing
the following energy terms: wsil Esil + wlmk Elmk + warap Earap + wnorm Enorm (3)
2D Landmark Alignment Elmk measures the distance be-
tween 2D landmarks L2d detected by BRCNN and the 2D where w∗ are the respective weights of the losses.
We set large regularization weights warap , wnorm at the render emission-only images under the front and back view.
initial iterations. We then reduce their values progressively Next, we use Phase I again to recover the corresponding
during the optimization procedure, so that the final rendered coarse textures. After collecting the pairs of coarse and fine
texture aligns with the input images. Please refer to the textures, we train an inpainting network to fill the missing
supp. material for more details. regions in the coarse texture maps.

3.2.2 Image-based Optimization 3.3.1 Diffusion-based Data Generation


After the shape of the template mesh is aligned with the We employ diffusion models [7, 24, 39] to generate realistic
image silhouette, we then optimize the UV texture map and diverse training data.
T to minimize the difference between the rendered image We generate texture maps following the UV template
Irend = Srend (M, T ) and the given input catalog images configuration, adopting the pre-trained ControlNet with
Iin from both sides simultaneously. To avoid any outside edge map as input conditions. ControlNet finetunes text-
interference during the optimization, we only preserve the to-image diffusion models to incorporate additional struc-
ambient color and set both diffuse and specular components tural conditions as input. The input edge maps are obtained
to be zero in the settings of SoftRas [17], PyTorch3D [23]. through canny edge detection on clothing-specific UV, and
Since the front and back views do not cover the full the input text prompts are generated by applying image cap-
clothing texture, e.g. the seams between the front and back tioning models, namely Lavis-BLIP [16], OFA [32] and
bodice can not be recovered well due to the occlusions, we MPlug [15], on tens of thousands of clothes crawled from
use the total variation method [25] to fill in the blank of Amazon and Taobao.
seam-affected UV areas. The total variation loss Etv is de- After generating the fine UV texture maps, we are al-
fined as the norm of the spatial gradients of the rendered ready able to generate synthetic front and back 2D catalog
image ∇x Irend and ∇y Irend : images, which will be used to train the impainting network.
We leverage the rendering power of Blender native EEVEE
Etv = ∥∇x Irend ∥2 + ∥∇y Irend ∥2 (4) engine to get the best visual result. A critical step of our ap-
In summary, the energy function for the image-based op- proach is to perform data augmentation so that the impaint-
timization is defined as below: ing network captures invariant features instead of details
that differ between synthetic images and testing images,
wimg ∥Iin − Irend ∥2 + wtv Etv (5) which do not generalize. To this end, we vary the blend
shape parameters of the template mesh to generate 2D cata-
where Iin and Irend are the reference and rendered image. log images in different shapes and pose configurations and
As shown in Fig. 3, T implicitly changes towards the final simulate self-occlusions, which frequently exist in reality
coarse texture Tcoarse , which ensures the final rendering is and lead to erroneous textures as shown in Fig. 2. We hand-
as similar as possible with the input. Please refer to our craft three common blendshapes (Fig. 4) that are enough to
attached video for a vivid illustration. simulate the diverse cloth-sleeve correlation/layout in real-
3.3. Phase II: Fine texture generation ity.
Next, we run Phase I to produce coarse textures from the
In Phase II, we refine the coarse texture from Sec. 3.2 and rendered synthetic 2D catalog images, yielding the coarse,
fill in the missing regions. Our approach takes inspiration defect textures corresponding to the fine textures. These
from the strong and comprehensive capacity of Stable Dif- pairs of coarse-fine textures serve as the training data for
fusion (SD), which is a terrific model to have by itself in the subsequent inpainting network.
image inpainting, completion, and text2image tasks. In
fact, there’s also an entire, growing ecosystem around it:
3.3.2 Texture Inpainting
LoRA [12], ControlNet [39], textual inversion [10] and Sta-
ble Diffusion WebUI [1]. Therefore, a straightforward idea Given the training data simulated by LDMs, we then train
is to resolve our texture completion via SD. our inpainting network. Note that we train a single network
However, we find poor content consistency between the for all clothing categories, making it general-purpose.
inpainted blank and original textured UV. This is because Regarding the impainting work, we choose
UV data in our setting rarely appears in the training dataset Pix2PixHD [33], which shows better results than al-
LAION-5B [26] of SD. In other words, the semantic com- ternative approaches such as conditional TransUNet [6],
position of LAION-5B and UV texture (cloth) are quite dif- ControlNet. One issue of Pix2PixHD is that produces
ferent and challenging for SD to generalize. color-consistent output To in contrast to prompt-guided
To address this issue, we first leverage ControlNet [39] to ControlNet (please check our supp. material for visual-
generate ∼ 2, 000+ HQ complete textures per template and ization comparison). These results are compared with the
Figure 4. Illustration of the three sleeve-related blendshapes of our template mesh model. These blendshapes allow rendering clothing
images in diverse pose configurations to facilitate simulating real-world clothing image layouts.

input full UV as the condition. To address this issue, we


first locate the missing holes, continuous edges and lines in
the coarse UV as the residual mask Mr (left corner at the
bottom line of Fig. 9). We then linearly blend those blank
areas with the model’s output during texture repairing.
Formally speaking, we compute the output as below:

Tfine = BilateralFilter(Tcoarse + Mr ∗ To ) (6)

where BilateralFilter is non-linear filter that can blur the ir-


regular and rough seaming between Tcoarse and To very well
while keeping edges fairly sharp. More details can be seen
in our attached video.

4. Experiments
Our goal is to generate 3D garments from 2D catalog im-
ages. We verify the effectiveness of Cloth2Tex via thor-
ough evaluation and comparison with state-of-the-art base-
lines. Furthermore, we conduct a detailed ablation study to
demonstrate the effects of individual components. Figure 5. Comparison with Pix2Surf [20] and Warping [19] on
T-shirts. Please zoom in for more details.
4.1. Comparison with SOTA
We first compare our method with SOTA virtual try-on al- logos. In contrast, the baseline method Pix2Surf [20] tends
gorithms, both 3D and 2D approaches. to produce blurry textures due to a smooth mapping net-
Comparison with 3D SOTA: We compare Cloth2Tex work, and the Warping [19] baseline introduces undesired
with SOTA methods that produce 3D mesh textures from spatial distortions (e.g., second row in Fig. 5) due to sparse
2D clothing images, including model-based Pix2Surf [20] correspondences.
and TPS-based Warping [19] (We replace the original Comparison with 2D SOTA: We further compare
MADF with locally changed UV-constrained Naiver Stokes Cloth2Tex with 2D virtual try-on methods: Flow-based
method, differences between our UV-constrained naiver- DAFlow [2] and StyleGAN-enhanced Deep-Generative-
stokes and original version is described in the suppl. mate- Projection (DGP) [8]. As shown in Fig. 6, Cloth2Tex
rial). As shown in Fig. 5, our method produces high-fidelity achieves better quality than 2D virtual try-on methods in
3D textures with sharp, high-frequency details of the pat- sharpness and semantic consistency. More importantly, our
terns on clothing, such as the leaves and characters on the outputs, namely 3D textured clothing meshes, are naturally
top row. In addition, our method accurately preserves the compatible with cloth physics simulation, allowing the syn-
spatial configuration of the garment, particularly the overall thesis of realistic try-on effects in various body poses. In
aspect ratio of the patterns and the relative locations of the contrast, 2D methods rely on prior learned from training
images and are hence limited in their generalization ability
to extreme poses outside the training distribution.

Figure 6. Comparison with 2D Virtual Try-One methods, includ-


ing DAFlow [2] and DGP [8].

User Study: Finally, we conduct a user study to evalu- Figure 8. Ablation Study on Phase I. From left to right: base, base
ate the overall perceptual quality and consistency with our + total variation loss Etv , base + Etv + automatic scaling.
methods’ provided input catalog images and 2D and 3D
baselines. We consider DGP the 2D baseline and TPS
the 3D baseline due to their best performance among ex- new pipeline based on neural rendering. We compare our
isting work. Each participant is shown three randomly se- method with TPS warping quantitatively to verify this de-
lected pairs of results, one produced by our method and the sign choice. Our test set consists of 10+ clothing cate-
other made by one of the baseline methods. The participant gories, including T-shirts, Polos, sweatshirts, jackets, hood-
is requested to choose the one that appears more realistic ies, shorts, trousers, and skirts, with 500 samples per cat-
and matches the reference clothing image better. In total, egory. We report the structure similarity (SSIM [36]) and
we received 643 responses from 72 users aged between 15 peak signal-to-noise ratio (PSNR) between the recovered
and 60. The results are reported in Fig. 7. Compared to textures and the ground truth textures.
DGP [8] and TPS, Cloth2Tex is favored by the participants As shown in Tab. 1, our neural rendering-based pipeline
with preference rates of 74.60% and 81.65%, respectively. achieves superior SSIM and PSNR compared to TPS warp-
This user study result verified the quality and consistency of ing. This improvement is also preserved after inpainting
our method. and refinement, leading to a much better quality of the final
texture. We conduct a comprehensive comparison study on
various inpainting methods in the supp. material, and please
check it if needed.
Table 1. Neural Rendering vs. TPS Warping. We evaluate the
texture quality of neural rendering and TPS-based warping, with
0% and without inpainting.

Figure 7. User preferences among 643 responses from 72 partici- Baseline Inpainting SSIM ↑ PSNR ↑
pants. Our method is favored by significantly more users. TPS None 0.70 20.29
TPS Pix2PixHD 0.76 23.81
4.2. Ablation Study Phase I None 0.80 21.72
Phase I Pix2PixHD 0.83 24.56
To demonstrate the effect of individual components in our
pipeline, we perform an ablation study for both stages in
our pipeline. Total Variation Loss & Automatic Scaling (Phase I) As
Neural Rendering vs. TPS Warping: TPS warping has shown in Fig. 8, dropping the total variation loss Etv and
been widely used in previous work on generating 3D gar- automatic scaling, the textures are incomplete and can-
ment textures. However, we found that it suffers from not maintain a semantically correct layout. With Etv ,
challenging cases illustrated in Fig. 2, so we propose a Cloth2Tex produces more complete textures by exploiting
Figure 9. Comparison with SOTA inpainting methods (Naiver-Stokes [4], LaMa [30], MADF [40] and Stable Diffusion v2 [24]) on texture
inpainting. The upper left corners of each column are the conditional mask input. Blue in the first column shows that our method is capable
of maintaining consistent boundary and curvature w.r.t reference image while Green highlights the blank regions that need inpainting.

the local consistency of textures. Further applying auto- texture in addition to the main UV.
matic scaling results in better alignment between the tem- Another imperfection is that our method cannot main-
plate mesh and the input images, resulting in a more se- tain the uniformity of checked shirts with densely assem-
mantically correct texture map. bled grids: As shown in the second row of Fig. 6, our
Inpainting Methods (Phase II) Next, to demonstrate the method inferior to 2D VTON methods in preserving tex-
need for training an inpainting model specifically for UV ture among which comprised of thousands of fine and tiny
clothing textures, we compare our task-specific inpaint- checkerboard-like grids, checked shirts and pleated skirts
ing model with general-purpose inpainting algorithms, in- are representative type of such garments.
cluding Navier-Stokes [4] algorithm and off-the-shelf deep We boil this down to the subtle position changes during
learning models including LaMa [30], MADF [40] and Sta- deformation graph optimization period, which leads to the
ble Diffusion v2 [24] with pre-trained checkpoints. Here, template mesh becomes less uniform eventually as the regu-
we modify the traditional Navier-Stokes [4] algorithm to a larization terms, i.e. as-rigid-as-possible is not a very strong
UV-constrained version because a texture map is only part constraint energy terms in obtaining a conformal mesh. We
of the whole squared image grid, where plenty of non-UV acknowledge this challenge and leave it to future work to
regions produce an adverse effect for texture in-painting explore the possibility in generating a homogeneous mesh
(please see supp. material for comparison). with uniformly-spaced triangles.
As shown in Fig. 9, our method, trained on our syn-
thetic dataset generated by the diffusion model, outperforms 5. Conclusion
general-purpose inpainting methods in the task of refining
and completing clothing textures, especially in terms of the This paper presents a novel pipeline, Cloth2Tex, for synthe-
color consistency between inpainted regions and the origi- sizing high-quality textures for 3D meshes from the pictures
nal image. taken from only front and back views. Cloth2Tex adopts a
two-stage process in obtaining visually appealing textures,
where phase I offers coarse texture generation and phase II
4.3. Limitations
performs texture refinement. Training a generalized texture
As shown in Fig. 10, Cloth2Tex can produce high-quality inpainting network is non-trivial due to the high topolog-
textures for common garments, e.g. T-shirt, Shorts, Trousers ical variability of UV space. Therefore, obtaining paired
and etc. (blue bounding box (bbox)). However, we have data under such circumstances is important. To the best of
observed that it is having difficulty in recovering textures our knowledge, this is the first study to combine a diffusion
for garments with complex patterns: e.g. inaccurate and in- model with a 3D engine (Blender) in collecting coarse-fine
consistent local texture (belt, collarband) occurred in wind- paired textures in 3D texturing tasks. We show the general-
breaker (red bbox). We regard this as the extra accessories izability of this approach in a variety of examples.
occurred in the garment, which inevitably add on the partial To avoid distortion and stretched artifacts across clothes,
Figure 10. Visualization of 3D virtual try-on. We obtain textured 3D meshes from 2D reference images shown on the left. The 3D meshes
are then draped onto 3D humans.
we automatically adjust the scale of vertices of template [9] Clement Fuji Tsang, Maria Shugrina, Jean Francois
meshes and thus best prepare them for later image-based op- Lafleche, Towaki Takikawa, Jiehan Wang, Charles
timization, which effectively guides the implicitly learned Loop, Wenzheng Chen, Krishna Murthy Jatavallab-
texture with a complete and distortion-free structure. Ex- hula, Edward Smith, Artem Rozantsev, Or Perel, Tian-
tensive experiments demonstrate that our method can effec- chang Shen, Jun Gao, Sanja Fidler, Gavriel State, Ja-
tively synthesize consistent and highly detailed textures for son Gorski, Tommy Xiang, Jianing Li, Michael Li,
typical clothes without extra manual effort. and Rev Lebaredian. Kaolin: A pytorch library for
In summary, we hope our work can inspire more future accelerating 3d deep learning research. https:
research in 3D texture synthesis and shed some light on this //github.com/NVIDIAGameWorks/kaolin,
area. 2022. 4
[10] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash-
nik, Amit H Bermano, Gal Chechik, and Daniel
References Cohen-Or. An image is worth one word: Personal-
izing text-to-image generation using textual inversion.
[1] AUTOMATIC1111. Stable diffusion web ui. https: arXiv preprint arXiv:2208.01618, 2022. 5
/ / github . com / AUTOMATIC1111 / stable - [11] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising
diffusion-webui, 2022. 5 diffusion probabilistic models. Advances in Neural In-
[2] Shuai Bai, Huiling Zhou, Zhikang Li, Chang Zhou, formation Processing Systems, 33:6840–6851, 2020.
and Hongxia Yang. Single stage virtual try-on via de- 3
formable attention flows. In Computer Vision–ECCV [12] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan
2022: 17th European Conference, Tel Aviv, Israel, Oc- Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and
tober 23–27, 2022, Proceedings, Part XV, pages 409– Weizhu Chen. Lora: Low-rank adaptation of large
425. Springer, 2022. 6, 7 language models. arXiv preprint arXiv:2106.09685,
[3] S. Belongie, J. Malik, and J. Puzicha. Shape match- 2021. 5
ing and object recognition using shape contexts. IEEE [13] Diederik P Kingma and Jimmy Ba. Adam: A
Transactions on Pattern Analysis and Machine Intelli- method for stochastic optimization. arXiv preprint
gence, 24(4):509–522, 2002. doi: 10.1109/34.993558. arXiv:1412.6980, 2014. 2
2, 3 [14] Mikhail Konstantinov, Alex Shonenkov, Daria Bak-
[4] Marcelo Bertalmio, Andrea L Bertozzi, and Guillermo shandaeva, and Ksenia Ivanova. Deepfloyd: Text-to-
Sapiro. Navier-stokes, fluid dynamics, and image image model with a high degree of photorealism and
and video inpainting. In Proceedings of the 2001 language understanding. https://deepfloyd.
IEEE Computer Society Conference on Computer Vi- ai/, 2023. 2
sion and Pattern Recognition. CVPR 2001, volume 1, [15] Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang,
pages I–I. IEEE, 2001. 8, 2 Ming Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai
[5] Dave Zhenyu Chen, Yawar Siddiqui, Hsin-Ying Lee, Xu, Zheng Cao, et al. mplug: Effective and effi-
Sergey Tulyakov, and Matthias Nießner. Text2tex: cient vision-language learning by cross-modal skip-
Text-driven texture synthesis via diffusion models. connections. arXiv preprint arXiv:2205.12005, 2022.
arXiv preprint arXiv:2303.11396, 2023. 3, 2 5, 2
[6] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, [16] Dongxu Li, Junnan Li, Hung Le, Guangsen Wang, Sil-
Ehsan Adeli, Yan Wang, Le Lu, Alan L Yuille, and vio Savarese, and Steven C. H. Hoi. Lavis: A library
Yuyin Zhou. Transunet: Transformers make strong for language-vision intelligence, 2022. 5, 2
encoders for medical image segmentation. arXiv [17] Shichen Liu, Tianye Li, Weikai Chen, and Hao Li.
preprint arXiv:2102.04306, 2021. 5, 2 Soft rasterizer: A differentiable renderer for image-
[7] Prafulla Dhariwal and Alexander Nichol. Diffusion based 3d reasoning. The IEEE International Confer-
models beat gans on image synthesis. Advances ence on Computer Vision (ICCV), Oct 2019. 2, 3, 4,
in Neural Information Processing Systems, 34:8780– 5
8794, 2021. 3, 5 [18] Matthew Loper, Naureen Mahmood, Javier Romero,
[8] Ruili Feng, Cheng Ma, Chengji Shen, Xin Gao, Zhen- Gerard Pons-Moll, and Michael J Black. Smpl: A
jiang Liu, Xiaobo Li, Kairi Ou, Deli Zhao, and Zheng- skinned multi-person linear model. ACM transactions
Jun Zha. Weakly supervised high-fidelity clothing on graphics (TOG), 34(6):1–16, 2015. 2
model generation. In Proceedings of the IEEE/CVF [19] Sahib Majithia, Sandeep N Parameswaran, Sadbha-
Conference on Computer Vision and Pattern Recogni- vana Babar, Vikram Garg, Astitva Srivastava, and
tion, pages 3440–3449, 2022. 6, 7 Avinash Sharma. Robust 3d garment digitization from
monocular 2d images for 3d virtual try-on systems. In Aleksei Silvestrov, Naejin Kong, Harshith Goka, Ki-
Proceedings of the IEEE/CVF Winter Conference on woong Park, and Victor Lempitsky. Resolution-robust
Applications of Computer Vision, pages 3428–3438, large mask inpainting with fourier convolutions.
2022. 2, 3, 6 arXiv preprint arXiv:2109.07161, 2021. 8, 2
[20] Aymen Mir, Thiemo Alldieck, and Gerard Pons-Moll. [31] Brandon Trabucco, Kyle Doherty, Max Gurinas,
Learning to transfer texture from clothing images to and Ruslan Salakhutdinov. Effective data aug-
3d humans. In Proceedings of the IEEE/CVF Con- mentation with diffusion models. arXiv preprint
ference on Computer Vision and Pattern Recognition, arXiv:2302.07944, 2023. 3
pages 7023–7034, 2020. 2, 3, 6 [32] Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai
[21] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jin-
Chanan, Edward Yang, Zachary DeVito, Zeming Lin, gren Zhou, and Hongxia Yang. Ofa: Unifying ar-
Alban Desmaison, Luca Antiga, and Adam Lerer. Au- chitectures, tasks, and modalities through a simple
tomatic differentiation in pytorch. 2017. 2 sequence-to-sequence learning framework. In In-
[22] Yongming Rao, Wenliang Zhao, Guangyi Chen, Yan- ternational Conference on Machine Learning, pages
song Tang, Zheng Zhu, Guan Huang, Jie Zhou, and 23318–23340. PMLR, 2022. 5, 2
Jiwen Lu. Denseclip: Language-guided dense predic- [33] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, An-
tion with context-aware prompting. In Proceedings of drew Tao, Jan Kautz, and Bryan Catanzaro. High-
the IEEE Conference on Computer Vision and Pattern resolution image synthesis and semantic manipulation
Recognition (CVPR), 2022. 4 with conditional gans. In Proceedings of the IEEE
[23] Nikhila Ravi, Jeremy Reizenstein, David Novotny, conference on computer vision and pattern recogni-
Taylor Gordon, Wan-Yen Lo, Justin Johnson, and tion, pages 8798–8807, 2018. 5, 2
Georgia Gkioxari. Accelerating 3d deep learning with [34] Tuanfeng Y. Wang, Duygu Ceylan, Jovan Popovic,
pytorch3d. arXiv:2007.08501, 2020. 5, 2 and Niloy J. Mitra. Learning a shared shape space for
[24] Robin Rombach, Andreas Blattmann, Dominik multimodal garment design. ACM Trans. Graph., 37
Lorenz, Patrick Esser, and Björn Ommer. High- (6):1:1–1:14, 2018. doi: 10.1145/3272127.3275074.
resolution image synthesis with latent diffusion mod- 3
els. In Proceedings of the IEEE/CVF Conference [35] Wenguan Wang, Yuanlu Xu, Jianbing Shen, and
on Computer Vision and Pattern Recognition, pages Song-Chun Zhu. Attentive fashion grammar network
10684–10695, 2022. 2, 3, 5, 8 for fashion landmark detection and clothing category
[25] Leonid I Rudin, Stanley Osher, and Emad Fatemi. classification. In Proceedings of the IEEE conference
Nonlinear total variation based noise removal algo- on computer vision and pattern recognition, pages
rithms. Physica D: nonlinear phenomena, 60(1-4): 4271–4280, 2018. 2, 4
259–268, 1992. 5 [36] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and
[26] Christoph Schuhmann, Romain Beaumont, Richard Eero P Simoncelli. Image quality assessment: from
Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, error visibility to structural similarity. IEEE transac-
Theo Coombes, Aarush Katta, Clayton Mullis, tions on image processing, 13(4):600–612, 2004. 7
Mitchell Wortsman, et al. Laion-5b: An open large- [37] Yi Xu, Shanglin Yang, Wei Sun, Li Tan, Kefeng Li,
scale dataset for training next generation image-text and Hui Zhou. 3d virtual garment modeling from rgb
models. arXiv preprint arXiv:2210.08402, 2022. 3, 5 images. In 2019 IEEE International Symposium on
[27] Yawar Siddiqui, Justus Thies, Fangchang Ma, Mixed and Augmented Reality (ISMAR), pages 37–45.
Qi Shan, Matthias Nießner, and Angela Dai. Textu- IEEE, 2019. 2
rify: Generating textures on 3d shape surfaces. In [38] Rui Yu, Yue Dong, Pieter Peers, and Xin Tong. Learn-
Computer Vision–ECCV 2022: 17th European Con- ing texture generators for 3d shape collections from
ference, Tel Aviv, Israel, October 23–27, 2022, Pro- internet photo sets. In British Machine Vision Confer-
ceedings, Part III, pages 72–88. Springer, 2022. 2 ence, 2021. 3
[28] Olga Sorkine and Marc Alexa. As-rigid-as-possible [39] Lvmin Zhang and Maneesh Agrawala. Adding condi-
surface modeling. In Symposium on Geometry pro- tional control to text-to-image diffusion models, 2023.
cessing, volume 4, pages 109–116, 2007. 4 2, 5
[29] Robert W Sumner, Johannes Schmid, and Mark Pauly. [40] Manyu Zhu, Dongliang He, Xin Li, Chao Li, Fu Li,
Embedded deformation for shape manipulation. In Xiao Liu, Errui Ding, and Zhaoxiang Zhang. Image
ACM siggraph 2007 papers, pages 80–es. 2007. 3 inpainting by end-to-end cascaded refinement with
[30] Roman Suvorov, Elizaveta Logacheva, Anton mask awareness. IEEE Transactions on Image Pro-
Mashikhin, Anastasia Remizova, Arsenii Ashukha, cessing, 30:4855–4866, 2021. 3, 8
Cloth2Tex: A Customized Cloth Texture Generation Pipeline for 3D Virtual
Try-On
Supplementary Material
6. Implementation Details
In phase I, we fix the optimization steps of both silhouette
matching and image-based optimization to 1,000, which
makes each coarse texture generation process takes less
than 1 minute to complete on an NVIDIA Ampere A100
(80GB VRAM). The initial weights of each energy term
are wsil = 50, wlmk = 0.01, warap = 50, wnorm =
10, wimg = 100, wtv = 1, we then use cosine scheduler
for decaying warap , wnorm to 5, 1.
During the blender-enhanced rendering process, we aug-
ment the data by random sampling blendshapes of upper
cloth by a range of [0.1, 1.0]. The synthetic images were
rendered using Blender EEVEE engine at a resolution of
5122 , emission only (disentangle from the impact of shad-
ing, which is the notoriously difficult puzzle as dissected in
Text2Tex [5]).
The synthetic data used for training texture inpainting
network are yielded from pretrained ControlNet through
prompts (generates from Lavis-BLIP [16], OFA [32] and Figure 11. Visualization of Navier-stokes method on UV tem-
MPlug [15]) and UV templates (manually crafted UV maps plate. Our locally constrained NS method fills the blanks thor-
by artists) can be shown in Fig. 14, which contains more oughly (though lack of precision) compared to the original global
garment types than previous methods, e.g. Pix2Surf [20] (4) counterpart.
and Warping [19] (2).
The only existing trainable Pix2PixHD in phase II is op- The detailed parameters of template meshes in
timized by Adam [13] with lr = 2e − 4 for 200 epochs. Cloth2Tex are summarized in Tab. 4, sketch of all template
Our implementation is build on top of PyTorch [21] along- meshes and UV maps are shown in Fig. 12 and Fig. 13 re-
side PyTorch3D [23] for silhouette matching, rendering and spectively.
inpainting.
Table 2. SOTA inpainting methods act on our synthetic data. 7. Self-modified UV-constrained Naiver-Stokes
Method
Baseline Inpainting SSIM ↑
As shown in Fig. 11, we display the results between our
Phase I None 0.80
self-modified UV-constrained Navier-Stokes (NS) method
Phase I Navier-Stokes [4] 0.80
(local) and original NS (global) method. Specifically, we
Phase I LaMa [30] 0.78
add a reference branch (UV template) for NS and thus con-
Phase I Stable Diffusion (v2) [24] 0.77
fine the inpainting-affected region to the given UV tem-
Phase I Deep Floyd [14] 0.80
plate for each garment, thus contributing directly to the in-
terpolation result. Our locally constrained NS method al-
Table 3. Inpainting methods trained on our synthetic data. lows blanks to be filled thoroughly compared to the original
global NS method.
The sole aim of modifying the original global NS method
Baseline Inpainting SSIM ↑
is to conduct a fair comparison with deep learning based
Phase I None 0.80
methods as depicted in the main paper.
Phase I Cond-TransUNet [6] 0.78
The noteworthy thing is that for small blank areas (e.g.
Phase I ControlNet [39] 0.77
Column 1,3 of Fig. 11), the texture uniformity and consis-
Phase I Pix2PixHD [33] 0.83
tency are well-persevered thus capable of producing plausi-
ble textures.
Figure 12. Visualization of all template meshes used in Cloth2Tex.

Figure 13. All UV maps of template meshes used in Cloth2Tex.


Table 4. Detailed parameters of template mesh in Cloth2Tex. As shown in the table, each template’s vertex is less than 10,000 and all are
animatable by means of Style3D, which is the best fit software for clothing animation.

Category Vertices Faces Key Nodes (Deformation Graph) Animatable


T-shirts 8,523 16,039 427 ✓
Polo 8,922 16,968 447 ✓
Shorts 8,767 14,845 435 ✓
Trousers 9,323 16,995 466 ✓
Dress 7,752 14,959 388 ✓
Skirt 6,116 11,764 306 ✓
Windbreaker 9,881 17,341 494 ✓
Jacket 8,168 15,184 409 ✓
Hoodie (Zipup) 8,537 15,874 427 ✓
Sweatshirt 9,648 18,209 483 ✓
One-piece Dress 9,102 17,111 455 ✓

Figure 14. Texture maps for training instance map guided Pix2PixHD, synthesized by ControlNet canny edge.
Figure 15. Comparison with representative image2image methods with conditional input: autoencoder-based TransUNet [6] (we modify
the base model and add an extra branch for UV map, aims to train it with all types of garment together), diffusion-based ControlNet [39]
and GAN-based Pix2PixHD [33]. It is rather obvious that prompts-sensitive ControlNet limited in recover a globally color-consistent
texture maps. Upper right corner of each method is the conditional input.

8. Efficiency of mainstream Inpainting meth- views.


ods
As depicted in the main paper, our neural rendering based
pipeline achieves superior SSIM compared to TPS warp-
ing. This improvement is also preserved after inpainting
and refinement, leading to a much better quality of the final
texture.
Free from the page limit in the main paper, here we
conduct a comprehensive comparison study on various in-
painting methods act upon the coarse texture maps derived
from Phase I directly, to demonstrate the efficiency of main-
stream inpainting methods.
First, we compare the state-of-the-art inpainting methods
quantitatively on our synthetic coarse-fine paired dataset.
One thing to note is that checkpoints derived from all deep
learning based inpainting methods are open and free. No
finetune or modification is involved in this comparison. As
described in Tab. 2, none of such methods produce a notice-
able positive impact in boosting the SSIM score compared
to the original coarse texture (None version).
Next, we revise TransUNet [6] with input a conditional
UV map for the unity of the input and output with Con-
trolNet [39] and Pix2PixHD [33]. Then we train cond-
TransUNet, ControlNet, and Pix2PixHD on the synthetic
data for a fair comparison. We input all these three with
original input coarse texture maps, conditional input UV
maps, and output fine texture maps. The selective basis
of TransUNet, ControlNet, and Pix2PixHD originates from
the generative paradigm: TransUNet is a basic autoencoder-
based supervised learning image2image model, ControlNet
is a diffusion-based generative model and Pix2PixHD is a
GAN-based generative model. We want to explore the fea-
sibility of these methods in our task, as depicted in Tab. 3
and Fig. 15, Pix2PixHD is superior in obtaining satisfactory
texture maps in terms of both qualitative and quantitative

You might also like