Cloth 2 Tex
Cloth 2 Tex
Try-On
Figure 1. We propose Cloth2Tex, a novel pipeline for converting 2D images of clothing to high-quality 3D textured meshes that can be
draped onto 3D humans. In contrast to previous methods, Cloth2Tex supports a variety of clothing types. Results of 3D textured meshes
produced by our method as well as the corresponding input images are shown above.
4. Experiments
Our goal is to generate 3D garments from 2D catalog im-
ages. We verify the effectiveness of Cloth2Tex via thor-
ough evaluation and comparison with state-of-the-art base-
lines. Furthermore, we conduct a detailed ablation study to
demonstrate the effects of individual components. Figure 5. Comparison with Pix2Surf [20] and Warping [19] on
T-shirts. Please zoom in for more details.
4.1. Comparison with SOTA
We first compare our method with SOTA virtual try-on al- logos. In contrast, the baseline method Pix2Surf [20] tends
gorithms, both 3D and 2D approaches. to produce blurry textures due to a smooth mapping net-
Comparison with 3D SOTA: We compare Cloth2Tex work, and the Warping [19] baseline introduces undesired
with SOTA methods that produce 3D mesh textures from spatial distortions (e.g., second row in Fig. 5) due to sparse
2D clothing images, including model-based Pix2Surf [20] correspondences.
and TPS-based Warping [19] (We replace the original Comparison with 2D SOTA: We further compare
MADF with locally changed UV-constrained Naiver Stokes Cloth2Tex with 2D virtual try-on methods: Flow-based
method, differences between our UV-constrained naiver- DAFlow [2] and StyleGAN-enhanced Deep-Generative-
stokes and original version is described in the suppl. mate- Projection (DGP) [8]. As shown in Fig. 6, Cloth2Tex
rial). As shown in Fig. 5, our method produces high-fidelity achieves better quality than 2D virtual try-on methods in
3D textures with sharp, high-frequency details of the pat- sharpness and semantic consistency. More importantly, our
terns on clothing, such as the leaves and characters on the outputs, namely 3D textured clothing meshes, are naturally
top row. In addition, our method accurately preserves the compatible with cloth physics simulation, allowing the syn-
spatial configuration of the garment, particularly the overall thesis of realistic try-on effects in various body poses. In
aspect ratio of the patterns and the relative locations of the contrast, 2D methods rely on prior learned from training
images and are hence limited in their generalization ability
to extreme poses outside the training distribution.
User Study: Finally, we conduct a user study to evalu- Figure 8. Ablation Study on Phase I. From left to right: base, base
ate the overall perceptual quality and consistency with our + total variation loss Etv , base + Etv + automatic scaling.
methods’ provided input catalog images and 2D and 3D
baselines. We consider DGP the 2D baseline and TPS
the 3D baseline due to their best performance among ex- new pipeline based on neural rendering. We compare our
isting work. Each participant is shown three randomly se- method with TPS warping quantitatively to verify this de-
lected pairs of results, one produced by our method and the sign choice. Our test set consists of 10+ clothing cate-
other made by one of the baseline methods. The participant gories, including T-shirts, Polos, sweatshirts, jackets, hood-
is requested to choose the one that appears more realistic ies, shorts, trousers, and skirts, with 500 samples per cat-
and matches the reference clothing image better. In total, egory. We report the structure similarity (SSIM [36]) and
we received 643 responses from 72 users aged between 15 peak signal-to-noise ratio (PSNR) between the recovered
and 60. The results are reported in Fig. 7. Compared to textures and the ground truth textures.
DGP [8] and TPS, Cloth2Tex is favored by the participants As shown in Tab. 1, our neural rendering-based pipeline
with preference rates of 74.60% and 81.65%, respectively. achieves superior SSIM and PSNR compared to TPS warp-
This user study result verified the quality and consistency of ing. This improvement is also preserved after inpainting
our method. and refinement, leading to a much better quality of the final
texture. We conduct a comprehensive comparison study on
various inpainting methods in the supp. material, and please
check it if needed.
Table 1. Neural Rendering vs. TPS Warping. We evaluate the
texture quality of neural rendering and TPS-based warping, with
0% and without inpainting.
Figure 7. User preferences among 643 responses from 72 partici- Baseline Inpainting SSIM ↑ PSNR ↑
pants. Our method is favored by significantly more users. TPS None 0.70 20.29
TPS Pix2PixHD 0.76 23.81
4.2. Ablation Study Phase I None 0.80 21.72
Phase I Pix2PixHD 0.83 24.56
To demonstrate the effect of individual components in our
pipeline, we perform an ablation study for both stages in
our pipeline. Total Variation Loss & Automatic Scaling (Phase I) As
Neural Rendering vs. TPS Warping: TPS warping has shown in Fig. 8, dropping the total variation loss Etv and
been widely used in previous work on generating 3D gar- automatic scaling, the textures are incomplete and can-
ment textures. However, we found that it suffers from not maintain a semantically correct layout. With Etv ,
challenging cases illustrated in Fig. 2, so we propose a Cloth2Tex produces more complete textures by exploiting
Figure 9. Comparison with SOTA inpainting methods (Naiver-Stokes [4], LaMa [30], MADF [40] and Stable Diffusion v2 [24]) on texture
inpainting. The upper left corners of each column are the conditional mask input. Blue in the first column shows that our method is capable
of maintaining consistent boundary and curvature w.r.t reference image while Green highlights the blank regions that need inpainting.
the local consistency of textures. Further applying auto- texture in addition to the main UV.
matic scaling results in better alignment between the tem- Another imperfection is that our method cannot main-
plate mesh and the input images, resulting in a more se- tain the uniformity of checked shirts with densely assem-
mantically correct texture map. bled grids: As shown in the second row of Fig. 6, our
Inpainting Methods (Phase II) Next, to demonstrate the method inferior to 2D VTON methods in preserving tex-
need for training an inpainting model specifically for UV ture among which comprised of thousands of fine and tiny
clothing textures, we compare our task-specific inpaint- checkerboard-like grids, checked shirts and pleated skirts
ing model with general-purpose inpainting algorithms, in- are representative type of such garments.
cluding Navier-Stokes [4] algorithm and off-the-shelf deep We boil this down to the subtle position changes during
learning models including LaMa [30], MADF [40] and Sta- deformation graph optimization period, which leads to the
ble Diffusion v2 [24] with pre-trained checkpoints. Here, template mesh becomes less uniform eventually as the regu-
we modify the traditional Navier-Stokes [4] algorithm to a larization terms, i.e. as-rigid-as-possible is not a very strong
UV-constrained version because a texture map is only part constraint energy terms in obtaining a conformal mesh. We
of the whole squared image grid, where plenty of non-UV acknowledge this challenge and leave it to future work to
regions produce an adverse effect for texture in-painting explore the possibility in generating a homogeneous mesh
(please see supp. material for comparison). with uniformly-spaced triangles.
As shown in Fig. 9, our method, trained on our syn-
thetic dataset generated by the diffusion model, outperforms 5. Conclusion
general-purpose inpainting methods in the task of refining
and completing clothing textures, especially in terms of the This paper presents a novel pipeline, Cloth2Tex, for synthe-
color consistency between inpainted regions and the origi- sizing high-quality textures for 3D meshes from the pictures
nal image. taken from only front and back views. Cloth2Tex adopts a
two-stage process in obtaining visually appealing textures,
where phase I offers coarse texture generation and phase II
4.3. Limitations
performs texture refinement. Training a generalized texture
As shown in Fig. 10, Cloth2Tex can produce high-quality inpainting network is non-trivial due to the high topolog-
textures for common garments, e.g. T-shirt, Shorts, Trousers ical variability of UV space. Therefore, obtaining paired
and etc. (blue bounding box (bbox)). However, we have data under such circumstances is important. To the best of
observed that it is having difficulty in recovering textures our knowledge, this is the first study to combine a diffusion
for garments with complex patterns: e.g. inaccurate and in- model with a 3D engine (Blender) in collecting coarse-fine
consistent local texture (belt, collarband) occurred in wind- paired textures in 3D texturing tasks. We show the general-
breaker (red bbox). We regard this as the extra accessories izability of this approach in a variety of examples.
occurred in the garment, which inevitably add on the partial To avoid distortion and stretched artifacts across clothes,
Figure 10. Visualization of 3D virtual try-on. We obtain textured 3D meshes from 2D reference images shown on the left. The 3D meshes
are then draped onto 3D humans.
we automatically adjust the scale of vertices of template [9] Clement Fuji Tsang, Maria Shugrina, Jean Francois
meshes and thus best prepare them for later image-based op- Lafleche, Towaki Takikawa, Jiehan Wang, Charles
timization, which effectively guides the implicitly learned Loop, Wenzheng Chen, Krishna Murthy Jatavallab-
texture with a complete and distortion-free structure. Ex- hula, Edward Smith, Artem Rozantsev, Or Perel, Tian-
tensive experiments demonstrate that our method can effec- chang Shen, Jun Gao, Sanja Fidler, Gavriel State, Ja-
tively synthesize consistent and highly detailed textures for son Gorski, Tommy Xiang, Jianing Li, Michael Li,
typical clothes without extra manual effort. and Rev Lebaredian. Kaolin: A pytorch library for
In summary, we hope our work can inspire more future accelerating 3d deep learning research. https:
research in 3D texture synthesis and shed some light on this //github.com/NVIDIAGameWorks/kaolin,
area. 2022. 4
[10] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash-
nik, Amit H Bermano, Gal Chechik, and Daniel
References Cohen-Or. An image is worth one word: Personal-
izing text-to-image generation using textual inversion.
[1] AUTOMATIC1111. Stable diffusion web ui. https: arXiv preprint arXiv:2208.01618, 2022. 5
/ / github . com / AUTOMATIC1111 / stable - [11] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising
diffusion-webui, 2022. 5 diffusion probabilistic models. Advances in Neural In-
[2] Shuai Bai, Huiling Zhou, Zhikang Li, Chang Zhou, formation Processing Systems, 33:6840–6851, 2020.
and Hongxia Yang. Single stage virtual try-on via de- 3
formable attention flows. In Computer Vision–ECCV [12] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan
2022: 17th European Conference, Tel Aviv, Israel, Oc- Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and
tober 23–27, 2022, Proceedings, Part XV, pages 409– Weizhu Chen. Lora: Low-rank adaptation of large
425. Springer, 2022. 6, 7 language models. arXiv preprint arXiv:2106.09685,
[3] S. Belongie, J. Malik, and J. Puzicha. Shape match- 2021. 5
ing and object recognition using shape contexts. IEEE [13] Diederik P Kingma and Jimmy Ba. Adam: A
Transactions on Pattern Analysis and Machine Intelli- method for stochastic optimization. arXiv preprint
gence, 24(4):509–522, 2002. doi: 10.1109/34.993558. arXiv:1412.6980, 2014. 2
2, 3 [14] Mikhail Konstantinov, Alex Shonenkov, Daria Bak-
[4] Marcelo Bertalmio, Andrea L Bertozzi, and Guillermo shandaeva, and Ksenia Ivanova. Deepfloyd: Text-to-
Sapiro. Navier-stokes, fluid dynamics, and image image model with a high degree of photorealism and
and video inpainting. In Proceedings of the 2001 language understanding. https://deepfloyd.
IEEE Computer Society Conference on Computer Vi- ai/, 2023. 2
sion and Pattern Recognition. CVPR 2001, volume 1, [15] Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang,
pages I–I. IEEE, 2001. 8, 2 Ming Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai
[5] Dave Zhenyu Chen, Yawar Siddiqui, Hsin-Ying Lee, Xu, Zheng Cao, et al. mplug: Effective and effi-
Sergey Tulyakov, and Matthias Nießner. Text2tex: cient vision-language learning by cross-modal skip-
Text-driven texture synthesis via diffusion models. connections. arXiv preprint arXiv:2205.12005, 2022.
arXiv preprint arXiv:2303.11396, 2023. 3, 2 5, 2
[6] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, [16] Dongxu Li, Junnan Li, Hung Le, Guangsen Wang, Sil-
Ehsan Adeli, Yan Wang, Le Lu, Alan L Yuille, and vio Savarese, and Steven C. H. Hoi. Lavis: A library
Yuyin Zhou. Transunet: Transformers make strong for language-vision intelligence, 2022. 5, 2
encoders for medical image segmentation. arXiv [17] Shichen Liu, Tianye Li, Weikai Chen, and Hao Li.
preprint arXiv:2102.04306, 2021. 5, 2 Soft rasterizer: A differentiable renderer for image-
[7] Prafulla Dhariwal and Alexander Nichol. Diffusion based 3d reasoning. The IEEE International Confer-
models beat gans on image synthesis. Advances ence on Computer Vision (ICCV), Oct 2019. 2, 3, 4,
in Neural Information Processing Systems, 34:8780– 5
8794, 2021. 3, 5 [18] Matthew Loper, Naureen Mahmood, Javier Romero,
[8] Ruili Feng, Cheng Ma, Chengji Shen, Xin Gao, Zhen- Gerard Pons-Moll, and Michael J Black. Smpl: A
jiang Liu, Xiaobo Li, Kairi Ou, Deli Zhao, and Zheng- skinned multi-person linear model. ACM transactions
Jun Zha. Weakly supervised high-fidelity clothing on graphics (TOG), 34(6):1–16, 2015. 2
model generation. In Proceedings of the IEEE/CVF [19] Sahib Majithia, Sandeep N Parameswaran, Sadbha-
Conference on Computer Vision and Pattern Recogni- vana Babar, Vikram Garg, Astitva Srivastava, and
tion, pages 3440–3449, 2022. 6, 7 Avinash Sharma. Robust 3d garment digitization from
monocular 2d images for 3d virtual try-on systems. In Aleksei Silvestrov, Naejin Kong, Harshith Goka, Ki-
Proceedings of the IEEE/CVF Winter Conference on woong Park, and Victor Lempitsky. Resolution-robust
Applications of Computer Vision, pages 3428–3438, large mask inpainting with fourier convolutions.
2022. 2, 3, 6 arXiv preprint arXiv:2109.07161, 2021. 8, 2
[20] Aymen Mir, Thiemo Alldieck, and Gerard Pons-Moll. [31] Brandon Trabucco, Kyle Doherty, Max Gurinas,
Learning to transfer texture from clothing images to and Ruslan Salakhutdinov. Effective data aug-
3d humans. In Proceedings of the IEEE/CVF Con- mentation with diffusion models. arXiv preprint
ference on Computer Vision and Pattern Recognition, arXiv:2302.07944, 2023. 3
pages 7023–7034, 2020. 2, 3, 6 [32] Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai
[21] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jin-
Chanan, Edward Yang, Zachary DeVito, Zeming Lin, gren Zhou, and Hongxia Yang. Ofa: Unifying ar-
Alban Desmaison, Luca Antiga, and Adam Lerer. Au- chitectures, tasks, and modalities through a simple
tomatic differentiation in pytorch. 2017. 2 sequence-to-sequence learning framework. In In-
[22] Yongming Rao, Wenliang Zhao, Guangyi Chen, Yan- ternational Conference on Machine Learning, pages
song Tang, Zheng Zhu, Guan Huang, Jie Zhou, and 23318–23340. PMLR, 2022. 5, 2
Jiwen Lu. Denseclip: Language-guided dense predic- [33] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, An-
tion with context-aware prompting. In Proceedings of drew Tao, Jan Kautz, and Bryan Catanzaro. High-
the IEEE Conference on Computer Vision and Pattern resolution image synthesis and semantic manipulation
Recognition (CVPR), 2022. 4 with conditional gans. In Proceedings of the IEEE
[23] Nikhila Ravi, Jeremy Reizenstein, David Novotny, conference on computer vision and pattern recogni-
Taylor Gordon, Wan-Yen Lo, Justin Johnson, and tion, pages 8798–8807, 2018. 5, 2
Georgia Gkioxari. Accelerating 3d deep learning with [34] Tuanfeng Y. Wang, Duygu Ceylan, Jovan Popovic,
pytorch3d. arXiv:2007.08501, 2020. 5, 2 and Niloy J. Mitra. Learning a shared shape space for
[24] Robin Rombach, Andreas Blattmann, Dominik multimodal garment design. ACM Trans. Graph., 37
Lorenz, Patrick Esser, and Björn Ommer. High- (6):1:1–1:14, 2018. doi: 10.1145/3272127.3275074.
resolution image synthesis with latent diffusion mod- 3
els. In Proceedings of the IEEE/CVF Conference [35] Wenguan Wang, Yuanlu Xu, Jianbing Shen, and
on Computer Vision and Pattern Recognition, pages Song-Chun Zhu. Attentive fashion grammar network
10684–10695, 2022. 2, 3, 5, 8 for fashion landmark detection and clothing category
[25] Leonid I Rudin, Stanley Osher, and Emad Fatemi. classification. In Proceedings of the IEEE conference
Nonlinear total variation based noise removal algo- on computer vision and pattern recognition, pages
rithms. Physica D: nonlinear phenomena, 60(1-4): 4271–4280, 2018. 2, 4
259–268, 1992. 5 [36] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and
[26] Christoph Schuhmann, Romain Beaumont, Richard Eero P Simoncelli. Image quality assessment: from
Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, error visibility to structural similarity. IEEE transac-
Theo Coombes, Aarush Katta, Clayton Mullis, tions on image processing, 13(4):600–612, 2004. 7
Mitchell Wortsman, et al. Laion-5b: An open large- [37] Yi Xu, Shanglin Yang, Wei Sun, Li Tan, Kefeng Li,
scale dataset for training next generation image-text and Hui Zhou. 3d virtual garment modeling from rgb
models. arXiv preprint arXiv:2210.08402, 2022. 3, 5 images. In 2019 IEEE International Symposium on
[27] Yawar Siddiqui, Justus Thies, Fangchang Ma, Mixed and Augmented Reality (ISMAR), pages 37–45.
Qi Shan, Matthias Nießner, and Angela Dai. Textu- IEEE, 2019. 2
rify: Generating textures on 3d shape surfaces. In [38] Rui Yu, Yue Dong, Pieter Peers, and Xin Tong. Learn-
Computer Vision–ECCV 2022: 17th European Con- ing texture generators for 3d shape collections from
ference, Tel Aviv, Israel, October 23–27, 2022, Pro- internet photo sets. In British Machine Vision Confer-
ceedings, Part III, pages 72–88. Springer, 2022. 2 ence, 2021. 3
[28] Olga Sorkine and Marc Alexa. As-rigid-as-possible [39] Lvmin Zhang and Maneesh Agrawala. Adding condi-
surface modeling. In Symposium on Geometry pro- tional control to text-to-image diffusion models, 2023.
cessing, volume 4, pages 109–116, 2007. 4 2, 5
[29] Robert W Sumner, Johannes Schmid, and Mark Pauly. [40] Manyu Zhu, Dongliang He, Xin Li, Chao Li, Fu Li,
Embedded deformation for shape manipulation. In Xiao Liu, Errui Ding, and Zhaoxiang Zhang. Image
ACM siggraph 2007 papers, pages 80–es. 2007. 3 inpainting by end-to-end cascaded refinement with
[30] Roman Suvorov, Elizaveta Logacheva, Anton mask awareness. IEEE Transactions on Image Pro-
Mashikhin, Anastasia Remizova, Arsenii Ashukha, cessing, 30:4855–4866, 2021. 3, 8
Cloth2Tex: A Customized Cloth Texture Generation Pipeline for 3D Virtual
Try-On
Supplementary Material
6. Implementation Details
In phase I, we fix the optimization steps of both silhouette
matching and image-based optimization to 1,000, which
makes each coarse texture generation process takes less
than 1 minute to complete on an NVIDIA Ampere A100
(80GB VRAM). The initial weights of each energy term
are wsil = 50, wlmk = 0.01, warap = 50, wnorm =
10, wimg = 100, wtv = 1, we then use cosine scheduler
for decaying warap , wnorm to 5, 1.
During the blender-enhanced rendering process, we aug-
ment the data by random sampling blendshapes of upper
cloth by a range of [0.1, 1.0]. The synthetic images were
rendered using Blender EEVEE engine at a resolution of
5122 , emission only (disentangle from the impact of shad-
ing, which is the notoriously difficult puzzle as dissected in
Text2Tex [5]).
The synthetic data used for training texture inpainting
network are yielded from pretrained ControlNet through
prompts (generates from Lavis-BLIP [16], OFA [32] and Figure 11. Visualization of Navier-stokes method on UV tem-
MPlug [15]) and UV templates (manually crafted UV maps plate. Our locally constrained NS method fills the blanks thor-
by artists) can be shown in Fig. 14, which contains more oughly (though lack of precision) compared to the original global
garment types than previous methods, e.g. Pix2Surf [20] (4) counterpart.
and Warping [19] (2).
The only existing trainable Pix2PixHD in phase II is op- The detailed parameters of template meshes in
timized by Adam [13] with lr = 2e − 4 for 200 epochs. Cloth2Tex are summarized in Tab. 4, sketch of all template
Our implementation is build on top of PyTorch [21] along- meshes and UV maps are shown in Fig. 12 and Fig. 13 re-
side PyTorch3D [23] for silhouette matching, rendering and spectively.
inpainting.
Table 2. SOTA inpainting methods act on our synthetic data. 7. Self-modified UV-constrained Naiver-Stokes
Method
Baseline Inpainting SSIM ↑
As shown in Fig. 11, we display the results between our
Phase I None 0.80
self-modified UV-constrained Navier-Stokes (NS) method
Phase I Navier-Stokes [4] 0.80
(local) and original NS (global) method. Specifically, we
Phase I LaMa [30] 0.78
add a reference branch (UV template) for NS and thus con-
Phase I Stable Diffusion (v2) [24] 0.77
fine the inpainting-affected region to the given UV tem-
Phase I Deep Floyd [14] 0.80
plate for each garment, thus contributing directly to the in-
terpolation result. Our locally constrained NS method al-
Table 3. Inpainting methods trained on our synthetic data. lows blanks to be filled thoroughly compared to the original
global NS method.
The sole aim of modifying the original global NS method
Baseline Inpainting SSIM ↑
is to conduct a fair comparison with deep learning based
Phase I None 0.80
methods as depicted in the main paper.
Phase I Cond-TransUNet [6] 0.78
The noteworthy thing is that for small blank areas (e.g.
Phase I ControlNet [39] 0.77
Column 1,3 of Fig. 11), the texture uniformity and consis-
Phase I Pix2PixHD [33] 0.83
tency are well-persevered thus capable of producing plausi-
ble textures.
Figure 12. Visualization of all template meshes used in Cloth2Tex.
Figure 14. Texture maps for training instance map guided Pix2PixHD, synthesized by ControlNet canny edge.
Figure 15. Comparison with representative image2image methods with conditional input: autoencoder-based TransUNet [6] (we modify
the base model and add an extra branch for UV map, aims to train it with all types of garment together), diffusion-based ControlNet [39]
and GAN-based Pix2PixHD [33]. It is rather obvious that prompts-sensitive ControlNet limited in recover a globally color-consistent
texture maps. Upper right corner of each method is the conditional input.