0% found this document useful (0 votes)
206 views5 pages

Human Parsing For Image-Based Virtual Try-On Using Pix2Pix

The document discusses human parsing for image-based virtual try-on using Pix2Pix. It provides background on challenges with segmentation for virtual try-on and reviews previous work using CNN models that achieve average accuracy of 80-96% but rely on limited datasets. The paper then proposes a human parsing method using Pix2Pix on the VITON dataset, which achieves 89.76% accuracy, 86.80% F1-score, and 76.79% IoU, allowing its use in future virtual try-on research.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
206 views5 pages

Human Parsing For Image-Based Virtual Try-On Using Pix2Pix

The document discusses human parsing for image-based virtual try-on using Pix2Pix. It provides background on challenges with segmentation for virtual try-on and reviews previous work using CNN models that achieve average accuracy of 80-96% but rely on limited datasets. The paper then proposes a human parsing method using Pix2Pix on the VITON dataset, which achieves 89.76% accuracy, 86.80% F1-score, and 76.79% IoU, allowing its use in future virtual try-on research.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2022 IEEE International Conference on Internet of Things and Intelligence Systems (IoTaIS)

Human Parsing for Image-Based Virtual Try-On


Using Pix2Pix
M. Haswin Anugrah Pratama Willy Anugrah Cahyadi Fiky Yosef Suratman
2022 IEEE International Conference on Internet of Things and Intelligence Systems (IoTaIS) | 979-8-3503-9645-4/22/$31.00 ©2022 IEEE | DOI: 10.1109/IoTaIS56727.2022.9975927

School of Electrical Engineering School of Electrical Engineering School of Electrical Engineering


Telkom University Telkom University Telkom University
Bandung, Indonesia Bandung, Indonesia Bandung, Indonesia
[email protected] [email protected] [email protected]
sity.ac.id

Abstract— Image-based virtual try-on is a method that can an average F1-score rate of 80.14%. Liang et al. [10] proposed
let people try on clothes virtually. One of the challenges in Local-Global Long Short-Term Memory (LG-LSTM) which
image-based virtual try-on is segmentation. The segmentation seamlessly incorporates short-distance and long-distance
needed in the virtual try-on implementation is the one that can spatial dependencies into the feature learning overall pixel
divide humans into several objects based on their body parts positions. The results obtained are an average accuracy rate of
such as hair, face, neck, hands, upper body, and lower body. 96.85% and an average F1-score rate of 84.12%. However, the
This type of segmentation is called human parsing. There are performance of those CNN-based approaches heavily relies
several human parsing methods and datasets that have achieved on the availability of annotated images for training from the
great results. Unfortunately, some limitations make the method
ATR dataset [8]. Unfortunately, the ATR dataset [8] doesn't
unsuitable in an image-based virtual try-on model. We proposed
come with the pose label annotations required in an image-
human parsing using the Pix2Pix model with the VITON
dataset. Our model yields an average accuracy of 89.76%, an based virtual try-on model and this dataset is no longer
average F1-score of 86.80%, and an average IoU of 76.79%. published.
These satisfactory results allow our model to be used in Inspired by the limitation of the ATR dataset, Gong et al.
upcoming image-based virtual try-on research. [11] introduced the LIP dataset, which is annotated with both
human parsing and pose labels. There are various pieces of
Keywords—Human Parsing, Pix2Pix, Image Segmentation,
research in human parsing using the LIP dataset [11]. Liang et
Generative Adversarial Networks
al. [12] proposed Self-supervised Structure-sensitive Joint
I. INTRODUCTION human Parsing and Pose estimation Network (SS-JPPNet)
which imposed human pose structures into the parsing results
In recent years, online shopping trends for fashion without resorting to extra supervision. The results obtained are
products have been increasing. This is driving the growth of an average accuracy rate of 54.94% and an average IoU rate
research on image-based virtual try-on. Image-based virtual of 44.73%. Chen et al. [13] proposed DeepLabv3+ which
try-on is a method that can let people try on clothes virtually. extended DeepLabv3 by adding a simple yet effective decoder
After Jetchev et al. [1] introduced CAGAN, there are various module to refine the segmentation results, especially along
image-based virtual try-on methods have been proposed [2], object boundaries. The results obtained are an average
[3], [4]. One of the challenges in image-based virtual try-on is accuracy rate of 55.62% and an average IoU rate of 44.80%.
segmentation. There are several types of image segmentation Wang et al. [14] proposed Hierarchical Human Parsing (HHP)
such as semantic segmentation [5], instance segmentation [6], which addressed the inference over the loopy human structure.
and panoptic segmentation [7] which can perform The results obtained are an average accuracy rate of 70.58%
segmentation to distinguish each object. Unfortunately, when and an average IoU rate of 59.25%. The quantitative results
applied to person image segmentation, a person will be obtained are not better than the ATR dataset [8] because the
counted as one object. Meanwhile, the segmentation needed LIP dataset [11] contains images of people appearing with
in the virtual try-on implementation is one that can divide challenging poses and viewpoints, heavy occlusions, various
humans into several objects based on their body parts such as appearances, and in a wide range of resolutions. Furthermore,
hair, face, neck, hands, upper body, and lower body. This type the background of the images in the LIP dataset is also more
of segmentation is called human parsing. Human parsing is a complex and diverse than the ATR dataset [8]. For that reason,
method to decompose human images into semantic body the LIP dataset [11] is not suitable to use in image-based
regions, which is an important component for image-based virtual try-on models.
virtual try-on research.
Recently, Choi et al. [15] proposed High-Resolution
Previously, convolutional neural networks (CNNs) have Virtual Try-On (VITON-HD) that shows outstanding results
achieved great results in human parsing. Liang et al. [8] as an image-based virtual try-on model. VITON-HD [15]
proposed Active Template Regression (ATR) which designed adopt U-Net architecture [16] for the segmentation generator
two separate convolutional neural networks to build the end- which successfully generates segmentation maps. These
to-end relation between the input image and the structure results inspired us to use U-Net architecture [16] as part of our
outputs. The results obtained are an average accuracy rate of human parsing model. Unfortunately, the VITON-HD dataset
91.11% and an average F1-score rate of 64.38%. Liang et al. is not publicly available. Therefore, we will use the most
[9] proposed Contextualized Convolutional Neural Network widely used dataset for the image-based virtual try-on model,
(Co-CNN) which integrated the cross-layer context, global namely VITON dataset [2]. To experiment on VITON, Han et
image-level context, within-super-pixel context, and cross- al. [2] collected a dataset consisting of around 19,000 frontal-
super-pixel neighborhood context into a unified network. The view women and top clothing image pairs. Due to some noisy
results obtained are an average accuracy rate of 96.02% and

979-8-3503-9645-4/22/$31.00 ©2022 IEEE 413

Authorized licensed use limited to: Keerthika Peravali. Downloaded on March 18,2023 at 02:56:41 UTC from IEEE Xplore. Restrictions apply.
2022 IEEE International Conference on Internet of Things and Intelligence Systems (IoTaIS)

Figure 1. Overview of Pix2Pix model for human parsing task


images being removed, this dataset is reduced to 16,253 very realistic images such as faces [19], indoor scenes [20],
images divided into 14,221 training sets and 2,032 testing sets. panoramas [21], and clothes [22]. GAN consists of two
However, VITON [2] uses an annotation style from the LIP models, namely a generative model (generator) and a
dataset [11] to compute a human segmentation map, where discriminative model (discriminator). The generator learns to
different regions represent different parts of the human body create fake information that looks real, while the discriminator
like arms, legs, etc. Unlike the LIP dataset [11], the VITON learns to distinguish the real and the fake. The two models will
dataset [2] uses a simple plain background of the images, so it compete with each other and improve their capabilities until
can be easily used in the image-based virtual try-on model. the generator is able to generate fake information that the
discriminator cannot distinguish. Mirza et al. [23] proposed
In this work, we want to improve the human parsing model Conditional Generative Adversarial Networks (CGAN) to
not only by using U-Net architecture [16] but also by allow generating of images from the specific class which
combining it with Generative Adversarial Networks [17]. became the basic idea for text-to-image synthesis [24], [25]
Therefore, we proposed human parsing using the Pix2Pix [18] and image-to-image translation [18], [26]. Isola et al. [18]
model which consists of U-Net [16] and PatchGAN [18] proposed Pix2Pix which learns the mapping from the input
architecture. The objective of this research is to examine the image to the output image. Pix2Pix model consists of a
Pix2Pix model for human parsing tasks using the VITON generator with a U-Net-based architecture [16] and a
dataset [2], so it can be used for the upcoming image-based discriminator with a convolutional PatchGAN classifier
virtual try-on research. architecture [18]. Pix2pix models can also be used for image
II. RELATED WORK segmentation. Therefore, in this work, we use the Pix2Pix
model for the human parsing task.
Human Parsing. Human parsing is a method to
decompose human images into semantic body regions, which III. PROPOSED METHOD
is an important component for image-based virtual try-on Pix2Pix is a Conditional Generative Adversarial Network
research. The CNN-based human parsing methods [8], [9], (CGAN) that learns the mapping from the input image to the
[10] with the ATR dataset [8] which contained 17,700 images output image. Technically, the generator learns the mapping
divided into 16,000 training sets, 1,000 test sets, and 700
from the real image as well as the noise vector , ∶
validation sets. These researches have achieved great results
, → , while the discriminator learns representation from
on human parsing. However, the dataset cannot be used in an
labels as well as the real images , , . In Figure 1, we
image-based virtual try-on model because it has no pose label
use the person image as an input, then put it into the
annotations and is no longer published. The other human
generator . The generator creates the fake parsing image
parsing methods [12], [13], [14] with the LIP dataset [11]
which contained 50,462 images divided into 30,642 training which is then fed into the discriminator along with the
sets, 10,000 test sets, and 10,000 validation sets. The person image . At the same time, the real parsing image
quantitative results obtained may not be better than expected. and the person image are fed into the discriminator . So,
But the dataset is superior because it has label annotations. now in the discriminator , we have two pairs of images, a
Unfortunately, it is not suitable to use in image-based virtual pair of real images , and also a pair of fake images
try-on models because it has complex and diverse , . The discriminator distinguishes the real and the
backgrounds which can complicate the segmentation process. fake using the loss function. Then, the discriminator
The VITON dataset [2] contains 16,253 images divided into produces discriminator loss, while the generator produces
14,221 training sets and 2,032 test sets. Recent image-based generator loss. After the losses are obtained, the model will
virtual try-on research [15], uses the VITON dataset and apply backpropagation to update the weights for the generator
adopts U-Net architecture [16] for the segmentation model. and the discriminator using the optimizer. The generator
Our purpose is to improve the model by using Pix2Pix which and the discriminator will continue to learn until the
is part of Generative Adversarial Networks. generator is able to generate a fake parsing image that
the discriminator cannot distinguish.
Image Translation with Generative Adversarial
Networks. Generative Adversarial Networks (GAN) [17] are
an emergent method of deep learning algorithms that generate

414

Authorized licensed use limited to: Keerthika Peravali. Downloaded on March 18,2023 at 02:56:41 UTC from IEEE Xplore. Restrictions apply.
2022 IEEE International Conference on Internet of Things and Intelligence Systems (IoTaIS)

Figure 2. U-Net architecture used

A. Loss Function learn to provide an accurate output through concatenating


We use the loss function based on a Conditional data.
Generative Adversarial Network which can be expressed as : The discriminator uses the PatchGAN classifier
ℒ , , log , architecture [18], as seen in Figure 3. Some transposed
convolutional blocks are used in this architecture. To
, log !1 # , , $%. (1) determine whether an image is real or fake, the discriminator
checks an NxN piece of an image, while N could be any size.
The adversarial loss above consists of one generator and The size might be smaller than the input image but could
one discriminator which used a minimax game. The deliver an excellent outcome. The entire image applies the
discriminator tries to maximize its probability by correctly discriminator convolutionally. Furthermore, the discriminator
classifying the real image and the fake image and the has fewer parameters than the generator, so its process could
generator tries to minimize the probability that the be shorter.
discriminator will predict the outcome to be false.
The adversarial loss is combined with the L1 loss to build
a generator that not only tricks the discriminator but also
delivers parsing images close to the ground truth.
ℒ'( , , ‖ # , ‖( .
(2)
Because the adversarial loss involves the L1 loss for the
generator, the total loss function is as follows:
(3)

arg min max ℒ , 2ℒ'( .
1

B. Network Architecture
Pix2Pix models consists of a generator with a U-Net-based
architecture [16] and a discriminator with a convolutional
PatchGAN classifier architecture [18].
Figure 3. PatchGAN architecture used
U-Net is a CNN-based architecture designed for semantic
segmentation tasks on biomedical images. The network is IV. EXPERIMENTS
built on a fully convolutional network [5], which has been A. Experiment Setup
modified and extended to work with fewer training images
and produce more precise segmentation. Dataset. We used VITON dataset [2] which contained
16,253 images divided into 14,221 training sets, and 2,032 test
U-Net is made up of two paths: a contracting path and an sets with a 192 x 256 resolution. The dataset consists of 4
expansive path. The convolutional layers in the contracting categories such as cloth, cloth-mask, image, and image-parse.
path downsample the image while extracting information. The For human parsing model, the only category used are image
expansive path is constructed from transposed convolutional and image-parse. Before the dataset is used in the model, there
layers that upsample the image. Figure 2 depicts the procedure are several pre-processing steps. First, the dataset must be
that takes place in the U-Net architecture. It begins with resized into a 256 x 256 resolution. Then, the dataset must be
downsampling layers (blue), where every convolutional block concatenated for each pair of image and image-parsed
extracts data and sends it into the following convolutional category. Finally, the dataset is ready to use.
block, which extracts more data before it hits the bottleneck in
the center (purple). When the process hits the bottleneck, it Training and Inference. The code was developed in
continues to the upsampling layers (orange), where Tensorflow and ran on an Nvidia K80 GPU with 12 GB of
every transposed convolutional block extends data from the RAM. For training, we utilize minibatch SGD using the Adam
preceding block while concatenating data from the related optimizer for both the generator and the discriminator, with a
downsampling layer. Based on this data, the network could learning rate of 0.0002 and momentum parameters 3( = 0.5.

415

Authorized licensed use limited to: Keerthika Peravali. Downloaded on March 18,2023 at 02:56:41 UTC from IEEE Xplore. Restrictions apply.
2022 IEEE International Conference on Internet of Things and Intelligence Systems (IoTaIS)

Figure 4. Human parsing results using Pix2Pix


For the generator, we use batch normalization at all layers,
LeakyReLU as the activation function in downsampling
layers, and ReLU as the activation function in upsampling
layers. We use dropout but only applied to the first 3 blocks in
upsampling layers. For the discriminator, we use batch
normalization and LeakyReLU at all layers.
B. Evaluation Metrics
Pixel Accuracy. Pixel accuracy is a popular metric that
reports the percentage of pixels in an image that were correctly
classified.
Intersection over Union. The Intersection over Union
(IoU) metric, also known as the Jaccard index, quantifies the
percent overlap between the target mask and the prediction
output. The IoU metric divides the total number of pixels
present across both masks by the number of pixels shared by V. CONCLUSIONS
the target and prediction masks.
We solved the human parsing task with Pix2Pix models
F1-score. The F1-score metric, also known as the Dice consisting of the U-Net Generator and the PatchGAN
coefficient, has the same concept with IoU metric. The Discriminator. Our model has an average accuracy of 89.76%,
different is F1-score count the intersection from the target and an average F1-score of 86.80%, and an average IoU of
prediction mask, then multiple it by 2 and divide it by the total 76.79%, which outperforms several state-of-the-art human
pixels in both the images. parsing models. The disadvantage of this model is that the
C. Quantitative Analysis resulting image contains few blurry areas, particularly on the
sides of the body parts. These satisfactory results allow our
We compare our proposed method with six state-of-the- model to be used in upcoming image-based virtual try-on
art approaches [8], [9], [10], [12], [13], and [14] shown in research.
Table 1. Our method can significantly outperform three
baselines: 34.82% over SS-JPPNet [12], 34.14% over REFERENCES
DeepLabv3+ [13], and 19.18% over HHP [14] in terms of
average pixel Accuracy. Our method also can significantly [1] N. Jetchev and U. Bergmann, “The Conditional
outperform three baselines: 22.42% over ATR [8], 6.66% Analogy GAN: Swapping Fashion Articles on People
over Co-CNN [9], and 2.68% over LG-LSTM [10] in terms Images,” in 2017 IEEE International Conference on
of the average F1-score. Our method also can significantly Computer Vision Workshops (ICCVW), Oct. 2017,
outperform three baselines: 32.06% over SS-JPPNet [12], pp. 2287–2292. doi: 10.1109/ICCVW.2017.269.
31.99% over DeepLabv3+ [13], and 17.54% over HHP [14] [2] X. Han, Z. Wu, Z. Wu, R. Yu, and L. S. Davis,
in terms of the average IoU. It demonstrates that our method “VITON: An Image-Based Virtual Try-on Network,”
performs very well as a human parsing model. in 2018 IEEE/CVF Conference on Computer Vision
and Pattern Recognition, Jun. 2018, pp. 7543–7552.
D. Qualitative Analysis doi: 10.1109/CVPR.2018.00787.
Figure 4. presents the visual results of our method. The [3] B. Wang, H. Zheng, X. Liang, Y. Chen, L. Lin, and
level of color similarity between the target image and the M. Yang, “Toward Characteristic-Preserving Image-
predicted image for each body part is quite satisfactory. The Based Virtual Try-On Network,” in European
disadvantage of this method is that the resulting image still has Conference on Computer Vision (ECCV), vol. 15,
some blurry parts, especially on the sides of the body parts. 2018, pp. 607–623. doi: 10.1007/978-3-030-01261-
8_36.
[4] H. Yang, R. Zhang, X. Guo, W. Liu, W. Zuo, and P.
Luo, “Towards Photo-Realistic Virtual Try-On by

416

Authorized licensed use limited to: Keerthika Peravali. Downloaded on March 18,2023 at 02:56:41 UTC from IEEE Xplore. Restrictions apply.
2022 IEEE International Conference on Internet of Things and Intelligence Systems (IoTaIS)

Adaptively Generating↔Preserving Image Content,” Aware Normalization,” in 2021 IEEE/CVF


in 2020 IEEE/CVF Conference on Computer Vision Conference on Computer Vision and Pattern
and Pattern Recognition (CVPR), Jun. 2020, pp. Recognition (CVPR), Jun. 2021, pp. 14126–14135.
7847–7856. doi: 10.1109/CVPR42600.2020.00787. doi: 10.1109/CVPR46437.2021.01391.
[5] J. Long, E. Shelhamer, and T. Darrell, “Fully [16] O. Ronneberger, P. Fischer, and T. Brox, “U-Net:
convolutional networks for semantic segmentation,” Convolutional Networks for Biomedical Image
in 2015 IEEE Conference on Computer Vision and Segmentation,” in Medical Image Computing and
Pattern Recognition (CVPR), Jun. 2015, pp. 3431– Computer-Assisted Intervention (MICCAI), 2015, pp.
3440. doi: 10.1109/CVPR.2015.7298965. 234–241. doi: 10.1007/978-3-319-24574-4_28.
[6] K. He, G. Gkioxari, P. Dollar, and R. Girshick, [17] I. Goodfellow et al., “Generative Adversarial
“Mask R-CNN,” in 2017 IEEE International Networks,” International Conference on Neural
Conference on Computer Vision (ICCV), Oct. 2017, Information Processing Systems (NIPS), vol. 27, pp.
pp. 2980–2988. doi: 10.1109/ICCV.2017.322. 2672–2680, Oct. 2014, doi: 10.1145/3422622.
[7] A. Kirillov, K. He, R. Girshick, C. Rother, and P. [18] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-
Dollar, “Panoptic Segmentation,” in 2019 IEEE/CVF to-Image Translation with Conditional Adversarial
Conference on Computer Vision and Pattern Networks,” in 2017 IEEE Conference on Computer
Recognition (CVPR), Jun. 2019, pp. 9396–9405. doi: Vision and Pattern Recognition (CVPR), Jul. 2017,
10.1109/CVPR.2019.00963. pp. 5967–5976. doi: 10.1109/CVPR.2017.632.
[8] X. Liang et al., “Deep Human Parsing with Active [19] A. Radford, L. Metz, and S. Chintala, “Unsupervised
Template Regression,” IEEE Transactions on Representation Learning with Deep Convolutional
Pattern Analysis and Machine Intelligence, vol. 37, Generative Adversarial Networks,” in International
no. 12, pp. 2402–2414, Dec. 2015, doi: Conference on Learning Representations (ICLR),
10.1109/TPAMI.2015.2408360. Oct. 2016, pp. 1–16.
[9] X. Liang et al., “Human Parsing with Contextualized [20] X. Wang and A. Gupta, “Generative Image Modeling
Convolutional Neural Network,” in 2015 IEEE Using Style and Structure Adversarial Networks,” in
International Conference on Computer Vision European Conference on Computer Vision (ECCV),
(ICCV), Dec. 2015, pp. 1386–1394. doi: vol. 14, 2016, pp. 318–335. doi: 10.1007/978-3-319-
10.1109/ICCV.2015.163. 46493-0_20.
[10] X. Liang, X. Shen, D. Xiang, J. Feng, L. Lin, and S. [21] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A.
Yan, “Semantic Object Parsing with Local-Global Efros, “Generative Visual Manipulation on the
Long Short-Term Memory,” in 2016 IEEE Natural Image Manifold,” in European Conference
Conference on Computer Vision and Pattern on Computer Vision (ECCV), vol. 14, 2016, pp. 597–
Recognition (CVPR), Jun. 2016, pp. 3185–3193. doi: 613. doi: 10.1007/978-3-319-46454-1_36.
10.1109/CVPR.2016.347. [22] D. Yoo, N. Kim, S. Park, A. S. Paek, and I. S. Kweon,
[11] K. Gong, X. Liang, D. Zhang, X. Shen, and L. Lin, “Pixel-Level Domain Transfer,” in European
“Look into Person: Self-Supervised Structure- Conference on Computer Vision (ECCV), vol. 14,
Sensitive Learning and a New Benchmark for Human 2016, pp. 517–532. doi: 10.1007/978-3-319-46484-
Parsing,” in 2017 IEEE Conference on Computer 8_31.
Vision and Pattern Recognition (CVPR), Jul. 2017, [23] M. Mirza and S. Osindero, “Conditional Generative
pp. 6757–6765. doi: 10.1109/CVPR.2017.715. Adversarial Nets,” Computing Research Repository
[12] X. Liang, K. Gong, X. Shen, and L. Lin, “Look into (CoRR), pp. 1–7, Nov. 2014.
Person: Joint Body Parsing & Pose Estimation [24] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B.
Network and a New Benchmark,” IEEE Transactions Schiele, and H. Lee, “Generative Adversarial Text to
on Pattern Analysis and Machine Intelligence, vol. Image Synthesis,” in International Conference on
41, no. 4, pp. 871–885, Apr. 2019, doi: Machine Learning (ICML), May 2016, pp. 1060–
10.1109/TPAMI.2018.2820063. 1069.
[13] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and [25] T. Xu et al., “AttnGAN: Fine-Grained Text to Image
H. Adam, “Encoder-Decoder with Atrous Separable Generation with Attentional Generative Adversarial
Convolution for Semantic Image Segmentation,” in Networks,” in 2018 IEEE/CVF Conference on
European Conference on Computer Vision (ECCV), Computer Vision and Pattern Recognition, Jun.
vol. 15, 2018, pp. 833–851. doi: 10.1007/978-3-030- 2018, pp. 1316–1324. doi:
01234-2_49. 10.1109/CVPR.2018.00143.
[14] W. Wang, H. Zhu, J. Dai, Y. Pang, J. Shen, and L. [26] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros,
Shao, “Hierarchical Human Parsing with Typed Part- “Unpaired Image-to-Image Translation Using Cycle-
Relation Reasoning,” in 2020 IEEE/CVF Conference Consistent Adversarial Networks,” in 2017 IEEE
on Computer Vision and Pattern Recognition International Conference on Computer Vision
(CVPR), Jun. 2020, pp. 8926–8936. doi: (ICCV), Oct. 2017, pp. 2242–2251. doi:
10.1109/CVPR42600.2020.00895. 10.1109/ICCV.2017.244.
[15] S. Choi, S. Park, M. Lee, and J. Choo, “VITON-HD:
High-Resolution Virtual Try-On via Misalignment-

417

Authorized licensed use limited to: Keerthika Peravali. Downloaded on March 18,2023 at 02:56:41 UTC from IEEE Xplore. Restrictions apply.

You might also like