Human Parsing For Image-Based Virtual Try-On Using Pix2Pix
Human Parsing For Image-Based Virtual Try-On Using Pix2Pix
Abstract— Image-based virtual try-on is a method that can an average F1-score rate of 80.14%. Liang et al. [10] proposed
let people try on clothes virtually. One of the challenges in Local-Global Long Short-Term Memory (LG-LSTM) which
image-based virtual try-on is segmentation. The segmentation seamlessly incorporates short-distance and long-distance
needed in the virtual try-on implementation is the one that can spatial dependencies into the feature learning overall pixel
divide humans into several objects based on their body parts positions. The results obtained are an average accuracy rate of
such as hair, face, neck, hands, upper body, and lower body. 96.85% and an average F1-score rate of 84.12%. However, the
This type of segmentation is called human parsing. There are performance of those CNN-based approaches heavily relies
several human parsing methods and datasets that have achieved on the availability of annotated images for training from the
great results. Unfortunately, some limitations make the method
ATR dataset [8]. Unfortunately, the ATR dataset [8] doesn't
unsuitable in an image-based virtual try-on model. We proposed
come with the pose label annotations required in an image-
human parsing using the Pix2Pix model with the VITON
dataset. Our model yields an average accuracy of 89.76%, an based virtual try-on model and this dataset is no longer
average F1-score of 86.80%, and an average IoU of 76.79%. published.
These satisfactory results allow our model to be used in Inspired by the limitation of the ATR dataset, Gong et al.
upcoming image-based virtual try-on research. [11] introduced the LIP dataset, which is annotated with both
human parsing and pose labels. There are various pieces of
Keywords—Human Parsing, Pix2Pix, Image Segmentation,
research in human parsing using the LIP dataset [11]. Liang et
Generative Adversarial Networks
al. [12] proposed Self-supervised Structure-sensitive Joint
I. INTRODUCTION human Parsing and Pose estimation Network (SS-JPPNet)
which imposed human pose structures into the parsing results
In recent years, online shopping trends for fashion without resorting to extra supervision. The results obtained are
products have been increasing. This is driving the growth of an average accuracy rate of 54.94% and an average IoU rate
research on image-based virtual try-on. Image-based virtual of 44.73%. Chen et al. [13] proposed DeepLabv3+ which
try-on is a method that can let people try on clothes virtually. extended DeepLabv3 by adding a simple yet effective decoder
After Jetchev et al. [1] introduced CAGAN, there are various module to refine the segmentation results, especially along
image-based virtual try-on methods have been proposed [2], object boundaries. The results obtained are an average
[3], [4]. One of the challenges in image-based virtual try-on is accuracy rate of 55.62% and an average IoU rate of 44.80%.
segmentation. There are several types of image segmentation Wang et al. [14] proposed Hierarchical Human Parsing (HHP)
such as semantic segmentation [5], instance segmentation [6], which addressed the inference over the loopy human structure.
and panoptic segmentation [7] which can perform The results obtained are an average accuracy rate of 70.58%
segmentation to distinguish each object. Unfortunately, when and an average IoU rate of 59.25%. The quantitative results
applied to person image segmentation, a person will be obtained are not better than the ATR dataset [8] because the
counted as one object. Meanwhile, the segmentation needed LIP dataset [11] contains images of people appearing with
in the virtual try-on implementation is one that can divide challenging poses and viewpoints, heavy occlusions, various
humans into several objects based on their body parts such as appearances, and in a wide range of resolutions. Furthermore,
hair, face, neck, hands, upper body, and lower body. This type the background of the images in the LIP dataset is also more
of segmentation is called human parsing. Human parsing is a complex and diverse than the ATR dataset [8]. For that reason,
method to decompose human images into semantic body the LIP dataset [11] is not suitable to use in image-based
regions, which is an important component for image-based virtual try-on models.
virtual try-on research.
Recently, Choi et al. [15] proposed High-Resolution
Previously, convolutional neural networks (CNNs) have Virtual Try-On (VITON-HD) that shows outstanding results
achieved great results in human parsing. Liang et al. [8] as an image-based virtual try-on model. VITON-HD [15]
proposed Active Template Regression (ATR) which designed adopt U-Net architecture [16] for the segmentation generator
two separate convolutional neural networks to build the end- which successfully generates segmentation maps. These
to-end relation between the input image and the structure results inspired us to use U-Net architecture [16] as part of our
outputs. The results obtained are an average accuracy rate of human parsing model. Unfortunately, the VITON-HD dataset
91.11% and an average F1-score rate of 64.38%. Liang et al. is not publicly available. Therefore, we will use the most
[9] proposed Contextualized Convolutional Neural Network widely used dataset for the image-based virtual try-on model,
(Co-CNN) which integrated the cross-layer context, global namely VITON dataset [2]. To experiment on VITON, Han et
image-level context, within-super-pixel context, and cross- al. [2] collected a dataset consisting of around 19,000 frontal-
super-pixel neighborhood context into a unified network. The view women and top clothing image pairs. Due to some noisy
results obtained are an average accuracy rate of 96.02% and
Authorized licensed use limited to: Keerthika Peravali. Downloaded on March 18,2023 at 02:56:41 UTC from IEEE Xplore. Restrictions apply.
2022 IEEE International Conference on Internet of Things and Intelligence Systems (IoTaIS)
414
Authorized licensed use limited to: Keerthika Peravali. Downloaded on March 18,2023 at 02:56:41 UTC from IEEE Xplore. Restrictions apply.
2022 IEEE International Conference on Internet of Things and Intelligence Systems (IoTaIS)
B. Network Architecture
Pix2Pix models consists of a generator with a U-Net-based
architecture [16] and a discriminator with a convolutional
PatchGAN classifier architecture [18].
Figure 3. PatchGAN architecture used
U-Net is a CNN-based architecture designed for semantic
segmentation tasks on biomedical images. The network is IV. EXPERIMENTS
built on a fully convolutional network [5], which has been A. Experiment Setup
modified and extended to work with fewer training images
and produce more precise segmentation. Dataset. We used VITON dataset [2] which contained
16,253 images divided into 14,221 training sets, and 2,032 test
U-Net is made up of two paths: a contracting path and an sets with a 192 x 256 resolution. The dataset consists of 4
expansive path. The convolutional layers in the contracting categories such as cloth, cloth-mask, image, and image-parse.
path downsample the image while extracting information. The For human parsing model, the only category used are image
expansive path is constructed from transposed convolutional and image-parse. Before the dataset is used in the model, there
layers that upsample the image. Figure 2 depicts the procedure are several pre-processing steps. First, the dataset must be
that takes place in the U-Net architecture. It begins with resized into a 256 x 256 resolution. Then, the dataset must be
downsampling layers (blue), where every convolutional block concatenated for each pair of image and image-parsed
extracts data and sends it into the following convolutional category. Finally, the dataset is ready to use.
block, which extracts more data before it hits the bottleneck in
the center (purple). When the process hits the bottleneck, it Training and Inference. The code was developed in
continues to the upsampling layers (orange), where Tensorflow and ran on an Nvidia K80 GPU with 12 GB of
every transposed convolutional block extends data from the RAM. For training, we utilize minibatch SGD using the Adam
preceding block while concatenating data from the related optimizer for both the generator and the discriminator, with a
downsampling layer. Based on this data, the network could learning rate of 0.0002 and momentum parameters 3( = 0.5.
415
Authorized licensed use limited to: Keerthika Peravali. Downloaded on March 18,2023 at 02:56:41 UTC from IEEE Xplore. Restrictions apply.
2022 IEEE International Conference on Internet of Things and Intelligence Systems (IoTaIS)
416
Authorized licensed use limited to: Keerthika Peravali. Downloaded on March 18,2023 at 02:56:41 UTC from IEEE Xplore. Restrictions apply.
2022 IEEE International Conference on Internet of Things and Intelligence Systems (IoTaIS)
417
Authorized licensed use limited to: Keerthika Peravali. Downloaded on March 18,2023 at 02:56:41 UTC from IEEE Xplore. Restrictions apply.