Federated Approach For Fine-Grained Classification of Fashion Apparel

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Machine Learning with Applications 6 (2021) 100118

Contents lists available at ScienceDirect

Machine Learning with Applications


journal homepage: www.elsevier.com/locate/mlwa

A federated approach for fine-grained classification of fashion apparel


Tejaswini Mallavarapu a,b , Luke Cranfill a , Eun Hye Kim a , Reza M. Parizi c , John Morris d ,
Junggab Son a ,∗
a Information and Intelligent Security (IIS) Lab, Kennesaw State University, Marietta, GA 30060, USA
b Analytics and Data Science Institute, Kennesaw State University, Marietta, GA 30060, USA
c
Decentralized Science Lab, Kennesaw State University, Marietta, GA 30060, USA
d
Oracle, Retail Global Business Unit, Atlanta, GA, USA

ARTICLE INFO ABSTRACT


Keywords: As online retail services proliferate and are pervasive in modern lives, applications for classifying fashion
Apparel attributes apparel features from image data are becoming more indispensable. Online retailers, from leading companies
Apparel classification to start-ups, can leverage such applications in order to increase profit margin and enhance the consumer
Fine-grained classification
experience. Many notable schemes have been proposed to classify fashion items, however, the majority of
Human keypoints detection
such schemes have focused upon classifying basic-level categories, such as T-shirts, pants, skirts, shoes, bags,
and so forth. In contrast to most prior efforts, this paper aims to enable an in-depth classification of fashion
item attributes within the same category. Beginning with a single dress, we seek to classify the type of dress
hem, the hem length, and the sleeve length. The proposed scheme is comprised of three major stages: (a)
localization of a target item from an input image using semantic segmentation, (b) detection of human key
points (e.g., point of shoulder) using a pre-trained CNN and a bounding box, and (c) three-phase classification
of the attributes using a combination of algorithmic approaches and deep neural networks. The experimental
results demonstrate that the proposed scheme is highly effective, with all categories having average precision
of above 93.02%, and outperforms existing Convolutional Neural Networks (CNNs)-based schemes.

1. Introduction Several schemes have been proposed to classify fashion items.


Liu et al. proposed a human detector based on multiple features
An application for classifying fashion apparel features in images extracted by Histograms of oriented gradients, local binary patterns,
(e.g., sleeve length, hem length, skirt style, and so forth) can be used and color (Liu et al., 2012). Eshwar et al. proposed an apparel classi-
by online retailers to help achieve a variety of goals. The merit of this fication using Convolutional Neural Networks (CNN) (Eshwar et al.,
proposed application arises from the fact that the current annotation 2016). Their scheme could classify five classes of T-shirts, pants,
or attributing of fashion apparel has not kept pace with the retailer’s footwear, saree, and women kurta, each of which has a great dis-
ability to effectively use detailed item attributes. Detailed attributes, crepancy among other classes. Iliukovich-Strakovskaia et al. proposed
beyond style and color, can be used for such tasks as price optimization a fine-grained apparel categorization system with two-level classes
and similar item searches. The former can contribute to increased (Iliukovich-Strakovskaia et al., 2016). It has 10 super-classes, such
profit margin and the latter can contribute to an enhanced consumer as high heels and slip-on shoes, each of which containing 14 classes
experience. In any case, the key insight is that a host of useful features inside. Seo et al. proposed a fashion image classification by applying
are embedded in the retailer’s fashion apparel image assets. If one a hierarchical classification structure, named Hierarchical CNN (H-
had a means of extracting those features, then one could use these CNN), on apparel images (Seo & Shin, 2019). They defined three levels
extracted features to annotate fashion apparel items with a consistent of hierarchical classes to reduce the classification error of labeling.
and coherent set of detailed attributes. Ultimately, the case can be made For every fashion item, their scheme outputs three hierarchical labels,
that an application for classifying fashion apparel features in a retailer’s e.g., ‘Clothes’-‘Tops’-‘T-shirt’. These results give novelty and show
image assets would contribute substantially to closing the gap between possibilities in fashion apparel classifications.
the retailer’s aspirations and the current reality with regard to detailed Much of the extant detection, localization, and classification work
fashion apparel attributes. focuses upon basic-level categories, which for people provide maximum

∗ Corresponding author.
E-mail addresses: [email protected] (T. Mallavarapu), [email protected] (L. Cranfill), [email protected]
(E.H. Kim), [email protected] (R.M. Parizi), [email protected] (J. Morris), [email protected] (J. Son).
URL: http://i2s.kennesaw.edu (J. Son).

https://doi.org/10.1016/j.mlwa.2021.100118
Received 12 February 2021; Received in revised form 20 July 2021; Accepted 21 July 2021
Available online 31 July 2021
2666-8270/© 2021 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
T. Mallavarapu, L. Cranfill, E.H. Kim et al. Machine Learning with Applications 6 (2021) 100118

information with the least cognitive effort. Specifically, one replaces a 2.1. Semantic segmentation
host of features with a symbol (e.g., ‘‘dog’’ or ‘‘bird’’) that encompasses
most, but not all features one might encounter in a given situation. Fashion item classification is an area of research that uses image seg-
Such classification work is inherently built on the notion that people mentation to great effect. Although many schemes have been developed
perceive the world around them as structured information rather than and utilized, Fully Convolutional Network (FCN) is one of the most
as arbitrary or unpredictable attributes. Our work, in contrast to most popular algorithms (Shelhamer et al., 2014). As the name suggests, the
prior efforts, seeks not to distinguish between categories, but to dis- model is fully convolutional, containing no fully connected layers. This
tinguish between instances of the same category in a meaningful and allows the algorithm to receive image input of any size and circumvents
consistent manner. Dresses, for example, have numerous subordinate difficulties such as pre- and post-processing complications.
features depending on their length (Koester & Bryant, 1991). Sleeve A deep convolutional Encoder–Decoder Architecture for image seg-
lengths categorized them into a cap, short, elbow length, bracelet, long, mentation (SegNet) is worthy of remark (Badrinarayanan et al., 2017;
and angel. Hem lengths categorized them into mini, above-the-knee, Hu et al., 2008). SegNet consists of an encoder, decoder network, and
knee, below-the-knee, mid-calf, evening, and floor. pixel-wise classification layer. The encoder network corresponds to the
Moreover, we do not begin our analysis of an image with vague 13 convolutional layers in the VGG16 network introduced in Simonyan
generalities concerning its content. Rather, we begin our analysis of and Zisserman (2014). It performs convolution with a filter bank to
an image with numerous priors. Specifically, we know a priori what produce a set of feature maps.
the image we are analyzing depicts, e.g., a model wearing a dress Other deep learning architectures have also been introduced for
against a neutral background—prior use of images in catalogs and on semantic segmentation, including U-Net (Ronneberger et al., 2015),
websites ensures that basic category information is available. Given Region-based Convolutional Neural Network (R-CNN) families (Gir-
that we know the instance category, we also know what to look for shick, 2015; He et al., 2017, 2015; Ren et al., 2015; Shrivastava et al.,
next, e.g., hem lengths, sleeve lengths, necklines, and so on. Thus, 2014), and YOLO (Howard et al., 2017; Redmon & Farhadi, 2018).
our problem begins with localizing a known target category and then U-Net is similar to FCNs, however, it is modified, which gives better
classifying the localized category instance with respect to one or more segmentation in medical images (Ronneberger et al., 2015). The main
features known to be present, such as hem length for dresses. differences, when compared to FCNs, are: U-Net is symmetric, and the
On the basis of these observations, we propose a federated ap- skip connections apply a concatenation operator between the down-
proach that consists of both deep neural networks and an algorithmic sampling path and the up-sampling path. One of the main advantages
approach. Specifically, our proposed scheme is constructed using Seg- of U-Net is that it is much faster than FCN and MASK R-CNN (He et al.,
Net (Badrinarayanan et al., 2017) to localize target categories and 2017), with the segmentation of a 512 × 512 image taking less than 1 s
an algorithmic approach to classify localized category instances. The on a GPU.
algorithmic approach leverages human key-point detection (Cao et al., Faster R-CNN is an algorithm that is typically used for object detec-
2021) and bounding box schemes in order to improve the classification tion (He et al., 2017). It contains two steps: a Region Proposal Network
performance. (RPN) that generates proposals for object bounding boxes and Fast R-
We analyzed the performance of our model on classifying attributes CNN to extract features using RoIPool from the object bounding boxes
of localized categories such as hem length, sleeve length, and hem style and perform classification (Ronneberger et al., 2015). Mask RCNN is
against three major CNN variant models, i.e. standard CNN, VGG16, an extension from Faster RCNN, in which a third branch that gives the
and VGG19, using metrics of precision, recall, and f1-score. As the binary object mask. However, since the much finer spatial layout of
results demonstrated, the average f1-score of the hem length category input is required in additional mask output, Mask RCNN uses the FCN,
of our model yielded 97.12%. For the sleeve length category, the and Mask RCNN results in more accurate outputs than FCN.
average f1-score of our model turned out to be 89.45%. Similarly, for
the hem style category, our model achieved the highest f1-score of 2.2. Pose estimation
85.04%.
The main contributions of this paper include: Pose estimation or landmark detection focuses on detecting key
features on an image, e.g., eyes and nose on a human. The beginning
• To the best of our knowledge, this is the first approach to classify
of rapid expansion in this field can be traced back to when neu-
fine-grained fashion apparel attributes. Specifically, a federated
ral networks were first used for pose estimation in 2014 (Toshev &
approach for effective classifications of the attributes is proposed
Szegedy, 2014). Following this, another influential work was published
by combining deep neural networks and algorithms.
in 2015, a model to output the results as a heat map instead of a
• Best ratios defining the attributes in an image are calculated based
single point (Tompson et al., 2015). The practice of using heat-maps as
on measurements used in the fashion industry.
output for landmark detection has shaped many different models. Many
• Experimental results show that our approach outperforms CNN
architectures in human pose estimation and particularly fashion item
variants. The average precision of our approach is 93.03%, while
detection now use multi stage-architecture first developed in 2016 (Wei
the ones of standard CNN, VGG16, and VGG19 are 65%, 60.57%,
et al., 2016). Other architectures have taken a different approach, using
and 59.74%, respectively.
a self-correcting model that used Iterative Error Feedback, that is, errors
The remainder of this paper is organized as follows. Section 2 gives are corrected iteratively instead of all at once (Carreira et al., 2016). In
the related work. Section 3 provides essential backgrounds and building 2016, a new model used an ‘‘hourglass’’ architecture that does repeat
blocks underpinning the research. The proposed scheme is presented in up-sampling then down-sampling, and at the time it was released it was
Section 4. gives the evaluation results with respect to the effectiveness the highest performing architecture in its field (Newell et al., 2016).
of the proposed scheme . Finally, Section 6 concludes this paper. Pose estimation was characterized by large and computationally ex-
pensive models until an archived work attempted to make the simplest
2. Related work architecture possible and performed competitively on the COCO data-
set (Xiao et al., 2018a). As of 2019, the highest performing model was
Image classification is a supervised learning problem that aims to the High-Resolution Network (HRNet) (Sun et al., 2019). As the name
identify objects or scenes in images. The advent of machine learning implies, this network’s novelty lays in the fact that it keeps the image
has helped propel this field forward, with the creation of CNN being at a high resolution throughout the process. These significant advances
the leading factor. This section introduces related works in fashion item in human pose estimation over the years have helped push forward the
detection literature and a number of key computer vision components. field of fashion item detection as well.

2
T. Mallavarapu, L. Cranfill, E.H. Kim et al. Machine Learning with Applications 6 (2021) 100118

2.3. Fashion apparel detection such as ‘‘dress’’ or ‘‘jeans’’, whereas our classification is far more fine
grained. In a work published in 2021, authors used machine vision to
Ye et al. proposed Finer-Net, which is a cascaded human parsing analyze images from catwalks during 2019 fashion week in order to
with hierarchical granularity (Ye et al., 2018). Finer-Net consists of predict new fashion trends (Zhao et al., 2021). Computer vision was
three stages, going from a higher level to lower level features with each used for semantic segmentation and then attribute detection of clothes
stage. The use of a high level to low-level features allows for a model such as color, style, and clothing combinations, this study also did not
robust to occlusion commonly seen in fashion item detection. MH- seek to identify fine grained attributes. A very thorough survey on
Parser is a novel multi-human parsing model and introduced with a new fashion analysis and Computer Vision was done recently by Gu et al.
multi-human parsing (MHP) dataset by Li et al. (2017) and is one of the (2020). This survey explores low level (such as landmark detection),
few models in the fashion detection field to implement a Generative middle level (such as clothing attribute prediction), and high level
Adversarial Network (GAN) for its detection. Additionally, when MH- (such as fashion item recommendation) fashion item classification and
Parser performs global parsing prediction, the MH-Parser uses a fully various tasks that fall within these categories. In this survey, even
convolutional representation learner shared for global parsing and map amongst the papers that seek to do fine-grained apparel detection, such
prediction. as Chen et al. (2015) and Dong et al. (2017), none of the papers have
Supplementary to the FCN and equally important has been the use such extensive categories, often stopping at the level of ‘‘pattern’’ or
of human pose estimation and fashion landmarks. Extracting fashion generic ‘‘long’’ or ‘‘short’’ sleeve length.
features plays an important role when performing Fashion Apparel
Detection (Dong et al., 2014; Ge et al., 2019; Huang et al., 2019; 3. Building blocks
Liu et al., 2016; Shen et al., 2014; Wang et al., 2018; Xiao et al.,
2018b; Ye et al., 2018). Experimental results from various papers 3.1. SegNet
show that clothing categories and attributes can be classified by using
fashion landmarks, key-points, and pose estimations. By utilizing these Simonyan and Zisserman (2014) SegNet is a convolutional neural
methods, the accuracy of detected fashion items can be increased (Ge network for pixel-wise semantic segmentation, originally developed by
et al., 2019; Ye et al., 2018). Cambridge (Badrinarayanan et al., 2015) in 2015. The architecture,
One of the most influential works in this field was the first (to the depicted in Fig. 1, consists of an encoder network and a decoder
best of our knowledge) to propose fashion landmarks around things network and ends in a soft-max layer for pixel-wise classification. The
like the neckline in 2016 (Liu et al., 2016), additionally, they released SegNet encoder architecture is identical to the convolutional layers in
a data-set annotated with these landmarks which paved the way for VGG16 (Simonyan & Zisserman, 2014), but the fully connected layers
many researchers to use these to improve their models. Since then are removed. This means the SegNet is fully convolution, allowing
researchers have been seeking to improve fashion landmark detection it to accept any input size and significantly reducing the number of
with different models (Chen et al., 2019). Among some of the best parameters. In the encoder network, the SegNet has 14.7M parameters
models using landmark detection for fashion item detection is a model compared to previous models which had 134M (Badrinarayanan et al.,
based on assistance from high-level industry knowledge in the form of 2017). Consequently, the memory and computational power needed
grammars to guide the fashion landmarks that significantly improve are significantly reduced compared to other models. At the time of its
performance (Wang et al., 2018). Additionally, it was shown in 2019 development, its speed and accuracy were unparalleled, and its novelty
that a model using landmark detection to bolster accuracy outper- laid in the decoder technique for up-sampling.
formed all state of the art methods on the Deep Fashion data-set (Lee In convolutional neural networks, there is often a pooling layer in
et al., 2019). Researchers recently attempted to improve landmark the network. There are multiple types of pooling layers, but the most
detection, which can overgeneralize, by using multiple layer Layout- popular type is max pooling. The purpose of the pooling layer is to
Graph Reasoning, with each layer mapping to another set of features, reduce the size of the feature while retaining its principal attributes.
one layer for body parts, coarse clothes features, finer features, etc. In a max-pooling layer, a subsection of the feature map is looked at.
and achieved impressive results, also contributing a new data-set with In a 9 × 9 image, for instance, a max pooling layer might look at a
200,000 images (Yu et al., 2019). 3 × 3 section of the top left corner. The maximum value of that section
In an archived paper (Jia et al., 2018), some researchers took a new then outputs the value for that area of the feature map, this 3 × 3
perspective on fashion item detection and sought to do fashion item window then shifts 1 value to the right. This process is repeated until
detection for industry professionals instead of consumers. In contrast all the maximum values have taken and a feature map reduced in size
to the land-marked images, some researchers sought to use weakly is produced that contains only the maximum values of the previous
annotated images with multi-stage architecture to perform fashion item feature map. The SegNet is effective because it saves the indices of the
detection (Zhang et al., 2018). This work was similar to our own in maximum value in each layer while doing max pooling.
that it reached into some of the finer details of fashion item detection, The purpose of the encoder network is to create a feature map that
but only used website keywords such as ‘‘sleeve’’ or ‘‘v-neck’’ instead has translation invariance. Translation invariance is a property in com-
of creating detailed categorical divisions among these finer-grained puter vision that means an algorithm is robust to changes in the image.
attributes. Additionally, this work exclusively used a CNN while our For example, if an object is moved in an image frame or inverted, the
work uses both a CNN and algorithmic approach. In the same vein, algorithm that has translation invariance can still correctly identify the
some authors attempted to use keyword metadata that accompanied object. Reducing the image in size with max pooling helps a network
images to predict the main product in each image (Rubio et al., 2017). achieve this because the network will only learn the principle attributes
For a more complete set of works in the fashion field, we direct our of the image but loses spatial context. This can help create a robust
readers to an archived survey done in 2020 that examines more than network, but it also loses information, specifically the boundaries in
200 works in the field (Cheng et al., 2020). an image. The boundaries between objects are essential for semantic
A recent archived work done uses pose estimation to identify if a segmentation, so this issue must be overcome to achieve high accuracy.
picture is a full frontal photograph with the subject of the photograph Each encoder layer in the SegNet has a corresponding identical
facing forward (Ravi et al., 2020). The scheme then uses fashion decoder layer. The decoder up-samples its feature maps using the saved
item detection to discover each fashion item in the image i.e., shoes, max-pooling indices from the corresponding encoder feature map. This
shirt, shorts. Then scheme then recommends similar items to each produces sparse feature maps, containing the most essential features so
of the detected fashion items. The categories for these fashion items the network is robust, but preserving the boundaries saved from the en-
are broader than those used in our study, comprised of classifications coded. These are then convolved with a kernel to produce dense feature

3
T. Mallavarapu, L. Cranfill, E.H. Kim et al. Machine Learning with Applications 6 (2021) 100118

Fig. 1. SegNet architecture.

maps. Once this up-sampling is complete, there is a soft-max classifica- together and that image is concatenated with the feature map output of
tion layer for pixel-wise classification. Before SegNet, architectures had stage 0. The resulting image is then fed into branch 1 and branch 2 in
fully connected layers and had to learn up-sampling. Removing these stage 2. This process repeats to increase accuracy for t number of stages
two components made the SegNet significantly more computationally which was set at t = 6. The confidence maps and affinity maps are then
efficient, which is of high concern considering semantic segmentation parsed by a greedy matching algorithm that matches body parts to one
holds a prominent place in devices such as self-driving cars. another to create the final pose estimation. The output is given in the
form of an array with values {𝑥𝑖 , 𝑦𝑖 , 𝑐𝑖 }, where the index 𝑖 represents the
3.2. Bounding boxes body parts, 𝑥 and 𝑦 are the 2D coordinates, and c represents confidence
level scored from 0 to 1. The following list represents the selected body
Bounding boxes are a very common tool for object detection in parts which will be useful for the attributes classification: {1:‘‘Neck’’,
the field of computer vision. Bounding boxes are rectangles defined 2:‘‘Right Shoulder’’, 3:‘‘Right Elbow’’, 4:‘‘Right Wrist’’, 5:‘‘Left Shoul-
by the upper left corner and the lower right corner on an (x, y) der’’, 6:‘‘Left Elbow’’, 7:‘‘Left Wrist’’, 8:‘‘Right Hip’’, 9:‘‘Right Knee’’,
coordinate plane. While neural networks can predict bounding box 10:‘‘Right Ankle’’, 11:‘‘Left Hip’’, 12:‘‘Left Knee’’, 13:‘‘Light Ankle’’,
locations, machine learning was not used in this part of the project. 25:‘‘Background’’}.
The (x, y) coordinates are identified by pixel values, this identifica-
tion of target pixels was made simple by semantic segmentation of the 4. Proposed scheme
images. The final product of our SegNet is interpreted as binary for this
portion of our approach. Pixels are either dress or not dress. We then
find the values of the top left dress pixel and bottom right dress pixel.
Once the indices are known, OpenCV library (Bradski, 2000) is used to 4.1. Overview
draw a rectangle around the desired portion of the image so that the
top of the box is the top of the dress, and the bottom of the box is the This section proposes a federated approach that aims to classify the
bottom of the dress. fine-grained attributes of dresses. The term federated means that the
proposed scheme consists of methods from two different subjects, deep
3.3. Key point identification neural networks, and algorithms. Among the three stages of the pro-
posed scheme, stages 1 and 2 are based on deep neural networks, while
The goal of key point identification is to identify points of interest stage 3 has three algorithms that effectively detects the attributes.
on an object, in this case, a human. Key points on humans are often The target attributes include ‘‘Hem Length’’, ‘‘Sleeve Length’’, and
the joints and facial features. In 2016 researchers at Carnegie Melon ‘‘Hem Style’’, which can be determined based on its length information.
University, won the COCO Keypoints Challenge with their model (Cao The ‘‘Hem Length’’ will be classified to one of ten attributes, such
et al., 2021), this is the model that we used in our scheme. Their as ’Floor Length’, ‘Evening’, ’Lower Calf’, ‘Midcalf’, ’Below Midcalf’,
architecture relies on a multi-stage, multi-branch architecture. The full ’Below Knee’, ‘Knee’, ’Above Knee’, ‘Mini’, and ‘Micro’. The ‘‘Sleeve
architecture of their model can be seen in Fig. 2. Length’’ will be classified into one of five attributes, such as ‘Long’,
The preliminary stage, stage 0, generates a set of feature maps ‘Bracelet’, ‘Elbow’, ‘Short’, and ‘Cap’. Finally, the ‘‘Hem Style’’ will be
of a given image, using the first 10 layers of VGG-19 (Simonyan & classified to one of four attributes, such as ‘Aline’, ‘Straight’, ’high-low’,
Zisserman, 2014). The set of feature maps is given as input to stage and ‘Asymmetrical’. Tables 1 and 2 summarize these classes and their
1, the next stage of the architecture. In each stage 1 through t, there attributes derived from Koester and Bryant (1991).
are two branches. The two branches predict two different things. Fig. 3 illustrates an overview of the proposed scheme. It consists of
Branch 1 predicts a confidence map (or heat map) for each body three stages:
part (body parts are specified by the training data-set). For example,
the confidence of the presence of the eye at an arbitrary pixel (x,y) is • Stage 1: Generating segmented images from input images Using
represented as a value between 0 and 1. SegNet.
Branch two predicts a part affinity field (PAF). The PAF is an • Stage 2: Estimating human key joints (key points) and creating a
association between different body parts or limbs that is represented bounding box over the dress region.
by a 2D vector. The PAF predictions direction to preserve orientation. • Stage 3
For example, a PAF would predict an association between knee and
– Phase 1: Classifying the hem length.
ankle as a two-dimensional vector, with the vector pointing downward
– Phase 2: Classifying the style of dress sleeves.
to the ankle. Accurate association of body parts is especially important
– Phase 3: Classifying the hem styles of dress.
in multi-person detection where the model needs to correctly predict
which body part belongs to which person. With all these stages, our proposed federated approach classifies lo-
These two branches simultaneously make their predictions based on calized categories and generates detailed descriptions of the dress. For
the same input. The results of these two branches are then concatenated example, the output will be ‘‘A-line floor length dress with cap sleeves’’.

4
T. Mallavarapu, L. Cranfill, E.H. Kim et al. Machine Learning with Applications 6 (2021) 100118

Fig. 2. Key point identification architecture.

Fig. 3. Overview of the proposed architecture.

4.2. Stage 1: Generation of segmented images from SegNet box, segmented images are subjected to thresholding, a type of image
segmentation. In thresholding, the binary mask with simple black and
This stage uses a pre-trained SegNet model to separate the dress white pixel values can be obtained by using the threshold value t. Gray-
region from noises like background and skin. SegNet is a pixel wise scale histogram of the image gives the t value and pixels values greater
classification model which outputs segmented images. We tested vari- than t will be turned ‘‘on’’, while the pixel values less than t value will
ous tools including FCN, SegNet, U-Net, R-CNN, and concluded that the be turned ‘‘off’’. Pixels that are turned on belongs to the dress region
SegNet performs the best for our problem, such that extracting the dress (ROI) and are covered in one color while the remaining predicted
area. Thus, we use SegNet as our preprocessing model. The detailed labels including the background that are turned off are grouped into
process is as follows: SegNet is firstly trained on LIP dataset (Liang one color. Then the bounding box is constructed by computing the
et al., 2019) and optimized until the loss is converged. We repeated minimum and maximum (x, y) coordinates of the dress region.
the experiments with different hyperparameters and selected the best The human pose estimation model followed in this paper is pro-
model that has the highest performance on validation dataset. The op- posed by Cao et al. (2017). The architecture of the model is described in
timal SegNet model which is pre-trained on the LIP dataset takes input Section 4. The model is pre-trained on the coco data set and generates
as either an RGB color image or a grayscale image from our dataset and Confidence Maps and Part Affinity maps which are all concatenated.
outputs a segmentation map where each pixel is assigned with a class The coco dataset has 18 key points including body, foot, hand, and
label. The training phase of the SegNet model includes resizing input facial key points. This model takes the segmented images as input
images into 320 x 320 pixels and the proper initialization of parameters and output the four-dimensional matrix of which the first dimension
is required as bad initialization can hinder the learning of the networks.
describes the image ID. The second dimension indicates the indices of
Therefore, a robust weight initialization method proposed in He et al.
key points which include 18 keypoint confidence Maps, 1 background,
(2015) is used to initialize the weights of encoder and decoder networks
and 38 Part Affinity Maps. The third and fourth dimension represents
whereas biases are initialized with zero.
the height and width of the output map, respectively. Once the indices
The hyper-parameters for SegNet are as follows. The learning rate
of key points are known, we locate the same indices on corresponding
is 0.01, momentum is 0.9 and the mini-batch size is 128 with 50,000
feature maps that are generated from phase I. The resultant output map
epochs. All the parameters are trained with stochastic gradient descent
that has both bounding box coordinates and keypoints is used as input
(SGD) until the training loss converges. The training data is shuffled for
for all the remaining phases.
each epoch to ensure that each image is used only once in an epoch.
The objective function of this model is the categorical cross-entropy loss
which is computed on a mini-batch. A larger variation in the number 4.4. Stage 3: Dress attributes classification
of pixels in each class requires computing the loss differently based
on true class. To get smooth segmentation, the median frequency class In stage 3, the algorithmic approaches of our proposed scheme for
balancing method is applied. This method ensures that smaller classes classifying dress categories are described. Each phase deals with the
in the training data have higher weights whereas the larger classes have classification of one category. The output map of stage 2 is passed
weights smaller than 1. through each phase to get the final output with hem length, sleeve
length, and hem style classification.
4.3. Stage 2: Keypoints estimation and bounding box
4.4.1. Phase 1: Hem length classification
The segmented images obtained from stage 1 are further processed In this phase, for a given input image, the proposed hem length
to estimate key points and build a bounding box. To get the bounding classifier maps it to one of the ten hem length attributes. The classifier

5
T. Mallavarapu, L. Cranfill, E.H. Kim et al. Machine Learning with Applications 6 (2021) 100118

Table 1 Table 2
Classes and attributes of dataset - hem length. Classes and attributes of dataset - sleeve length and hem style.
Classes Attributes Input Example Segmented Image Classes Attributes Input Example Segmentation Image

Floor Length
Long

Sleeve Length

Hem Length
Evening
Bracelet

Lower Calf
Elbow

Short
Below Midcalf

Cap
Midcalf

Aline
Below Knee
Hem Style

Straight
Knee

High-low

Above Knee

Asymmetrical

Mini

Micro

is developed by following four processes. First, a dataset that has at


least 700 images per hem length attributes is randomly collected from
several online retail stores then is manually labeled with the help of
measurements used in the fashion industry (Koester & Bryant, 1991).
Second, the dataset is split into training and testing datasets, and then
Fig. 4. Distribution of Hem Length to Leg Length Ratios for 10 Hem Length Attribute.
the best threshold values defining the attributes are derived. Third,
a classifier is developed based on the threshold values. Finally, the
performance of the classifier is evaluated using the testing datasets.
Thresholds are determined by using two metrics: hem and leg precise detection of hem lengths. To overcome this obstacle, we draw
lengths. The annotations of human key points and a bounding box the bounding box then measure a distance from a hip to the bottom-
help in computing the hem length and leg length. The hem length most point of the dress. Also, the leg length is measured as the distance
is determined as the distance between a hip keypoint to the end of between a hip to an ankle keypoint. The ratio (𝐻𝓁 ) of hem length to
the dress. Some dresses might have a waved end, which obstructs a leg length is computed for both training and testing data. Then, find the

6
T. Mallavarapu, L. Cranfill, E.H. Kim et al. Machine Learning with Applications 6 (2021) 100118

Table 3 Algorithm 1: Classifying Dress Sleeve.


Summary of Minimum, Maximum and Average of Hem Length to Leg Length Ratio for
All Hem Length Attributes. Input: A Segmented image with human key points (Output of Phase
Attributes Minimum Maximum Average 2)
Floor 1.05 1.19 1.12 Output: A classified result of sleeve length
Evening 0.9 1.05 0.974 1: Find an end point of a sleeve and set the value as 𝐸𝑠
Lower Calf 0.75 0.90 0.824 2: Find a shoulder point and set the value as 𝐾𝑆
Below Midcalf 0.7 0.743 0.722 3: Find an elbow point and set the value as 𝐾𝐸
Midcalf 0.651 0.699 0.675
Below Knee 0.55 0.648 0.599
4: Find a wrist point and set the value as 𝐾𝑊
Knee 0.451 0.549 0.5 5: Compute a bisect point, 𝐾𝑆𝐸 = (𝐾𝑆 + 𝐾𝐸 )∕2 = (𝐾𝑆𝐸 .𝑥, 𝐾𝑆𝐸 .𝑦)
Above Knee 0.375 0.45 0.413 6: Compute a bisect point, 𝐾𝐸𝑊 = (𝐾𝐸 + 𝐾𝑊 )∕2 = (𝐾𝐸𝑊 .𝑥, 𝐾𝐸𝑊 .𝑦)
Mini 0.301 0.3691 0.337 7: if 𝐸𝑠 .𝑦 > 𝐾𝐸𝑊 .𝑦 + 5 px then
Micro 0.201 0.299 0.25
8: return Long
9: else if 𝐾𝐸𝑊 .𝑦 − 5 px < 𝐸𝑠 .𝑦 ≤ 𝐾𝐸𝑊 .𝑦 + 5 px then
Table 4 10: return Bracelet
Summary of Minimum, Maximum and Average of Width of Bounding Box for A-line 11: else if 𝐾𝐸 .𝑦 − 5 px < 𝐸𝑠 .𝑦 ≤ 𝐾𝐸 .𝑦 + 5 px then
and Straight Attributes. 12: return Elbow
Attributes Minimum Maximum Average 13: else if 𝐾𝑆𝐸 .𝑦 − 5 px < 𝐸𝑠 .𝑦 ≤ 𝐾𝑆𝐸 .𝑦 + 5px then
A-line 83 319 201 14: return Short
Straight 44 116 78.5 15: else if 𝐸𝑠 .𝑦 ≤ 𝐾𝑆 + 5 px then
16: return Cap
17: else
minimum and maximum ratios (𝐻𝓁 ) for each attribute in the training 18: return ⊥
data and use them as the lower and upper range of thresholds for the 19: end if
corresponding attribute respectively. The hem length attributes of test
data are predicted by comparing their ratios (𝐻𝓁 ) with the thresholds
found from training data. The distribution of hem length to leg length 𝐾𝑊 , respectively. This phase then finds the endpoint of the sleeve by
ratios of training data are illustrated in Fig. 4. The 𝑋-axis represents checking the change of pixel values from a shoulder to an elbow or a
ratios (𝐻𝓁 ) of hem length to leg length while 𝑌 -axis represents the wrist. The endpoint, a key factor for the classification, is denoted as 𝐸𝑠 .
hem length attributes. It depicts that there is no overlapping of ratios If 𝐸𝑠 is not detectable on one arm, we check the other. To increase the
(𝐻𝓁 ) between any two attributes, which results in a trivial error rate number of classifiable sleeve attributes, we define two bisection points
of classification. Also, the minimum, maximum, and average values 𝐾𝑆𝐸 and 𝐾𝐸𝑊 , where 𝐾𝑆𝐸 is a point between 𝐾𝑆 and 𝐾𝐸 , and 𝐾𝐸𝑊 is
of each class are given in Table 3. We rounded the optimal threshold a point between 𝐾𝐸 and 𝐾𝑊 . It is worth noting that each point has (x,
values to the nearest hundredth in order to improve the performance y) coordinates on an image. Track the region where the 𝐸𝑠 is located
of the model for a new dress image. and classify a sleeve type. For example, the sleeve that ends in more
By considering these minimum and maximum ratios of each class as than five pixels below the mid-region between elbow and wrist key
thresholds, we predict hem length attributes for images in the testing point (𝐾𝐸𝑊 ) is classified as a long sleeve. The sleeve that lasts in the
dataset. It predicts the given dress as a floor length dress if 𝐻𝓁 value region of five pixels above and below 𝐾𝐸𝑊 point is defined as a bracelet
of dress is greater than or equal to 1.05 as it extends over the ankle. sleeve while the sleeve ends in the region of five pixels above and below
If the 𝐻𝓁 value of dress is less than 1.05 and greater than or equal to 𝐾𝐸 key point is classified as an elbow. The sleeve whose endpoint is
0.9 thresholds, we classified it as evening dress while the dress whose in a region of five pixels above and below 𝐾𝑆𝐸 is defined as short
𝐻𝓁 is between 0.75 and 0.9 is defined as lower calf. Likewise, the dress otherwise if find in the region near shoulder key point, it is labeled
with 𝐻𝓁 in between 0.7 and 0.74 falls below midcalf and the dress with as cap sleeve. The five pixels were also determined empirically based
𝐻𝓁 ranges from 0.65 to 0.7, 0.55 to 0.65, 0.45 to 0.55 is classified as on comprehensive simulations. Algorithm 1 illustrates the process of
midcalf, below the knee, and knee, respectively. Above knee and mini sleeve length classification.
hem lengths are defined if 𝐻𝓁 value ranges from 0.38 to 0.45 and 0.3
to 0.38. The dress whose 𝐻𝓁 is less than 0.3 is classified as micro as it 4.4.3. Phase 3: Dress Hem Style classification
ends near high thighs and closer to a hip key point.
It is worth noting that training and testing data are used only to In Phase 3, an input image is classified into one of four hem styles:
obtain thresholds. Users are not required to split data and compute Asymmetrical, High-low, Aline, or Straight by leveraging a bounding
thresholds as they are fixed pre-computed values. The major advantages box (the lowest point of a dress) with the three leg key points on each
of our approach are (a) thresholds are pretty consistent so it makes solid leg, such as hip, knee, and ankle. Based on the key points, we firstly find
performance unless key points and a bounding box are undetectable, the lowest point of the dress on the left and the right legs and denote
(b) online retailers do not need to train an algorithm again, and (c) the them as 𝐸𝐿 and 𝐸𝑅 , respectively. Also, we set the end of the dress that
classification process is just a simple comparison of ratios thus highly meets the bounding box bottom as 𝑇 . Each point has (x,y)-coordinate
efficient. on an image, e.g., 𝐸𝐿 = (𝐸𝐿 .𝑥, 𝐸𝐿 .𝑦).
Asymmetrical is a style of dress hem in which one side is longer
4.4.2. Phase 2: Dress sleeve classification than the other, while High-low is a style in which the front hem is
shorter than the back. Thus, we use the equation |𝑇 .𝑦 − 𝐸𝐿 .𝑦| ≥ 5𝑝𝑥
In Phase 2, each image is classified into one of five types of sleeves. or |𝑇 .𝑦 − 𝐸𝑅 .𝑦| ≥ 5𝑝𝑥, to define either asymmetrical or high-low. If a
Unlike the dress length, sleeve lengths could not be defined with hem length on one leg is greater than five pixels, the dress is classified
a precise number due to the variety of a model’s pose. Thus, we as asymmetrical. Otherwise, high-low. The five pixels are empirically
empirically define as many sleeve length attributes as possible based determined in order to maximize the classification accuracy.
on the key points. The key points on both arms are used in defining On the other hand, a-line and straight hem styles are classified based
the type of sleeves. There are three key points we can leverage, such on the width of the bounding box, denoted as 𝐻𝑊 . The distribution
as shoulder, elbow, and wrist, each of which is labeled as 𝐾𝑆 , 𝐾𝐸 , and of the 𝐻𝑊 values of the a-line and straight hem style attributes are

7
T. Mallavarapu, L. Cranfill, E.H. Kim et al. Machine Learning with Applications 6 (2021) 100118

Algorithm 2: Classifying Dress Hem Style. Dress classification experiments were evaluated on dress images
crawled from online retail shops. All images were gathered under the
Input: An image with human key points and a bounding box
fair use doctrine. We manually labeled each image with three different
(Output of Phase 2)
attributes of dresses including hem length, sleeve type, and hem style.
Output: A classified result of hem style
For each subclass around 700 images are collected. All the collected
1: Find an end point of the dress from left knee or ankle and set the
images must be full-body shot images with no occlusions. Images that
value as 𝐸𝐿 = (𝐸𝐿 .𝑥, 𝐸𝐿 .𝑦)
have a rear view of the dress or no proper posture to detect human
2: Find an end point of the dress from right knee or ankle and set the
keypoints were removed from the dataset as they prevent the successful
value as 𝐸𝑅 = (𝐸𝑅 .𝑥, 𝐸𝐿 .𝑦))
identification of different categories.
3: 𝑇 = (𝑇 .𝑥, 𝑇 .𝑦) is the bottom of bounding box
4: if (|𝑇 .𝑦 − 𝐸𝐿 .𝑦| ≥ 5 px)||(|𝑇 .𝑦 − 𝐸𝑅 .𝑦| ≥ 5 px) then
4.6. Algorithm settings
5: if |𝐸𝐿 .𝑦 − 𝐸𝑅 .𝑦| ≥ 5px then
6: return Asymmetrical
We evaluate the performance of our model on each dress category
7: else
and compared it with several state-of-art classifiers such as Convolu-
8: return High-low
tional Neural Network (CNN) and VGG16 and VGG19 networks.
9: end if
The CNN architecture has three convolutional layers, two fully con-
10: else if width of bounding box ≥ 110 then
nected layers (FC), and an output layer. Each convolutional layer has a
11: return Aline
filter size of 3 × 3, and max-pooling was performed on every 2 × 2-pixel
12: else if width of bounding box < 110 then
blocks. Batch Normalization and Dropout are used as regularization
13: return Straight
factors. The output is then fed to Soft-max layer for classification that
14: else
helps to determine the classification probabilities used by the final
15: return ⊥
classification layer. The CNN model is trained with Adam optimizer
16: end if
with an adaptive learning rate.
Both the VGG16 and VGG19 networks are deep CNN models with
five building blocks. The VGG16 has a total of 16 layers while VGG19
has 19 layers. The first two blocks in both the networks have two
convolutional layers and 1 pooling layer with 64 filters in the first
block and 128 filters in the second block. The third and fourth block of
VGG16 network consists of 3 convolutional layers and 1 pooling layer
each and the last block has 3 convolutional layers whereas the VGG19
network has 4 convolutional layers and 1 pooling layer each in the third
and fourth blocks and 4 convolutional layers in the fifth block. The fifth
block is followed by two fully connected layers of size 4096 nodes. The
3 × 3 sized filters with a stride of 1 are used in all convolutional layers.
The third block has 256 filters while the fourth and fifth blocks have
512 filters. For all the max-pooling layers, a 2 × 2 filter with the stride
Fig. 5. Distribution of Width of the Bounding Box for A-line and Straight Attributes. of 2 is used. Batch normalization is performed at each block for easier
initialization and faster training. Both the networks used stochastic
gradient descent optimizer.
illustrated in Fig. 5. It is evident from Fig. 5 that most of the distribution
4.7. Hyperparameter search
of straight hem attributes lies in the width of the 60 to 110 pixels. The
minimum, maximum and average values of the width of the bounding
We performed a grid search with five fold cross validation to tune
box for a-line and straight hem style attributes are determined and
hyperparameters for determining optimal training parameters for all
given in Table 4. We defined a-line hem style if the width of the
the three networks CNN, VGG16 and VGG19 and selected models with
bounding box is greater than 110 pixels, otherwise straight hem style.
the lowest loss computed on the validation sets. To reduce the risk of
sectionExperimental Results and Analysis
overfitting and improve the performance of the classification process,
We performed all the experiments in Python using Keras and Ten-
hyperparameters such as dropout rate and learning rate are selected
sorflow libraries on a Lambda Workstation with Intel Core i9-7920X,
for grid search. Learning rate is evaluated on a logarithmic scale of
16 GB main memory, four Geforce RTX 2080 Ti, and 4 TB Samsung
[0.1, 0.01, 0.001, 0.0001], and dropout rate is evaluated for [0.25, 0.5,
SSD 860. In this section, the types of data sets, algorithm settings, and
0.75]. Although grid search is tedious process, leveraging the parallel
experimental results are discussed. computational resources yields the optimal hyperparameters quickly.
The grid search results show the model exhibits the best performance
4.5. Dataset preparation when the learning rate and dropout rate are set to 0.001 and 0.5,
respectively. The other hyperparameters such as epoch size and batch
Image segmentation experiments using the SegNet model were car- size are set to 500 epochs and 24, respectively for all experiments. The
ried out on 50,000 images from Look into Person (LIP) dataset (Liang nonlinear ReLU is used as an activation function in all layers but the
et al., 2019). Each image in the set is annotated pixel-wise with 19 output layer. In addition, to avoid memory issues with our GPU server,
semantic human part labels and 1 background label for human parsing. the original image size is resized to 220 x 200 pixels.
The training, validation, and test sets consist of 30,000, 10,000 and
10,000 images, respectively. The labels of this dataset include 6 body 4.8. Evaluation metrics
parts such as right and left side of arms and legs and 13 clothes
categories like upper clothes, pants, dress, skirts, sunglasses, gloves, Our main goal is to create a model that classifies dress attributes
shoes, and socks, etc. Each label is annotated with different RGB color with high computational efficiency which can be used in real ap-
encoding. For instance, all the pixels belonging to the dress are encoded plications. In order to evaluate the classification performance of the
as [0,0,85], all pixels of the right leg are encoded as [255,134,255], etc. proposed algorithm, we computed several metrics such as precision,

8
T. Mallavarapu, L. Cranfill, E.H. Kim et al. Machine Learning with Applications 6 (2021) 100118

Table 5
Classification report with the metrics precision, recall, and f1-score.
Class Attribute Our Scheme CNN VGG16 VGG19
P R f1 P R f1 P R f1 P R f1
Floor Length 0.97 1 0.99 0.82 0.99 0.9 0.91 0.94 0.924 0.93 0.93 0.93
Evening 0.98 0.93 0.95 0.5 0.47 0.485 0.5 0.6 0.545 0.47 0.6 0.527
Lower Calf 1 0.98 0.99 0.38 0.15 0.215 0.53 0.45 0.489 0.52 0.55 0.535
Below Midcalf 0.94 1 0.97 0.16 0.36 0.222 0.44 0.36 0.4 0.25 0.27 0.26
Midcalf 0.89 0.95 0.92 0.45 0.5 0.474 0.41 0.43 0.42 0.35 0.43 0.39
Hem Length
Below Knee 0.90 0.94 0.92 0.53 0.25 0.34 0.4 0.43 0.414 0.37 0.34 0.354
Knee 0.93 0.91 0.92 0.78 0.52 0.624 0.5 0.53 0.515 0.66 0.5 0.569
Above Knee 1 0.93 0.96 0.7 0.75 0.724 0.64 0.61 0.625 0.61 0.73 0.665
Mini 0.97 1 0.99 0.51 0.47 0.49 0.46 0.45 0.455 0.49 0.35 0.408
Micro 1 0.92 0.96 0.4 0.48 0.436 0.35 0.21 0.263 0.39 0.38 0.385
Long 0.915 0.89 0.902 0.92 0.79 0.85 0.72 0.82 0.767 0.86 0.78 0.8187
Bracelet 0.845 0.899 0.871 0.74 0.93 0.824 0.76 0.72 0.739 0.82 0.72 0.767
Sleeve Length Elbow 0.926 0.895 0.91 0.89 0.8 0.842 0.87 0.62 0.724 0.77 0.8 0.785
Short 0.869 0.67 0.757 0.82 0.62 0.706 0.75 0.54 0.628 0.77 0.8 0.785
Cap 0.75 0.885 0.812 0.72 0.86 0.784 0.49 0.79 0.605 0.63 0.84 0.72
Aline 0.917 0.978 0.947 0.91 0.77 0.834 0.51 0.94 0.661 0.55 0.87 0.674
Straight 0.981 0.952 0.967 0.97 0.78 0.864 0.77 0.7 0.733 0.71 0.79 0.748
Hem Style
High-low 0.836 0.853 0.844 0.73 0.93 0.818 0.89 0.15 0.257 0.87 0.25 0.388
Asymmetrical 0.89 0.76 0.82 0.45 0.75 0.562 0.61 0.23 0.334 0.33 0.02 0.038

P = precision, R = recall, f1 = f1-score.

recall, and f1-score and compared with three variants of CNN-based Table 6
Computation time comparison of different models for 100 images in seconds.
models. Precision is defined as the percentage of correctly predicted
images, while recall is defined as the percentage of predicted images a Schemes Class Average Execution time (s)

model correctly identified. The f1-score is defined as the weighted har- Ours All 59.61
monic mean of precision and recall. The formal definition of precision, Sleeve Length 81.09
recall and f1-score are given below: CNN Hem Style 86.03
Dress Length 86.58
𝑇𝑃
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (1) Sleeve Length 227.298
𝑇𝑃 + 𝐹𝑃 VGG16 Hem Style 225.59
𝑇𝑃 Dress Length 227.20
𝑅𝑒𝑐𝑎𝑙𝑙 = (2)
𝑇𝑃 + 𝐹𝑁 Sleeve Length 261.00
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙 VGG19 Hem Style 259.41
𝐹 1 𝑆𝑐𝑜𝑟𝑒 = 2 ∗ (3) Dress Length 259.39
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙

4.9. Results and discussion

This demonstrates that CNN with less complex architecture can ex-
To validate and evaluate the performance of the proposed method
tract attribute-sensitive features for enhancing the classification. How-
we compare our method with different deep learning models on dress
ever, the average precision, recall and f1-score of our proposed scheme
attribute prediction. Since CNN, VGG16, and VGG19 are capable of
is higher than CNN. This indicates that the CNN model which per-
outputting a single classification result each time, we measured the
formed better than both the VGG networks is completely inferior to
performance of each phase in Stage 3 in order to provide the same level our model in all the three categories. In addition, Table 6 provides
comparisons. the execution time of all models. The given execution times are the
Table 5 summarizes the performance of different methods on Hem average of 10 executions for 100 images. The average run time shows
length, sleeve length, and hem style classification tasks. The results that our approach needs very less time to predict as it outputs three
illustrate that our scheme completely outperformed conventional deep classification results by a single execution while other schemes should
learning models in all classification tasks regardless of the number of be executed three times to obtain three results. Among all schemes,
convolutional and fully connected (fc) layers. From the results, it is VGG19 was the worst in terms of execution time. This further proves
evident that our model has 100% precision, recall and f1-score for that for this dress attribute classification, adding more layers deteri-
below knee, knee and micro attributes representing that our model orates the performance while increasing the computational time and
has very low false positive rate and false negative rate than other complexity.
models. In hem length classification task, our model achieved an av- Furthermore, the plots in Fig. 4 emphasize our model’s superiority
erage f1-score of 97% where as CNN, VGG16 and VGG19 showed poor in classification performance to other models. Figs 6(a) to 6(c) rep-
performance of 49%, 50.5%, and 50.2%, respectively. It is interesting resent the performance on hem length data. These plots depict that
to note that except floor length, all these CNN and VGG models have our proposed model has the highest precision and f1-score over other
poor performance on remaining hem length attributes. In the sleeve models. The sleeve length plots are shown in Figs 6(d) to 6(f). Though
length classification task, our scheme reported highest f1-score for CNN has the highest recall on bracelet sleeve, our model’s average per-
formance scores are far higher than CNN and VGG scores. It is evident
elbow sleeve whereas CNN and VGG networks show highest f1-score for
from the hem style plots in Figs 6(g) to 6(i) that our proposed scheme
long sleeve. In hem style classification, all four models reported highest
achieved prominent classification results when compared against CNN,
f1-score for straight hem attribute however our model has highest score
VGG16 and VGG19 in terms of precision, recall and f1-score metrics.
of 96.7%, while CNN, VGG16 and VGG19 have 86.4%, 73.3%, and
74.8%, respectively. The higher average f1-scores of CNN to VGG16 5. Discussion
and VGG19 networks infers that the inclusion of more convolutional
and fc layers has no effect in improving the classification results for In the fashion industry, among consumers, there is a very broad
this dress attribute classification problem. range of definitions of different ‘‘fashions’’ or classifications of fashion

9
T. Mallavarapu, L. Cranfill, E.H. Kim et al. Machine Learning with Applications 6 (2021) 100118

Fig. 6. Experimental results.

items. When looking at clothing, its specific class can often be a Our model sought to reach deeper than the broad categories that
subjective thing to both consumers and experts. When working with may be listed on a website, and create multiple classes within one
labeling fashion items, a business cannot have this same level of sub- category of clothing. When analyzing fashion items to this depth,
jectivity. When creating categories for clothes online, businesses always few customers have expectations about how or even if a retailer will
distinguish categories based on their own definitions. For example, segregate within one category of clothing. At this depth, the merit of
businesses will frequently have a long sleeve shirt tab and a short fashion item annotation lies within recommendations, and less with
sleeve shirt tab. The length of the sleeves in each category is at the allowing customers to access the fine-grained categories. Customers
discretion of the business. Customers’ expectations may or may not have the expectation that recommendations will be relevant to the item
always align with the company’s categories, but as categories get they are viewing, which would be fulfilled by our scheme. Therefore,
broader, the alignment increases. Additionally, having these fashion we could conclude that customers’ expectations would have a very
categories defined by companies for consumers is commonplace, and minor impact/interaction with the fine-grained nature of our scheme.
user expectations of categories are frequently conforming to and being When shopping, there are two broad classes of consumers. There are
defined by the retail websites on which they shop. consumers who want to find a specific item, and there are consumers

10
T. Mallavarapu, L. Cranfill, E.H. Kim et al. Machine Learning with Applications 6 (2021) 100118

that are simply browsing. Consumers in the first category rarely find References
exactly what they want with the first apparel item they click on. For
these users, highly relevant recommendations have merit. The sooner Badrinarayanan, V., Kendall, A., & Cipolla, R. (2015). SegNet: A Deep Convolutional
Encoder-Decoder Architecture for Image Segmentation. arXiv:1511.00561. version.
a customer sees what they need on a retailer’s website, the more
1.
likely the user is to buy from that retailer as opposed to looking Badrinarayanan, V., Kendall, A., & Cipolla, R. (2017). SegNet: A deep convolutional
somewhere else to fulfill their needs. For shoppers who are simply encoder-decoder architecture for image segmentation. IEEE Transactions on Pat-
browsing high-quality, fine-grained recommendations also have merit. tern Analysis and Machine Intelligence, 39, 2481–2495. http://dx.doi.org/10.1109/
TPAMI.2016.2644615.
The more items a consumer sees that are relevant to the style they like,
Bradski, G. (2000). The OpenCV Library. Dr. Dobb’s Journal of Software Tools, URL:
the more likely they are to purchase an item. Showing these customers a https://opencv.org/.
more fine-grained recommendation as opposed to recommending other Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., & Sheikh, Y. (2021). OpenPose: Realtime
‘‘t-shirts’’ will increase the likelihood of purchase and increase time multi-person 2D pose estimation using part affinity fields. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 43(1), 172–186. http://dx.doi.org/10.
spent on the website. Improved recommendations for both categories
1109/TPAMI.2019.2929257.
of users will increase sales for a company. Beyond the scope of the Cao, Z., Simon, T., Wei, S., & Sheikh, Y. (2017). Realtime multi-person 2D pose
company, it will also help the consumer, especially the first category of estimation using part affinity fields. In Proceedings of the IEEE conference on computer
consumers. Consumers looking for a specific item have a need to find vision and pattern recognition (pp. 1302–1310). http://dx.doi.org/10.1109/CVPR.
2017.143.
something quickly and the more relevant the recommendations are to
Carreira, J., Agrawal, P., Fragkiadaki, K., & Malik, J. (2016). Human pose estimation
a customer’s needs the sooner they will find what they are looking for. with iterative error feedback. In Proceedings of the IEEE conference on computer
Improved recommendations will streamline the shopping experience vision and pattern recognition (pp. 4733–4742). http://dx.doi.org/10.1109/CVPR.
for customers looking for a specific item, which can often be an arduous 2016.512.
Chen, Q., Huang, J., Feris, R., Brown, L. M., Dong, J., & Yan, S. (2015). Deep
task.
domain adaptation for describing people based on fine-grained clothing attributes.
In Proceedings of the IEEE conference on computer vision and pattern recognition.
6. Conclusion Chen, M., Qin, Y., Qi, L., & Sun, Y. (2019). Improving fashion landmark detection by
dual attention feature enhancement. In Proceedings of the IEEE/CVF international
conference on computer vision (ICCV) workshops (pp. 3101–3104). http://dx.doi.org/
This paper proposed a novel algorithm for dress attribute classi- 10.1109/ICCVW.2019.00374.
fication. Our approach leverages the key points estimation and the Cheng, W.-H., Song, S., Chen, C.-Y., Hidayati, S. C., & Liu, J. (2020). Fashion meets
computer vision: A survey. (pp. 1–35). arXiv:2003.13988.
bounding box construction on segmented images. The experimental
Dong, J., Chen, Q., Shen, X., Yang, J., & Yan, S. (2014). Towards unified human parsing
analysis shows promising results of our model on the three localized and pose estimation. In Proceedings of the IEEE conference on computer vision and
categories against CNN, VGG16 and VGG19 models. The benefit of pattern recognition (pp. 843–850). http://dx.doi.org/10.1109/CVPR.2014.113.
our approach is three-fold. First, our model is computationally efficient Dong, Q., Gong, S., & Zhu, X. (2017). Multi-task curriculum transfer deep learning of
clothing attributes. In 2017 IEEE winter conference on applications of computer vision
and thus easily applicable by a broad range of online retailers, unlike
(pp. 520–529). http://dx.doi.org/10.1109/WACV.2017.64.
CNN-based approaches whose time complexity depends on the level of Eshwar, S. G., Gautham Ganesh Prabhu, J., Rishikesh, A. V., Charan, N. A., &
architecture complexity. Second, our approach produces robust results Umadevi, V. (2016). Apparel classification using convolutional neural networks.
even with small datasets whereas CNN-based models require a huge In Proceedings of the international conference on ICT in business industry government
amount of data to reduce overfitting. A third benefit is the feature (pp. 1–5). http://dx.doi.org/10.1109/ICTBIG.2016.7892641.
Ge, Y., Zhang, R., Wang, X., Tang, X., & Luo, P. (2019). DeepFashion2: A versatile
extractors are independent of one another. They can be refined or benchmark for detection, pose estimation, segmentation and re-identification of
replaced as better approaches emerge without adversely affecting the clothing images. In Proceedings of the IEEE/CVF conference on computer vision and
performance of other extractors. With that in mind, our comparison of pattern recognition (pp. 5332–5340). http://dx.doi.org/10.1109/CVPR.2019.00548.
algorithmic approaches to classification versus CNN and VGG was less Girshick, R. (2015). Fast R-CNN. In Proceedings of the 2015 IEEE international conference
on computer vision (pp. 1440–1448). http://dx.doi.org/10.1109/ICCV.2015.169.
about demonstrating the superiority of our approach and more about Gu, X., Gao, F., Tan, M., & Peng, P. (2020). Fashion analysis and understanding with
being certain that there was not a better way to implement a specific artificial intelligence. Information Processing & Management, 57(5), Article 102276.
feature extractor. In the future study, we will extend the proposed http://dx.doi.org/10.1016/102276.
scheme to classify other fashion items, e.g., pants, skirts, T-shirts, and He, K., Gkioxari, G., Dollar, P., & Girshick, R. (2017). Mask R-CNN. In Proceedings
of the 2017 IEEE international conference on computer vision (pp. 2961–2969).
so forth. http://dx.doi.org/10.1109/ICCV.2017.322.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image
CRediT authorship contribution statement recognition. Proceedings of the IEEE conference on computer vision and pattern
recognition, 770–778. http://dx.doi.org/10.1109/CVPR.2016.90.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing
Tejaswini Mallavarapu: Writing – original draft, Software, Valida- human-level performance on imagenet classification. In Proceedings of the IEEE
tion, Investigation. Luke Cranfill: Writing – original draft, Data cura- international conference on computer vision (pp. 1026–1034). http://dx.doi.org/10.
1109/ICCV.2015.123.
tion, Visualization. Eun Hye Kim: Data curation. Reza M. Parizi: Writ-
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T.,
ing – review & editing. John Morris: Resources, Funding acquisition. Andreetto, M., & Adam, H. (2017). MobileNets: Efficient convolutional neural
Junggab Son: Supervision, Project Administration, Conceptualization, networks for mobile vision applications. (pp. 1–6). arXiv:1704.04861.
Methodology, Investigation, Writing – review & editing. Hu, Z., Yan, H., & Lin, X. (2008). Clothing segmentation using foreground and
background estimation based on the constrained delaunay triangulation. Pattern
Recognition, 41, 1581–1592. http://dx.doi.org/10.1016/j.patcog.2007.10.005.
Declaration of competing interest Huang, C., Chen, J., Pan, Y., Lai, H., Yin, J., & Huang, Q. (2019). Clothing landmark de-
tection using deep networks with prior of key point associations. IEEE Transactions
on Cybernetics, 49(10), 3744–3754. http://dx.doi.org/10.1109/TCYB.2018.2850745.
The authors declare that they have no known competing finan- Iliukovich-Strakovskaia, A., Dral, A., & Dral, E. (2016). Using pre-trained
cial interests or personal relationships that could have appeared to models for fine-grained image classification in fashion field. (pp. 1–5). URL:
influence the work reported in this paper. http://kddfashion2016.mybluemix.net/kddfashion_finalSubmissions/Using%20Pre-
Trained%20Models%20for%20Fine-Grained%20Image%20Classification%20in%
20Fashion%20Field.pdf.
Acknowledgment Jia, M., Zhou, Y., Shi, M., & Hariharan, B. (2018). A deep-learning-based fashion
attributes detection model. (pp. 1–7). arXiv:1810.10148.
Koester, A. W., & Bryant, N. O. (1991). Fashion terms and styles for women’s garments.
This work was supported by the Oracle Retail Applied Research Oregon State University ScholarsArchive, 1–50, URL: http://hdl.handle.net/1957/
under Grant KSFP30CCSEDD. 24654.

11
T. Mallavarapu, L. Cranfill, E.H. Kim et al. Machine Learning with Applications 6 (2021) 100118

Lee, S., Oh, S., Jung, C., & Kim, C. (2019). A global-local embedding module for Shrivastava, A., Gupta, A., & Girshick, R. (2014). Rich feature hierarchies for accurate
fashion landmark detection. In Proceedings of the IEEE/CVF international conference object detection and semantic segmentation. In Proceedings of the IEEE conference
on computer vision (ICCV) workshops (pp. 3153–3156). http://dx.doi.org/10.1109/ on computer vision and pattern recognition (pp. 1440–1448). http://dx.doi.org/10.
ICCVW.2019.00387. 1109/CVPR.2014.81.
Li, J., Zhao, J., Wei, Y., Lang, C., Li, Y., Sim, T., Yan, S., & Feng, J. (2017). Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale
Multiple-human parsing in the wild. (pp. 1–19). arXiv:1705.07206. image recognition. arXiv:1409.1556.
Liang, X., Gong, K., Shen, X., & Lin, L. (2019). Look into person: Joint body Sun, K., Xiao, B., Liu, D., & Wang, J. (2019). Deep high-resolution representation
parsing & pose estimation network and a new benchmark. IEEE Transactions on learning for human pose estimation. In Proceedings of the IEEE/CVF conference on
Pattern Analysis and Machine Intelligence, 41(4), 871–885. http://dx.doi.org/10. computer vision and pattern recognition (pp. 5693–5703). http://dx.doi.org/10.1109/
1109/TPAMI.2018.2820063. CVPR.2019.00584.
Liu, S., Song, Z., Liu, G., Xu, C., Lu, H., & Yan, S. (2012). Street-to-shop: Cross- Tompson, J., Goroshin, R., Jain, A., LeCun, Y., & Bregler, C. (2015). Efficient object
scenario clothing retrieval via parts alignment and auxiliary set. In Proceedings localization using convolutional networks. In Proceedings of the IEEE conference on
of the IEEE Conference on computer vision and pattern recognition (pp. 3330–3337). computer vision and pattern recognition (pp. 648–656). http://dx.doi.org/10.1109/
http://dx.doi.org/10.1109/CVPR.2012.6248071. CVPR.2015.7298664.
Liu, Z., Yan, S., Luo, P., Wang, X., & Tang, X. (2016). Fashion landmark detection in Toshev, A., & Szegedy, C. (2014). DeepPose: Human pose estimation via deep neural
the wild. In Proceedings of the European conference on computer vision (pp. 229–245). networks. In Proceedings of the IEEE conference on computer vision and pattern
Springer. recognition. http://dx.doi.org/10.1109/CVPR.2014.214.
Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human Wang, W., Wang, W., Xu, Y., Shen, J., & Zhu, S. (2018). Attentive fashion grammar
pose estimation. In Proceedings of the European conference on computer vision (pp. network for fashion landmark detection and clothing category classification. In
483–499). Springer. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Ravi, A., Repakula, S., Dutta, U. K., & Parmar, M. (2020). Buy me that look: An (pp. 4271–4280). http://dx.doi.org/10.1109/CVPR.2018.00449.
approach for recommending similar fashion products. arXiv preprint arXiv:2008. Wang, W., Xu, Y., Shen, J., & Zhu, S.-C. (2018). Attentive fashion grammar network for
11638. fashion landmark detection and clothing category classification. In Proceedings of
Redmon, J., & Farhadi, A. (2018). YOLOv3: An incremental improvement. (pp. 1–8). the IEEE/CVF conference on computer vision and pattern recognition (pp. 4271–4280).
arXiv:1804.02767. http://dx.doi.org/10.1109/CVPR.2018.00449.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object Wei, S.-E., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016). Convolutional pose
detection with region proposal networks. In Proceedings of the advances in neural machines. In Proceedings of the IEEE conference on computer vision and pattern
information processing systems (pp. 91–99). recognition (pp. 4724–4732). http://dx.doi.org/10.1109/CVPR.2016.511.
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for Xiao, B., Wu, H., & Wei, Y. (2018). Simple baselines for human pose estimation and
biomedical image segmentation. In Proceedings of the international conference on tracking. In Proceedings of the European conference on computer cision (pp. 466–481).
medical image computing and computer-assisted intervention (pp. 234–241). http: Xiao, B., Wu, H., & Wei, Y. (2018). Simple baselines for human pose estimation and
//dx.doi.org/10.1007/978-3-319-24574-4_28. tracking. In Proceedings of the European conference on computer vision (pp. 1–16).
Rubio, A., Yu, L., Simo-Serra, E., & Moreno-Noguer, F. (2017). Multi-modal embedding Ye, J., Feng, Z., Jing, Y., & Song, M. (2018). Finer-Net: Cascaded human parsing with
for main product detection in fashion. In Proceedings of the IEEE international hierarchical granularity. In 2018 IEEE international conference on multimedia and
conference on computer vision (ICCV) workshops (pp. 2236–2242). http://dx.doi.org/ expo (pp. 1–6). http://dx.doi.org/10.1109/ICME.2018.8486505.
10.1109/ICCVW.2017.261. Yu, W., Liang, X., Gong, K., Jiang, C., Xiao, N., & Lin, L. (2019). Layout-graph
Seo, Y., & Shin, K.-S. (2019). Hierarchical convolutional neural networks for fashion reasoning for fashion landmark detection. In Proceedings of the IEEE/CVF conference
image classification. Expert Systems with Applications, 116, 328–339. http://dx.doi. on computer vision and pattern recognition. http://dx.doi.org/10.1109/CVPR.2019.
org/10.1016/j.eswa.2018.09.022. 00305.
Shelhamer, E., Long, J., & Darrell, T. (2014). Fully convolutional networks for semantic Zhang, H., Li, S., Cai, S., Jiang, H., & Jay Kuo, C. C. (2018). Representative fashion
segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, feature extraction by leveraging weakly annotated online resources. In Proceedings
640–651. http://dx.doi.org/10.1109/TPAMI.2016.2572683. of the 25th IEEE international conference on image processing (ICIP), (pp. 2640–2644).
Shen, J., Liu, G., Chen, J., Fang, Y., Xie, J., Yu, Y., & Yan, S. (2014). Unified http://dx.doi.org/10.1109/ICIP.2018.8451125.
structured learning for simultaneous human pose estimation and garment attribute Zhao, L., Li, M., & Sun, P. (2021). Neo-fashion: A data-driven fashion trend forecasting
classification. IEEE Transactions on Image Processing, 23(11), 4786–4798. http://dx. system using catwalk analysis. Clothing and Textiles Research Journal, http://dx.doi.
doi.org/10.1109/TIP.2014.2358082. org/10.1177/0887302X211004299, 0887302X211004299.

12

You might also like