Atlas: A Dataset and Benchmark For E-Commerce Clothing Product Categorization
Atlas: A Dataset and Benchmark For E-Commerce Clothing Product Categorization
Atlas: A Dataset and Benchmark For E-Commerce Clothing Product Categorization
[email protected]
2
Ericsson Research, Chennai, India
[email protected]
3
University of Colorado, Boulder, Colorado, USA
[email protected]
1 Introduction
With the Internet revolution, E-commerce has become a major platform for
selling products to customers. E-commerce stores host a collection of prod-
ucts ranging from electronics to fashion apparel to grocery. A well-organized
E-commerce store lets customers navigate through the website with ease and
locate the product they are looking for. Unlike a traditional retail store where
you can walk in and seek assistance, online retailers rely on their product cat-
alog or categorization to assist shoppers to find their desired product. Product
taxonomy is a tree structure with multiple top and intermediate levels, ending
2 V. Umaashankar et al.
2 Related Work
A clean and detailed product taxonomy offers several benefits to both the
E-commerce store and its customers. However, creating, maintaining or adapting
an existing categorization standard is not an easy task. Still, most of E-commerce
stores want to have flexibility in the way they organize their catalog and create
their product taxonomy.
Initially, techniques from information retrieval and machine learning were
applied to solve the problem of product categorization. GoldenBullet [4] is a
software environment targeted to automatically classify the products, based on
their original descriptions and existent classification standards (such as UN-
SPSC). It integrates different classification algorithms like Vector space model
(VSM), K-nearest neighbor and Naive-Bayes classifier algorithms and some nat-
ural language processing techniques to pre-process data. [5] approached product
categorization as a hierarchical text classification task. They proposed two differ-
4
https://github.com/vumaasha/Atlas
Atlas: A Dataset for E-commerce Clothing Product Categorization 3
ent approaches of building separate classifiers for each level in the hierarchy and
a flat classifier that directly predicts the leaf level assignment of a document.
They used Support Vector Machine (SVM) classifiers for evaluating both the
approaches. [11] presented a simple linear classifier based approach for product
categorization using mutual information and LDA based features. In general, the
computational complexity involved in some of these traditional machine learning
techniques is well beyond linear with respect to the number of training examples,
features, or classes. The scale of the E-commerce product categorization requires
algorithms capable of processing a huge volume of training data in a reasonable
time, capable of handling a large number of classes and also capable of making
fast real-time predictions.[18].
The remarkable progress made in the field of deep learning in recent years has
provided a better way to approach this problem. [3] has done a detailed study
of using Convolutional Neural Networks (CNN) for the product categorization
task. They used the Amazon product dataset provided by [17] and text features
such as product titles, navigational breadcrumbs, and list price. [6] used mul-
tiple Deep Recurrent Neural Network (RNNs) and generated features from the
text metadata. In recent times, Sequential model-based approaches have been
widely used for product categorization. [8] modeled product categorization as a
Sequence to Sequence learning, they used product titles which are a sequence of
words as input and predicted the category path as a sequence of category levels
in the product taxonomy.
Due to the availability of large high-quality image datasets, the field of Image
classification [12] has matured a lot in recent times. Noise and ambiguity is a
common problem in textual product titles and description. However, most of
the E-commerce products tend to have decent product images, this leads to
a natural choice of using images for product categorization. [1], [2], [10] and
[16] applied computer vision techniques for fashion apparel categorization based
on the product images. The closest to our work is by [13] where they use Seq
to Seq model with product titles as input to predict category paths using an
LSTM Decoder and beam search for inference. We extend their work in this
paper by using product images instead of product titles as input for product
categorization. Similar to [13], we learn an Attention based Seq to Seq model.
3 Atlas Dataset
Rakuten made a product classification dataset publicly available in Rakuten
Data Challenge [15], However, this dataset contains only the product titles and
the levels in the taxonomy are represented using numerical IDs instead of plain
text. Real world product taxonomy datasets are not publicly available. Also,
there is no widely adopted industry standard for defining product taxonomies.
In addition to these, factors like data size, category skewness, and noisy metadata
are limiting further research and practical implementation of large scale product
categorization. This motivated us to develop a real-world dataset for product
categorization.
4 V. Umaashankar et al.
Fig. 1. Examples of (a) Zoomed(dirty) and (b) Normal(clean) images from our Atlas
dataset. The Zoomed images show close-ups of the apparel or cropped versions of the
image that make it difficult to recognize the product, whereas the Normal images show
figures with the entire product visible.
tative image of the product. Some images might display packaging, installation
instructions, etc. In the case of clothing, we found that many product listings
also included zoomed in images that display intrinsic details such as the texture
of the fabric, brand labels, button, and pocket styles. Without the context of the
product listing, it would be even hard for a human to identify the corresponding
product. Including these zoomed in images would drastically affect the qual-
ity of the dataset. To find and remove these noisy images manually would take
considerable time and effort. We modeled this as a binary classification task
(Zoomed Vs Normal Images) and compared Linear SVM with simple 3 layer
CNN (Figure. 2) based classification models. We prepared the training data by
visual inspection. We segregated noisy and high-quality images into two different
folders by looking at the thumbnails of hundreds of product images in a go. Our
models were trained on 6005 normal images and 1054 zoomed images and the
performance metrics on the test are shown in the Table. 1. We used computer
vision based features such as contors and histogram of gradients as input for our
LinearSVM. We automated the process of filtering out the noisy images using
the CNN model due to its superior performance compared to that of LinearSVM.
Table 1. Metrics for the models used to predict Zoomed Vs Normal images
CNN SVM
precision recall f-score precision recall f-score
Normal 0.99 0.99 0.99 0.91 0.99 0.95
Zoomed 0.95 0.95 0.95 0.86 0.48 0.62
Average 0.98 0.98 0.98 0.91 0.91 0.90
8
https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning
Atlas: A Dataset for E-commerce Clothing Product Categorization 7
Fig. 3. A sample of category paths predicted on test dataset by our model. We can
observe how the Attention focuses on different sections of the image while generating
each category level. For example, the face is being focused to predict the first category
level - gender.
encoding produced by the Encoder will have the dimensions: batch size,14,14,
2048.
Recurrent Neural Networks(RNN) are popular for sequential classification
task as it considers both the current input and the learnings from the previously
received inputs for prediction. Usually, RNN’s have short term memory but when
combined with Long Short-Term Memory (LSTM) Network they have long term
memory as LSTMs contain their information in a memory. We have a stacked
LSTM Network along with Attention in our Decoder which is shown in Figure
4.
The Attention Network shown in Figure 4, learns which part of the image has
to be focused to predict the next level in the category path while performing the
sequence classification task. The Attention Network generates weights by con-
sidering the relevance between the encoded image and the previous hidden state
or previous output of the Decoder. It consists of linear layers which transform
the encoded image and the previous Decoder’s output to the same size. These
vectors are summed together and passed to another linear layer. This layer cal-
culates the values to be Softmaxed and then passes the values to a ReLU layer.
A final softmax layer calculates the weights alphas of the pixels which add up
to 1. If there are P pixels in our encoded image, then at each time step t,
P
X
αp,t = 1 (1)
p
We use a weighted average across all the pixels instead of a simple average so
that the important pixels are assigned greater weights.
The Decoder receives the encoded image from the Encoder using which it ini-
tializes the hidden and cell state of the LSTM model through two linear layers.
8 V. Umaashankar et al.
Two virtual category levels <start> and <end> which denote the beginning
and end of the sequence are added to the category path. The Decoder LSTM
uses teacher forcing proposed by [20] for training. The Decoder uses a <start>
marker which is considered to be the zeroth category level. The <start> marker
along with the encoded image is used to generate the first-top level of the cat-
egory path. Subsequently, all other levels are predicted using the sequence gen-
erated so far along with the Attention weights. An <end> marker is used to
mark the end of a category path. The Decoder stops decoding the sequence fur-
ther as soon it generates the <end> marker. At each time step, the Decoder
computes the weights and Attention weighted encoding from the Attention Net-
work using its previous hidden state. Another linear layer is added to create a
sigmoid-activated gate and the Attention weighted encodings are passed through
it and concatenated with the embedding of the previously generated category
path and fed into the LSTM Decoder to generate the new hidden state which is
also the next predicted level. The next level is predicted using a final softmax
layer from the hidden state of the Decoder. The softmax layer transforms the
hidden state into scores which are stored for further utilization in beam search
for selecting ’k’ best levels.
5 Training
5.1 Model hyperparameters
Zoomed Vs Normal LinearSVM Model We trained LinearSVM available
in Scikit-learn with C set to 0.0001, class weight set to ’balanced’ using hinge
loss. The optimal C value was identified using a grid search.
Zoomed Vs Normal CNN Model We trained for 10 epochs using Binary
CrossEntropy as loss function, RMSProp Optimizer with a learning rate set to
0.001, rho set to 0.9 and decay set to 0.0.
Resnet34 based Image Classification We trained for 17 epochs using Cat-
egorical CrossEntropy as loss function and Leslie Smith’s one cycle policy [19]
for choosing the learning rate. We used early stopping to terminate the training
Atlas: A Dataset for E-commerce Clothing Product Categorization 9
process when the decrease in validation loss is less than 0.001 for 3 consecutive
epochs.
Attention based Seq to Seq Model We trained our model in GPU for 3
epochs with a batch size of 128 and dropout rate as 0.5 after which the validation
accuracy stop improving. We used Adam optimizers with a learning rate of 1e-
4 and 4e-4 for Encoder and Decoder respectively. We picked the beam width
as 5 based on our experiments. Regularization parameter for doubly stochastic
Attention was set to 1 and gradient clipping was set to an absolute value of 5.
The pre-trained model can be downloaded from here9 .
5.2 Hardware
(1) Nvidia GPU GEFORCE GTX 1080 Ti 11GB RAM (2) Intel R Xeon R
Processor E5-2650 v4 30M Cache, 2.20 GHz, 12 Cores, 24 Threads (3) 250 GB
RAM (4) CentOS 7
6 Results
We evaluated the proposed model on our dataset having 186,150 clothing
images and their category paths. We split our dataset into train, validation and
test sets similar to the splits used in the work by [9]. Stratified random sampling
was carried out on our dataset with training set having 65% of data(119,155
images), 5% in the validation set(11,147 images) and 30% in the test set(55,848
images). The Resnet34 classification model and the Seq to Seq model trained on
our Atlas dataset achieved an overall micro f-score of 92% and 90% respectively.
A comparison of the f-scores of both the benchmark models over support size of
leaf categories is shown in Figure 5. Though we observe that the classification
model’s performance is better than Seq to Seq model, we believe the reason is
that we have only 52 categories at the moment. As the number of categories
increases, the structure in the taxonomy can be leveraged better using Seq to
Seq model. In addition to Seq to Seq models predicting the category paths, it
also explains the reason behind the predictions which is shown in Figure 3.
[14] claim that using Seq to Seq model for product categorization helps to
identify new category paths in the taxonomy. However, in our experiments, we
have observed that all the new category paths that are generated by the Seq to
Seq model are not always valid. In our case our model generated 5 new category
paths which are shown in Table 2 out of which we found only 2 to be valid.
Therefore, a manual inspection of newly created category paths is needed to
filter out the category paths which could be used to enrich the taxonomy.
Table 2. Valid and invalid category paths created by Seq to Seq model
9
https://goo.gl/forms/C1824kjmbuVo7H6H3
10 V. Umaashankar et al.
Fig. 5. F-scores of our benchmark models over leaf level categories ordered by their
sample size. Note that the sample size in x axis is in log scale
8 Acknowledgements
The First Author Venkatesh Umaashankar worked extensively in the problem
of Product Categorization using text attributes during his tenure at Indix. He
thanks Krishna Sangeeth, Sriram Ramachandrasekaran, Anirudh Venkataraman,
Manoj Mahalingam, Rajesh Muppalla and Sridhar Venkatesh for their help and
support.
Bibliography
[1] Bossard, L., Dantone, M., Leistner, C., Wengert, C., Quack, T., Van Gool,
L.: Apparel classification with style. In: Asian conference on computer vi-
sion. pp. 321–335. Springer (2012)
[2] Chen, H., Gallagher, A., Girod, B.: Describing clothing by semantic at-
tributes. In: European conference on computer vision. pp. 609–623. Springer
(2012)
[3] Das, P., Xia, Y., Levine, A., Di Fabbrizio, G., Datta, A.: Web-scale
language-independent cataloging of noisy product listings for e-commerce.
In: Proceedings of the 15th Conference of the European Chapter of the As-
sociation for Computational Linguistics: Volume 1, Long Papers. vol. 1, pp.
969–979 (2017)
[4] Ding, Y., Korotkiy, M., Omelayenko, B., Kartseva, V., Zykov, V., Klein, M.,
Schulten, E., Fensel, D.: Goldenbullet: Automated classification of product
data in e-commerce. In: Proceedings of the 5th international conference on
business information systems (2002)
[5] Dumais, S., Chen, H.: Hierarchical classification of web content. In: Proceed-
ings of the 23rd annual international ACM SIGIR conference on Research
and development in information retrieval. pp. 256–263. ACM (2000)
[6] Ha, J.W., Pyo, H., Kim, J.: Large-scale item categorization in e-commerce
using multiple recurrent neural networks. In: Proceedings of the 22nd ACM
SIGKDD International Conference on Knowledge Discovery and Data Min-
ing. pp. 107–115. ACM (2016)
[7] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recog-
nition. In: Proceedings of the IEEE conference on computer vision and
pattern recognition. pp. 770–778 (2016)
[8] Hiramatsu, M., Wakabayashi, K.: Encoder-decoder neural networks for tax-
onomy classification. In: eCOM@SIGIR. CEUR Workshop Proceedings,
vol. 2319. CEUR-WS.org (2018)
[9] Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating
image descriptions. In: Proceedings of the IEEE conference on computer
vision and pattern recognition. pp. 3128–3137 (2015)
[10] Kiapour, M.H., Yamaguchi, K., Berg, A.C., Berg, T.L.: Hipster wars: Dis-
covering elements of fashion styles. In: European conference on computer
vision. pp. 472–488. Springer (2014)
[11] Kozareva, Z.: Everyone likes shopping! multi-class product categorization
for e-commerce. In: Proceedings of the 2015 Conference of the North Ameri-
can Chapter of the Association for Computational Linguistics: Human Lan-
guage Technologies. pp. 1329–1333 (2015)
[12] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with
deep convolutional neural networks. In: Advances in neural information pro-
cessing systems. pp. 1097–1105 (2012)
12 V. Umaashankar et al.
[13] Li, M.Y., Kok, S., Kok, S.: Unconstrained product categorization with
sequence-to-sequence models. In: eCOM@SIGIR. CEUR Workshop Pro-
ceedings, vol. 2319. CEUR-WS.org (2018)
[14] Li, M.Y., Kok, S., Tan, L.: Don’t classify, translate: Multi-level e-
commerce product categorization via machine translation. arXiv preprint
arXiv:1812.05774 (2018)
[15] Lin, Y., Das, P., Datta, A.: Overview of the SIGIR 2018 ecom rakuten
data challenge. In: eCOM@SIGIR. CEUR Workshop Proceedings, vol. 2319.
CEUR-WS.org (2018)
[16] Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: Deepfashion: Powering robust
clothes recognition and retrieval with rich annotations. In: Proceedings of
the IEEE conference on computer vision and pattern recognition. pp. 1096–
1104 (2016)
[17] McAuley, J., Targett, C., Shi, Q., Van Den Hengel, A.: Image-based recom-
mendations on styles and substitutes. In: Proceedings of the 38th Interna-
tional ACM SIGIR Conference on Research and Development in Informa-
tion Retrieval. pp. 43–52. ACM (2015)
[18] Shen, D., Ruvini, J.D., Sarwar, B.: Large-scale item categorization for e-
commerce. In: Proceedings of the 21st ACM international conference on
Information and knowledge management. pp. 595–604. ACM (2012)
[19] Smith, L.N.: Cyclical learning rates for training neural networks. In: 2017
IEEE Winter Conference on Applications of Computer Vision (WACV). pp.
464–472. IEEE (2017)
[20] Williams, R.J., Zipser, D.: A learning algorithm for continually running
fully recurrent neural networks. Neural computation 1(2), 270–280 (1989)
[21] Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W.,
Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al.: Google’s neural ma-
chine translation system: Bridging the gap between human and machine
translation. arXiv preprint arXiv:1609.08144 (2016)
[22] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel,
R., Bengio, Y.: Show, attend and tell: Neural image caption generation
with visual attention. In: International conference on machine learning. pp.
2048–2057 (2015)