0% found this document useful (0 votes)
228 views23 pages

02 Deep Learning For Retail Product Recognition Challenges and Techniques

Uploaded by

frank-e18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
228 views23 pages

02 Deep Learning For Retail Product Recognition Challenges and Techniques

Uploaded by

frank-e18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Hindawi

Computational Intelligence and Neuroscience


Volume 2020, Article ID 8875910, 23 pages
https://doi.org/10.1155/2020/8875910

Review Article
Deep Learning for Retail Product Recognition: Challenges
and Techniques

Yuchen Wei , Son Tran , Shuxiang Xu , Byeong Kang , and Matthew Springer
Discipline of ICT, School of TED, University of Tasmania, Launceston, Tasmania, Australia

Correspondence should be addressed to Yuchen Wei; [email protected]

Received 15 July 2020; Revised 13 October 2020; Accepted 19 October 2020; Published 12 November 2020

Academic Editor: Massimo Panella

Copyright © 2020 Yuchen Wei et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Taking time to identify expected products and waiting for the checkout in a retail store are common scenes we all encounter in our
daily lives. The realization of automatic product recognition has great significance for both economic and social progress because it
is more reliable than manual operation and time-saving. Product recognition via images is a challenging task in the field of
computer vision. It receives increasing consideration due to the great application prospect, such as automatic checkout, stock
tracking, planogram compliance, and visually impaired assistance. In recent years, deep learning enjoys a flourishing evolution
with tremendous achievements in image classification and object detection. This article aims to present a comprehensive literature
review of recent research on deep learning-based retail product recognition. More specifically, this paper reviews the key
challenges of deep learning for retail product recognition and discusses potential techniques that can be helpful for the research of
the topic. Next, we provide the details of public datasets which could be used for deep learning. Finally, we conclude the current
progress and point new perspectives to the research of related fields.

1. Introduction and Background tag has its specific number corresponding to a specific
product, and the product is identified by wireless signal
The intention of product recognition is to facilitate the communication. Unlike the barcode, RFID tag data are
management of retail products and improve consumers’ readable without the line-of-sight requirements of an optical
shopping experience. At present, barcode [1] recognition is scanner. Definitely, RFID has shortcomings. Identifying
the most widely used technology not only in research but also multiple products still has a high error rate due to radio waves
in industries where automatic identification of commodities is being blocked or influencing each other. Also, RFID labels are
used. By scanning barcode marks on each product package, expensive and difficult to recycle, resulting in higher sales
the management of products can be easily facilitated. Nor- costs and sustainability issues [4].
mally, almost every item on the market has its corresponding As retail is evolving at an accelerated rate, enterprises are
barcode. However, due to the uncertainty of the printing increasingly focusing on how to use artificial intelligence
position of the barcode, it often requires time to manually find technology to reshape the retail industry’s ecology and in-
the barcode and assist the machine in identifying the barcode tegrate online and offline experiences [5]. Based on the study
at the checkout counter. Based on a survey from Digimarc [2], from Juniper Research, the global spending by retailers on AI
45% customers complained that, sometimes, it was not services will increase over 300% from $3.6 billion in 2019 to
convenient to use barcode scanning machines. RFID (radio $12 billion in 2023 [6]. That is to say, the new innovative retail
frequency identification) [3] has been applied in business in the future may be completely realized by artificial intel-
fields with the growth of computer technology to enhance the ligence technology. Also, with the improvement of living
automation of product identification. This technology auto- standards, supermarket staff and customers are greeted with
matically transmits data and information using radio fre- more than countless retail products. In this scenario, a
quency signals. RFID tags are placed on each product. Each massive amount of human labour and a large percentage of
2 Computational Intelligence and Neuroscience

the workload were required for recognising products so as to Deep learning-based retail product recognition has in-
conduct goods management [7]. Furthermore, with the help creasingly attracted researchers, and plenty of work has been
of various electronic devices for photographing, image digital done in this field. However, it appears that there are very few
resources of products are growing rapidly every day. As such, reviews or surveys that summarize existing achievements
for a tremendous amount of image data, how to effectively and current progress. We collected over a hundred related
analyze and process them, as well as to be able to identify and publications through Google Scholar, IEEE Xplore, and Web
classify the products in supermarkets, has become a key re- of Science, as well as some great conferences such as CVPR,
search issue in the product recognition field. Product rec- ICCV, IJCAI, NIPS, and AAAI. As a result, only two for-
ognition refers to the use of technology which is mainly based mally published surveys [4, 23] came to light, which studied
on computer vision methods so that computers can replace the detection of products on the shelf in retail stores. The
the process of manually identifying and classifying products. scenario of recognising products for self-checkout systems
Implementing automatic product recognition in grocery has been neglected in their surveys, which is also a complex
stores through images has a significant impact on the retail task that needs to be solved for the retail industry.
industry. Firstly, it will benefit the planogram compliance of In the published article [23], authors reviewed 24 papers
products on the shelf. For instance, product detection can and proposed a classification of product recognition sys-
identify which items are missing from the shelf to remind the tems. Nevertheless, deep learning methods are not men-
store staff to replenish the products immediately. It is ob- tioned in this paper. Another related survey was from [4],
served that when an optimized planogram is 100% matched, and the authors presented a brief study on computer vision-
sales will be increased by 7.8% and profit by 8.1% [8]. based product recognition in shelf images. However, this
Furthermore, image-based commodity identification can be survey does not focus on the field of deep learning: most of
applied to automatic self-checkout systems to optimize the the methods presented are based on hand-crafted features.
user experience of checkout operations. Global self-checkout Therefore, with the rising popularity and potential appli-
(SCO) shipments have steadily increased between 2014 and cations of deep learning in retail product recognition, a new
2019. Growing numbers of SCOs have been installed to comprehensive survey is demanded for a better under-
reduce retailers’ costs and enhance customer experience standing of this research field.
[9, 10]. The research in [11, 12] demonstrates that customers’ In this paper, we present an extensive literature review of
waiting time for checkout operations has a negative influ- current studies on deep learning-based retail product rec-
ence on their shopping satisfaction, which is to say that ognition. Our detailed survey presents challenges, tech-
applying a computer-vision-based product recognition in niques, and open datasets for deep learning-based product
SCOs benefits both retailers and customers. Thirdly, product recognition. It offers meaningful insights into advances in
recognition technology can assist people who are visually deep learning for retail product identification. It also serves
impaired to shop independently, which is conducive to their as a guideline for researchers and engineers who have just
social connectivity [13]. Traditional shopping methods started researching the issue of product recognition, with the
usually require assistance from a sighted person because it purpose that they will find the problems that need to be
can be difficult for a person who is visually impaired to studied quickly. In summary, there are three points for the
identify products by their visual features (e.g., price, brand, contribution of this paper: (1) for the implementation of
and due date), making purchase decisions difficult [14]. deep learning methods in product identification, we provide
In general, retail product recognition problems can be a comprehensive literature review. (2) We propose current
described as an arduous instance related to image classification problem-solving techniques according to the complexity of
[15, 16] and object detection problems [17–19]. During the last retail product recognition. (3) We discuss the challenges and
decade, deep learning, especially in the domain of computer available resources and identify future research directions.
vision, has achieved tremendous success and has become the The rest of this article will be structured as follows:
core solution for image classification and object detection. The Section 2 introduces the overview of computer vision
primary difference between deep learning and traditional methods for product recognition. Section 3 presents the
pattern recognition methods is that the former can directly challenges in the field of detecting grocery products in retail
learn features from image data rather than using manually stores. Section 4 gives current techniques to solve the
designed features. Another reason for the strong ability of deep complex problems. Section 5 describes the publicly available
learning is the deeper layers that can extract more precise datasets and analyzes their particular application scenarios.
features than traditional neural networks. The above advan- Finally, Section 6 draws the conclusion and provides di-
tages enable deep learning methods to bring new ideas to solve rections for future studies.
some important computer vision problems such as image
segmentation and keypoint detection. Recently, a few attempts 2. Computer Vision Methods in Retail
have been applied to the retail industry, following with state-of- Product Recognition
the-art results [20–22]. In the meanwhile, some automated
retail stores have emerged, such as Amazon Go (https://www. 2.1. Classic Methods. With computer vision’s rapid growth,
amazon.com/b?ie�UTF8&node�16008589011) and Walmart’s researchers have been drawn to product recognition using
Intelligent Retail Lab (https://www.intelligentretaillab.com/), the technology. Product recognition is realized by extracting
which indicate that there is interest in unmanned retail with features on the image of the package. The composition of the
deep learning. product image recognition system is shown in Figure 1. (1)
Computational Intelligence and Neuroscience 3

Image capture: collecting images from cameras and mobile which seeks to divide different images into different cate-
phones. (2) Image preprocessing: reducing noise and re- gories. The performance of classifying images with com-
moving redundant information to provide high-quality puters is already better than humans. (2) Object detection: it
images for subsequent operations. It mainly includes image refers to detecting objects with rectangular boxes while
segmentation, transformation, and enhancement. (3) Fea- categorising images. In the last few years, with the ongoing
ture extraction: the analysis and processing of image data to growth of deep learning, many scientists and developers
determine the invariant characteristics in the image. (4) have built and optimized some deep learning frameworks to
Feature classification: after a certain image feature is mapped help speed-up training and forecast procedures, such as
to the feature vector or space, a specific decision rule is Caffe [35], TensorFlow [36], MXNet [37], and PyTorch [38],
applied to classify the low-dimensional feature to make the which are the most common frameworks that make the use
recognition result accurate. (5) The output of recognition: of deep learning methods much easier for scientists.
the pretrained classifier is employed to predict the category
of the retail product.
2.2.1. Convolutional Neural Networks. The success of deep
The core of product recognition is whether accurate
learning in computer vision profits from convolutional
features can be extracted or not. SIFT [24, 25] and SURF
neural networks (CNNs), which are inspired by the biology
[26, 27] are the best representatives of traditional feature
research of the cat’s visual cortex [39]. LeCun et al. first
extraction technology. In 1999, Lowe suggested SIFT, paying
proposed to employ convolutional neural networks to
greater attention to local information, where an image
classify images in 1988 [40]. They conceived the LeNet
pyramid was established to solve the problem of multiscale
convolutional neural network model that had seven layers.
features. SIFT features have many advantages, such as ro-
After training on a dataset which contained 32 ∗ 32 hand-
tation invariance, translation invariance, and scale infinity,
written characters, this model had been successfully applied
which are the most widely used hand-crafted features before
to the digital identification of checks. Opportunely, the
deep learning. In 2006, based on the foundation of SIFT,
structure of the CNN and training techniques have been
some researchers proposed SURF features to improve cal-
experiencing strong advances since 2010, benefiting from the
culation speed. SIFT has been used as a feature extractor for
ImageNet Large-Scale Visual Recognition Challenge. Also,
product classification in [13], and the SURF algorithm has
with the advance of computing power from GPUs, deep
been applied in [28] to detect the out-of-stock and misplaced
learning has undoubtedly become a phenomenon. After the
products on shelves. However, due to the features extracted
year of 2010, a series of network structures such as AlexNet
by SIFT and SURF being hand-crafted, it is unable to reflect
[21], GoogLeNet [41], VGG [42], and ResNet [43] were
all sufficient information fully. Thus, researchers are in-
devised for image classification based on LeNet [40]. Re-
creasingly interested in deep learning for end-to-end
cently, the CNN becomes able to classify 3D objects, which is
training to extract effective features.
named as a multiview CNN [44]. The multiview CNN has
shown a remarkable performance on image classification
tasks by inputting multiple images to the networks [45]. In
2.2. Deep Learning. Deep learning is often regarded as a
the age of big data, it enables researchers to select large
subfield of machine learning. The vital objective of deep
datasets to train complex structures of networks that output
learning is to learn deep representation, i.e., to learn mul-
more accurate results. In conclusion, big data and deeper
tilevel representation and abstraction from information [29].
networks are the two key elements for the success of deep
Initially, the concept of deep learning (also known as deep
learning, and these two aspects accelerate each other.
structured learning) was proposed by authoritative scholars
in the field of machine learning in 2006 [30]. After a short
while in 2006, Hinton and Salakhutdinov presented the 2.2.2. Deep Learning for Object Detection. CNNs have been
methods of unsupervised pretraining and fine-tuning to the major deep learning technique for object detection.
solve the vanishing gradient problem [31]. After that year, Therefore, all the deep learning models discussed in this
deep learning became a research hotspot. In 2007, a greedy paper are based on the CNN. In order to detect various
layer-wise training strategy was provided to optimize the objects, it is essential to conduct region extraction on dif-
initial weights for deep networks [32]. ReLU (rectified linear ferent objects before image classification. Before deep
unit) was defined in 2011 to preserve more information learning, the common regional extraction method is the
among multiple layers which could restrain the vanishing sliding window algorithm [46]. This algorithm is a tradi-
gradient problem [33]. The dropout algorithm [34] was tional method, identifying the object in each window by
proposed in 2012 to prevent overfitting, and it helped im- sliding the image. The sliding window strategy is inefficient,
prove the deep network performance. which requires a very large amount of calculation. After
In the field of computer vision, deep neural networks incorporating deep learning into this field, the object de-
have been exploited with the improvement of computing tection techniques can be classified into two categories: the
power from computer hardware, particularly thanks to the two-stage model (region proposal-based) and the one-stage
implementation of GPUs in image processing. Nowadays, model (regression/classification-based) [47]. The two-stage
the application of deep learning in retail product recognition model requires a region proposal algorithm to find out the
primarily covers the following two elements: (1) image possible location of the object in a graph. It takes advantage
classification: this is a fundamental task in computer vision, of textures, edges, and colours from the image to ensure a
4 Computational Intelligence and Neuroscience

Image Image Feature Feature Output of


capture preprocessing extraction classification recognition

Figure 1: The flowchart of the product image recognition system.

high recall rate, while fewer windows (thousands or even to inspect the application of artificial intelligence in retail
hundreds) are selected. In the R-CNN algorithm [48], an services. In IRL, deep learning was exploited with cameras to
unsupervised region proposal method, selective search [49], automatically detect the out-of-stock products and alert staff
is introduced, combining the power of both exhaustive members when to restock. Furthermore, a number of intelli-
search and segmentation. Although this method has im- gent retail facilities, such as automatic vending machines and
proved computing speed, it still needs to implement a CNN self-serve scales, have emerged recently. A Chinese company,
calculation for every region proposal. Then, Fast R-CNN DeepBlue Technology (https://en.deepblueai.com/), has de-
[18] was developed to reduce the repeated CNN calculation. veloped automatic vending machines and self-checkout
Ren et al. proposed a region proposal network (RPN) [50] by counters based on deep learning algorithms, which can ac-
using a deep network while sharing features with the clas- curately recognize commodities by using the cameras. Malong
sification network. The shared features not only avoid the Technologies (https://www.malong.com/en/home.html) is
time consumption caused by recalculation but also improve another well-known business in China that aims to provide
the accuracy. The Faster R-CNN algorithm, based on the deep learning solutions for the retail industry. The facilities
RPN, is presently the mainstream technique of object from Malong Technologies include AI Cabinets that perform
identification, but it does not satisfy the computing speed automatic product recognition using the computer vision
criteria in real time. Compared with the two-stage method, technology and AI Fresh that enables identification of fresh
the one-stage method computes faster because it skips the products on a self-serve scale automatically. However, all the
region proposal stage, and then objects’ locations and cat- deep learning-based facilities are still in their early stages and
egories are directly regressed from multiple positions of the have not entered the widespread implementation. More re-
image. YOLO [51] and SSD [52] are the most representative searches and practical tests need to be done in this area.
algorithms, greatly speeding up detection, while accuracy is Based on the above review of current studies, we suggest
inferior to the two-stage method. that deep learning is an advanced method, as well as a
growing technique, for retail product recognition; however,
more research is needed in this area.
2.2.3. Product Recognition Based on Deep Learning. Deep
learning has made a research on object detection to develop
rapidly. In this work, we perceive product recognition as a 3. Challenges
particular research issue related to object detection. At As mentioned in the Introduction section, the peculiarity of
present, computer vision has achieved widespread use al- retail product recognition makes it more complicated than
ready; however, its application of product image recognition common object detection since there are some specific
is still less perfect. A typical pipeline of image-based product situations to consider. In this section, we generalize the
recognition is shown in Figure 2, and the product images are challenges regarding retail product recognition and classify
from the RPC dataset [7]. In general, as regional proposals, them into the four aspects shown in the following.
an object detector was used to acquire a set of bounding
boxes. Then, several single-product images are cropped from
the original image, which contains multiple products. Fi- 3.1. Large-Scale Classification. The number of distinct
nally, each cropped image can be input into the classifier, products to be identified in a supermarket can be enormous,
making the recognition of the products an image classifi- approximately several thousands, for a medium-sized gro-
cation task. cery store that far exceeds the ordinary capability of object
In the last few years, some large technology companies have detectors [53].
applied deep learning methods for recognising retail products Currently, YOLO [17, 51, 54], SSD [52], Faster R-CNN
in order to set up unmanned stores. Amazon Go (https://www. [50], and Mask R-CNN [55] are state-of-the-art object de-
amazon.com/b?ie�UTF8&node�16008589011) was the first tection methods, which evaluate their algorithms with
unmanned retail store that was open to the general public in PASCAL VOC [56] and MS COCO [57] datasets. However,
2018. There are dozens of CCTV cameras in the Amazon Go PASCAL VOC only contains 20 classes of objects, and MS
store, and by using deep learning methods, the cameras are able COCO contains photos of 80 object categories. This is to say
to detect the customers’ behaviour and identify the products that the current object detectors are not appropriate to apply
they are buying. Nevertheless, the recognition accuracy with to retail product recognition directly due to their limitations
the images still leaves much to be desired. Hence, some other with large-scaled categories. Figure 3 compares the results
technologies, including Bluetooth and weight sensors, are also on VOC 2012 (20 object categories) and COCO (80 object
employed to ensure the retail products can be identified categories) test sets with different algorithms, including
correctly. Shortly after the Amazon Go store, a new retail store Faster R-CNN, SSD, and YOLOv2. We only list three ap-
called Intelligent Retail Lab (IRL) (https://www. proaches of object identification to demonstrate that the
intelligentretaillab.com/) was designed by Walmart in 2019 precision of all detectors reduces dramatically when the
Computational Intelligence and Neuroscience 5

Bottle
Beer, 98%
Product: water,
beer 97%
Prediction:
98% Final
Detector Crop output

Boxed
Product: milk,
bottle water 97%
Prediction:
Reference 97%
images Fine-grained
classification Product:
Search database boxed milk
Prediction:
97%

Figure 2: A typical pipeline of image-based product recognition.

number of classes rises. More comparative results can be


found in [47]. 90.0
80.0 78.2
Additionally, the data distribution of the VOC dataset is 80.0 75.9
more than 70 percent of the images contain objects belonging 70.0
to one category, and more than 50 percent involve only one 60.0
instance per image. On average, each picture contains 1.4 mAP (%) 46.5
50.0 44.0
categories and 2.3 instances. With regard to the COCO 42.7
dataset, it contains an average of 3.5 categories and 7.7 in- 40.0
stances per image. In a practical scenario of a grocery store, 30.0
customers usually buy dozens of items from more than ten 20.0
categories. Therefore, based on the data above, it illustrates 10.0
that the recognition of the retail product has its peculiarities 0.0
compared with common object detection. As a result, how to Faster R-CNN SSD YOLOv2
settle this practical problem is still an open question.
VOC 2012 (20 classes)
COCO (80 classes)
3.2. Data Limitation. Deep learning-based approaches re- Figure 3: Comparative results on VOC 2012 and COCO test sets.
quire a large amount of annotated data for training, raising a
remarkable challenge in circumstances where only a small
object datasets, retail product datasets have fewer images with
number of examples are available [21]. In Table 1, it lists
more classes. Therefore, it is necessary to provide a larger
some open-source tools that can be used for image labelling.
dataset for training a deep learning model when we want that
These tools have been divided into two categories: bounding
model to be able to recognize objects from various categories.
box and mask. The bounding box category includes tools
Based on the above realization, we can conclude that the
that can label the object with a bounding box, while tools in
data shortage is a real challenge to retail product recognition.
the mask category can be useful for image segmentation.
These image captioning tools require manual labour to label
every object in each image. Normally, there are at least tens 3.3. Intraclass Variation. Intraclass classification, also
of thousands of training images in a general object detection known as subcategory recognition, is a popular research
dataset, apparently indicating that creating a dataset with topic both in the industrial and academic areas, aiming at
enough training data for deep learning is time-consuming distinguishing subordinate-level categories. Generally,
work. identifying intraclass objects is a very challenging task due to
Furthermore, with regard to grocery product recog- the following: (1) objects from similar subordinate categories
nition in retail scenarios, the majority of the training data often have only minor differences in a certain area of their
is acquired in ideal conditions instead of practical envi- appearance. Sometimes, this task is even difficult for humans
ronments [58]. As a sample shown in Figure 4, training to classify. (2) Intraclass objects may present multiple ap-
images are usually taken with the same single product pearance variations with different scales or from various
from several different angles in a rotating platform, while viewpoints. (3) Different environmental factors, such as
testing images are from real conditions, which contain lighting, backgrounds, and occlusions, may have a great
multiple products per image with a complex background. impact on the identification of intraclass objects [59]. To
Last but not least, the majority of scholars aims to perfect solve this challenging problem, fine-grained object classifi-
the dataset of common object detection, such as VOC 2012 and cation is required to identify subcategory object classes,
COCO, which results in the data limitation issue to product which includes finding the subtle differences among visually
recognition. Figure 5 illustrates that compared with common similar subcategories. At present, fine-grained object
6 Computational Intelligence and Neuroscience

Table 1: Image labelling tools.


Categories Tools Environment
labelImg (https://github.com/tzutalin/labelImg) Python
bbox-label-tool (https://github.com/puzzledqs/BBox-Label-Tool) Python
LabelBoundingBox (https://github.com/hjptriplebee/LabelBoundingBox) Python
Bounding box YOLOm ark (https://github.com/AlexeyAB/Yolo_mark) Python
CVAT (https://github.com/opencv/cvat) Python
RectLabel (https://rectlabel.com/) Mac OS
VoTT (https://github.com/microsoft/VoTT) Java/Python
labelme (https://github.com/wkentaro/labelme) Python
Mask
Labelbox (https://github.com/Labelbox/Labelbox) Java/Python

(a) (b)

Figure 4: GroZi-120: samples of training images (a) and testing images (b).

text on the package. Figure 6(b) shows the visually similar


products with different sizes. Additionally, up to now, there
140
have been no specific fine-grained datasets for retail product
120 GroZi-120 recognition. The fine-grained classification methods usually
100 require additional manual labelling information. Without
enough annotation data, it is more demanding to use deep
Categories

80 Grocery store GroZi-3.2k COCO


learning methods to identify similar products.
60 D2S
40
3.4. Flexibility. In general, with the increasing number of
20 Freiburg grocery VOC 2012 new products every day, grocery stores need to import new
0 items regularly to attract customers. Moreover, the ap-
500 5000 50000 500000 pearances of existing products change frequently over time.
Number of images Due to the reasons above, a practical recognition system
Figure 5: Comparison between common object datasets and retail should be flexible with no or minimal retraining whenever a
product datasets. new product/package is introduced [20]. However, con-
volutional neural networks always suffer from “catastrophic
forgetting”—they are unable to recognize some previously
classification is mainly applied to distinguish different learned objects when adapted to a new task [65].
species of birds [60], dogs [61], flowers [62], or different Figure 7 illustrates that, after training a detector with a
brands of cars [63]. Moreover, compared with datasets for new class, banana, it may probably forget the previous
common object classification, it is more difficult to acquire objects. The top detector is trained with a dataset including
fine-grained image datasets, which require relevant pro- orange, so it can detect orange in the image. Then, intro-
fessional knowledge to complete image annotations. ducing a new class, banana, to the detector, we train it only
Due to the visual similarity in terms of shape, colour, with banana images rather than with all the classes jointly.
text, and metric size between intraclass products, retail Finally, the bottom detector is generated, which can rec-
products are really hard to be identified [64]. It can be ognize the new class, banana, in the image. Nevertheless, this
difficult for customers to determine the difference between bottom detector fails to localize orange because of forgetting
two flavours of cookies of the same brand; we can expect it to the original knowledge of orange.
be complex for computers to classify these intraclass Currently, top-performing image classification and ob-
products. Figure 6(a) demonstrates two products with dif- ject detection models have to be retrained completely when
ferent flavours only have minute differences of colour and introducing a new category. It poses a key issue as collecting
Computational Intelligence and Neuroscience 7

(a) (b)

Figure 6: Intraclass products with different flavours (a) (honey flavour and chocolate flavour) and size (b) (110 g and 190 g).

Orange Orange

Detector Training with


banana images

Orange
banana ??? Banana

Detector
Figure 7: An example of introducing a new class to an existing retail product detector.

new training data and retraining networks can be time- Techniques Challenges
consuming. Therefore, how to develop an object detector
CNN feature Large-scale
with long-term memory is a problem worthy of study. descriptors classification

4. Techniques
Data Data
Concerning the four challenges proposed in Section 3, we augmentation limitation
refer to a considerable amount of literature and summarize
current techniques related to deep learning, aiming to Intraclass
Fine-grained
provide some references with which readers can quickly gain classification variation
entrance to the field of deep learning-based product rec-
ognition. In this paper, we not only introduce the ap-
proaches in the scope of deep learning but also present some One-shot
Flexibility
related methods that can be combined with deep learning to learning
advance the recognition performance. Figure 8 demon-
strates the techniques’ target for the proposed challenges. Figure 8: Techniques for challenges.

4.1. CNN-Based Feature Descriptors. The key issue of image classes of objects. Some researchers have attempted to use
classification lies in the extraction of image features; by using the CNN for feature extraction [48, 67–69]. Table 2 shows
the extracted features, the images can be categorized into the related works with CNN-based feature descriptors for
different classes. For the challenge of large-scale classifica- retail product recognition.
tion in Section 3, the traditional hand-crafted feature ex- In [72], Inception V3 [81] has been used to implement
traction methods, e.g., SIFT [24, 25] and SURF [26, 27], seem image classification of eight different kinds of products on
to be overtaken by the convolutional neural network (CNN) the shelves. The drawback is that the prediction accuracy of
[66] due to their limitations for exploring deep information the images from real stores only reaches 87.5%, and that
from images. At the moment, CNN is a promising technique needs to be improved. Geng et al. [74] employed VGG-16 as
that has a strong ability to create embedding for different the feature descriptor to recognize the product instances,
8 Computational Intelligence and Neuroscience

Table 2: CNN-based feature descriptors and relevant approaches still has a considerable space for improvement to implement
where these descriptors are employed. this technique in the retail industry area.
Feature descriptors Approaches Lately, the most popular object detector YOLO9000 [54]
has proposed a method that can detect 9,000 object classes by
Inception [70] [71, 72]
GoogLeNet [41] [67] using revised Darknet [86]. Unfortunately, YOLO9000 has
AlexNet [15] [21, 53, 58, 67, 73] been trained with millions of images, which is infeasible in
VGG [42] [20, 21, 71, 74–76] the case of training a product recognition model due to the
CaffeNet [35] [10, 67, 73, 77] high annotation costs. However, the success of YOLO9000
ResNet [43] [22, 68, 71, 74, 78–80] illustrates the potential ability of the CNN to achieve a large-
scale level of classification (thousands of classes). As for the
problem of how to produce more data available for training,
achieving recognition for 857 classes of food products. In we will discuss in the next section.
this work, VGG-16 is integrated with recurring features and
attention maps to improve the performance of grocery
4.2. Data Augmentation. It is common knowledge that deep
product recognition in the real-world application scenario.
learning methods require a large number of training ex-
The authors also implemented their method with ResNet;
amples; nevertheless, acquiring large sets of training ex-
then, 102 grocery products from CAPG-GP (the dataset built
amples is often tricky and expensive [87]. Data
in this paper) were successfully classified with the mAP of
augmentation is a common technique used in deep network
0.75. Another notable work using ResNet is from [22] that
training to handle the shortage of training data [78]. This
introduces a scale-aware network for generating product
technique uses a small number of images to generate new
proposals in supermarket images. Although this method
synthetic images, aiming to artificially enlarge the small
does not aim to predict the product categories, it can ac-
datasets to reduce the overfitting [15, 88]. In this paper, we
curately perform the object proposal detection for the
define the current mainstream approaches into two cate-
products with different scale ratios in one image, which is a
gories: common synthetic methods and generative models.
practical issue in supermarket scenarios. In [71], authors
The existing publications are listed in Table 3.
considered three different popular CNN models, VGG-16
[42], ResNet [43], and Inception [70], in their approach and
performed the K-NN similarity search extensively with the 4.2.1. Common Synthesis Methods. The common methods
output of the three models. Their method was evaluated with for image data augmentation generate new images through
three grocery product datasets, and the largest one contained translations, rotations, mirror reflections, scaling, and
938 classes of food items. AlexNet was exploited in [53] to adding random noise [15, 90, 91]. A significant attempt can
compute visual features of products, combining deep class be found in the work of Merler et al. [89]; synthetic samples
embedding into a CRF (conditional random field) [82] were created from images under ideal imaging conditions
formulation, which enables classifying products with a huge (referred to as in vitro) by applying randomly generated
number of classes. The benchmark in this paper involved perspective distortions.
24,024 images and 460,121 objects, and each object belonged The occlusion for each product is also a common
to one of 972 different classes. The above method can only be phenomenon in real practice. In [22], the authors proposed a
applied to a small retail store as all of them recognize up to virtual supermarket dataset to let models learn in the virtual
1,000 classes of products, while a stronger ability to classify environment. In this dataset, the occlusion threshold is set to
more categories of items is required for medium-sized and 0.9, which means the product occluded under the threshold
large-sized retail stores. 0.9 will not be labelled as the ground truth. UnrealCV [92]
Recent works have tried to realize large-scale classifi- was employed to extract the ground truth of object masks
cation, e.g., Tonioni et al. and Karlinsky et al. [20, 21] from real-world images. Then, they manipulated the
proposed approaches that can detect several thousand extracted object masks on a background of shelves and
product classes. In [20], the backbone network for its feature rendered 5,000 high-quality synthetic images. In this paper,
descriptor is VGG, from which a global image embedding is some other aspects such as realism, the randomness of
obtained by computing MAC (maximum activations of placement, products’ overlapping, object scales, lighting, and
convolutions) features [83]. This research is dealing with the materials were taken into account when constructing the
products belonging to 3,288 different classes of food synthetic dataset. By using the virtual supermarket dataset,
products. Finally, Tonioni et al. obtained state-of-the-art they achieved identification of items in the real-world
results of precision and recall, as 57.07% PR and 36.02% datasets without fine-tuning. Recently, Yi et al. [79] tried to
mAP, respectively. In the work of Karlinsky et al. [21], the simulate the situation of occlusion by overwriting a random
CNN feature descriptor is based on fine-tuning a variant of region in the original image either by a black block or a
the VGG-F network [84], which deploys the first 2–15 layers random patch from another product. Then, they fine-tuned
of VGG-F trained on ImageNet [85] unchanged. As a result, their Faster R-CNN detection model with in vitro (in ideal
the authors presented a method to recognize each product conditions) and in situ (in natural environments) data and
category out of a total of 3,235, with an mAP of 52.16%. obtained a relatively high rate in mAP and recall. In situ is
According to the data from these two papers, it is obvious divided into conveyor and shelf scenarios where the authors
that the recognition accuracy, including precision and recall, obtained the mAP of 0.84 on the conveyor and 0.79 on the
Computational Intelligence and Neuroscience 9

Table 3: Related works for data limitation in the field of retail In Table 4, we list some state-of-the-art models that are based
product recognition. on the architectures of VAE and GAN for image generation
and translation, respectively. The works displayed in the
Technique Categories Existing works
table prove that the models based on the GAN are powerful
[22, 76, 80] for producing new images as well as to enable the image-to-
Common synthesis
Data augmentation [21, 79, 89] image transfer. Unfortunately, the approaches based on the
Generative [7, 71, 78]
strength of VAE have been unable to achieve image
translation tasks up to now. The detailed research and ap-
shelf, respectively. Some synthetic samples are shown in the plication status of image synthesis with VAE and GAN are
first two rows of Figure 9. Inadequately, the comparative introduced in the following.
experiments between the proposed algorithm and the other VAE has not been applied as an image creator in the
state-of-the-art algorithms are absent in this paper. domain of product recognition so far. The general frame-
The work in [80] synthesizes new images containing work of VAE comprises an “encoder network” and a “de-
multiple objects by combining and reorganizing atom object coder network.” After training the model, we can use the
masks. Ten thousand new images were assembled, which “decoder network” to generate realistic images. In this paper,
contained one to fifteen objects randomly. For each gen- we present some successful cases of VAE in other classifi-
erated image, the lighting, the class of object instances, the cation and detection fields for reference. In [101], a novel
orientation, and the location in the image are randomly layered foreground-background generative model trained in
sampled. The last row in Figure 9 shows example synthetic an end-to-end deep neural network using VAE is provided
images under three different lightings. By adding the 10,000 for generating realistic samples from visual attributes. This
generated images to the training set, the AP on the test set model was evaluated with the Wild (LFW) dataset [111] and
has been improved to 79.9% and 72.5% for Mask R-CNN the Caltech-UCSD Birds-200-2011 (CUB) dataset [60] which
[55] and FCIS [93], respectively. By contrast, the achieve- contained natural images of faces and birds, respectively.
ment of AP is only 49.5% and 45.6% without the generated The authors have trained an attribute regressor to compare
data. the differences between generated images and real data.
To realize product recognition with a single example, Finally, their model achieved 16.71 mean squared error
researchers in [21] generated large numbers of training (MSE) and 0.9057 cosine similarity on the generated sam-
images using geometric and photometric transformations ples. Another noteworthy work is from [112], where the
based on a few available training examples. In order to authors used a conditional VAE to generate the samples
facilitate image augmentations for computer vision tasks, from the given attributes for addressing zero-shot learning
albumentations are presented in [94] as a publicly available problems. They tested this method on four benchmark
tool that enables varieties of image transformation opera- datasets, AwA [113], CUB [60], SUN [114], and ImageNet
tions. Recent work in [95] has applied albumentations with a [85], and gained state-of-the-art results, particularly in a
small training dataset and then trained the product detection more realistic generalized setting. These successful appli-
model with the augmented dataset. The outcomes show that cation examples of VAE manifest that VAE is a promising
the model can attain reasonable detection accuracy with technique for data augmentation. With the increasing at-
fewer images. tention for product recognition, VAE will be applied in this
However, the common methods for generating new field soon.
images have their limitations to simulate various conditions GAN, which was proposed in 2014, has been achieving
in the real world. Generative models are provided to prevent remarkable efficiency in various research fields. The
the models from learning various conditions illogically. framework of the GAN consists of two models: a generator
that produces fake images and a discriminator that estimates
the probability that a sample is a real image rather than a fake
4.2.2. Generative Models. Nowadays, generative models one [97]. As a result, compared with common synthetic
include variational autoencoder (VAE) [96] and generative methods, the generator can be used to generate images that
adversarial networks (GANs) [97], gaining more and more look more realistic.
attention due to the potential ability to synthesize in vitro With the advantage of generating realistic images,
images similar to those in realistic scenes for data aug- scholars have demonstrated the great potential of using the
mentation [78]. Normally, generative models enrich the GAN and its variant [104–107] to produce images for en-
training dataset in two ways. One is generating new images larging the training set. For example, in [115], authors built a
with an object that looks similar to the real data. The framework with structure-aware image-to-image translation
synthetic images can directly increase the number of training networks, which could generate large-scale trainable data.
images for each category. Another approach is the image-to- After training with the synthetic dataset, the proposed de-
image translation, which is described as the issue of tector provided a significant performance on night-time
translating the picture style from the source domain to the vehicle detection. In another work [116], a novel deep se-
target domain [98]. For example, if the target domain is mantic hashing was presented, which combined with the
defined as a practical scene in a retail store, this image semisupervised GAN, to produce highly compelling data
transfer approach can improve training images to be more with intrinsic invariance and global coherence. This method
realistic, such as different lightings, views, and backgrounds. achieved state-of-the-art results with CIFAR-10 [117] and
10 Computational Intelligence and Neuroscience

Figure 9: First two rows show examples of occlusion simulation in [79], and the third row demonstrates example images from [80] under
three different lightings.

Table 4: Summary of models based on the structures of VAE and are very few works using GANs for product recognition. To
GAN for image synthesis. the best of our knowledge, there are only three papers
Synthesis type VAE GANs
[7, 71, 78] attempting to exploit GANs to create new images
in the field of product recognition. In the work of [7], the
VAE [96] GAN [97]
authors proposed a large-scale checkout dataset containing
cVAE [99] CGAN [100]
Image generation synthetic training images generated by CycleGAN [106].
Attribute2Image [101] DCGAN [102]
Multistage VAE [103] InfoGAN [104] Technically, they firstly synthesized images with object in-
Pix2Pix [105]
stances on a prepared background image. Then, CycleGAN
CycleGAN [106] was employed to translate these images into the checkout
DualGAN [107] image domain. By training with the combination of trans-
Image translation — lated images and original images, their product detector,
DiscoGAN [108]
StarGAN [109] feature pyramid network (FPN) [123], attained 56.68% ac-
VAE-GAN [110] curacy and 96.57% mAP on average. Figure 10 indicates the
CycleGAN translating effects. Based on the work of Wei
et al. [7], Li et al. [78] conducted further research through
selecting reliable checkout images with the proposed data
NUS-WIDE [118] datasets. A new image density model priming network (DPNet). Their method achieved 80.51%
based on the PixelCNN architecture was established in [119], checkout accuracy and 97.91% mAP. In [71], GAN was
which could be used to generate images from diverse classes deployed to produce realistic samples, as well as to play an
by simply conditioning on a one-hot encoding of that class. adversarial game against the encoder network. However, the
Zheng et al. employed the DCGAN to produce unlabeled translated images in both [7, 78] only contain a simple
images in [120] and then applied these new images to train background of flat colour. Considering the complex back-
the model for recognising fine-grained birds. This method grounds of the real checkout counter and the goods shelf,
has attained an enhancement of +0.6% over a powerful how to generate retail product images in a more true-to-life
baseline [121]. In [122], CycleGAN was used to create setting is worthy of research.
200,000 license plate images from 9,000 real pictures. Its
result demonstrated an increase of 7.5 percentage points of
recognition precision over a strong benchmark that was 4.3. Fine-Grained Classification. Fine-grained classification
trained only with real data. The evidence above indicates that is a challenging problem in computer vision, which can
GANs are powerful tools for generating realistic images that enable computers to recognize the objects of subclass cat-
can be used for training deep neural networks. It is likely egories [124, 125]. Recently, a number of researchers and
that, in the near future, the experience of the above methods engineers have focused on the technique of fine-grained
can be borrowed for improving the effects of product classification and already applied it in a significant number
recognition. of domains with remarkable achievements, e.g., animal
Although GANs have shown compelling results in the breeds or species [126–131], plant species [62, 131–133], and
domains of general object classification and detection, there artificial entities [129, 130, 134–136]. Fine-grained retail
Computational Intelligence and Neuroscience 11

(a) (b) (c) (d)

Figure 10: Synthesized checkout images (left) and the corresponding images generated by CycleGAN (right) from [7].

product recognition is a more challenging task than general from each cropped image. (3) Based on different parts,
object recognition due to intraclass variance and interclass convolution features are extracted from multiple layers of
similarity. Considering the specific complications in product the CNN. (4) These features are imported into one-vs-all
recognition in terms of blur, lighting, deformation, orien- linear SVMs (support vector machines) [138] to learn
tation, and the alignment of products in shelves, we sum- weights. Eventually, the classification accuracy of their
marized the existing product fine-grained classification method reached 75.7% on the Caltech-UCSD bird dataset.
methods into two categories, i.e., fine feature representation In the domain of retail product recognition, the work in
and context awareness. [139] can be considered as a solution for the fine-grained
classification to some extent. The researchers designed an
algorithm called DiffNet that could detect different products
4.3.1. Fine Feature Representation. Fine feature represen- between a pair of similar images. They have labelled different
tation refers to extracting fine features in a local part of the products in each pair of images, and there is no need to
image to find the discriminative information between vi- annotate the constant objects. The consequence of this was
sually similar products. As a consequence, how to effectively that this algorithm achieved a relatively desirable detection
detect foreground objects and find important local infor- accuracy of 95.56% mAP. The DiffNet would probably
mation has become a principal problem for fine-grained benefit the progress of product recognition, particularly for
feature representation. According to the supervisory in- detecting the changes of the on-shelf products.
formation for training the models, the fine feature repre-
sentation methods can be divided into two categories: (2) Fine Feature Representation from Weakly Supervised
“strongly supervised fine feature representation” and Models. The weakly supervised techniques prevent the use of
“weakly supervised fine feature representation.” costly annotations such as bounding boxes and part in-
formation. Similar to the strongly supervised classification
(1) Fine Feature Representation from Strongly Supervised methods, the weakly supervised methods also require global
Models. The strongly supervised methods require additional and local features for the fine-grained classification. Con-
manual labelling information such as a bounding box and sequently, the principal task of a weakly supervised model is
part annotation. As mentioned in Section 3, the practical how to detect the parts of the object and extract fine-grained
applicability of such methods has been largely limited by the features.
high acquisition cost of annotation information. The clas- The two-level attention [126] algorithm is the first at-
sical methods include part-based R-CNN [127] and pose- tempt to perform fine-grained image classification without
normalized CNN [137]. relying on part annotation information. This method is
In [127], part-based R-CNN is established to identify based on a simple intuition: extracting the features from the
fine-grained species of birds. This method uses R-CNN to object level and then focusing on the most discriminative
extract features from the whole-objects (birds) and local parts that can be used for the fine-grained classification. The
areas (head, body, etc.). Then, for each region proposal, it constellation [140] algorithm was proposed by Simon and
computes scores with features from an object and each of its Rodner in 2015. It exploits the features from the convolution
parts. Finally, through considering synthetically with the neural network to generate some neural activation patterns
scores of fine-grained features, this method achieves state- that can be used to extract features from parts of the object.
of-the-art results on the widely used fine-grained benchmark Another remarkable work is from [141], where the authors
Caltech-UCSD bird dataset [60]. proposed novel bilinear models that contain two CNNs, A
Branson et al. presented pose-normalized CNN in [137], and B. The function of CNN A is to complete the localization
and the fine-grained feature extraction process in this paper of the object and its parts, while B is able to extract features
is as follows: (1) the DPM algorithm is used to detect the of region proposals from CNN B. These two networks co-
object location and its local areas. (2) The image is cropped ordinate with each other and obtain 84.1% accuracy in the
according to the bounding boxes, and features are extracted Caltech-UCSD bird dataset.
12 Computational Intelligence and Neuroscience

Regarding the fine-grained classification of retail products, aware hybrid classification system for fine-grained product
some academic staff are beginning to take advantage of fine recognition, which combines the relationships between the
feature representation to identify subclass products. In [21], a products on the shelf with image features extracted by SIFT
CNN was proposed for improving the fine-grained classifi- methods. This method achieves an 11.4% improvement
cation performance, combined with scored short-lists of compared with the context-free method. In [148], authors
possible classifications from a fast detection model. Specifically, proposed a computer vision pipeline that detects missing or
a variable containing the product of the scores from a fast misplaced items by using a novel graph-based consistency
detection model and corresponding CNN confidences are used check method. This method regards the product recognition
for ranking the final result. In the research of [74], Geng et al. problem as a subgraph isomorphism between the item
applied visual attention [74] to fine-grained product classifi- packaging and the ideal locations.
cation tasks. Attention maps are employed to magnify the
influences of the features, consequently to guide the CNN
classifier to focus on fine discriminative details. Eventually, they 4.4. One-Shot Learning. One-shot learning is derived from
compared their method with state-of-the-art approaches and distance metric learning [149] with the purpose of learning
obtained promising results. Based on the method of [142], information about object categories from one or only a few
George et al. performed fine-grained classification for products training samples/images [87]. It is of great benefit for
on a shelf in [143]. They extracted midlevel discriminative seamlessly handling new products/packages as the only
patches on product packaging and then employed SVM requirement is to introduce one or several images of the new
classifiers to differentiate visually similar product classes by item into the reference database with no or minimal
analyzing the extracted patches. Their work shows the superior retraining. The basic concept of how to classify objects with
performance of using discriminative patches in the fine- one-shot learning is shown in Figure 11. The points, C1, C2,
grained product classification. In the recent study from [144], a and C3, are the mean centres of feature embeddings from
self-attention module is proposed for capturing the most in- three different categories, respectively. Based on the feature
formative parts in images. The authors compared the activation embedding of X, the calculation of the distance between X
response of a position with the mean value of features to locate and the three points (C1, C2, and C3) can be conducted.
the crucial parts of the fine-grained objects. The experimental Thus, X will be identified in the class that has the shortest
results in [144] show that the fine-grained recognition per- distance. Additionally, one-shot learning is also a powerful
formance has been improved in cross-domain scenarios. method to deal with the training data shortage, with the
possibility of learning much information about a category
from just one or a handful of images [87]. Considering the
4.3.2. Context Awareness. Context is a statistical property of advantages of one-shot learning, a lot of literature has
the world which provides critical cues to help us detect combined one-shot learning with the CNN for a variety of
specific objects in retail stores [145], especially when the tasks including image classification [150–153] and object
appearance of an object may not be sufficient for accurate detection [154, 155].
categorization. Context information has been applied to In [152], a novel metric was proposed, including colour-
improve the performance for the domain of object detection invariant features from intensity images with CNNs and
[145–147] due to its ability to provide useful information colour components from a colour checker chart. The metric
about spatial and semantic relationships between objects. is then used by a one-shot metric learning approach to
With regard to the scenario in a supermarket, products realize person identification. Vinyals et al. in [150] designed
are generally placed on shelves according to certain ar- a matching network, which employs metric learning based
rangement rules, e.g., intraclass products are more likely to on deep neural features. Their approach is tested on the
appear adjacent to each other on the same shelf. Conse- ImageNet dataset and is able to recognize new items when
quently, context can be considered as a reference for rec- introducing a few examples of a new item. Compared with
ognising similar products on shelves, jointly with deep the Inception classifier [41], it has increased the accuracy of
features. Currently, there are few works of the literature taking one-shot classification on ImageNet from 87.6% to 93.2%. In
contextual information into account with deep learning de- the domain of object detection, the work in [155] combines
tectors for product recognition. In [53], a novel technique to distance metric learning with R-CNN and implements an-
learn deep contextual and visual features for the fine-grained imal detection with few training examples. Video object
classification of products on shelves is introduced. Techni- segmentation was achieved in [154], where the authors
cally, authors proposed a CRF-based method [82] to learn the adapted the pretrained CNN to retrieve a particular object
class embedding from a CNN concerning its neighbour’s instance, given a single annotated image, by fine-tuning on a
visual features. In this paper, the product recognition problem segmentation example for the specific target object.
is addressed not only based on its visual appearance but also Two very recent papers have succeeded in addressing
on its relative locations. This method has been evaluated on a the specific domain of retail products to take the experience
dataset that contains product images from retail stores, and it of one-shot learning combined with deep features from
improves the recall to 87% with 91% precision. Another two CNNs. In [74], a framework integrating feature-based
papers also obtained prominent results by considering the matching and one-shot learning with a coarse-to-fine
context. However, they did not use a deep learning-based strategy is introduced. This framework performs flexibly,
feature descriptor. One is from [64], and it presents a context- which allows adding new product classes without
Computational Intelligence and Neuroscience 13

instance, enabling this dataset to suit one-shot learning. The


test set has 4,973 frames annotated with ground truth in situ,
which are rack images obtained from natural environments
C2 with a variety of illuminations, sizes, and poses. It also has 29
videos with a total duration of 30 minutes, including every
product presented in the training set. The in situ videos are
recorded using a VGA resolution MiniDV camcorder at
X 30 fps, and the in situ rack images are of low resolution.
C1
Samples of in vitro images and in situ rack images are shown
in Figure 4.
C3

Figure 11: The prototype network of one-shot learning. 5.1.2. GroZi-3.2k. GroZi-3.2k [13] is a dataset containing
supermarket products, which can be used in fine-grained
recognition. This dataset includes 8,350 training images
retraining existing classifiers. It has been evaluated on the collected from the web, belonging to 80 broad product
GroZi-3.2k [13], GP-20 [58], and GP181 [148] datasets and categories. Training images are taken in ideal conditions
attained 73.93%, 65.55%, and 85.79% for mAP, respec- with a white background, and most of them only contain one
tively. Another work from [20] proposes a pipeline which single instance in each image. On the contrary, the testing set
pursues product recognition through a similarity search consists of 680 images captured from 5 real-life retail stores
between the deep features of reference and query images. using a mobile phone, with ground truth annotations. The
Their pipeline just requires one training image for each reason why this dataset is named as GroZi-3.2k is that all the
product class and handles seamlessly new product products in the test images are from the 27 training classes of
packaging. the “food” category under which 3,235 training images are
In this section, we provided a comprehensive literature included. Examples of training and testing images are shown
review to summarize the research status of the four tech- in Figure 12.
niques, which are powerful tools to deal with the challenging
problems of product recognition. In the next section, we 5.1.3. Freiburg Grocery Dataset. The Freiburg Grocery
introduce the public datasets and present a comparative dataset [77] consists of 4,947 images of 25 grocery classes.
study on the performances of deep learning methods. The training images are taken at some stores, apartments,
and offices in Germany using four different phone cameras.
5. Dataset Resources Each training image has been downscaled to a size of
256 ∗ 256 pixels, containing one or several instances of one
As mentioned earlier, deep learning always requires plenty category. Furthermore, an additional set includes 74 images
of annotation images for training and testing, while it is collected in 37 cluttered scenes that can be used as a testing
often labour-intensive to obtain labelled images in real set. Each testing image is recorded by a Kinect v2 [156]
practice. In this section, we present public dataset resources, camera at 1920 ∗ 1080 pixels RGB, containing several
assisting researchers in testing their methods and comparing products belonging to multiple classes. Figure 13 indicates
results based on the same dataset. According to the different some examples of training images and testing images from
application scenarios, we split the resources into two cate- this dataset.
gories: on-shelf and checkout. Table 5 lists the detailed
information of several available datasets, including the
number of product categories, the number of instances in 5.1.4. Cigarette Dataset. The Cigarette dataset [157] comes
each image, and the number of images in the training and with product images and shelf images from 40 retail stores,
testing sets. The datasets are briefly introduced in the captured by four different types of cameras. The training set
following. consists of 3,600 product images belonging to 10 cigarette
classes. Each image in this set includes only one instance.
The testing set is made of 354 shelf images, which have
5.1. On-Shelf Datasets. On-shelf datasets are benchmarks for approximately 13,000 products in total. Each product in the
testing methods proposed to recognize products on shelves, shelf image has been annotated with bounding boxes and
which shall benefit product management. Here, we present cigarette categories using the Image Clipper utility. Figure 14
six available datasets. demonstrates the brand classes and an example of shelf
images.
5.1.1. GroZi-120. The GroZi-120 dataset [89] consists of 120
product categories, with images representing the same 5.1.5. Grocery Store Dataset. The Grocery Store dataset [158]
products under completely different conditions, together was developed to address the natural image classification for
with their text annotations. The training set includes the 676 assisting people who are visually impaired. This dataset
images in vitro: such images are captured under ideal consists of iconic images and natural images. The iconic
conditions. A training image just contains one single images are downloaded from a grocery store website with
14 Computational Intelligence and Neuroscience

Table 5: Detailed information of several public datasets.


Training set Test set
#product
Scenario Dataset #instances per #number of #instances per #number of
categories
image images image images
GroZi-120 dataset (http://grozi.calit2.net/grozi.
120 Multiple 676 Multiple 4,973
html)
GroZi-3.2k dataset (https://sites.google.com/
27/80 Single 8,350 Multiple 3,235
view/mariangeorge/datasets)
Freiburg Grocery dataset (https://github.com/ Multiple (one
On-shelf 25 4,947 Multiple 74
PhilJd/freiburg_groceries_dataset) class)
Cigarette dataset (https://github.com/gulvarol/
10 Single 3,600 Multiple 354
grocerydataset)
Grocery Store dataset (https://github.com/ Multiple (one Multiple (one
81 2,640 2,458
marcusklasson/GroceryStoreDataset) class) class)
D2S dataset (https://www.mvtec.com/
60 Single 4,380 Multiple 16,620
Checkout company/research/datasets/mvtec-d2s/)
RPC dataset (https://rpc-dataset.github.io/) 200/17 Single 53,739 Multiple 30,000

(a) (b)

Figure 12: GroZi-3.2k: samples of training images (a) and testing images (b).

(a) (b)

Figure 13: Freiburg Grocery: samples of training images (a) and testing images (b).

product information, such as origin country, weight, and testing sets, respectively. Each training image includes a single
nutrient values. On the contrary, the natural images are instance of one product category. Images from test sets have
collected images from 18 different grocery stores recorded by been annotated with item-specific bounding boxes. This
a 16-megapixel phone camera with different distances and dataset can be found at http://vision.disi.unibo.it/index.php?
angles. This set, containing 5,125 images from 81 fine-grained option�com_content&view�article&id�111&catid�78.
classes, has been split into one training set and one test set Here, we present a comparison of the product recognition
randomly to reduce the data bias. The training and test set performance on GroZi-120, GroZi-3.2k, and its subset in
contain 2,640 and 2,485 images, respectively, and each image Table 6. All the methods of the listed publications are based on
contains one or several instances of one product class. Fig- deep learning. The performance was calculated by using recall,
ure 15 illustrates the examples of iconic and natural images. precision, and accuracy. Precision measures the percentage of
correct predictions over the total number of predictions, while
the recall measures the percentage of correctly detected
5.1.6. GP181 Dataset. The GP181 dataset [148] is a subset of products over the total number of labelled products in the
the Grozi-3.2k dataset, with 183 and 73 images in training and image [148]. Here are their mathematical definitions:
Computational Intelligence and Neuroscience 15

(a) (b)

Figure 14: Cigarette dataset: samples of training images (a) and testing images (b).

(a) (b)

Figure 15: Grocery Store dataset: samples of iconic images (a) and natural images (b).

Table 6: Recognition performance comparison of approaches based on deep learning on benchmark datasets.
GroZi-120 [89] GroZi-3.2k [13]
Publications
Precision (%) Recall (%) #product categories Precision (%) Recall (%) #product categories
[58] 45.20 52.70 120 73.10 73.60 20
[20] — — — 73.50 82.68 181
[148] — — — 90.47 90.26 181
[71] — — — Accuracy: 85.30 181
65.83 45.52 857
[74] 49.05 29.37 120
92.19 87.89 181
[21] 49.80 — 120 52.16 — 27

precision � (TP/(TP + FP)), recall � (TP/(TP + FN)), and 21,000 high-resolution images of groceries and daily products,
accuracy � ((TP + TN)/(TP + FN + FP + TN)), where TP, such as fruits, vegetables, cereal packets, pasta, and bottles, from
TN, FP, and FN refer to true positive, true negative, false 60 categories. The images are taken in 700 different scenes
positive, and false negative, respectively. under three different lightings and three additional back-
grounds. The training set includes 4,380 images captured from
different views, and each image involves one product of a single
5.2. Checkout Datasets. As mentioned in the Introduction class. There are 3,600 and 13,020 images in the validation and
section, the scenario of recognising products for the self- test sets, respectively. Furthermore, 10,000 images in the val-
checkout system is also a complex task that needs to be idation and test sets are artificially synthesized that contain one
solved, which will benefit both retailers and customers. Since to fifteen objects randomly picked from the training set. The
it is an emerging research area, this problem has not been samples of training images and test images are shown in
extensively studied. There are two public datasets available Figure 16.
for the checkout system. In the work of [80], the authors evaluated the perfor-
mance of several state-of-the-art deep learning-based
5.2.1. D2S Dataset. The D2S dataset [80] is the first-ever methods on the D2S dataset, including Mask R-CNN [55],
benchmark to provide pixelwise annotations on the instance FCIS [93], Faster R-CNN [50], and RetinaNet [159]. The
level, aiming to cover real-world applications of an automatic results are summarized in Table 7. The evaluation metric is
checkout, inventory, or warehouse system. It contains a total of mean average precision (mAP) [57]. Specifically, mAP50 and
16 Computational Intelligence and Neuroscience

(a) (b)

Figure 16: D2S dataset: samples of training images (a) and testing images (b).

Table 7: Product detection benchmark results on the test set of the Given N images from the RPC dataset, cAcc measures
D2S dataset. the mean accuracy rate of the correct predictions. Its
Approaches mAP (%) mAP50 (%) mAP75 (%) mathematical definition is
Mask R-CNN [55] 78.3 89.8 84.9 􏽐N
i�1 δ CDi , 0􏼁 (3)
FCIS [93] 68.3 88.5 80.9 cAcc � ,
N
Faster R-CNN [50] 78.0 90.3 84.8
RetinaNet [159] 80.1 89.6 84.5 where δ(·) � 1 if CDi � 0; otherwise, it equals 0. The value of
cAcc ranges from 0 to 1.
Afterwards, based on the work of Wei et al. [7], data
mAP75 are calculated at the intersection-over-union (IoU) priming network (DPNet) was developed to select reliable
thresholds 0.50 and 0.75 over all product classes, samples to promote the training process in [78]. Conse-
respectively. quently, the performance of product recognition has been
significantly boosted with DPNet. The comparative results of
[7, 78] are listed in Table 8, where mmAP is the mean value
5.2.2. RPC Dataset. The RPC dataset [7] is developed to over all 10 IoU thresholds (i.e., ranging from 0 : 50 to 0 : 95
support research on addressing product recognition in real- with the uniform step size 0 : 05) of all product classes [7].
world checkout scenarios. It consists of 83,739 images in total,
including 53,739 single-product exemplary images for training 6. Research Directions and Conclusion
and 30,000 checkout images for validation and testing. It has a
hierarchical structure of 200 fine-grained product categories, To the best of our knowledge, this paper is the first com-
which can be coarsely categorized as 17 metaclasses. Each prehensive literature review on deep learning approaches for
training image is captured in controlled conditions with four retail product recognition. Based on the thorough investi-
cameras from different views. The checkout images are gation into the research of retail product recognition with
recorded with three clutter levels using a camera mounted on deep learning, this section outlines several promising re-
top, annotated with a bounding box and object category for search directions for the future. Finally, we present a con-
each product. Figure 17 demonstrates some examples of clusion for the whole article.
training images and checkout images in the RPC dataset.
In [7], feature pyramid network (FPN) [123] is adopted
6.1. Research Directions
as the detector for recognising items on the RPC dataset, and
reasonable results have been achieved in this paper. In 6.1.1. Generating Product Images with Deep Neural Networks.
addition, the authors also proposed an essential metric, In the previous introduction of dataset resources, the largest
checkout accuracy (cAcc), for the automatic checkout task in publicly available dataset only contained 200 product cate-
[7]. At first, CDi,k is defined as the counting error for a gories. Nevertheless, the number of different items to be
particular category in a checkout image: recognized in a medium-sized grocery store can be ap-
􏼌􏼌 􏼌􏼌 proximately several thousands, far exceeding the category
CDi,k � 􏼌􏼌Pi,k − GTi,k 􏼌􏼌, (1) quantity of the existing datasets. Considering the appearances
of existing products frequently change over time, it is im-
where Pi,k and GTi,k denote the predicted count and ground-
possible to build a man-made dataset that includes the ma-
truth item number of the k-th class in the i-th image, re-
jority of daily products. Some works [7, 71, 78] have
spectively. Then, the calculation of the error over all K
demonstrated the advantages of generative adversarial net-
product classes in the i-th image is defined as
works (GANs) for generating images that look realistic.
K
Moreover, significant work in [102] has filled the gap between
CDi � 􏽘 CDi,k . (2) CNNs and GANs by proposing the deep convolutional
k�1 generative adversarial networks (DCGANs) that can create
Computational Intelligence and Neuroscience 17

(a) (b)

Figure 17: RPC dataset: samples of training images (a) and checkout images (b).

high-quality generated images. In this case, it is feasible to Table 8: Comparative results on the RPC dataset.
generate images with deep neural networks to enlarge the Publications cAcc (%) mAP50 (%) mmAP (%)
training dataset for retail product recognition. So, developing
image generators with deep neural networks to simulate real- [7] 56.68 96.57 73.83
[78] 80.51 97.91 77.04
world scenes shall be a future research direction.

6.1.2. Graph Neural Networks with Deep Learning for Pla- domain recognition. Cross-domain recognition is usually
nogram Compliance Check. Graph neural networks (GNNs) based on transfer learning [168] that assists the target do-
[160] are a powerful tool for non-Euclidean data, which can main in learning by using knowledge transferred from other
represent the relationships between objects [161, 162]. Cur- domains. Transfer learning is capable of solving new
rently, GNNs have achieved great success on recommenda- problems easily by applying knowledge obtained previously.
tion systems [163, 164], molecule identification [165], and For a new task, researchers normally use the pretrained
paper citation analysis [166]. For an image that contains detector either as an initialization or a fixed feature extractor
multiple objects, each object can be considered as a node, and and then fine-tune the weights of some layers in the network
GNNs have the ability to learn the location relationship to realize cross-domain detection. Ordinarily, the majority
between every two nodes. With regard to the scenarios in of approaches in established papers employs models pre-
supermarkets, products are generally placed on shelves trained with ImageNet to implement product recognition
according to certain arrangement rules. In this case, GNNs [20, 74]. However, how to make a model adaptable in various
can be used with deep learning to learn the position rela- shops still needs attention.
tionships between different products, and then they are
assisted by identifying missing or misplaced items for pla- 6.1.4. Joint Feature Learning from Text Information on
nogram compliance. In [148], authors attempted to apply Product Packaging. Intraclass product classification is a
GNNs for consistency checks and achieved a remarkable challenge since it is visually similar. Sometimes, we human
result. Specifically, there are two relationship representations. beings recognize similar products by reading the text on
One is “observed planogram” generated from GNNs, and packaging when we are facing a lot of intraclass items. Thus,
another one is “reference planogram,” the true representation. the text information on product packaging can be considered
By comparing the observed planogram and reference pla- as a factor for classifying fine-grained products. Currently,
nogram, they obtained the result of the consistency check that joint feature learning (JFL) methods have shown their effec-
helps to correct the false detection and missing detection. tiveness in improving the face recognition performance by
stacking features extracted from different face regions [169].
For this reason, it is possible for the idea of JFL to be in-
6.1.3. Cross-Domain Retail Product Recognition with Transfer
troduced to the field of retail product recognition, i.e., learning
Learning. In object detection algorithms, a significant as-
the product image features and package text features jointly to
sumption is that the learning and test data are derived from
enhance the recognition performance. In [143], researchers
the same feature space and the same distribution [167], i.e.,
tried to automatically recognize the text on each product
most object detectors require retraining with new data from
packaging. Unfortunately, the extracted text information in
random initialization when the distribution changes. In the
this paper is just used to search for products for users.
real world, many different retail stores and supermarkets are
selling diversified products. Moreover, the internal envi-
ronment between different shops can be varied. One model 6.1.5. Incremental Learning with the CNN for Flexible
trained by data from a specific shop is unable to be applied Product Recognition. Deep learning methods always suffer
with a newly built store, which arises the concept of cross- from “catastrophic forgetting,” especially for convolutional
18 Computational Intelligence and Neuroscience

neural networks, i.e., they are incapable of recognising some corresponding techniques to those challenges. We have also
previously learned objects when adjusted to a new task [65]. briefly described the publicly available datasets and listed
Incremental learning is a powerful method that can deal with their detailed information, respectively.
new data without retraining the whole model. Additionally, Overall, this paper provides a clear overview of the
it enables deep neural networks to have a long-term current research status in this field and that it encourages
memory. Shmelkov et al. and Guan et al. [65, 170] imple- new researchers to join this field and complete extensive
mented incremental learning of object detection by pro- research in this area.
posing two detection networks. One is an existing network
that has already been trained, and the other one will be Conflicts of Interest
trained for detecting new classes. In [171], authors attempted
to combine incremental learning with CNNs and compared The authors declare no conflicts of interest.
various incremental teaching approaches for CNN-based
architectures. Therefore, incremental learning will be helpful Authors’ Contributions
to make the recognition system flexible with no or minimal
retraining whenever a fresh item is launched. Y. W., S. T. and S. X. contributed to conceptualization. Y. W.
contributed to writing and original draft preparation. S. T.,
S. X., B. K., and M. S. contributed to writing, reviewing, and
6.1.6. The Regression-Based Object Detection Methods for
editing. S. X. and B. K. supervised the study. B. K. was
Retail Product Recognition. If we want to apply product
responsible for funding acquisition. All authors read and
recognition in the industry area, it requires real-time
agreed to the published version of the manuscript.
availability. Consumers would like to check out immedi-
ately, and retailers shall receive real-time feedback when
something is missing from the shelves. As we all know, deep Acknowledgments
learning is computationally expensive. A large number of
The first author Y. W. was sponsored by the China Schol-
deep learning algorithms need to use GPUs to run image
arship Council (CSC).
processing. As mentioned in Section 2, there are two cat-
egories of the object detection methods: region proposal-
based and regression-based [47]. The regression-based References
methods can reduce the time expense by regressing the [1] T. Sriram, K. V. Rao, S. Biswas, and B. Ahmed, “Applications
objects’ locations and categories directly from image pixels of barcode technology in automated storage and retrieval
[54]. Ordinarily, the regression-based methods perform systems,” in Proceedings of the 1996 22nd International
better for real-time detection tasks than the methods based Conference on Industrial Electronics, Control, and Instru-
on region proposals. However, although the work in [51] mentation, vol. 1, pp. 641–646, Taipei, Taiwan, 1996.
achieves detection of general objects at a high rate of speed, it [2] H. Poll, “Digimarc survey: 88 percent of U.S. adults want their
suffers from accuracy reduction. Therefore, how to improve retail checkout experience to be faster,” 2015, https://www.
the detection accuracy with the regression-based approach digimarc.com/about/news-events/press-releases/2015/07/21/
for retail product recognition is worth more research. digimarc-survey-88-percent-of-u.s.-adults-want-their-retail-
checkout-experience-to-be-faster.
[3] R. Want, “An introduction to RFID technology,” IEEE
6.2. Conclusion. This paper addresses the broad area of Pervasive Computing, vol. 5, no. 1, pp. 25–33, 2006.
product recognition technologies. Product recognition will [4] B. Santra and D. P. Mukherjee, “A comprehensive survey on
become increasingly important in a world where cost computer vision based approaches for automatic identifi-
margins are becoming increasingly tight, and customers cation of products in retail store,” Image and Vision Com-
puting, vol. 86, pp. 45–63, 2019.
have increasing pressures on their available time. By sum-
[5] D. Grewal, A. L. Roggeveen, and J. Nordfält, “The future of
marising the literature in the field, we make research in this retailing,” Journal of Retailing, vol. 93, no. 1, pp. 1–6, 2017.
area more accessible to new researchers, allowing for the [6] Hampshire, “AI spending by retailers to reach $12 billion
field to progress. It is very important that this field addresses by 2023, driven by the promise of improved margins,”
these four challenging problems: (1) large-scale classifica- April 2019, https://www.juniperresearch.com/press/
tion; (2) data limitations; (3) intraclass variation; and (4) press-releases/ai-spending-by-retailers-reach-12-billion-
flexibility. We have identified several areas for further re- 2023.
search: (1) generating data with deep neural networks; (2) [7] X. S. Wei, Q. Cui, L. Yang, P. Wang, and L. Liu, “RPC: a
graph neural networks with deep learning; (3) cross-domain large-scale retail product checkout dataset,” 2019, https://
recognition with transfer learning; (4) joint feature learning arxiv.org/abs/1901.07249.
from text information on packaging; (5) incremental [8] M. Shapiro, Executing the Best Planogram, Vol. 1, Profes-
sional Candy Buyer, Norwalk, CT, USA, 2009.
learning with the CNN; and (6) the regression-based object
[9] F. D. Orel and A. Kara, “Supermarket self-checkout service
detection methods for retail product recognition. quality, customer satisfaction, and loyalty: empirical evi-
In this article, we have presented an extensive review of dence from an emerging market,” Journal of Retailing and
recent research on deep learning-based retail product rec- Consumer Services, vol. 21, pp. 118–129, 2014.
ognition, with more than one hundred references. We [10] B. F. Wu, W. J. Tseng, Y. S. Chen, S. J. Yao, and P. J. Chang,
propose four challenging problems and provide “An intelligent self-checkout system for smart retail,” in
Computational Intelligence and Neuroscience 19

Proceedings of the 2016 International Conference on System [27] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up
Science and Engineering (ICSSE), pp. 1–4, Puli, Taiwan, 2016. robust features (SURF),” Computer Vision and Image Un-
[11] A. C. R. Van Riel, J. Semeijn, D. Ribbink, and Y. Bomert- derstanding, vol. 110, no. 3, pp. 346–359, 2008.
Peters, “Waiting for service at the checkout: negative [28] R. Moorthy, S. Behera, S. Verma, S. Bhargave, and
emotional responses, store image and overall satisfaction,” P. Ramanathan, “Applying image processing for detecting
Journal of Service Management, vol. 23, no. 2, pp. 144–169, on-shelf availability and product positioning in retail stores,”
2012. in Proceedings of the 3rd International Symposium on Women
[12] F. Morimura and K. Nishioka, “Waiting in exit-stage op- in Computing and Informatics, Kochi, India, 2015.
erations: expectation for self-checkout systems and overall [29] S. Zhang, L. Yao, A. Sun, and Y. Tay, “Deep learning based
satisfaction,” Journal of Marketing Channels, vol. 23, no. 4, recommender system: a survey and new perspectives,” ACM
pp. 241–254, 2016. Computing Surveys (CSUR), vol. 52, p. 5, 2019.
[13] M. George and C. Floerkemeier, “Recognizing products: a [30] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning
per-exemplar multi-label image classification approach,” in algorithm for deep belief nets,” Neural Computation, vol. 18,
Proceedings of the 2014 European Conference on Computer no. 7, pp. 1527–1554, 2006.
Vision, pp. 440–455, Zurich, Switzerland, 2014. [31] G. E. Hinton and R. R. Salakhutdinov, “Reducing the di-
[14] D. López-de-Ipiña, T. Lorido, and U. López, “Indoor nav- mensionality of data with neural networks,” Science, vol. 313,
igation and product recognition for blind people assisted no. 5786, pp. 504–507, 2006.
shopping,” in Proceedings of the 2011 International Work- [32] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle,
shop on Ambient Assisted Living, pp. 33–40, Torremolinos, “Greedy layer-wise training of deep networks,” Advances in
Spain, 2011. Neural Information Processing Systems, pp. 153–160, MIT
[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet Press, Cambridge, MA, USA, 2007.
classification with deep convolutional neural networks,” [33] V. Nair and G. E. Hinton, “Rectified linear units improve
Advances in Neural Information Processing Systems, restricted Boltzmann machines,” in Proceedings of the 27th
International Conference on Machine Learning (ICML-10),
pp. 1097–1105, Springer, Berlin, Germany, 2012.
[16] C. Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, “DSSD: Haifa, Israel, 2010.
[34] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and
deconvolutional single shot detector,” 2017, https://arxiv.
R. R. Salakhutdinov, “Improving neural networks by pre-
org/abs/1701.06659.
venting co-adaptation of feature detectors,” 2012, https://
[17] J. Redmon and A. Farhadi, “Yolov3: an incremental im-
arxiv.org/abs/1207.0580.
provement,” 2018, https://arxiv.org/abs/1804.02767.
[35] Y. Jia, E. Shelhamer, J. Donahue et al., “CAFFE: convolu-
[18] R. Girshick, “Fast R-CNN,” in Proceedings of the 2015 IEEE
tional architecture for fast feature embedding,” in Proceed-
International Conference on Computer Vision, pp. 1440–
ings of the 22nd ACM International Conference on
1448, Santiago, Chile, 2015.
Multimedia, Glasgow, UK, 2014.
[19] Q. Zhao, T. Sheng, Y. Wang et al., “M2Det: a single-shot
[36] M. Abadi, A. Agarwal, P. Barham et al., “TensorFlow: large-
object detector based on multi-level feature pyramid net-
scale machine learning on heterogeneous distributed sys-
work,” 2018, https://arxiv.org/abs/1811.04533.
tems,” 2016, https://arxiv.org/abs/1603.04467.
[20] A. Tonioni, E. Serro, and L. Di Stefano, “A deep learning
[37] T. Chen, M. Li, Y. Li et al., “MXNet: a flexible and efficient
pipeline for product recognition in store shelves,” 2018,
machine learning library for heterogeneous distributed
https://arxiv.org/abs/1810.01733.
systems,” 2015, https://arxiv.org/abs/1512.01274.
[21] L. Karlinsky, J. Shtok, Y. Tzur, and A. Tzadok, “Fine-grained
[38] A. Paszke, S. Gross, S. Chintala, and G. Chanan, “Pytorch:
recognition of thousands of object categories with single- tensors and dynamic neural networks in python with strong
example training,” in Proceedings of the 2017 IEEE Confer- GPU acceleration,” 2017, https://pytorch.org/.
ence on Computer Vision and Pattern Recognition, [39] D. H. Hubel and T. N. Wiesel, “Receptive fields, binocular
pp. 4113–4122, Honolulu, HI, USA, 2017. interaction and functional architecture in the cat’s visual
[22] S. Qiao, W. Shen, W. Qiu, C. Liu, and A. Yuille, “Scalenet: cortex,” The Journal of Physiology, vol. 160, no. 1,
guiding object proposal generation in supermarkets and pp. 106–154, 1962.
beyond,” in Proceedings of the 2017 IEEE International [40] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, and others,
Conference on Computer Vision, pp. 1791–1800, Venice, “Gradient-based learning applied to document recognition,”
Italy, 2017. Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[23] C. G. Melek, E. B. Sonmez, and S. Albayrak, “A survey of [41] C. Szegedy, W. Liu, Y. Jia et al., “Going deeper with con-
product recognition in shelf images,” in Proceedings of the volutions,” in Proceedings of the IEEE Conference on Com-
2017 International Conference on Computer Science and puter Vision and Pattern Recognition, Boston, MA, USA,
Engineering (UBMK), pp. 145–150, Antalya, Turkey, 2017. 2015.
[24] D. G. Lowe, “Object recognition from local scale-invariant [42] K. Simonyan and A. Zisserman, “Very deep convolutional
features,” in Proceedings of the 2017 7th International networks for large-scale image recognition,” 2014, https://
Conference on Computer Vision, Kerkyra, Greece, 1999. arxiv.org/abs/1409.1556.
[25] D. G. Lowe, “Distinctive image features from scale-invariant [43] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
keypoints,” International Journal of Computer Vision, vol. 60, for image recognition,” in Proceedings of the 2016 IEEE
no. 2, pp. 91–110, 2004. Conference on Computer Vision and Pattern Recognition, Las
[26] H. Bay, T. Tuytelaars, and L. Van Gool, “SURF: speeded up Vegas, NV, USA, 2016.
robust features,” in Proceedings of the 2006 European Con- [44] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller,
ference on Computer Vision, pp. 404–417, Graz, Austria, “Multi-view convolutional neural networks for 3D shape
2006. recognition,” in Proceedings of the 2015 IEEE
20 Computational Intelligence and Neuroscience

International Conference on Computer Vision, Santiago, [61] A. Khosla, N. Jayadevaprakash, B. Yao, and F. F. Li, “Novel
Chile, 2015. dataset for fine-grained image categorization: Stanford
[45] Z. Gao, Y. Li, and S. Wan, “Exploring deep learning for view- dogs,” in Proceedings of the 2011 CVPR Workshop on Fine-
based 3D model retrieval,” ACM Transactions on Multimedia Grained Visual Categorization (FGVC), Colorado Springs,
Computing, Communications, and Applications, vol. 16, no. 1, CO, USA, June, 2011.
pp. 1–21, 2020. [62] M. E. Nilsback and A. Zisserman, “Automated flower
[46] P. Viola and M. J. Jones, “Robust real-time face detection,” classification over a large number of classes,” in Proceedings
International Journal of Computer Vision, vol. 57, no. 2, of the 2008 6th Indian Conference on Computer Vision,
pp. 137–154, 2004. Graphics & Image Processing, IEEE, Bhubaneswar, India,
[47] Z. Q. Zhao, P. Zheng, S. t. Xu, and X. Wu, “Object detection 2008.
with deep learning: a review,” IEEE Transactions on Neural [63] J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3D object
Networks and Learning Systems, vol. 30, no. 11, pp. 3212– representations for fine-grained categorization,” in Pro-
3232, 2019. ceedings of the 2013 IEEE International Conference on
[48] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich Computer Vision Workshops, Sydney, Australia, 2013.
feature hierarchies for accurate object detection and se- [64] I. Baz, E. Yoruk, and M. Cetin, “Context-aware hybrid
mantic segmentation,” in Proceedings of the 2014 IEEE classification system for fine-grained retail product recog-
Conference on Computer Vision and Pattern Recognition, nition,” in Proceedings of the 2016 IEEE 12th Image, Video,
Columbus, OH, USA, 2014. and Multidimensional Signal Processing Workshop (IVMSP),
[49] J. R. R. Uijlings, K. E. A. Van De Sande, T. Gevers, and Bordeaux, France, 2016.
A. W. M. Smeulders, “Selective search for object recogni- [65] K. Shmelkov, C. Schmid, and K. Alahari, “Incremental
tion,” International Journal of Computer Vision, vol. 104, learning of object detectors without catastrophic forgetting,”
no. 2, pp. 154–171, 2013. in Proceedings of the 2017 IEEE International Conference on
[50] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: to-
Computer Vision, Venice, Italy, 2017.
wards real-time object detection with region proposal net- [66] L. Zheng, Y. Yang, and Q. Tian, “SIFT meets CNN: a decade
works,” Advances in Neural Information Processing Systems,
survey of instance retrieval,” IEEE Transactions on Pattern
pp. 91–99, MIT Press, Cambridge, MA, USA, 2015.
Analysis and Machine Intelligence, vol. 40, pp. 1224–1244,
[51] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only
2017.
look once: unified, real-time object detection,” in Proceedings
[67] D. Farren, Classifying Food Items by Image Using Convolu-
of the 2016 IEEE Conference on Computer Vision and Pattern
tional Neural Networks, Stanford University, Stanford, CA,
Recognition, Las Vegas, NV, USA, 2016.
USA, 2017.
[52] W. Liu, D. Anguelov, D. Erhan et al., “SSD: single shot
[68] L. Liu, B. Zhou, Z. Zou, S. C. Yeh, and L. Zheng, “A smart
multibox detector,” in Proceedings of the 2016 European
unstaffed retail shop based on artificial intelligence and IoT,”
Conference on Computer Vision, pp. 21–37, Amsterdam,
in Proceedings of the 2018 IEEE 23rd International Workshop
Netherlands, 2016.
[53] E. Goldman and J. Goldberger, “Large-scale classification of on Computer Aided Modeling and Design of Communication
structured objects using a CRF with deep class embedding,” Links and Networks (CAMAD), Barcelona, Spain, 2018.
[69] L. Li, T.-T. Goh, and D. Jin, “How textual quality of online
2017, https://arxiv.org/pdf/1705.07420.
[54] J. Redmon and A. Farhadi, “YOLO9000: better, faster, reviews affect classification performance: a case of deep
stronger,” in Proceedings of the 2017 IEEE Conference on learning sentiment analysis,” Neural Computing and Ap-
Computer Vision and Pattern Recognition, Honolulu, HI, plications, vol. 32, no. 9, pp. 4387–4415, 2020.
USA, 2017. [70] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “In-
[55] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask ception-v4, inception-resnet and the impact of residual
R-CNN,” in Proceedings of the 2017 IEEE International connections on learning,” in Proceedings of the 31st AAAI
Conference on Computer Vision, Venice, Italy, 2017. Conference on Artificial Intelligence, San Francisco, CA, USA,
[56] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and 2017.
A. Zisserman, “The pascal visual object classes (VOC) [71] A. Tonioni and L. Di Stefano, “Domain invariant hierarchical
challenge,” International Journal of Computer Vision, vol. 88, embedding for grocery products recognition,” Computer
no. 2, pp. 303–338, 2010. Vision and Image Understanding, vol. 182, 2019.
[57] T.-Y. Lin, M. Maire, S. Belongie et al., “Microsoft coco: [72] T. Chong, I. Bustan, and M. Wee, Deep Learning Approach to
common objects in context,” in Proceedings of the 2014 Planogram Compliance in Retail Stores, Stanford University,
European Conference on Computer Vision, pp. 740–755, Stanford, CA, USA, 2016.
Zurich, Switzerland, 2014. [73] J. Li, X. Wang, and H. Su, “Supermarket commodity iden-
[58] A. Franco, D. Maltoni, and S. Papi, “Grocery product de- tification using convolutional neural networks,” in Pro-
tection and recognition,” Expert Systems with Applications, ceedings of the 2016 2nd International Conference on Cloud
vol. 81, pp. 163–176, 2017. Computing and Internet of Things (CCIOT), Dalian, China,
[59] B. Zhao, J. Feng, X. Wu, and S. Yan, “A survey on deep 2016.
learning-based fine-grained object classification and se- [74] W. Geng, F. Han, J. Lin et al., “Fine-grained grocery product
mantic segmentation,” International Journal of Automation recognition by one-shot learning,” in Proceedings of the 2018
and Computing, vol. 14, no. 2, pp. 119–135, 2017. ACM Multimedia Conference on Multimedia Conference,
[60] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, Seoul, Republic of Korea, 2018.
“The caltech-UCSD birds-200-2011 dataset,” Technical re- [75] A. De Biasio, “Retail shelf analytics through image processing
port CNS-TR-2011-001, California Institute of Technology, and deep learning,” Master thesis, Universityo Padua, Padua,
Pasadena, CA, USA, 2011. Italy, 2019.
Computational Intelligence and Neuroscience 21

[76] S. Varadarajan and M. M. Srivastava, “Weakly supervised [94] A. Buslaev, A. Parinov, E. Khvedchenya, V. I. Iglovikov, and
object localization on grocery shelves using simple FCN and A. A. Kalinin, “Albumentations: fast and flexible image
synthetic dataset,” 2018, https://arxiv.org/abs/1803.06813. augmentations,” 2018, https://arxiv.org/abs/1809.06839.
[77] P. Jund, N. Abdo, A. Eitel, and W. Burgard, “The freiburg [95] S. Varadarajan, S. Kant, and M. M. Srivastava, “Benchmark
groceries dataset,” 2016, https://arxiv.org/abs/1611.05799. for generic product detection: a strong baseline for dense
[78] C. Li, D. Du, L. Zhang et al., “Data priming network for object detection,” 2019, https://arxiv.org/abs/1912.09476.
automatic check-out,” 2019, https://arxiv.org/abs/1904. [96] D. P. Kingma and M. Welling, “Auto-encoding variational
04978. Bayes,” 2013, https://arxiv.org/abs/1312.6114.
[79] W. Yi, Y. Sun, T. Ding, and S. He, “Detecting retail products [97] I. Goodfellow, J. Pouget-Abadie, M. Mirza et al., “Generative
in situ using CNN without human effort labeling,” 2019, adversarial nets,” Advances in Neural Information Processing
https://arxiv.org/abs/1904.09781. Systems, pp. 2672–2680, MIT Press, Cambridge, MA, USA,
[80] P. Follmann, T. Bottger, P. Hartinger, R. Konig, and 2014.
M. Ulrich, “MVTec D2S: densely segmented supermarket [98] H. Huang, P. S. Yu, and C. Wang, “An introduction to image
dataset,” in Proceedings of the 2018 European Conference on synthesis with generative adversarial nets,” 2018, https://
Computer Vision (ECCV), Munich, Germany, 2018. arxiv.org/abs/1803.04469.
[81] C. Szegedy, V. Vanhoucke, S. Ioffeme)"[?--]>, J. Shlens, and [99] K. Sohn, H. Lee, and X. Yan, “Learning structured output
Z. Wojna, “Rethinking the inception architecture for com- representation using deep conditional generative models,”
puter vision,” in Proceedings of the 2016 IEEE Conference on Advances in Neural Information Processing Systems,
Computer Vision and Pattern Recognition, Las Vegas, NV, pp. 3483–3491, MIT Press, Cambridge, MA, USA, 2015.
USA, 2016. [100] M. Mirza and S. Osindero, “Conditional generative adver-
[82] J. Lafferty, A. McCallum, and F. C. Pereira, Conditional sarial nets,” 2014, https://arxiv.org/abs/1411.1784.
Random Fields: Probabilistic Models for Segmenting and [101] X. Yan, J. Yang, K. Sohn, and H. Lee, “Attribute2Image:
Labeling Sequence Data, University of Pennsylvania, conditional image generation from visual attributes,” in
Philadelphia, PA, USA, 2001. Proceedings of the 2016 European Conference on Computer
[83] G. Tolias, R. Sicre, and H. Jégou, “Particular object retrieval Vision, pp. 776–791, Amsterdam, Netherlands, 2016.
with integral max-pooling of CNN activations,” 2015, [102] A. Radford, L. Metz, and S. Chintala, “Unsupervised rep-
https://arxiv.org/abs/1511.05879. resentation learning with deep convolutional generative
[84] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, adversarial networks,” 2015, https://arxiv.org/abs/1511.
“Return of the devil in the details: delving deep into con- 06434.
volutional nets,” 2014, https://arxiv.org/abs/1405.3531. [103] L. Cai, H. Gao, and S. Ji, “Multi-stage variational auto-en-
[85] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei, coders for coarse-to-fine image generation,” in Proceedings of
“Imagenet: a large-scale hierarchical image database,” in the 2019 SIAM International Conference on Data Mining,
Proceedings of the 2009 IEEE Conference on Computer Vision Calgary, Canada, 2019.
and Pattern Recognition, IEEE, Miami, FL, USA, 2009. [104] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever,
[86] J. Redmon, “Darknet: open source neural networks in C,” and P. Abbeel, “Infogan: interpretable representation
2013, http://pjreddie.com/darknet/. learning by information maximizing generative adversarial
[87] L. Fei-Fei, R. Fergus, and P. Perona, “One-shot learning of nets,” Advances in Neural Information Processing Systems,
object categories,” IEEE Transactions on Pattern Analysis and pp. 2172–2180, MIT Press, Cambridge, MA, USA, 2016.
Machine Intelligence, vol. 28, pp. 594–611, 2006. [105] P. Isola, J. Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image
[88] D. C. Cireşan, U. Meier, J. Masci, L. M. Gambardella, and translation with conditional adversarial networks,” in Pro-
J. Schmidhuber, “High-performance neural networks for ceedings of the 2017 IEEE Conference on Computer Vision and
visual object classification,” 2011, https://arxiv.org/abs/1102. Pattern Recognition, Honolulu, HI, USA, 2017.
0183. [106] J. Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-
[89] M. Merler, C. Galleguillos, and S. Belongie, “Recognizing to-image translation using cycle-consistent adversarial net-
groceries in situ using in vitro training data,” in Proceedings works,” in Proceedings of the 2017 IEEE International
of the 2007 IEEE Conference on Computer Vision and Pattern Conference on Computer Vision, Venice, Italy, 2017.
Recognition, Minneapolis, MN, USA, 2007. [107] Z. Yi, H. Zhang, P. Tan, and M. Gong, “Dualgan: unsu-
[90] P. Y. Simard, D. Steinkraus, J. C. Platt, and others, “Best pervised dual learning for image-to-image translation,” in
practices for convolutional neural networks applied to visual Proceedings of the 2017 IEEE International Conference on
document analysis,” in Proceedings of the 7th International Computer Vision, Venice, Italy, 2017.
Conference on Document Analysis and Recognition, Edinburgh, [108] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim, “Learning to
UK, 2003. discover cross-domain relations with generative adversarial
[91] D. Cireşan, U. Meier, and J. Schmidhuber, “Multi-column networks,” in Proceedings of the 34th International Confer-
deep neural networks for image classification,” 2012, https:// ence on Machine Learning, vol. 70, Sydney, Australia, 2017.
arxiv.org/abs/1202.2745. [109] Y. Choi, M. Choi, M. Kim, J. W. Ha, S. Kim, and J. Choo,
[92] W. Qiu and A. Yuille, “UnrealCV: connecting computer “Stargan: unified generative adversarial networks for multi-
vision to unreal engine,” in Proceedings of the European domain image-to-image translation,” in Proceedings of the
Conference on Computer Vision, Springer, Amsterdam, 2018 IEEE Conference on Computer Vision and Pattern
Netherlands, 2016. Recognition, Salt Lake City, UT, USA, 2018.
[93] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei, “Fully convolutional [110] M. Y. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-
instance-aware semantic segmentation,” in Proceedings of the image translation networks,” Advances in Neural Informa-
2017 IEEE Conference on Computer Vision and Pattern tion Processing Systems, pp. 700–708, MIT Press, Cambridge,
Recognition (CVPR), Honolulu, HI, USA, 2017. MA, USA, 2017.
22 Computational Intelligence and Neuroscience

[111] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, 2013 IEEE Conference on Computer Vision and Pattern
“Labeled faces in the wild: a database forstudying face rec- Recognition, Portland, OR, USA, 2013.
ognition in unconstrained environments,” in Proceedings of [126] T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang,
the 2008 Workshop on faces in “Real-Life” Images: Detection, “The application of two-level attention models in deep
Alignment, and Recognition, Marseille, France, 2008. convolutional neural network for fine-grained image clas-
[112] A. Mishra, S. Krishna Reddy, A. Mittal, and H. A. Murthy, “A sification,” in Proceedings of the 2015 IEEE Conference on
generative model for zero shot learning using conditional Computer Vision and Pattern Recognition, Boston, MA, USA,
variational autoencoders,” in Proceedings of the 2018 IEEE 2015.
Conference on Computer Vision and Pattern Recognition [127] N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part-
Workshops, Salt Lake City, UT, USA, 2018. based R-CNNs for fine-grained category detection,” in
[113] C. H. Lampert, H. Nickisch, and S. Harmeling, “Attribute- Proceedings of the 2014 European Conference on Computer
based classification for zero-shot visual object categoriza- Vision, Springer, Zurich, Switzerland, 2014.
tion,” IEEE Transactions on Pattern Analysis and Machine [128] S. Kong and C. Fowlkes, “Low-rank bilinear pooling for fine-
Intelligence, vol. 36, pp. 453–465, 2013. grained classification,” in Proceedings of the 2017 IEEE
[114] G. Patterson and J. Hays, “Sun attribute database: discov- Conference on Computer Vision and Pattern Recognition,
ering, annotating, and recognizing scene attributes,” in Honolulu, HI, USA, 2017.
Proceedings of the 2012 IEEE Conference on Computer Vision [129] B. Zhao, X. Wu, J. Feng, Q. Peng, and S. Yan, “Diversified
and Pattern Recognition, IEEE, Providence, RI, USA, 2012. visual attention networks for fine-grained object classifica-
[115] S. W. Huang, C. T. Lin, S. P. Chen, Y. Y. Wu, P. H. Hsu, and tion,” IEEE Transactions on Multimedia, vol. 19, no. 6,
S. H. Lai, “AugGAN: cross domain adaptation with GAN- pp. 1245–1256, 2017.
based data augmentation,” in Proceedings of the 2018 Eu- [130] J. Krause, H. Jin, J. Yang, and L. Fei-Fei, “Fine-grained
ropean Conference on Computer Vision (ECCV), Munich, recognition without part annotations,” in Proceedings of the
Germany, 2018. 2015 IEEE Conference on Computer Vision and Pattern
[116] Z. Qiu, Y. Pan, T. Yao, and T. Mei, “Deep semantic hashing Recognition, Boston, MA, USA, 2015.
with generative adversarial networks,” in Proceedings of the [131] S. Reed, Z. Akata, H. Lee, and B. Schiele, “Learning deep
40th International ACM SIGIR Conference on Research and representations of fine-grained visual descriptions,” in
Development in Information Retrieval, ACM, Tokyo, Japan,
Proceedings of the 2016 IEEE Conference on Computer Vision
2017.
and Pattern Recognition, Las Vegas, NV, USA, 2016.
[117] A. Krizhevsky, V. Nair, and G. Hinton, “The CIFAR-10
[132] A. Angelova, S. Zhu, and Y. Lin, “Image segmentation for
dataset,” 2014, http://www.cs.toronto.edu/kriz/cifar.html.
large-scale subcategory flower recognition,” in Proceedings of
[118] T. S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. T. Zheng,
the 2013 IEEE Workshop on Applications of Computer Vision
“NUS-WIDE: a real-world web image database from na-
(WACV), IEEE, Tampa, FL, USA, 2013.
tional university of Singapore,” in Proceedings of the 2009
[133] Z. Ge, C. McCool, C. Sanderson, and P. Corke, Content Specific
ACM Conference on Image and Video Retrieval (CIVR’09),
Feature Learning for Fine-Grained Plant Classification,
New York, NY, USA, 2009.
[119] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, Queensland University of Technology, Brisbane, Australia,
A. Graves, and others, “Conditional image generation with 2015.
pixelcnn decoders,” Advances in Neural Information Pro- [134] L. Yang, P. Luo, C. Loy, and X. Tang, “A large-scale car
cessing Systems, pp. 4790–4798, MIT Press, Cambridge, MA, dataset for fine-grained categorization and verification,” in
USA, 2016. Proceedings of the 2015 IEEE Conference on Computer Vision
[120] Z. Zheng, L. Zheng, and Y. Yang, “Unlabeled samples and Pattern Recognition, Boston, MA, USA, 2015.
generated by gan improve the person re-identification [135] S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi,
baseline in vitro,” in Proceedings of the 2017 IEEE Interna- “Fine-grained visual classification of aircraft,” 2013, https://
tional Conference on Computer Vision, Venice, Italy, 2017. arxiv.org/abs/1306.5151.
[121] X. Liu, J. Wang, S. Wen, E. Ding, and Y. Lin, “Localizing by [136] J. Krause, T. Gebru, J. Deng, L. J. Li, and L. Fei-Fei, “Learning
describing: attribute-guided attention localization for fine- features and parts for fine-grained recognition,” in Pro-
grained recognition,” in Proceedings of the 31st AAAI Con- ceedings of the 2014 22nd International Conference on Pattern
ference on Artificial Intelligence, San Francisco, CA, USA, Recognition, IEEE, Stockholm, Sweden, 2014.
2017. [137] S. Branson, G. Van Horn, S. Belongie, and P. Perona, “Bird
[122] X. Wang, Z. Man, M. You, and C. Shen, “Adversarial species categorization using pose normalized deep con-
generation of training examples: applications to moving volutional nets,” 2014, https://arxiv.org/abs/1406.2952.
vehicle license plate recognition,” 2017, https://arxiv.org/abs/ [138] V. Vapnik, The Nature of Statistical Learning Theory,
1707.03124. Springer Science & Business Media, Berlin, Germany, 2013.
[123] T. Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and [139] B. Hu, N. Zhou, Q. Zhou, X. Wang, and W. Liu, “DiffNet: a
S. Belongie, “Feature pyramid networks for object detection,” learning to compare deep network for product recognition,”
in Proceedings of the 2017 IEEE Conference on Computer IEEE Access, vol. 8, pp. 19336–19344, 2020.
Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, [140] M. Simon and E. Rodner, “Neural activation constellations:
2017. unsupervised part model discovery with convolutional
[124] B. Yao, A. Khosla, and L. Fei-Fei, “Combining randomiza- networks,” in Proceedings of the 2015 IEEE International
tion and discrimination for fine-grained image categoriza- Conference on Computer Vision, Santiago, Chile, 2015.
tion,” in Proceedings of the CVPR 2011, IEEE, Providence, RI, [141] T. Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear CNN
USA, 2011. models for fine-grained visual recognition,” in Proceedings of
[125] J. Deng, J. Krause, and L. Fei-Fei, “Fine-grained crowd- the 2015 IEEE International Conference on Computer Vision,
sourcing for fine-grained recognition,” in Proceedings of the Santiago, Chile, 2015.
Computational Intelligence and Neuroscience 23

[142] S. Singh, A. Gupta, and A. A. Efros, “Unsupervised discovery Conference on Graphic and Image Processing (ICGIP 2014),
of mid-level discriminative patches,” in Proceedings of the Beijing, China, 2015.
2012 European Conference on Computer Vision, Springer, [158] M. Klasson, C. Zhang, and H. Kjellström, “A hierarchical
Firenze, Italy, 2012. grocery store image dataset with visual and semantic labels,”
[143] M. George, D. Mircic, G. Soros, C. Floerkemeier, and in Proceedings of the 2019 IEEE Winter Conference on Ap-
F. Mattern, “Fine-grained product class recognition for assisted plications of Computer Vision (WACV), IEEE, Waikoloa
shopping,” in Proceedings of the 2015 IEEE International Village, HI, USA, 2019.
Conference on Computer Vision Workshops, Santiago, Chile, [159] T. Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal
2015. loss for dense object detection,” in Proceedings of the 2017
[144] Y. Wang, R. Song, X. S. Wei, and L. Zhang, “An adversarial IEEE International Conference on Computer Vision (ICCV),
domain adaptation network for cross-domain fine-grained Venice, Italy, 2017.
recognition,” in Proceedings of the 2020 IEEE Winter Con- [160] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and
ference on Applications of Computer Vision, Aspen, CO, USA, G. Monfardini, “The graph neural network model,” IEEE
2020. Transactions on Neural Networks, vol. 20, pp. 61–80, 2008.
[145] R. Mottaghi, X. Chen, X. Liu et al., “The role of context for [161] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “A
object detection and semantic segmentation in the wild,” in comprehensive survey on graph neural networks,” 2019,
Proceedings of the 2014 IEEE Conference on Computer Vision https://arxiv.org/abs/1901.00596.
and Pattern Recognition, Columbus, OH, USA, 2014. [162] P. W. Battaglia, J. B. Hamrick, V. Bapst et al., “Relational
[146] S. K. Divvala, D. Hoiem, J. H. Hays, A. A. Efros, and inductive biases, deep learning, and graph networks,” 2018,
M. Hebert, “An empirical study of context in object de- https://arxiv.org/abs/1806.01261.
tection,” in Proceedings of the 2009 IEEE Conference on [163] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton,
Computer Vision and Pattern Recognition, IEEE, Miami, FL, and J. Leskovec, “Graph convolutional neural networks for
USA, 2009. web-scale recommender systems,” in Proceedings of the 24th
[147] A. Torralba, “Contextual priming for object detection,” In- ACM SIGKDD International Conference on Knowledge
ternational Journal of Computer Vision, vol. 53, no. 2, Discovery & Data Mining, ACM, London, UK, 2018.
pp. 169–191, 2003. [164] W. Fan, Y. Ma, Q. Li et al., “Graph neural networks for social
[148] A. Tonioni and L. Di Stefano, “Product recognition in store recommendation,” in Proceedings of the 2019 World Wide
shelves as a sub-graph isomorphism problem,” in Proceed- Web Conference, ACM, San Francisco, CA, USA, 2019.
ings of the 2017 International Conference on Image Analysis [165] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and
and Processing, Springer, Catania, Italy, 2017. G. E. Dahl, “Neural message passing for quantum chemis-
[149] E. P. Xing, M. I. Jordan, S. J. Russell, and A. Y. Ng, “Distance try,” in Proceedings of the 34th International Conference on
metric learning with application to clustering with side-in- Machine Learning, vol. 70, Sydney, Australia, 2017.
formation,” Advances in Neural Information Processing [166] T. N. Kipf and M. Welling, “Semi-supervised classification
Systems, pp. 521–528, MIT Press, Cambridge, MA, USA, with graph convolutional networks,” 2016, https://arxiv.org/
2003. abs/1609.02907.
[150] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, and others, [167] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE
“Matching networks for one shot learning,” Advances in Transactions on Knowledge and Data Engineering, vol. 22,
Neural Information Processing Systems, pp. 3630–3638, MIT pp. 1345–1359, 2009.
Press, Cambridge, MA, USA, 2016. [168] L. Torrey and J. Shavlik, “Transfer learning,” in Handbook of
[151] R. Keshari, M. Vatsa, R. Singh, and A. Noore, “Learning Research on Machine Learning Applications and Trends:
structure and strength of CNN filters for small sample size Algorithms, Methods, and Techniques, pp. 242–264, IGI
training,” in Proceedings of the 2018 IEEE Conference on Global, Philadelphia, PA, USA, 2010.
Computer Vision and Pattern Recognition, Salt Lake City, [169] J. Lu, V. E. Liong, G. Wang, and P. Moulin, “Joint feature
UT, USA, 2018. learning for face recognition,” IEEE Transactions on Infor-
[152] S. Bak and P. Carr, “One-shot metric learning for person re- mation Forensics and Security, vol. 10, no. 7, pp. 1371–1383,
identification,” in Proceedings of the 2017 IEEE Conference on 2015.
[170] L. Guan, Y. Wu, J. Zhao, and C. Ye, “Learn to detect objects
Computer Vision and Pattern Recognition, Honolulu, HI,
incrementally,” in Proceedings of the 2018 IEEE Intelligent
USA, 2017.
Vehicles Symposium (IV), Changshu, China, 2018.
[153] G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural
[171] V. Lomonaco and D. Maltoni, “Comparing incremental
networks for one-shot image recognition,” in Proceedings of
learning strategies for convolutional neural networks,” in
the 2015 ICML Deep Learning Workshop, Lille, France, 2015.
Proceedings of the 2016 IAPR Workshop on Artificial Neural
[154] S. Caelles, K. K. Maninis, J. Pont-Tuset, L. Leal-Taixé,
Networks in Pattern Recognition, Springer, Ulm, Germany,
D. Cremers, and L. Van Gool, “One-shot video object seg-
2016.
mentation,” in Proceedings of the 2017 IEEE Conference on
Computer Vision and Pattern Recognition, Honolulu, HI,
USA, 2017.
[155] E. Schwartz, L. Karlinsky, J. Shtok et al., “RepMet: repre-
sentative-based metric learning for classification and one-
shot object detection,” 2018, https://arxiv.org/abs/1806.
04728.
[156] T. Wiedemeyer, “IAI Kinect2,” 2015, https://github.com/
code-iai/iai_kinect2.
[157] G. Varol and R. S. Kuzu, “Toward retail product recognition
on grocery shelves,” in Proceedings of the 6th International

You might also like