0% found this document useful (0 votes)
48 views6 pages

Amazon Inventory Reconciliation Using AI: ST ND RD

This document summarizes research on using artificial intelligence models to reconcile inventory at Amazon fulfillment centers. The researchers evaluated convolutional neural networks (CNNs), support vector machines (SVMs), classification trees, and other algorithms on Amazon's bin image dataset. Their CNN approach achieved over 56% accuracy at predicting the number of items in bins, over 3x better than random guessing and over 75% better than their best SVM model. Prior research by another group using ResNet CNNs achieved 55.67% accuracy.

Uploaded by

sumangal chhauda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views6 pages

Amazon Inventory Reconciliation Using AI: ST ND RD

This document summarizes research on using artificial intelligence models to reconcile inventory at Amazon fulfillment centers. The researchers evaluated convolutional neural networks (CNNs), support vector machines (SVMs), classification trees, and other algorithms on Amazon's bin image dataset. Their CNN approach achieved over 56% accuracy at predicting the number of items in bins, over 3x better than random guessing and over 75% better than their best SVM model. Prior research by another group using ResNet CNNs achieved 55.67% accuracy.

Uploaded by

sumangal chhauda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Amazon Inventory Reconciliation Using AI

1st Pablo Rodriguez Bertorello 2nd Sravan Sripada 3rd Nutchapol Dendumrongsup
Computer Science Computer Science Computational & Mathematical Engineering
Stanford University Stanford University Stanford University
Palo Alto, California Palo Alto, California Palo Alto, California
[email protected] [email protected] [email protected]

Abstract—Inventory management is critical to Amazon’s suc- Support Vector Machines. For the latter, we attained accuracy
cess. Thus, the need arises to apply artificial intelligence to assure of over 30%. We operated on a reduced dataset, for bins that
the correctness of deliveries. The paper evaluates Convolutional contained up to 5 products, for a random baseline probability
Neural Networks for inventory reconciliation. The network de-
tects the the number of items being carried, relying only on of 16.6%.
a photo of the bin. Convolutional performance is evaluated We are among the first to publish results on Amazon’s Bin
against Support Vector Machines, both non-linear and linear. Our Image Dataset. Prior art by Eunbyung Park of the University of
Convolutional method performed over 3x better than random, North Carolina at Chapel Hill [2] applied Deep Convolutional
and over 75% better than our best Support Vector classifier. Classification (ResNet 34). It achieved 55.67% accuracy.
Index Terms—Convolutional Neural Network, Deep Learning,
Support Vector Machine, Radial Basis Function, Amazon Bin II. DATASET AND F EATURES
Image Dataset
A. Input Data Set
I. I NTRODUCTION The data set contains 535,234 images, which contain
Amazon Fulfillment Centers are bustling hubs of innovation 459,476 different product skews, of different shapes and sizes.
that allow Amazon to deliver millions of products to over 100
countries worldwide. These products are randomly placed in
bins, which are carried by robots.
Occasionally, items are misplaced while being handled,
resulting in a mismatch: the recorded bin inventory, versus
its actual content. The paper describes methods to predict
the number of items in a bin, thus detecting any inventory
variance. By correcting variance upon detection, Amazon will
better serve its customers.
Specifically, the input to our model is a raw color photo, of
the products in a bin. To find the best solution to the inventory Fig. 1: Sample Image
mismatch problem, Amazon published the Bin Image Dataset,
which is detailed in Section II. Each image/metadata tuple corresponds to a bin with prod-
The output of a model is the bin’s predicted quantity, the ucts. The metadata includes the actual count of objects in the
number of products in the image. bin, which is used as the label to train our model.
While we started with linear methods, the quest for model We worked with the subset of 150K images, each containing
performance lead us to non-linear algorithms, and ultimately with up to five products. We split this as follows: 70% training,
to convolutional deep learning. Section III summarizes each 20% validation, and 10% test.
algorithms applied. They include:
• Logistic Regression and Classification Trees, summarized B. Data Engineering
in Section III-A and Section III-B Before model training, images were normalized:
• Support Vector Machines: linear kernel, polynomial ker- 1) Re-sized to 224x224 pixels
nel, radial kernel. The algorithms are summarized in 2) Tansformed for zero mean, and unit variance
Section III-C 3) For convolutional models, the dataset was augmented
• Convolutional Neural Network: ResNet, cross-entropy with horizontal flips of every image
loss function, with learning rate optimizations. The al-
gorithm summary in Section III-D C. Feature Engineering: Blobs
Section IV summarizes performance resuts. With Convolu- We explored the Blob features extraction. Blobs are bright
tional Neural Networks we were able to achieve an overall on dark or dark on bright regions in an image. All the
accuracy exceeding 56%. This is over 60% better than with bins in which items are placed are similar and if items are

Electronic copy available at: https://ssrn.com/abstract=3311007


present in the bin, the idea is to make an attempt to create B. Classification Tree
features assuming items in the bin are relatively bright on In a classification tree, each observation belongs to the
the backgrounds. We used Laplacian of Gaussian approach, it most commonly occurring class of training observations in
computes the Laplacian of images with successively increasing the region to which it belongs. We use the Gini index, of
standard deviation and stacks them up in a cube. Blobs are which measures total variance across the K classes, as the
local maximas in this cube. Note the yellow circles in Figure loss function. Gini index is defined by
2
K
X
G= p̂mk (1 − p̂mk ) (2)
k=1

C. Support Vector Machines


The support vector machine (SVM) is an extension of
the support vector classifier that results from enlarging the
feature space in a specific way, using kernels. Feature space
is enlarged in order to accommodate a non-linear boundary
between the classes. The kernel approach enable an efficient
computational approach for SVM. We have attempted several
kernel types as follow
Fig. 2: Blob Features • Linear kernel
P
X
K(xi , xi0 ) = xij xi0 j (3)
j=1
D. Feature Engineering: Histogram of Oriented Gradients
• Polynomial kernel
In addition, Histogram of Oriented Gradients (HOG) is
P
explored because objects in the images tend to have dominant X
orientations. The distributions of directions of pixels gradients K(xi , xi0 ) = (1 + xij xi0 j )d (4)
j=1
are used as features. Unfortunately, the tape wrapping the
products in the bin skews the HOG diagram. We attempted to • Radial kernel
remedy this by pre-processing the images for edge-detection. P
X
Still, HOG under-performed. K(xi , xi0 ) = exp(−γ (xij − xi0 j )2 ) (5)
j=1

D. Convolutional Neural Networks


ResNet [4] consists of a series of convolutional filters
followed by a fully connected layer (refer to Fig 4). Deeper
networks usually suffer with vanishing gradient issues and
ResNet attempts to solve this problem by stacking identity
mappings.
1) Classifier and Loss Function: Softmax layer (σ) and
Cross entropy loss (CEL) function were used since we are
solving multi class classification problem
Fig. 3: HOG Features
M
X
CEL = − yo,c log(po,c ) (6)
c=1
III. M ETHODS K
X
σj = exp zj / exp zk (7)
A. Logistic Regression
k=1
The probability that each observation is classified to each 2) Learning Rate Finder: Learning rate determines the step
class is defined as a logistic function as follow size of the update and is one of the key hyper-parameters
to training a network. For some of our experiments, we set
e(β0 +β1 X1 +...+βp Xp ) the learning rate based on an approach introduced in the
p(X) = (1) paper ”Cyclical Learning Rates for Training Neural Networks”
1 + e(β0 +β1 X1 +...+βp Xp )
Smith, and Leslie N [7]. Fig 4 depicts the approach for one
The observation is assigned to the class in which it has of our experiments with ResNet34. Essentially, this algorithm
highest probability. runs one epoch on the network increasing learning rate every

Electronic copy available at: https://ssrn.com/abstract=3311007


few iterations. Eventually, the learning rate producing the TABLE I: MULTI-CLASS CLASSIFIER ACCURACY ON
steepest reduction in validation loss is picked. This eliminates RAW DATA
the need to rely on experimentation to find the best learning
Object Quantity Logistic Regression Decision Tree SVM
rate. 0 0.33 0.12 0.33
1 0.19 0.14 0.12
2 0.24 0.15 0.29
3 0.25 0.28 0.26
4 0.24 0.21 0.29
5 0.20 0.21 0.24
overall 0.24 0.19 0.26
a Decision Trees suffer severe over-fitting

Fig. 4: Learning Rate Finder - Loss Given the large number of columns in a raw RGB image
(total of 150,528), compressing work-load it to as few as 1000
3) Optimization Algorithms: columns, it is important to identify most effective solver. The
a) Stochastic Gradient Descent (SGD) and Stochastic following Principal Component Analysis solvers were tried:
Gradient Descent with Restarts (SGDR): Training CNNs ’auto’, ’full’, ’arpack’, ’randomized’. The PCA solver resulting
involve optimizing multimodal functions. SGDR helps reset in the highest accuracy: ’randomized’.
learning rate every few iterations so that gradient descent can With respect to Histogram of Oriented Gradients (HOG),
pop out of a local optimum and find its way to global minima, evaluation suggests that the images in the data set do not have
an idea shown to work effectively in the paper Loshchilov, et enough dominant gradients. Thus, identifying products in a
al. [5]. Fig 5 shows SGDR with cosine annealing. bin is difficult. In part, this may be due to Amazon’s usage
of tape to cover products in a bin. For many images, the tape
occludes the products, causing a significant information loss.

TABLE II: ACCURACY OF FEATURE EXTRACTION


METHODS, FOR SUPPORT VECTOR MACHINE CLAS-
SIFIERS

Object Quantity Blob HOG PCA


Fig. 5: SGDR Cosine Annealing 0 0.04 0.33 0.71
1 0.17 0.12 0.24
2 0.22 0.29 0.32
b) Adam: Adam is an algorithm that is computationally 3 0.53 0.26 0.48
efficient, has little memory requirements, and is well suited for 4 0.22 0.29 0.28
5 0.02 0.24 0.12
training deeper ResNets for Images, which are large in terms overall 0.26 0.26 0.32
of data and parameters In the paper Kingma, et al [6], results a Blob and HOG under-performed

demonstrate that Adam works well in practice and compares


favorably to other stochastic optimization methods.
C. Support Vector Machines
IV. E XPERIMENTS
The following SVMs were evaluated in Python, performing
The performance of the best methods, as well as the the corresponding parameter search:
rationale leading to identifying them is outlined below.
1) svm.LinearSVC: with a linear kernel. Both ’l1’ and ’l2’
Given the nature of the data set, we expect the decision
regularization were pursued. Additionally, ’hinge’ and
boundary to be highly non-linear.
’squared hinge’ loss functions. And ’ovr’ and ’cram-
A. Multi-Class Classification mer singer’ multi-class strategies. As well as C penalty
Several multi-class classifiers were explored with raw pixel in the range 1 to 1e4
data. Figure I shows the accuracy of Logistic Regression, Clas- 2) svm.NuSVC: Similar to SVC but uses a parameter to
sification Tree, and Support Vector Machines. SVM performed control the number of support vectors. Nu ranges 1e-6
best. to 1e-2 were tested.
3) svm.SVC: C-support vector classification. Tested
B. Feature Selection penalty range from 1e-9 to 1e12 for C, and from 1e-11
Next, with the intention of arriving at a more useful to 1e-10 for gamma
description of an image than raw pixel data, a number of Ten-fold cross-validation was used to find the best param-
feature extraction algorithms were attempted. Figure III show eters against the validation set. Thus over-fitting and bias
the performance of Blob features, Histogram of Oriented were traded off to achieve the greatest overall accuracy, while
Gradients (HOG), and Principal Component Analysis (PCA). achieving greater than zero accuracy for every class. The

Electronic copy available at: https://ssrn.com/abstract=3311007


highest accuracy was achieved by svm.SVC, with the fol-
lowing parameters: C=1e2, gamma=1e-7. The corresponding
confusion matrix is in Figure 6.

Fig. 7: ResNet18 Training Loss

Fig. 8: ResNet18 Train/Val Accuracy

ResNet34 model, obtained with learning rate 0.01 and weight


Fig. 6: SVM SVC Confusion Matrix
decay 1e-2, improved the accuracy by just 1.3% to 51.78%
and RMSE decreased to 0.9647. Further, no improvement was
D. Convolutional Neural Networks via Transfer Learning observed in accuracy of the images with 4, 5 objects. Upon
We took ResNet18, ResNet34 and ResNet50 ConvNets that analyzing the images that the model failed to correctly predict,
have been pretrained on ImageNet for all our deep learning it was observed that some incorrect classifications were due
experiments. to large amount of noise and occlusion in images. Refer to
a) ResNet18 - SGD Learning Rate Step Decay: We Fig x (it was difficult to visually determine the right count).
initiated learning process with a batch size of 128 and trained However, some were due to model not being able to learn some
the full network of ResNet18 architecture using SGD with of features pertaining to shapes associated with a combination
weight decay of 1e-4 and a learning rate of 0.1 set to decay by of certain object categories in the data set. This motivated the
a factor of 10 every 10 epochs. We noticed that the model starts experiments of using SGDR to improve the performance of
to over fit after 11 epochs. Accuracy is 49.38% and Root Mean the network based on research by Jeremy Howard’s [12] Fig
Squared Error (RMSE) 0.9905 on the validation set. Note that 9 and Fig 10 depict Training Loss and Train/Val accuracy for
we are able to achieve over 94% accuracy on the training set best performing ResNet34 model.
This indicates that the model is failing to generalize, hence we
increased the regularization by setting weight decay to 1e-3
for the next iteration. After training the model for 20 epochs,
we are able to achieve an accuracy of 50.4% and RMSE
of 0.9810 on the dev set and increase in regularization has
prevented over fitting by constraining the parameter weights.
Upon experimenting further by increase regularization further
and decreasing learning rate, the accuracy further dropped. Fig
7 shows the training loss and Fig 8 Training / Val Accuracy Fig. 9: ResNet34 Training Loss
for the best performing ResNet18 Model. Upon diving into
per class accuracy numbers for ResNet18, we noticed that
model is performing poorly on images with 4 and 5 items.
Our hypothesis was that images with more items needed
more complex features to predict well, in order to learn more
complex features in considerable time, we tried a deeper
ResNet34 architecture.
b) ResNet34 - SGD Learning Rate Step Decay: Using
the same learning process and initial set of learning rates and
regularization settings from ResNet18, we trained ResNet34 Fig. 10: ResNet34 Train/Val Accuracy
models with step annealing. Model iterations were performed
by examining training, validation loss and accuracy. It was c) ResNet34 - SGDR: We set the batch size to 128 and
surprising to notice that the accuracy of the best performing transformed the images to 64x64 with horizontal flip data

Electronic copy available at: https://ssrn.com/abstract=3311007


augmentation for training. Learning rate finder (Introduced in architecture, did not improve the results. Table III compares
Methods Section) was used to set the initial learning rate lr to the performance of various deep learning algorithms.
0.008. Trained the network by freezing all the layers except for
the final fully connected layer for 3 epochs using SGDR with
cosine annealing. Weights on the Fully connected layer are
very specific to the problem at hand so weight initialization for
the last layer will be done somewhat randomly as the Amazon
Bin Images are very different than ImageNet. Training last
layer will reduce randomness in weights and can be performed
quickly as the other layers are frozen. Underlying theory is
described in the paper osinski,et al [8]. Now the training is
performed on the entire network by using SGDR with cosine
Fig. 11: ResNet34 Adam Confusion Matrix
annealing with differentiated learning rates for 3 epochs.
Differentiated learning rates set a lower learning rate of lr/9 to
first few layers of the network followed by lr/3 for the middle TABLE III: ACCURACY OF CNN MODELS
layers and lr for the fully connected layer. The premise is that
the first few layers are generic to all images and the last layers Object Quantity Training Accuracy Val Accuracy Val RMSE
are more specific to the problem. Since ResNet was pretrained ResNet18 SGD 55.9 50.4 0.98
on ImageNet we wouldn’t expect the weights of the earlier ResNet34 SGD 55.2 51.2 0.99
ResNet34 SGDR 57.8 53.8 0.94
network to change much. Refer to osinski,et al [8] for more ResNet34 Adam 50.6 51.8 0.99
details. The process above is repeated by setting the image ResNet34 Adam All 62.3 56.2 0.90
size to 128x128 and finally changing the size to 224x224 and ResNet50 Adam All 61.2 55.2 0.91
a All - Refers to training on all Images
training the full network until the model starts to over fit. Re
sizing the images is useful to reduce over fitting since the same
features are extracted from images with varying dimensions of
features that prevent over fitting on the training data. Training V. C ONCLUSIONS
accuracy after the final set of epochs on the 224x224 images Convolutional Neural Networks model outperforms SVM
resulted in an accuracy of 52.78% on the training set. Using by 75% and achieves 3x performance over random. Our
test time augmentation, i.e. applying data augmentation on test best performing model was able to gain 70 basis points
image and letting model predict for all the augmented images improvement over the existing work [2] that was performed
and taking average of predictions, the accuracy improved by on a similar dataset. Overall accuracy on the test set is over
100 basis points to 53.8% 56% and overall RMSE is 0.90. Upon analyzing images that
We have seen from the training and validation loss on the were incorrectly classified, it was observed that some images
last few epochs that training on different sized images is indeed have such large noise and occlusions that it is not possible
preventing over fitting. even for a human to count the number of items, resulting in
Upon inspecting some images that were incorrectly clas- confounding the training data set.
sified it was observed that primary culprit was heavy noise
and occlusions in images, especially images with fewer than VI. F UTURE W ORK
3 items that were misclassified. Given over 450,000 object
categories, we expect training and test set to have different First, manually cleaning the data to remove such images
object categories which could affect the learning of the model. would enhance learning. Second, with the Adam optimizer,
d) ResNet34, ResNet50 with Adam Optimizer: Trained we have seen that performance increases with a larger dataset.
a ResNet34 model using Adam optimizer initially with a We are intrigued by the possibility that photos be taken by
learning rate of 0.001 and 0 weight decay. Experimented with from different angles. And that metadata connect the content
regularization and learning rate, the best performing model of a bin over time, so belief state may be tracked. Third,
provided an accuracy of 50.6. SGD performs better although with over 450K product skews, images are bound to violate
Adam converges much faster than SGD taking considerably our assumption that they’re drawn from a single distribu-
lesser time for each gradient descent update. So far, all the tion. Thus, forming ensemble models with different architec-
experiments involved working with 150K images split into tures and learning approaches may achieve higher accuracy.
70% train, 20% val and 10% test. Given, the computational The project’s repository is: https://github.com/OneNow/AI-
efficiency of Adam, we were able to execute a ResNet34 Inventory-Reconciliation
Model on the entire dataset of 324K images with 80% train,
VII. C ONTRIBUTIONS
10% val and 10% test with a learning rate of 0.001 and
0 weight decay. We were able to achieve an accuracy of Pablo’s primary contribution was on Support Vector Ma-
56.2% on the validation set (Fig 11 depicts the confusion chines, Sravan’s on Convolutional Neural Networks, and
matrix) Repeating the experiment with ResNet50, a deeper Nutchapols across the board.

Electronic copy available at: https://ssrn.com/abstract=3311007


ACKNOWLEDGMENT [8] Yosinski, et al. How Transferable Are Features in Deep Neural Net-
works? [Astro-Ph/0005112] A Determination of the Hubble Constant
The authors are grateful to Professor Andrew Ng for his from Cepheid Distances and a Model of the Local Peculiar Velocity
masterly transmission of machine learning to us. Field, American Physical Society, 6 Nov. 2014, arxiv.org/abs/1411.1792.
[9] https://www.fast.ai/
AUTHORS

Pablo Rodriguez Bertorello leads Next


Generation data engineering at Cadreon, a maketing technol-
ogy platform company. Previously he was CTO of Airfox,
which completed a successful Initial Coin Offering. He is the
co-inventor of cloud platform company acquired by Oracle.
And the original designer of the data bus for Intel’s Itanium
processor. Pablo has been issued over a dozen patents.

Sravan Sripada works at Amazon. He


is interested in applying artificial intelligence techniques to
solve problems in retail, cloud computing and voice controlled
devices.

Nutchapol Dendumrongsup is a Mas-


ter’s student at the Institute for Computational and Mathe-
matical Engineering and Deapartment of Energy Resources
Engineering at Stanford. He is interested in the application
of machine learning in the energy industry and the traditional
reservoir simulation in oil and gas industry.
R EFERENCES
[1] Joe Flasher, Amazon Bin Image Dataset, OpenData.
[2] Eunbyung Park, silverbottlep/abid challenge, GitHub, Jul 20, 2017.
[3] Gregory Koch, Richard Zemel, Ruslan Salakhutdinov, ICML Deep
Learning Workshop, 2015
[4] Zhang, et al. Deep Residual Learning for Image Recognition. [Astro-
Ph/0005112] A Determination of the Hubble Constant from Cepheid
Distances and a Model of the Local Peculiar Velocity Field, American
Physical Society, 10 Dec. 2015, arxiv.org/abs/1512.03385
[5] Loshchilov, et al. SGDR: Stochastic Gradient Descent with Warm
Restarts. [Astro-Ph/0005112] A Determination of the Hubble Con-
stant from Cepheid Distances and a Model of the Local Pe-
culiar Velocity Field, American Physical Society, 3 May 2017,
arxiv.org/abs/1608.03983.
[6] Kingma, et al. Adam: A Method for Stochastic Optimization. [Astro-
Ph/0005112] A Determination of the Hubble Constant from Cepheid
Distances and a Model of the Local Peculiar Velocity Field, American
Physical Society, 30 Jan. 2017, arxiv.org/abs/1412.6980v9.
[7] Smith, and Leslie N. Cyclical Learning Rates for Training Neu-
ral Networks. [Astro-Ph/0005112] A Determination of the Hub-
ble Constant from Cepheid Distances and a Model of the Local
Peculiar Velocity Field, American Physical Society, 4 Apr. 2017,
arxiv.org/abs/1506.01186.

Electronic copy available at: https://ssrn.com/abstract=3311007

You might also like