Lokad ReceiptStream
Lokad ReceiptStream
Long before ‘big data’ became the technology A compact binary representation
buzzword of 2012, retail networks have been
with about 5 bytes per receipt
among the pioneers in dealing with the large
amounts of data that is produced by their supply
line on average, that is 60% less
chain and their point of sale systems. Recognizing than GZip over TSV. It requires
the richness and immense value of their data, also 7x less CPU to write receipts,
heavy IT infrastructure investments have, in many and 11x less CPU to read receipts.
cases, been made.
However, to date, the limitations and cost of the required infrastructure has left the reality
far behind ambition and promise. This is particularly true for the richest retail data source,
which also dwarfs all others in size: receipts generated by point of sale systems. Collecting
and processing receipts of hundreds or even thousands of stores has remained a daunting,
and very expensive, task.
Each receipt contains the list of items purchased, including information such as day, time,
price, discounts etc. This data for example provides the basis for inventory and supply chain
optimization, on-shelf availability improvements, customer loyalty
marketing, consumer analysis and staffing optimization. In some
cases, retail networks are even selling this information to their
suppliers in multi-million yearly contracts.
1
See https://github.com/Lokad/lokad-receiptstream
www.lokad
2
When Lokad migrated toward the cloud in 2010, Windows Azure gave us access to an
unprecedented amount of processing power (roughly 100x more!). Interestingly however,
while we learned to ‘harness’ the suddenly abundant computing resources provided by the
cloud, a rather unexpected skill emerged in parallel: we also learned to be 100x more
efficient when processing retail data.
The combination of these two radical evolutions has made a reality what would have been
dismissed as science-fiction only a few years ago. By sharing some insights on big data for
retail, we hope to further fuel the advances in retail data exploitation.
1000 stores.
15 billion Euros of turnover.
1.5€ per item on average.
37 items purchased per receipt on average (i.e. 50€ shopping carts).
Those assumptions imply collecting, processing and persisting about 10 billion receipt lines
per year.
Here, we will assume that each receipt header hold the following information:
For the sake of simplicity, we have restricted the scope to a relatively narrow number of
fields. However, our experience indicates that those fields cover the overwhelming majority
of applications that could be envisioned while leveraging receipts.
www.lokad
3
Also, the methodology described below could be extended with extra fields without
fundamentally changing the primary claim: a smartphone should be enough.
1Ghz CPU
512MB of RAM
64GB of solid state drive2 with I/O at 30MB/s
Compared to nowadays high-end servers, the smartphone computing power is thin, and yet,
from an absolute perspective, it’s enormous: the text of the Wikipedia fits several times on
such a device.
In order to do that, we propose ReceiptStream a tiny piece of software, weighting less than
500 lines of code.
To perform those operations, we observe that retailers nowadays quasi exclusively rely on
SQL databases. However, we will compare ReceiptStream to two alternative approaches:
Indeed, while retail systems in production are primarily relying on SQL databases, GZIP-
compressed flat text files are surprisingly efficient, and consequently worth being
benchmarked against ReceiptStream.
In order to measure the respective performance of those approaches, we believe there are
three key metrics:
1. The number of bytes consumed on average per receipt line for persistence.
2. The number of receipts that can be written per second.
3. The number of receipts that can be read per second.
2
As of 2012-05-31, a 64GB MicroSD card from SanDisk is sold $64.99 on Amazon.com
www.lokad
4
With 37 lines per receipt in this study, receipt lines vastly outweigh receipt headers. Thus,
ReceiptStream is not even doing anything special for receipt headers: it relies on a plain
binary representation. Subtleties lie in the representation of receipt lines.
GTIN: 12 to 14 digits, however, there are hardly more than 100k distinct barcodes
being sold on any given day, even for the largest retail networks.
The purchased quantity, which frequently equals 1. Our measurements indicate that
for grocery retail, about 85% of the receipt lines have a quantity of 1.
Unit prices typically follow a price grid: 49cts is much more frequent than 51cts.
Hence, frequent prices can be leveraged for compression.
The optional discount follows a pattern quasi-identical to prices; however the most
frequent discount value is zero.
VAT rates have a very limited diversity (dozens at most).
Representing GTINs with 10+ bytes is a waste of storage, because the total number of
distinct GTINs in existence is only a tiny fraction of the potential number of combinations.
Thus, instead, ReceiptStream replaces GTINs with 7-bit encoded integers, called varint, i.e.
integers of varying size. The principle behind those varints is simple: the 8 th bit indicates if
the next byte should be read as part of the current integer. As a result, values between 0 and
127 need only a single byte to be represented.
Intuitively, the varint format assumes that small values are much more frequent than large
values. ReceiptStream leverages this property by making sure that, indeed, small values are
much more frequent than large ones.
www.lokad
5
In particular, the GTIN and the price are replaced by two indexes written as varints.
Naturally, this representation implies that ReceiptStream maintains two lookups3 mapping
between indexes toward the actual GTIN and price values.
Then, the 3rd block has a size of exactly 1 byte. It contains 2 Boolean flags indicating if
quantity is not one and if discount is not zero. If the quantity equals one then the 4 th block is
omitted. Similarly if the discount equals zero, then the 5 th block is omitted. This pattern
leaves room for 64 distinct VAT rates which is vastly sufficient in practice.
Quantities are represented by varints directly (quantities larger than 127 are extremely
infrequent), while VAT rates, and discounts get represented through indexes, and thus
involve 2 more lookups. In practice, the size of the GTIN lookup dwarfs the size of the 3
other lookups put together. Yet, the GTIN lookup remains below 50MB in memory.
The content of a receipt is represented by the concatenation of all the receipt lines. Before
writing down the first receipt line, ReceiptStream writes down a varint indicating how many
lines should be read as part of the current receipt.
A similar process needs to be applied to other fields; however for the sake of concision, we
only describe the GTIN case in the following: ReceiptStream treats prices, discounts and VAT
rates in an identical manner.
The naïve way to perform this index allocation consists in a 2-pass process over the data:
1. Read all receipts, and build the lookup by sorting GTINs by decreasing frequency.
2. Read all receipts again, and leverage the lookup established in (1) to encode.
However, any multi-pass algorithm would be rather impractical for production purposes.
ReceiptStream is, as the name suggest, an online algorithm that processes the receipts in a
strict streaming manner (1-pass only).
It is possible to keep a sorted dictionary of the GTINs (sorted against their respective
frequency) and to update this dictionary after reading each individual receipt. However, we
found out that such a process is relatively CPU intensive.
3
See http://en.wikipedia.org/wiki/Lookup_table
4
See http://en.wikipedia.org/wiki/Bubble_sort
www.lokad
6
a new GTIN of index A is observed, its observation count is increased of 1. If the observation
count of A is greater than the observation count of the GTIN of index B = A / 2, then, the two
GTINs swaps their indexes A and B.
Each update (of the table that counts observations) triggers at most a single swap.
The frequency of swaps decrease rapidly as the number of receipts increases.
Adding a new value to one of the lookup tables. By convention the value is added at
the end of the lookup. In practice, ReceiptStream uses dynamic arrays5 to keep this
process efficient.
Swapping two indexes in one of the lookup tables. Contrary to GZip, swaps are
explicit in the data stream. By following this pattern, at read-time, ReceiptStream
does not need to maintain any hash table6 or binary tree, but only plain lookups.
For the sake of concision, we will not give here the full detail of the header binary
representation, i.e. the section that contains all the lookup changes. We suggest referring to
the code on GitHub instead, as there are no clever tricks being used, only straightforward
binary formatting.
EXPERIMENTAL RESULTS
The table below has been obtained on an Intel Core i5 at 2.4 GHz using only a single CPU
processing over a million receipts (real world data).
www.lokad
7
We observe that:
Furthermore, with SQL, one needs to store indexes in memory. Here, we cannot avoid an
index on the receipt_lines table, which alone represent about 25% of the overall space being
consumed by the SQL database. Denormalizing the database is possible, but it further
increases storage inefficiencies.
In practice, processing receipts through SQL requires machines with a large amount of RAM,
and while your mileage may vary, we believe that typical I/O performance for such SQL
setups are about 100x times lower than those of ReceiptStream.
This conclusion is not specific of Microsoft SQL Server, but applies to all relational databases.
While it is possible to fine-tune the SQL schema to improve performance (ex: storing heavily
compressed blobs in a table), we believe that such tuning defeats the purpose of SQL in the
first place.
www.lokad
8
LOOKING AHEAD
While being capable of processing receipts of a very large retail network over a smartphone
might look as a tough challenge: it is not. In particular, ReceiptStream is not even really
optimized: our primary goal was to keep it extremely simple, not extremely efficient. With
less than 500 lines of code, ReceiptStream is a tiny piece of software. We believe that
achieving a further 2x compression factor while halving the CPU consumption is possible,
but maybe not worth the effort.
Then, it’s not because basic retail data processing can happen over a smartphone, that
massive cloud computing resources do not come handy. By migrating toward Windows
Azure back in 2009, at Lokad, we discovered that combining both lean data infrastructure
(with a typical 100x speed-up over SQL) and cloud auto-scaling capabilities (with a 100x
scale-out factor, available within minutes) was the best approach to get the most out big
data in retail.
We believe that processing all receipts of the largest retail networks is a non-issue. We
believe that vendors selling tools in 6 digit figures to do it are just doing it wrong and/or
doing a disservice to their clients.
ABOUT LOKAD
Lokad is a technology company focusing on big data analytics software for retail, wholesale
and eCommerce. Software solutions include inventory optimization, loyalty card data
analysis and out-of-shelf monitoring. The company is the winner of the 2010 Microsoft
Windows Azure Worldwide Partner of the Year Award.
For large retail networks, we also routinely perform big data consulting helping local teams
to get the most of their data while maintaining IT budget in control. Learn more about our
services at http://www.lokad.com/big-data-consulting.ashx
www.lokad