Snappy Compression相關論文
Snappy Compression相關論文
Algorithm
Kyle Kovacs
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission.
Acknowledgement
I want to especially thank Yue Dai for working closely with me on writing the
RTL for the compression accelerator. This was made possible in part by her
hard work.
I also greatly thank Adam Izraelevitz for being a mentor, friend, and role
model to me throughout my time as a grad student at Berkeley.
Vighnesh Iyer and Paul Rigge provided a lot of needed assistance early on in
the course of the project, and I am thankful for their help.
The members of the ADEPT lab Chick Markley, Alon Amid, Nathan
Pemberton, Brendan Sweeney, David Bruns-Smith, Jim Lawson, Howard
Mao, Edward Wang, Sagar Karandikar, David Biancolin, and Richard Lin were
also instrumental in assisting me with learning how to use various tools and
software quirks.
Thanks to all those listed here and others for supporting me.
A Hardware Implementation of the Snappy Compression Algorithm
by
Kyle Kovacs
Master of Science
in
in the
Graduate Division
of the
Committee in charge:
Spring 2019
The thesis of Kyle Kovacs, titled A Hardware Implementation of the Snappy Compression
Algorithm, is approved:
Chair Date
Date
Copyright 2019
by
Kyle Kovacs
1
Abstract
In the exa-scale age of big data, file size reduction via compression is ever more impor-
tant. This work explores the possibility of using dedicated hardware to accelerate the same
general-purpose compression algorithm normally run at the warehouse-scale computer level.
A working prototype of the compression accelerator is designed and programmed, then sim-
ulated to asses its speed and compression performance. Simulation results show that the
hardware accelerator is capable of compressing data up to 100 times faster than software, at
the cost of a slightly decreased compression ratio. The prototype also leaves room for future
performance improvements, which could improve the accelerator to eliminate this slightly
decreased compression ratio.
i
Contents
Contents i
List of Figures ii
List of Tables iv
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 This Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Implementation 3
2.1 Context and Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Results 18
3.1 Software Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Hardware Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Synthesis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Conclusion 24
4.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Bibliography 26
ii
List of Figures
2.1 The RoCC instruction format. The three 5-bit fields correspond to the source
registers and the destination register. The 1-bit fields act as switches to specify
whether or not the register fields are to be used. The 7-bit fields cover the rest
of the encoding space, acting as routing and operation selection information. . . 3
2.2 RoCC API code snippet. rocc.h provides the appropriate assembly macros.
The three instructions, compress, uncompress, and setLength are defined. Two
functions, compress(source, uncompressed length, dest) and uncompress(source,
compressed length, dest) provide the main library API for using the compres-
sor. Each instruction includes a fence to preserve memory operation ordering. . 5
2.3 Full system block diagram. The L2 cache is shown at the bottom connecting
to both Scratchpad banks through a DMA controller. The banks are read by the
Match Finder and written to by the Copy and Literal Emitters. The box labelled
“Compressor” highlights the bounds of the main control logic that governs the
compression algorithm flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Match Finder block diagram. The hash table keeps track of recently-seen
data. Offsets from the beginning of the stream are stored in the offset column,
32-bit data chunks are stored in the data column, and the valid bits are cleared
between compression operations. A match is considered to be found when valid
data from the hash table matches the input data. The matching data is the input
data, and the matched data is at the offset specified by the appropriate line in
the hash table. The control logic for the Match Finder handles data flow in and
out of the hash table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Memory Read Aligner block diagram. Data comes out of the memory in 64-
bit chunks, but needs to be read in byte-aligned 32-bit chunks. The Read Aligner
uses simple caching to concatenate the memory data properly and output byte-
aligned 32-bit data through a decoupled interface. When sequential addresses are
provided, there is no additional delay introduced by the Read Aligner. . . . . . 12
2.6 Top-level RocketChip Integration The Rocket core is attached to the L2 data
cache through a TileLink crossbar. The accelerator also connects to the crossbar
and accessed the same L2 cache. Memory consistency is handled by the rest of
the system, and the accelerator only needs to see the L2 cache in order to operate. 13
iii
2.7 Accelerator dataflow. These six figures depict the flow of data in and out of
the compressor. Input data is stored in the L2 cache at the location indicated
by the input pointer, and the output is produced in the L2 cache at the location
specified by the output pointer. The size of the Scratchpad banks is configurable,
and it determines the sliding window length for the algorithm. A longer window
allows for longer copy back-references, and a shorter window requires less hardware. 14
3.1 Cylce timing code. This code represents an example of running the Snappy
software compression with timing. The result will tell how many cycles it took
the processor to compress the input data. The rdcycle instruction is used to
obtain timing information, and it can be run in just one assembly instruction
with the -O3 compiler flag. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Compression results. Plotted are the measurements taken for compression
ratio and efficiency (cycles per byte). The green lines denote the software im-
plementation; the blue lines denote the hardware accelerated implementation.
Figures 3.2a, 3.2c, and 3.2e compare compression ratios, while Figures 3.2b, 3.2d,
and 3.2f compare efficiency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
iv
List of Tables
2.1 Types of Snappy substreams. Literal substreams can have anywhere between
1 and 5 bytes of tag data. For this implementation, the maximum literal length
is 60, meaning that all literal tags are just one byte. There are three different
formats for copy substreams; the choice of which copy tag format to use is made
based on the offset value. Very large offsets can be encoded by up to four bytes,
whereas small offsets are encoded by just eleven bits. . . . . . . . . . . . . . . . 7
3.1 Software compression results. The software compression library was run on
a simulated Rocket core using FireSim. For each input size, compression ratio
and number of cycles were measured. Cycle counts were obtained via the RISC-V
rdcycle instruction, and compression ratios were obtained via the library itself.
For an input length N , the “Random” dataset consists of N random bytes, the
“Real” dataset consists of the first N bytes of MTGCorpus, and the “Repeating”
dataset consists of the letter “a” N times. . . . . . . . . . . . . . . . . . . . . . 20
3.2 Hardware compression results. The same input was run through the hard-
ware accelerator in simulation. Efficiency is higher by up to 200-fold, but com-
pression ratio is lower by a factor of about 2 for compressible data. The software
expands incompressible data rather badly, but the hardware does not pay that
penalty. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Hardware compression with larger hash table. Here, the hash table size
was increased to 2048 lines instead of 512. The compression ratios for real data are
the only ones that change significantly, because random data is not compressible
anyway and repeating data does not benefit from a large hash table. Efficiency
in terms of number of cycles is not impacted. . . . . . . . . . . . . . . . . . . . 23
v
Acknowledgments
I want to especially thank Yue Dai for working closely with me on writing the RTL for
the compression accelerator. This was made possible in part by her hard work.
I also greatly thank Adam Izraelevitz for being a mentor, friend, and role model to me
throughout my time as a graduate student at Berkeley.
Vighnesh Iyer and Paul Rigge provided a lot of needed assistance early on in the
course of the project, and I am thankful for their help.
The members of the ADEPT lab (in no particular order) Chick Markley, Alon Amid,
Nathan Pemberton, Brendan Sweeney, David Bruns-Smith, Jim Lawson, Howard
Mao, Edward Wang, Sagar Karandikar, David Biancolin, and Richard Lin were also
instrumental in assisting me with learning how to use various tools, software, quirks of Chisel,
and other odds and ends.
Thanks to all those listed here and others for supporting me.
1
Chapter 1
Introduction
1.1 Background
Why is Compression Important?
Data compression is important across a wide range of applications, especially as hundreds,
thousands, and millions of photos, videos, documents, server logs, and all manner of data
streams are produced, collected, broadcast, and stored across the billions of electronic devices
in the world today. Sizes and amounts of data are continuing to grow, and as they do, it
is ever more essential to find ways to increase our storage capacity. While disk technology
can help alleviate storage capacity needs, another option is to decrease the amount of data
that needs to be stored. This is where compression can play a major role in data-heavy
applications.
What is Compression?
There are two major classes of data compression: lossy and lossless. “Lossy” compression, as
the name implies, loses some of the original content through the compression process. This
means that the transformation is not reversible, and so the original data cannot be recovered.
Lossy compression is acceptable in a variety of domains (especially in images) because not
all the original data is necessary to convey meaning. Compression that preserves the original
data is called “lossless.” This means that after being compressed, the data can undergo an
inverse transformation and end up the same as the original bit-stream. Lossy compression
works by eliminating unnecessary information, like high-frequency noise in images, whereas
lossless compression works by changing the way data is represented. Lossless compression
will be the subject of the remainder of this paper.
There are many and various lossless compression algorithms [11], and the field is well
developed. Some algorithms offer high compression ratios (i.e. the size of the output data is
very small when contrasted with the size of the input data) while others offer higher speed
(i.e. the amount of time the algorithm takes to run is relatively low). Different types of data
CHAPTER 1. INTRODUCTION 2
tend to be more or less compressible depending on the algorithm used in combination with
the data itself.
Project Context
Because this research was conducted within the ADEPT lab at UC Berkeley, it utilizes the
favorite set of tools chosen by the lab i.e. Chisel, RocketChip, FireSim, RISC-V, and others.
These are described in more detail in the following sections.
3
Chapter 2
Implementation
bits 31 25 24 20 19 15 14 13 12 11 7 6 0
funct7 rs2 rs1 xd xs1 xs2 rd opcode
length 7 5 5 1 1 1 5 7
Figure 2.1: The RoCC instruction format. The three 5-bit fields correspond to the source
registers and the destination register. The 1-bit fields act as switches to specify whether or
not the register fields are to be used. The 7-bit fields cover the rest of the encoding space,
acting as routing and operation selection information.
CHAPTER 2. IMPLEMENTATION 4
not the register fields are to be used. If either or both of the xs* bits are set, then the bits
in the appropriate architectural registers specified by the rs* fields will be passed over the
connection bus to the co-processor; likewise, if they are clear, then no value will be sent for
that source register. Similarly, if the xd bit is set, then the issuing core is meant to wait
for the co-processor to send a value back over the connection bus, at which point it will be
stored in the register specified by rd.
Instructions of this format can be hard-coded as assembly sequences in user-level C code
and run on a Rocket core with an appropriate co-processor attached. Modification of the
compiler-assembler RISC-V toolchain would allow normal, high-level C code to be translated
into these custom instructions, but that possibility will not be explored here.
Chisel
Chisel is a Scala-embedded hardware construction language developed at UC Berkeley.
Chisel code running in the Scala runtime environment produces Flexible Intermediate Repre-
sentation of Register Transfer Level (FIRRTL) code [9] through a process called elaboration.
Once FIRRTL is generated, it can be either passed to a simulation environment, such as
CHAPTER 2. IMPLEMENTATION 5
Figure 2.2: RoCC API code snippet. rocc.h provides the appropriate assembly
macros. The three instructions, compress, uncompress, and setLength are defined. Two
functions, compress(source, uncompressed length, dest) and uncompress(source,
compressed length, dest) provide the main library API for using the compressor. Each
instruction includes a fence to preserve memory operation ordering.
CHAPTER 2. IMPLEMENTATION 6
Treadle [14] or the FIRRTL Interpreter [6], or it can be used to emit Verilog code for other
simulation environments like VCS [15] or Verilator [16].
Chisel provides hardware designers with powerful parameterization options and abstrac-
tion layers unavailable to programmers of classical hardware description languages like Ver-
ilog or VHDL. Chisel circuits are thought of as generators rather than instances of hardware,
viz. precise types, bitwidths, and connections are not known in a Chisel circuit until after
it has been elaborated. Chisel was chosen as the design language for this project because of
its flexibility and because it is tightly integrated with the RocketChip ecosystem.
2.2 Overview
Compression Algorithm
The compression algorithm chosen for this work is called Snappy [12]. Snappy compression
is a dictionary-in-data sliding-window compression algorithm, meaning that it does not store
its dictionary separate from the compressed data stream, but rather uses back-references to
indicate repeated data sequences. This algorithm was chosen because of its wide use and
portability. Google’s open-source C++ implementation of the algorithm [7] was used to
guide the development of the hardware platform.
Table 2.1: Types of Snappy substreams. Literal substreams can have anywhere between
1 and 5 bytes of tag data. For this implementation, the maximum literal length is 60,
meaning that all literal tags are just one byte. There are three different formats for copy
substreams; the choice of which copy tag format to use is made based on the offset value.
Very large offsets can be encoded by up to four bytes, whereas small offsets are encoded by
just eleven bits.
Because a tag longer than one byte could result in cache line crossings and the need to shift
large amounts of already-copied data, it was simpler for this implementation to limit the
literal length to 60. The compression ratio will not suffer by much from this simplification,
1
as in the worst case this adds an overhead of just 60 times the shortest possible encoding.
That is, if the entire stream were comprised of literal substreams, then the longest literal tag
(five bytes) could be used. This would result in five tag bytes encoding 232 bytes of literal
data. With the maximum literal length set at 60, the resultant stream would contain one
1
tag byte for every 60 bytes, which accounts for an expansion by a factor of 60 . However, this
case is extremely unlikely, and long literals may never even show up, especially with highly
compressible data.
Decompressing the compressed stream is rather simple, assuming the decompressed data
is read serially. When a literal tag encoding a literal of length N is reached, then the next
N bytes simply need to be copied. When a copy tag encoding a copy of length N at offset O
is reached, then N bytes simply need to be copied from O bytes behind the tag’s location.
For a given input stream, the output of the Snappy compression algorithm is not unique.
A data stream is considered compressed if it conforms to the substream format discussed
above. There are many ways to compress a single input data stream via the Snappy algo-
CHAPTER 2. IMPLEMENTATION 8
Figure 2.3: Full system block diagram. The L2 cache is shown at the bottom connecting
to both Scratchpad banks through a DMA controller. The banks are read by the Match
Finder and written to by the Copy and Literal Emitters. The box labelled “Compressor”
highlights the bounds of the main control logic that governs the compression algorithm flow.
rithm, all of which are valid compressed streams as long as they are decompressible via the
same decompression semantics.
CHAPTER 2. IMPLEMENTATION 9
Full System
The diagram in Figure 2.3 illustrates how the basic blocks of the compressor work together.
The basic structure consists of a DMA controller, which fills up a Scratchpad read bank
with data from memory; a Match Finder, which scans through the data and locates pieces
of it that repeat; a Literal Emitter, which writes tag bytes and streamed data from the
Scratchpad read bank back into a Scratchpad write bank; and a Copy Emitter, which writes
tag bytes and copy information into the Scratchpad write bank.
The DMA controller automatically requests data from the L2 cache in order to keep the
read bank of the Scratchpad full so that the compression can run. It also interleaves write
requests to the L2 cache so that the Scratchpad write bank can continually be flushed.
The compressor also has more control logic (omitted from the diagram for cleanliness)
which coordinates data flow through the system by telling the blocks when to run and when
to wait. When the Match Finder detects a match, it is the Copy Emitter’s job to determine
the length of the matching data. The Match Finder cannot know where to look next before
the Copy Emitter determines how long the match was and whether or not more matches
can be emitted immediately. Because of this dependency, it is not possible to run both the
Match Finder and the Copy Emitter at the same time for a given compression job. Hence,
the compressor control logic must handle this information and pause the blocks depending
on the state of the system.
2.3 Details
Scratchpad
In order to work with data, the compressor needs to access main system memory. Co-
processors generated with RocketChip have a connection to the CPU L1 data cache by
default. However, accessing the L1 cache should be avoided for this application because
serially streaming all the input data through the L1 cache would cause far too much traffic
and hurt CPU performance. For this reason, main memory is accessed through a Scratchpad
connected to the L2 cache over a TileLink [5] bus (RocketChip’s main interconnect bus).
The Scratchpad consists of two banks, each of which can be read from or written to
independently. The read bank sends read requests to the L2 cache via a DMA controller,
and it fills up with the desired input data. Control logic inside the compressor keeps track
of which section of the input is currently held in the Scratchpad read bank, and signals the
DMA controller to overwrite old data when new data needs to be pulled in from memory.
The write bank is filled up by the output side of the compressor. Each time a literal or copy
is emitted, it is buffered in the Scratchpad write bank before the DMA controller flushes the
bank out to memory.
CHAPTER 2. IMPLEMENTATION 10
Figure 2.4: Match Finder block diagram. The hash table keeps track of recently-seen
data. Offsets from the beginning of the stream are stored in the offset column, 32-bit data
chunks are stored in the data column, and the valid bits are cleared between compression
operations. A match is considered to be found when valid data from the hash table matches
the input data. The matching data is the input data, and the matched data is at the offset
specified by the appropriate line in the hash table. The control logic for the Match Finder
handles data flow in and out of the hash table.
Match Finder
The Match Finder is responsible for locating a 4-byte match in the input data stream. A
starting address is provided to the Match Finder, and it produces pairs of addresses that point
to matching 4-byte sequences in the Scratchpad. Each time the input decoupled connection
fires, the Match Finder will produce a match some number of cycles later. Once the match
length is determined, other pieces of external control logic decide when to start the Match
Finder again.
To find a match, the Match Finder scans through the input, taking the hash of each
4-byte chunk and storing its offset from a global base pointer in a hash table. The hash
table has three columns. The first column holds the offset (between 0 and the maximum
window size) of the data, the second column holds the data itself (32-bit words), and the
third column holds a valid bit, which is set the first time data is entered into that hash table
row. All valid bits in the hash table are cleared when a new compression operation begins.
For each address scanned, the 4-byte chunk stored at that address will hash to some value,
CHAPTER 2. IMPLEMENTATION 11
H. Row H of the hash table is then populated with the data and appropriate offset, and its
valid bit is set. If the valid bit was already set, the control logic checks to see if the previous
data stored in the row H matches the new data. If it does, then a match has been found.
The new memory address and the offset from row H of the hash table are the pointers to
the two matching segments.
In addition to checking the valid bits from the hash table, the control logic also keeps
track of a few bits of state such as the current read pointer into the Scratchpad, and whether
a match is being sought or not. It also handles all of the ready and valid signals for the
decoupled inputs and outputs.
The hash function used here is simply the input bytes multiplied by 0x1e35a7bd and
shifted right by the size of the hash table. This is the same function used in the open-source
C++ algorithm. It is meant to minimize collisions.
Figure 2.5: Memory Read Aligner block diagram. Data comes out of the memory in
64-bit chunks, but needs to be read in byte-aligned 32-bit chunks. The Read Aligner uses
simple caching to concatenate the memory data properly and output byte-aligned 32-bit
data through a decoupled interface. When sequential addresses are provided, there is no
additional delay introduced by the Read Aligner.
2.4 Integration
Figure 2.6 is a schematic of how the accelerator would connect to RocketChip. The shared
TileLink crossbar manages memory access and consistency such that it is abstracted away
from the compressor itself. The Rocket core and the compressor can then both access the
same L2 cache backed by the same physical memory.
2.5 Dataflow
When a compression job starts, an input and output pointer are specified by the RoCC
instruction. The DMA controller keeps track of these values, and produces and consumes
requests to and from the L2 cache to facilitate data transfer between the Scratchpad and
the cache itself. The Scratchpad’s two banks operate in concert to stream data byte-by-byte
through the compressor. Both banks are managed as circular buffers, meaning that they
evict old data when they run out of space. The first data loaded into each bank will be the
first data evicted from that bank. The Scratchpad manager keeps track of which range of
data is held in a bank at any given time in order to ensure data is valid when it is read.
Before compression can begin, the read bank must fill up with input data (see Figure
2.7a). As it fills up, its tail pointer advances until it reaches the same line as the head
pointer. At this point the read bank is full.
Once the read bank has data in it, the Match Finder can scan the data for matches.
CHAPTER 2. IMPLEMENTATION 13
Figure 2.6: Top-level RocketChip Integration The Rocket core is attached to the L2
data cache through a TileLink crossbar. The accelerator also connects to the crossbar and
accessed the same L2 cache. Memory consistency is handled by the rest of the system, and
the accelerator only needs to see the L2 cache in order to operate.
CHAPTER 2. IMPLEMENTATION 14
(a) To start the compression, the read bank (b) When there is valid data in the read bank,
of the Scratchpad must first fill up with data the Match Finder begins scanning the input
from the L2 cache, starting at the input pointer. data for matches, populating the hash table
Once the tail reaches the head, the read bank while doing so. As bytes are scanned, they are
is full of valid data. copied into the write bank of the Scratchpad.
A hole is left at the start of the write bank to
hold the literal tag. Once a match is found, all
data prior to the matching data is considered a
literal and the hole is filled in. If 60 bytes are
scanned before a match is found, then a literal
of length 60 will be emitted by force.
Figure 2.7: Accelerator dataflow. These six figures depict the flow of data in and out of
the compressor. Input data is stored in the L2 cache at the location indicated by the input
pointer, and the output is produced in the L2 cache at the location specified by the output
pointer. The size of the Scratchpad banks is configurable, and it determines the sliding
window length for the algorithm. A longer window allows for longer copy back-references,
and a shorter window requires less hardware.
CHAPTER 2. IMPLEMENTATION 15
(c) The Match Finder continues to scan through (d) When a match is found, the Copy Emit-
the input data, crossing line boundaries as it ter will scan through the data to determine the
scans. The Read Aligner handles this boundary length of the match. Nothing will be put into
crossing. Meanwhile, unmatched data is still the write bank until the length is determined,
being copied into the write bank. but the match pointers will advance until their
data no longer matches. Note that the length
of the literal is now known, so the hole is filled
in.
Figure 2.7: Accelerator dataflow. These six figures depict the flow of data in and out of
the compressor. Input data is stored in the L2 cache at the location indicated by the input
pointer, and the output is produced in the L2 cache at the location specified by the output
pointer. The size of the Scratchpad banks is configurable, and it determines the sliding
window length for the algorithm. A longer window allows for longer copy back-references,
and a shorter window requires less hardware.
CHAPTER 2. IMPLEMENTATION 16
(e) Once the end of the match is detected, the (f) When the Match Finder gets to the end of
appropriate copy tag will be emitted into the the Scratchpad read bank, the Scratchpad will
write bank, and the Match Finder will continue evict the oldest line in the bank and replace it
looking for more matches. If the subsequent with new data from the L2 cache. The maxi-
data is unmatched, a literal will be produced. mum look-back distance for matches is therefore
If the subsequent data matches, another match limited by the size of the Scratchpad. matchB is
will be produced. shown wrapping around back to the beginning
of the read bank. Meanwhile, the write bank
can begin to flush data out to the L2 cache at
the location specified by the output pointer.
Figure 2.7: Accelerator dataflow. These six figures depict the flow of data in and out of
the compressor. Input data is stored in the L2 cache at the location indicated by the input
pointer, and the output is produced in the L2 cache at the location specified by the output
pointer. The size of the Scratchpad banks is configurable, and it determines the sliding
window length for the algorithm. A longer window allows for longer copy back-references,
and a shorter window requires less hardware.
CHAPTER 2. IMPLEMENTATION 17
The matchB pointer advances one byte every cycle, as depicted in Figure 2.7b, allowing the
Match Finder to check a 4-byte word composed of three old bytes and one new byte. That
4-byte word is hashed and the appropriate line of the hash table is updated.
According to the Snappy algorithm, when a match is found (in this case by the Match
Finder), all preceding data up until that pointed to by the seeking pointer (in this case
matchB) is considered a literal and is emitted as such. In order to do this effectively, the
data is actually emitted one Scratchpad line at a time as the input is being scanned. This is
done to keep intermediate buffer size and extra copy time to a minimum. The consequence
of this is that a hole needs to be left in the output at the beginning of the literal. This
hole will eventually contain the length tag for the literal, but until the length of the literal
is known, the value of the length tag byte cannot be. This is the reason for the imposed
60-byte literal length limit — without this limit, the literal data would have to be aligned
in accordance with the literal tag length. This represents a trade-off in which compression
ratio is slightly sacrificed in order to provide faster compression runtime. The hole is shown
in Figures 2.7b and 2.7c.
Upon finding a match, the literal length becomes known, and so the hole left can be filled
in with the appropriate literal tag (Figure 2.7d). At that point, the control logic takes note
that all data up until the last byte of the literal is theoretically safe to evict from the read
bank of the Scratchpad by updating a pointer. However, this data should be kept in the
scratchpad as long as possible in order to allow for a longer look-back window for match
detection. The Match Finder cannot look back at data that has already been evicted from
the Scratchpad read bank, so it is beneficial to only evict and overwrite the data when space
is needed.
The Copy Emitter uses the match information (the matchB pointer that was used to scan
for matches and the matchA pointer that is calculated based on the offset column of the hash
table) to determine the length of the match. Both matchA and matchB advance in lockstep
until the end of the match is found, at which point the copy can be emitted into the write
bank. Figures 2.7d and 2.7e show the match pointers advancing. A copy is emitted in Figure
2.7e.
Once a copy is emitted, the process can continue as before, with matching data producing
more copies and non-matching data producing more literals. As the write bank fills up, its
tail pointer advances. Once data has been written into the write bank, it can be flushed
out to the L2 cache. It is written to the L2 cache at the location specified by the RoCC
instruction output pointer. If the write bank becomes full, or if the input runs out, then
the write bank is forcibly flushed. Because there is only one port to the L2 cache, read and
write requests cannot be made on the same cycle. For this reason, the accelerator alternates
between filling the read bank and flushing the write bank. Figure 2.7f shows new data in
the read bank evicting old data, as well as data being flushed from the write bank out to
the L2 cache.
When the entire compression job is complete, the hash table will be marked invalid and
all the pointers will reset. Any remaining data in the write bank needs to be flushed out to
the L2 cache before the output is considered valid for the application to consume.
18
Chapter 3
Results
// Allocate space
std::string output;
unsigned long start, end;
Figure 3.1: Cylce timing code. This code represents an example of running the Snappy
software compression with timing. The result will tell how many cycles it took the processor
to compress the input data. The rdcycle instruction is used to obtain timing information,
and it can be run in just one assembly instruction with the -O3 compiler flag.
CHAPTER 3. RESULTS 20
Table 3.1: Software compression results. The software compression library was run on
a simulated Rocket core using FireSim. For each input size, compression ratio and number
of cycles were measured. Cycle counts were obtained via the RISC-V rdcycle instruction,
and compression ratios were obtained via the library itself. For an input length N , the
“Random” dataset consists of N random bytes, the “Real” dataset consists of the first N
bytes of MTGCorpus, and the “Repeating” dataset consists of the letter “a” N times.
Table 3.2: Hardware compression results. The same input was run through the hardware
accelerator in simulation. Efficiency is higher by up to 200-fold, but compression ratio is
lower by a factor of about 2 for compressible data. The software expands incompressible
data rather badly, but the hardware does not pay that penalty.
CHAPTER 3. RESULTS 21
size. This is because more and more random data stays equally incompressible. However,
for Real and Repeating data, the compression ratio grows as the input grows. With these
datasets, more data means more opportunities to find matches and emit copy substreams.
For Repeating data, however, the ratio and efficiency both tend to level off once the input
becomes large enough. After a certain point, the algorithm is simply emitting maximum-
length copies again and again, utilizing only one line of the hash table and taking the same
amount of time to emit each substream.
Note that for Random data, the efficiency increases with the size of the input. This is
because of an optimization in the software implementation which begins to “realize” that
the data is incompressible and start skipping further and further along.
0.6
0.4 10
0.2
0 1
Software Software
Input Size in Bytes Input Size in Bytes
Hardware Hardware
(a) (b)
Real Data Compression Ratio Real Data Compression Efficiency
3 1000
Output Size / Input Size
2.5
Cycles per Byte
2 100
1.5
1 10
0.5
0 1
Software
Software
Hardware (big hash table) Input Size in Bytes Input Size in Bytes
Hardware Hardware
(c) (d)
Repeating Data Compression Ratio Repeating Data Compression Efficiency
25 1000
Output Size / Input Size
20
Cycles per Byte
100
15
10
10
5
0 1
Software Software
Input Size in Bytes Input Size in Bytes
Hardware Hardware
(e) (f)
Figure 3.2: Compression results. Plotted are the measurements taken for compression
ratio and efficiency (cycles per byte). The green lines denote the software implementation;
the blue lines denote the hardware accelerated implementation. Figures 3.2a, 3.2c, and 3.2e
compare compression ratios, while Figures 3.2b, 3.2d, and 3.2f compare efficiency.
CHAPTER 3. RESULTS 23
Real
Bytes Ratio Cycles/Byte
10 1.0309 4.110
20 2.0354 3.124
50 0.9804 4.640
100 0.9901 4.260
200 0.7692 7.000
500 1.1682 3.886
1000 1.2563 3.758
2000 1.3245 3.736
5000 1.6202 3.436
10000 0.8696 5.200
20000 1.8440 3.277
50000 1.7374 3.376
Table 3.3: Hardware compression with larger hash table. Here, the hash table size
was increased to 2048 lines instead of 512. The compression ratios for real data are the
only ones that change significantly, because random data is not compressible anyway and
repeating data does not benefit from a large hash table. Efficiency in terms of number of
cycles is not impacted.
24
Chapter 4
Conclusion
This work successfully designed and built a compression hardware accelerator capable of
compressing data in accordance with the Snappy data compression algorithm. Simulation
results show that this accelerator prototype can achieve much greater efficiency than an
equivalent software implementation. Although compression ratio performance is lower, there
are more optimizations to be made for this prototype.
Code Optimization
The Chisel code was not heavily optimized. Functionality was important to establish first,
and optimization has not been investigated. For example, when the accelerator first starts,
there is a waiting period while the scratchpad completely fills before data begins to be
processed. With more careful inspection and tuning, the code can be made more efficient.
RocketChip Simulation
In this work, simulations of the stand-alone accelerator were run using the Chisel test frame-
work. It would be interesting to simulate the entire Rocket core and use FireSim to run
the hardware tests. This would allow for more interesting analysis. For example, a major
benefit of the hardware accelerator is that the CPU can do other useful work while the ac-
celerator runs. This concurrency could be exercised in a simulation environment that allows
for both hardware and software to run on a Rocket core with the accelerator attached as a
co-processor.
Bibliography
[1] “The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Document Version
2.2”, Editors Andrew Waterman and Krste Asanović, RISC-V Foundation, May 2017.
[2] Krste Asanović et al. The Rocket Chip Generator. Tech. rep. UCB/EECS-2016-17.
EECS Department, University of California, Berkeley, Apr. 2016. url: http://www2.
eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-17.html.
[3] Jonathan Bachrach et al. “Chisel: Constructing Hardware in a Scala Embedded Lan-
guage”. In: Proceedings of the 49th Annual Design Automation Conference. DAC ’12.
San Francisco, California: ACM, 2012, pp. 1216–1225. isbn: 978-1-4503-1199-1. doi:
10.1145/2228360.2228584. url: http://doi.acm.org/10.1145/2228360.2228584.
[4] Chisel. https://chisel.eecs.berkeley.edu/. Accessed: 2019-5-6.
[5] Henry Cook. Productive design of extensible on-chip memory hierarchies. Tech. rep.
UCB/EECS-2016-89. EECS Department, University of California, Berkeley, May 2016.
url: https://people.eecs.berkeley.edu/~krste/papers/EECS-2016-89.pdf.
[6] FIRRTL Interpreter. https://github.com/freechipsproject/firrtl-interpreter.
Accessed: 2019-5-6.
[7] Google Snappy. https://github.com/google/snappy. Accessed: 2019-5-6.
[8] Hammer: Highly Agile Masks Made Effortlessly from RTL. https://github.com/ucb-
bar/hammer. Accessed: 2019-8-6.
[9] Adam Izraelevitz et al. “Reusability is FIRRTL Ground: Hardware Construction Lan-
guages, Compiler Frameworks, and Transformations”. In: Proceedings of the 36th Inter-
national Conference on Computer-Aided Design. ICCAD ’17. Irvine, California: IEEE
Press, 2017, pp. 209–216. url: http://dl.acm.org/citation.cfm?id=3199700.
3199728.
[10] Sagar Karandikar et al. “Firesim: FPGA-accelerated Cycle-exact Scale-out System
Simulation in the Public Cloud”. In: Proceedings of the 45th Annual International
Symposium on Computer Architecture. ISCA ’18. Los Angeles, California: IEEE Press,
2018, pp. 29–42. isbn: 978-1-5386-5984-7. doi: 10 . 1109 / ISCA . 2018 . 00014. url:
https://doi.org/10.1109/ISCA.2018.00014.
[11] David Salomon. Data Compression: The Complete Reference. Berlin, Heidelberg: Springer-
Verlag, 2006. isbn: 1846286026.
BIBLIOGRAPHY 27