Linearly Compressed Pages
Linearly Compressed Pages
ABSTRACT Keywords
Data compression is a promising approach for meeting the increas- Data Compression, Memory, Memory Bandwidth, Memory Capac-
ing memory capacity demands expected in future systems. Un- ity, Memory Controller, DRAM
fortunately, existing compression algorithms do not translate well
when directly applied to main memory because they require the 1. INTRODUCTION
memory controller to perform non-trivial computation to locate a Main memory, commonly implemented using DRAM technol-
cache line within a compressed memory page, thereby increasing ogy, is a critical resource in modern systems. To avoid the devas-
access latency and degrading system performance. Prior propos- tating performance loss resulting from frequent page faults, main
als for addressing this performance degradation problem are either memory capacity must be sufficiently provisioned to prevent the
costly or energy inefficient. target workload’s working set from overflowing into the orders-of-
By leveraging the key insight that all cache lines within a page magnitude-slower backing store (e.g., hard disk or flash).
should be compressed to the same size, this paper proposes a new Unfortunately, the required minimum memory capacity is ex-
approach to main memory compression—Linearly Compressed pected to increase in the future due to two major trends: (i) appli-
Pages (LCP)—that avoids the performance degradation problem cations are generally becoming more data-intensive with increasing
without requiring costly or energy-inefficient hardware. We show working set sizes, and (ii) with more cores integrated onto the same
that any compression algorithm can be adapted to fit the require- chip, more applications are running concurrently on the system,
ments of LCP, and we specifically adapt two previously-proposed thereby increasing the aggregate working set size. Simply scaling
compression algorithms to LCP: Frequent Pattern Compression and up main memory capacity at a commensurate rate is unattractive for
Base-Delta-Immediate Compression. two reasons: (i) DRAM already constitutes a significant portion of
Evaluations using benchmarks from SPEC CPU2006 and five the system’s cost and power budget [19], and (ii) for signal integrity
server benchmarks show that our approach can significantly in- reasons, today’s high frequency memory channels prevent many
crease the effective memory capacity (by 69% on average). In DRAM modules from being connected to the same channel [17],
addition to the capacity gains, we evaluate the benefit of trans- effectively limiting the maximum amount of DRAM in a system
ferring consecutive compressed cache lines between the memory unless one resorts to expensive off-chip signaling buffers [6].
controller and main memory. Our new mechanism considerably If its potential could be realized in practice, data compression
reduces the memory bandwidth requirements of most of the eval- would be a very attractive approach to effectively increase main
uated benchmarks (by 24% on average), and improves overall per- memory capacity without requiring significant increases in cost
formance (by 6.1%/13.9%/10.7% for single-/two-/four-core work- or power, because a compressed piece of data can be stored in a
loads on average) compared to a baseline system that does not em- smaller amount of physical memory. Further, such compression
ploy main memory compression. LCP also decreases energy con- could be hidden from application (and most system1 ) software by
sumed by the main memory subsystem (by 9.5% on average over materializing the uncompressed data as it is brought into the pro-
the best prior mechanism). cessor cache. Building upon the observation that there is significant
redundancy in in-memory data, previous work has proposed a vari-
Categories and Subject Descriptors ety of techniques for compressing data in caches [2, 3, 5, 12, 25, 37,
B.3.1 [Semiconductor Memories]: Dynamic memory (DRAM); 39] and in main memory [1, 7, 8, 10, 35].
D.4.2 [Storage Management]: Main memory; E.4 [Coding and
Information Theory]: Data compaction and compression 1.1 Shortcomings of Prior Approaches
A key stumbling block to making data compression practical
Permission to make digital or hard copies of all or part of this work for is that decompression lies on the critical path of accessing any
personal or classroom use is granted without fee provided that copies are not
compressed data. Sophisticated compression algorithms, such as
made or distributed for profit or commercial advantage and that copies bear
this notice and the full citation on the first page. Copyrights for components Lempel-Ziv and Huffman encoding [13, 40], typically achieve high
of this work owned by others than ACM must be honored. Abstracting with compression ratios at the expense of large decompression latencies
credit is permitted. To copy otherwise, or republish, to post on servers or to 1 We assume that main memory compression is made visible to the
redistribute to lists, requires prior specific permission and/or a fee. Request
permissions from [email protected]. memory management functions of the operating system (OS). In
MICRO-46 December 07-11 2013, Davis, CA, USA. Section 2.3, we discuss the drawbacks of a design that makes main
Copyright 2013 ACM 978-1-4503-2638-4/13/12 ...$15.00. memory compression mostly transparent to the OS [1].
that can significantly degrade performance. To counter this prob- (thereby minimizing the number of compressions/decompressions
lem, prior work [3, 25, 39] on cache compression proposed special- performed), and (iii) supporting compression algorithms with high
ized compression algorithms that exploit regular patterns present compression ratios.
in in-memory data, and showed that such specialized algorithms To this end, we propose a new approach to compress pages,
have reasonable compression ratios compared to more complex al- which we call Linearly Compressed Pages (LCP). The key idea
gorithms while incurring much lower decompression latencies. of LCP is to compress all of the cache lines within a given page to
While promising, applying compression algorithms, sophisti- the same size. Doing so simplifies the computation of the physi-
cated or simpler, to compress data stored in main memory requires cal address of the cache line, because the page offset is simply the
first overcoming the following three challenges. First, main mem- product of the index of the cache line and the compressed cache line
ory compression complicates memory management, because the size (i.e., it can be calculated using a simple shift operation). Based
operating system has to map fixed-size virtual pages to variable- on this idea, a target compressed cache line size is determined for
size physical pages. Second, because modern processors employ each page. Cache lines that cannot be compressed to the target size
on-chip caches with tags derived from the physical address to avoid for its page are called exceptions. All exceptions, along with the
aliasing between different cache lines (as physical addresses are metadata required to locate them, are stored separately in the same
unique, while virtual addresses are not), the cache tagging logic compressed page. If a page requires more space in compressed
needs to be modified in light of memory compression to take the form than in uncompressed form, then this page is not compressed.
main memory address computation off the critical path of latency- The page table indicates the form in which the page is stored.
critical L1 cache accesses. Third, in contrast with normal virtual- The LCP framework can be used with any compression al-
to-physical address translation, the physical page offset of a cache gorithm. We adapt two previously proposed compression algo-
line is often different from the corresponding virtual page offset, rithms (Frequent Pattern Compression (FPC) [2] and Base-Delta-
because compressed physical cache lines are smaller than their cor- Immediate Compression (BDI) [25]) to fit the requirements of LCP,
responding virtual cache lines. In fact, the location of a compressed and show that the resulting designs can significantly improve effec-
cache line in a physical page in main memory depends upon the tive main memory capacity on a wide variety of workloads.
sizes of the compressed cache lines that come before it in that same Note that, throughout this paper, we assume that compressed
physical page. As a result, accessing a cache line within a com- cache lines are decompressed before being placed in the processor
pressed page in main memory requires an additional layer of ad- caches. LCP may be combined with compressed cache designs by
dress computation to compute the location of the cache line in main storing compressed lines in the higher-level caches (as in [2, 25]),
memory (which we will call the main memory address). This addi- but the techniques are largely orthogonal, and for clarity, we present
tional main memory address computation not only adds complex- an LCP design where only main memory is compressed.2
ity and cost to the system, but it can also increase the latency of An additional, potential benefit of compressing data in main
accessing main memory (e.g., it requires up to 22 integer addition memory, which has not been fully explored by prior work on main
operations in one prior design for main memory compression [10]), memory compression, is memory bandwidth reduction. When data
which in turn can degrade system performance. are stored in compressed format in main memory, multiple consec-
While simple solutions exist for these first two challenges (as we utive compressed cache lines can be retrieved at the cost of access-
describe later in Section 4), prior attempts to mitigate the perfor- ing a single uncompressed cache line. Given the increasing demand
mance degradation of the third challenge are either costly or inef- on main memory bandwidth, such a mechanism can significantly
ficient [1, 10]. One approach (IBM MXT [1]) aims to reduce the reduce the memory bandwidth requirement of applications, espe-
number of main memory accesses, the cause of long-latency main cially those with high spatial locality. Prior works on bandwidth
memory address computation, by adding a large (32MB) uncom- compression [27, 32, 36] assumed efficient variable-length off-chip
pressed cache managed at the granularity at which blocks are com- data transfers that are hard to achieve with general-purpose DRAM
pressed (1KB). If locality is present in the program, this approach (e.g., DDR3 [23]). We propose a mechanism that enables the mem-
can avoid the latency penalty of main memory address computa- ory controller to retrieve multiple consecutive cache lines with a
tions to access compressed data. Unfortunately, its benefit comes single access to DRAM, with negligible additional cost. Evalua-
at a significant additional area and energy cost, and the approach tions show that our mechanism provides significant bandwidth sav-
is ineffective for accesses that miss in the large cache. A second ings, leading to improved system performance.
approach [10] aims to hide the latency of main memory address In summary, this paper makes the following contributions:
computation by speculatively computing the main memory address
of every last-level cache request in parallel with the cache access • We propose a new main memory compression framework—
(i.e., before it is known whether or not the request needs to access Linearly Compressed Pages (LCP)—that solves the prob-
main memory). While this approach can effectively reduce the per- lem of efficiently computing the physical address of a com-
formance impact of main memory address computation, it wastes a pressed cache line in main memory with much lower cost
significant amount of energy (as we show in Section 7.3) because and complexity than prior proposals. We also demonstrate
many accesses to the last-level cache do not result in an access to that any compression algorithm can be adapted to fit the re-
main memory. quirements of LCP.
• We evaluate our design with two state-of-the-art compres-
1.2 Our Approach: Linearly Compressed sion algorithms (FPC [2] and BDI [25]), and observe that it
Pages can significantly increase the effective main memory capac-
We aim to build a main memory compression framework that ity (by 69% on average).
neither incurs the latency penalty for memory accesses nor requires
power-inefficient hardware. Our goals are: (i) having low com- • We evaluate the benefits of transferring compressed cache
plexity and low latency (especially when performing memory ad- lines over the bus between DRAM and the memory controller
dress computation for a cache line within a compressed page), (ii) 2 We show the results from combining main memory and cache
being compatible with compression employed in on-chip caches compression in our technical report [26].
2
and observe that it can considerably reduce memory band- Challenge 2: Physical Address Tag Computation. On-chip
width consumption (24% on average), and improve over- caches (including L1 caches) typically employ tags derived from
all performance by 6.1%/13.9%/10.7% for single-/two-/four- the physical address of the cache line to avoid aliasing, and in such
core workloads, relative to a system without main memory systems, every cache access requires the physical address of the
compression. LCP also decreases the energy consumed by corresponding cache line to be computed. Hence, because the main
the main memory subsystem (9.5% on average over the best memory addresses of the compressed cache lines differ from the
prior mechanism). nominal physical addresses of those lines, care must be taken that
the computation of cache line tag does not lengthen the critical path
of latency-critical L1 cache accesses.
2. BACKGROUND ON MAIN MEMORY Challenge 3: Cache Line Address Computation. When main
COMPRESSION memory is compressed, different cache lines within a page can be
Data compression is widely used in storage structures to in- compressed to different sizes. The main memory address of a cache
crease the effective capacity and bandwidth without significantly line is therefore dependent on the sizes of the compressed cache
increasing the system cost and power consumption. One primary lines that come before it in the page. As a result, the processor
downside of compression is that the compressed data must be de- (or the memory controller) must explicitly compute the location of
compressed before it can be used. Therefore, for latency-critical a cache line within a compressed main memory page before ac-
applications, using complex dictionary-based compression algo- cessing it (Figure 2), e.g., as in [10]. This computation not only
rithms [40] significantly degrades performance due to their high increases complexity, but can also lengthen the critical path of ac-
decompression latencies. Thus, prior work on compression of in- cessing the cache line from both the main memory and the physi-
memory data has proposed simpler algorithms with low decom- cally addressed cache. Note that systems that do not employ main
On‐Chip Cache
pression latencies and reasonably high compression ratios, as dis- memory compression do not suffer from this problem because the
Physical Tags
Cache Lines
offset of a cache line within thedataphysical
tag page is the same as the
cussed next. data tag
offset of the cache line within the
datacorresponding
tag virtual page.
2.1 Compressing In-Memory Data data Cache
tagLine (64B)
Several studies [2, 3, 25, 39] have shown that in-memory data
has exploitable patterns that allow for simpler compression tech-
Uncompressed Page
Virtual
Addr.
L0 L1 L2
Physical
Addr.
LN-1 ···
Address
Core Offset 0 64
Translation 128 (N-1)×64
niques. Frequent value compression (FVC) [39] is based on the
observation that an application’s working set is often dominated by
a small set of values. FVC exploits this observation by encoding
Compressed Page L0
load %r1,[0xFF] L1 L2 ··· LN-1
such frequently-occurring 4-byte values with fewer bits. Frequent Address Offset 0 ? ? ?
pattern compression (FPC) [3] shows that a majority of words (4- Figure 2: Cache Line Address Computation Challenge
byte elements) in memory fall under a few frequently occurring As will be seen shortly, while prior research efforts have con-
patterns. FPC compresses individual words within a cache line by sidered subsets of these challenges, this paper is the first design
encoding the frequently occurring patterns with fewer bits. Base- that provides a holistic solution to all three challenges, particularly
Cache Line (64B)
Delta-Immediate (BDI) compression [25] observes that, in many Uncompressed Page (4kB: 64 x 64B)
Challenge 3, with low latency and low (hardware and software)
Uncompressed Page L
cases, words co-located in memory0 have L1 small ∙ ∙ ∙ LN‐1 in their
L2 differences complexity.
Virtual Page Compressed Page (1kB + α)
(4KB)
Virtual
Address
Physical
pressed and is twice the size of the actual available physical mem- Uncompressed Page (4KB: 64 x 64B)
Address
Physical Page
ory. The operating system maps virtualFragmentation
pages to same-size pages
in the real address space,(?KB)
which addresses Challenge 1. On-chip
caches are tagged using the real address (instead of the physical
64B 64B 64B 64B ∙ ∙ ∙ 64B
4:1 Compression
address, which is dependent on compressibility), which effectively Exception
solves Challenge 2. On a miss in the 32MB cache, Pinnacle maps
the corresponding real address to the physical address of the com-
∙ ∙ ∙ M E Storage
Region
pressed block in main memory, using a memory-resident mapping- Compressed Data Metadata
table managed by the memory controller. Following this, Pinnacle Region Region
retrieves the compressed block from main memory, performs de- Figure 3: Organization of a Linearly Compressed Page
compression and sends the data back to the processor. Clearly, the
additional access to the memory-resident mapping table on every explicitly perform this complex calculation (or cache the mapping
cache miss significantly increases the main memory access latency. in a large, energy-inefficient structure) in order to access the line.
In addition to this, Pinnacle’s decompression latency, which is on To address this shortcoming, we propose a new approach to com-
the critical path of a memory access, is 64 processor cycles. pressing pages, called the Linearly Compressed Page (LCP). The
Ekman and Stenström [10] proposed a main memory compres- key idea of LCP is to use a fixed size for compressed cache lines
sion design to address the drawbacks of MXT. In their design, within a given page (alleviating the complex and long-latency main
the operating system maps the uncompressed virtual address space memory address calculation problem that arises due to variable-size
directly to a compressed physical address space. To compress cache lines), and yet still enable a page to be compressed even if
pages, they use a variant of the Frequent Pattern Compression tech- not all cache lines within the page can be compressed to that fixed
nique [2, 3], which has a much smaller decompression latency (5 size (enabling high compression ratios).
cycles) than the Lempel-Ziv compression in Pinnacle (64 cycles). Because all the cache lines within a given page are compressed
To avoid the long latency of a cache line’s main memory address to the same size, the location of a compressed cache line within the
computation (Challenge 3), their design overlaps this computation page is simply the product of the index of the cache line within the
with the last-level (L2) cache access. For this purpose, their design page and the size of the compressed cache line—essentially a linear
extends the page table entries to store the compressed sizes of all scaling using the index of the cache line (hence the name Linearly
the lines within the page. This information is loaded into a hard- Compressed Page). LCP greatly simplifies the task of computing a
ware structure called the Block Size Table (BST). On an L1 cache cache line’s main memory address. For example, if all cache lines
miss, the BST is accessed in parallel with the L2 cache to compute within a page are compressed to 16 bytes, the byte offset of the third
the exact main memory address of the corresponding cache line. cache line (index within the page is 2) from the start of the physical
While the proposed mechanism reduces the latency penalty of ac- page is 16 × 2 = 32, if the line is compressed. This computation
cessing compressed blocks by overlapping main memory address can be implemented as a simple shift operation.
computation with L2 cache access, the main memory address com- Figure 3 shows the organization of an example Linearly Com-
putation is performed on every L2 cache access (as opposed to only pressed Page, based on the ideas described above. In this example,
on L2 cache misses in LCP). This leads to significant wasted work we assume that a virtual page is 4KB, an uncompressed cache line
and additional power consumption. Even though BST has the same is 64B, and the target compressed cache line size is 16B.
number of entries as the translation lookaside buffer (TLB), its size As shown in the figure, the LCP contains three distinct regions.
is at least twice that of the TLB [10]. This adds to the complex- The first region, the compressed data region, contains a 16-byte
ity and power consumption of the system significantly. To address slot for each cache line in the virtual page. If a cache line is com-
Challenge 1, the operating system uses multiple pools of fixed-size pressible, the corresponding slot stores the compressed version of
physical pages. This reduces the complexity of managing physi- the cache line. However, if the cache line is not compressible,
cal pages at a fine granularity. Ekman and Stenstrom [10] do not the corresponding slot is assumed to contain invalid data. In our
address Challenge 2. design, we refer to such an incompressible cache line as an “ex-
In summary, prior work on hardware-based main memory com- ception”. The second region, metadata, contains all the necessary
pression mitigate the performance degradation due to the main information to identify and locate the exceptions of a page. We
memory address computation problem (Challenge 3) by either provide more details on what exactly is stored in the metadata re-
adding large hardware structures that consume significant area and gion in Section 4.2. The third region, the exception storage, is the
power [1] or by using techniques that require energy-inefficient place where all the exceptions of the LCP are stored in their un-
hardware and lead to wasted energy [10]. compressed form. Our LCP design allows the exception storage
to contain unused space. In other words, not all entries in the ex-
3. LINEARLY COMPRESSED PAGES ception storage may store valid exceptions. As we will describe
In this section, we provide the basic idea and a brief overview of in Section 4, this enables the memory controller to use the unused
our proposal, Linearly Compressed Pages (LCP), which overcomes space for storing future exceptions, and also simplifies the operat-
the aforementioned shortcomings of prior proposals. Further de- ing system page management mechanism.
tails will follow in Section 4. Next, we will provide a brief overview of the main memory com-
pression framework we build using LCP.
3.1 LCP: Basic Idea
The main shortcoming of prior approaches to main memory 3.2 LCP Operation
compression is that different cache lines within a physical page can Our LCP-based main memory compression framework consists
be compressed to different sizes based on the compression scheme. of components that handle three key issues: (i) page compression,
As a result, the location of a compressed cache line within a phys- (ii) cache line reads from main memory, and (iii) cache line write-
ical page depends on the sizes of all the compressed cache lines backs into main memory. Figure 4 shows the high-level design and
before it in the same page. This requires the memory controller to operation.
4
Processor
Disk pressed cache line. If so, it stores the cache line in an available
Compress/
slot in the region. If there are no free exception storage slots in the
Last-Level
Cache Decompress exception storage region of the page, the memory controller traps
Memory to the operating system, which migrates the page to a new location
Controller DRAM
Core TLB MD Cache
(which can also involve page recompression). In both cases 3 and
4, the memory controller appropriately modifies the LCP metadata
Figure 4: Memory request flow associated with the cache line’s page.
c-bit(1b) Note that in the case of an LLC writeback to main memory (and
c-type(3b)
Page Table Entry assuming that TLB information is not available at the LLC), the
Page Compression. When a page isc-size(2b)
accessed for the first time
c-base(3b) cache tag entry is augmented with the same bits that are used to
from disk, the operating system (with the help of the memory con-
augment page table entries. Cache compression mechanisms, e.g.,
troller) first determines whether the page is compressible using the
FPC [2] and BDI [25], already have the corresponding bits for en-
compression algorithm employed by the framework (described in
coding, so that the tag size overhead is minimal when main memory
Section 4.7). If the page is compressible, the OS allocates a physi-
compression is used together with cache compression.
cal page of appropriate size and stores the compressed page (LCP)
in the corresponding location. It also updates the relevant portions
of the corresponding page table mapping to indicate (i) whether the 4. DETAILED DESIGN
page is compressed, and if so, (ii) the compression scheme used to In this section, we provide a detailed description of LCP, along
compress the page (details in Section 4.1). with the changes to the memory controller, operating system and
Cache Line Read. When the memory controller receives a read on-chip cache tagging logic. In the process, we explain how
request for a cache line within an LCP, it must find and decom- our proposed design addresses each of the three challenges (Sec-
press the data. Multiple design solutions are possible to perform tion 2.2).
this task efficiently. A naïve way of reading a cache line from an
LCP would require at least two accesses to the corresponding page
in main memory. First, the memory controller accesses the meta- 4.1 Page Table Entry Extension
data in the LCP to determine whether the cache line is stored in To keep track of virtual pages that are stored in compressed for-
the compressed format. Second, based on the result, the controller mat in main memory, the page table entries need to be extended
either (i) accesses the cache line from the compressed data region to store information related to compression (Figure 5). In addi-
and decompresses it, or (ii) accesses it uncompressed from the ex- tion to the information already maintained in the page table en-
ception storage. tries (such as the base address for a corresponding physical page,
To avoid two accesses to main memory, we propose two opti- p-base), each virtual page in the system is associated with the fol-
mizations that enable the controller to retrieve the cache line with lowing pieces of metadata: (i) c-bit, a bit that indicates if the page
the latency of just one main memory access in the common case. is mapped to a compressed physical page (LCP), (ii) c-type, a
First, we add a small metadata (MD) cache to the memory con- field that indicates the compression scheme used to compress the
troller that caches the metadata of the recently accessed LCPs—the page, (iii) c-size, a field that indicates the size of the LCP, and
controller avoids the first main memory access to the metadata in (iv) c-base, a p-base extension that enables LCPs to start at an ad-
cases when the metadata is present in the MD cache. Second, in dress not aligned with the virtual page size. The number of bits
cases when the metadata is not present in the metadata cache, the required to store c-type, c-size and c-base depends on the exact
controller speculatively assumes that the cache line is stored in the Processor
implementation of the framework. In the implementation Diskwe eval-
compressed format and first accesses the data corresponding to the uate, we assume 3 bits for c-type (allowing 8 possible different
cache line from the compressed data region. The controller then compression encodings), 2 bits for c-size (4 possible page sizes:
Last-Level Compress/
overlaps the latency of the cache line decompression with the ac- 512B, 1KB, 2KB, 4KB), and 3 bits for c-base (at most eight 512B
Decompress
Cache
cess to the metadata of the LCP. In the common case, when the compressed pages can fit into a Memory
4KB uncompressed slot). Note that
speculation is correct (i.e., the cache line is actually stored in the unused bits (upDRAM
Controller
existing systems usually have enough to 15 bits in
compressed format), this approach significantly reduces the latency Coresystems
Intel x86-64 TLB [15]) in their
MD Cache
PTE entries that can be used by
of serving the read request. In the case of a misspeculation (uncom- LCP without increasing the PTE size.
mon case), the memory controller issues another request to retrieve
the cache line from the exception storage. Page Table Entry c-bit(1b)
Cache Line Writeback. If the memory controller receives a c-type(3b)
p-base
request for a cache line writeback, it then attempts to compress c-size(2b)
the cache line using the compression scheme associated with the c-base(3b)
corresponding LCP. Depending on the original state of the cache Figure 5: Page table entry extension.
line (compressible or incompressible), there are four different pos-
sibilities: the cache line (1) was compressed and stays compressed, When a virtual page is compressed (the c-bit is set), all the com-
(2) was uncompressed and stays uncompressed, (3) was uncom- pressible cache lines within the page are compressed to the same
pressed but becomes compressed, and (4) was compressed but be- size, say C ∗ . The value of C ∗ is uniquely determined by the com-
comes uncompressed. In the first two cases, the memory controller pression scheme used to compress the page, i.e., the c-type (Sec-
simply overwrites the old data with the new data at the same lo- tion 4.7 discusses determining the c-type for a page). We next
cation associated with the cache line. In case 3, the memory con- describe the LCP organization in more detail.
troller frees the exception storage slot for the cache line and writes
the compressible data in the compressed data region of the LCP. 4.2 LCP Organization
(Section 4.2 provides more details on how the exception storage is We will discuss each of an LCP’s three regions in turn. We begin
managed.) In case 4, the memory controller checks whether there by defining the following symbols: V is the virtual page size of
is enough space in the exception storage region to store the uncom- the system (e.g., 4KB); C is the uncompressed cache line size (e.g.,
5
PA0 + 512×4 Metadata Region
Page Table 4KB
2KB 2KB
e-bit(1b) v-bit(1b)
VA0 PA0 e-index(6b)
c‐base 1KB 1KB 1KB 1KB
VA1 PA1 ... ...
64 entries 64b
…
512B 512B ... 512B
…
PA1 + 512 M = n(1 + dlog2 ne) + n bits. Since n is fixed for the entire system,
the size of the metadata region (M ) is the same for all compressed
Figure 6: Physical memory layout with the LCP framework. pages (64B in our implementation).
64B);3 n = VC is the number of cache lines per virtual page (e.g., 4.2.3 Exception Storage Region
64); and M is the size of LCP’s metadata region. In addition, on The third region, the exception storage, is the place where
a per-page basis, we define P to be the compressed physical page all incompressible cache lines of the page are stored. If a
size; C ∗ to be the compressed cache line size; and navail to be the cache line is present in the location e-index in the excep-
number of slots available for exceptions. tion storage, its main memory address can be computed as:
4.2.1 Compressed Data Region p-base + m-size ∗ c-base + nC ∗ + M + e-indexC . The number
of slots available in the exception storage (navail ) is dictated by
The compressed data region is a contiguous array of n slots each
the size of the compressed physical page allocated by the oper-
of size C ∗ . Each one of the n cache lines in the virtual page is
ating system for the corresponding LCP. The following equation
mapped to one of the slots, irrespective of whether the cache line is
expresses the relation between the physical page size (P ), the
compressible or not. Therefore, the size of the compressed data
compressed cache line size (C ∗ ) that is determined by c-type, and
region is nC ∗ . This organization simplifies the computation re-
the number of available slots in the exception storage (navail ):
quired to determine the main memory address for the compressed
slot corresponding to a cache line. More specifically, the address navail = b(P − (nC ∗ + M ))/C c (1)
of the compressed slot for the ith cache line can be computed as As mentioned before, the metadata region contains a bit vector that
p-base + m-size ∗ c-base + (i − 1)C ∗ , where the first two terms cor-
is used to manage the exception storage. When the memory con-
respond to the start of the LCP (m-size equals to the minimum page troller assigns an exception slot to an incompressible cache line, it
size, 512B in our implementation) and the third indicates the off- sets the corresponding bit in the bit vector to indicate that the slot is
set within the LCP of the ith compressed slot (see Figure 6). Thus, no longer free. If the cache line later becomes compressible and no
computing the main memory address of a compressed cache line re- longer requires the exception slot, the memory controller resets the
quires one multiplication (can be implemented as a shift) and two corresponding bit in the bit vector. In the next section, we describe
additions independent of i (fixed latency). This computation re- the operating system memory management policy that determines
quires a lower latency and simpler hardware than prior approaches the physical page size (P ) allocated for an LCP, and hence, the
(e.g., up to 22 additions in the design proposed in [10]), thereby ef- number of available exception slots (navail ).
ficiently addressing Challenge 3 (cache line address computation).
6
a system that employs main memory compression, using the physi- the LCP. We call this scenario a page overflow. A page overflow
cal (main memory) address to tag cache lines puts the main memory increases the size of the LCP and leads to one of two scenarios:
address computation on the critical path of L1 cache access (Chal- (i) the LCP still requires a physical page size that is smaller than the
lenge 2). To address this challenge, we modify the cache tagging uncompressed virtual page size (type-1 page overflow), and (ii) the
logic to use the tuple <physical page base address, cache line in- LCP now requires a physical page size that is larger than the un-
dex within the page> for tagging cache lines. This tuple maps to a compressed virtual page size (type-2 page overflow).
unique cache line in the system, and hence avoids aliasing problems Type-1 page overflow simply requires the operating system to
without requiring the exact main memory address to be computed. migrate the LCP to a physical page of larger size (without recom-
The additional index bits are stored within the cache line tag. pression). The OS first allocates a new page and copies the data
from the old location to the new location. It then modifies the map-
4.5 Changes to the Memory Controller ping for the virtual page to point to the new location. While in
In addition to the changes to the memory controller operation transition, the page is locked, so any memory request to this page
described in Section 3.2, our LCP-based framework requires two is delayed. In our evaluations, we stall the application for 20,000
hardware structures to be added to the memory controller: (i) a cycles5 when a type-1 overflow occurs; we also find that (on av-
small metadata cache to accelerate main memory lookups in LCP, erage) type-1 overflows happen less than once per two million in-
and (ii) compression/decompression hardware to perform the com- structions. We vary this latency between 10,000–100,000 cycles
pression and decompression of cache lines. and observe that the benefits of our framework (e.g., bandwidth
compression) far outweigh the overhead due to type-1 overflows.
4.5.1 Metadata Cache In a type-2 page overflow, the size of the LCP exceeds the un-
As described in Section 3.2, a small metadata cache in the mem- compressed virtual page size. Therefore, the OS attempts to re-
ory controller enables our approach, in the common case, to re- compress the page, possibly using a different encoding (c-type).
trieve a compressed cache block in a single main memory access. Depending on whether the page is compressible or not, the OS al-
This cache stores the metadata region of recently accessed LCPs so locates a new physical page to fit the LCP or the uncompressed
that the metadata for subsequent accesses to such recently-accessed page, and migrates the data to the new location. The OS also ap-
LCPs can be retrieved directly from the cache. In our study, we find propriately modifies the c-bit, c-type and the c-base in the cor-
that a small 512-entry metadata cache (32KB4 ) can service 88% of responding page table entry. Clearly, a type-2 overflow requires
the metadata accesses on average across all our workloads. Some more work from the OS than a type-1 overflow. However, we ex-
applications have lower hit rate, especially sjeng and astar [29]. An pect page overflows of type-2 to occur rarely. In fact, we never
analysis of these applications reveals that their memory accesses observed a type-2 overflow in our evaluations.
exhibit very low locality. As a result, we also observed a low TLB
hit rate for these applications. Because TLB misses are costlier than 4.6.1 Avoiding Recursive Page Faults
MD cache misses (the former requires multiple memory accesses), There are two types of pages that require special consideration:
the low MD cache hit rate does not lead to significant performance (i) pages that keep internal OS data structures, e.g., pages contain-
degradation for these applications. ing information required to handle page faults, and (ii) shared data
We expect the MD cache power to be much lower than the power pages that have more than one page table entry (PTE) mapping to
consumed by other on-chip structures (e.g., L1 caches), because the same physical page. Compressing pages of the first type can po-
the MD cache is accessed much less frequently (hits in any on-chip tentially lead to recursive page fault handling. The problem can be
cache do not lead to an access to the MD cache). avoided if the OS sets a special do not compress bit, e.g., as a part
of the page compression encoding, so that the memory controller
4.5.2 Compression/Decompression Hardware does not compress these pages. The second type of pages (shared
Depending on the compression scheme employed with our LCP- pages) require consistency across multiple page table entries, such
based framework, the memory controller should be equipped with that when one PTE’s compression information changes, the second
the hardware necessary to compress and decompress cache lines entry is updated as well. There are two possible solutions to this
using the corresponding scheme. Although our framework does problem. First, as with the first type of pages, these pages can be
not impose any restrictions on the nature of the compression al- marked as do not compress. Second, the OS could maintain consis-
gorithm, it is desirable to have compression schemes that have tency of the shared PTEs by performing multiple synchronous PTE
low complexity and decompression latency – e.g., Frequent Pat- updates (with accompanying TLB shootdowns). While the second
tern Compression (FPC) [2] and Base-Delta-Immediate Compres- solution can potentially lead to better average compressibility, the
sion (BDI) [25]. In Section 4.7, we provide more details on how to first solution (used in our implementation) is simpler and requires
adapt any compression algorithm to fit the requirements of LCP and minimal changes inside the OS.
also the specific changes we made to FPC and BDI as case studies Another situation that can potentially lead to a recursive fault is
of compression algorithms that we adapted to the LCP framework. the eviction of dirty cache lines from the LLC to DRAM due to
some page overflow handling that leads to another overflow. In or-
4.6 Handling Page Overflows der to solve this problem, we assume that the memory controller
As described in Section 3.2, when a cache line is written back has a small dedicated portion of the main memory that is used as
to main memory, the cache line may switch from being compress-
ible to being incompressible. When this happens, the memory con- 5 To fetch a 4KB page, we need to access 64 cache lines (64 bytes
troller should explicitly find a slot in the exception storage for the each). In the worst case, this will lead to 64 accesses to main mem-
uncompressed cache line. However, it is possible that all the slots ory, most of which are likely to be DRAM row-buffer hits. Since a
in the exception storage are already used by other exceptions in row-buffer hit takes 7.5ns, the total time to fetch the page is 495ns.
On the other hand, the latency penalty of two context-switches (into
4 We evaluated the sensitivity of performance to MD cache size and the OS and out of the OS) is around 4us [20]. Overall, a type-1
find that 32KB is the smallest size that enables our design to avoid overflow takes around 4.5us. For a 4.4Ghz or slower processor,
most of the performance loss due to additional metadata accesses. this is less than 20,000 cycles.
7
a scratchpad to store cache lines needed to perform page overflow ferent versions of the Base-Delta encoding as described by Pekhi-
handling. Dirty cache lines that are evicted from LLC to DRAM menko et al. [25] and then chooses the combination that minimizes
due to OS overflow handling are stored in this buffer space. The the final compressed page size (according to the search procedure
OS is responsible to minimize the memory footprint of the over- in Section 4.7.1).
flow handler. Note that this situation is expected to be very rare in
practice, because even a single overflow is infrequent.
5. LCP OPTIMIZATIONS
4.7 Compression Algorithms In this section, we describe two simple optimizations to our pro-
Our LCP-based main memory compression framework can be posed LCP-based framework: (i) memory bandwidth reduction via
employed with any compression algorithm. In this section, we de- compressed cache lines, and (ii) exploiting zero pages and cache
scribe how to adapt a generic compression algorithm to fit the re- lines for higher bandwidth utilization.
quirements of the LCP framework. Subsequently, we describe how
to adapt the two compression algorithms used in our evaluation. 5.1 Enabling Memory Bandwidth Reduction
One potential benefit of main memory compression that has not
4.7.1 Adapting a Compression Algorithm to Fit LCP been examined in detail by prior work on memory compression is
Every compression scheme is associated with a compression bandwidth reduction.6 When cache lines are stored in compressed
function, fc , and a decompression function, fd . To compress a format in main memory, multiple consecutive compressed cache
virtual page into the corresponding LCP using the compression lines can be retrieved at the cost of retrieving a single uncom-
scheme, the memory controller carries out three steps. In the first pressed cache line. For example, when cache lines of a page are
step, the controller compresses every cache line in the page using compressed to 1/4 their original size, four compressed cache lines
fc and feeds the sizes of each compressed cache line to the second can be retrieved at the cost of a single uncompressed cache line ac-
step. In the second step, the controller computes the total com- cess. This can significantly reduce the bandwidth requirements of
pressed page size (compressed data + metadata + exceptions, using applications, especially those with good spatial locality. We pro-
the formulas from Section 4.2) for each of a fixed set of target com- pose two mechanisms that exploit this idea.
pressed cache line sizes and selects a target compressed cache line In the first mechanism, when the memory controller needs to
size C ∗ that minimizes the overall LCP size. In the third and final access a cache line in the compressed data region of LCP, it obtains
step, the memory controller classifies any cache line whose com- the data from multiple consecutive compressed slots (which add
pressed size is less than or equal to the target size as compressible up to the size of an uncompressed cache line). However, some of
and all other cache lines as incompressible (exceptions). The mem- the cache lines that are retrieved in this manner may not be valid.
ory controller uses this classification to generate the corresponding To determine if an additionally-fetched cache line is valid or not,
LCP based on the organization described in Section 3.1. the memory controller consults the metadata corresponding to the
To decompress a compressed cache line of the page, the memory LCP. If a cache line is not valid, then the corresponding data is not
controller reads the fixed-target-sized compressed data and feeds it decompressed. Otherwise, the cache line is decompressed and then
to the hardware implementation of function fd . stored in the cache.
The second mechanism is an improvement over the first mech-
4.7.2 FPC and BDI Compression Algorithms anism, where the memory controller additionally predicts if the
Although any compression algorithm can be employed with our additionally-fetched cache lines are useful for the application. For
framework using the approach described above, it is desirable to this purpose, the memory controller uses hints from a multi-stride
use compression algorithms that have low complexity hardware prefetcher [14]. In this mechanism, if the stride prefetcher suggests
implementation and low decompression latency, so that the over- that an additionally-fetched cache line is part of a useful stream,
all complexity and latency of the design are minimized. For this then the memory controller stores that cache line in the cache. This
reason, we adapt to fit our LCP framework two state-of-the-art approach has the potential to prevent cache lines that are not use-
compression algorithms that achieve such design points in the con- ful from polluting the cache. Section 7.5 shows the effect of this
text of compressing in-cache data: (i) Frequent Pattern Compres- approach on both performance and bandwidth consumption.
sion [2], and (ii) Base-Delta-Immediate Compression [25]. Note that prior work [11, 27, 32, 36] assumed that when a cache
Frequent Pattern Compression (FPC) is based on the observation line is compressed, only the compressed amount of data can be
that a majority of the words accessed by applications fall under transferred over the DRAM bus, thereby freeing the bus for the fu-
a small set of frequently occurring patterns [3]. FPC compresses ture accesses. Unfortunately, modern DRAM chips are optimized
each cache line one word at a time. Therefore, the final compressed for full cache block accesses [38], so they would need to be mod-
size of a cache line is dependent on the individual words within ified to support such smaller granularity transfers. Our proposal
the cache line. To minimize the time to perform the compression does not require modifications to DRAM itself or the use of spe-
search procedure described in Section 4.7.1, we limit the search cialized DRAM such as GDDR3 [16].
to four different target cache line sizes: 16B, 21B, 32B and 44B
(similar to the fixed sizes used in [10]). 5.2 Zero Pages and Zero Cache Lines
Base-Delta-Immediate (BDI) Compression is based on the ob- Prior work [2, 9, 10, 25, 37] observed that in-memory data con-
servation that in most cases, words co-located in memory have tains a significant number of zeros at two granularities: all-zero
small differences in their values, a property referred to as low dy-
6 Prior work [11, 27, 32, 36] looked at the possibility of using com-
namic range [25]. BDI encodes cache lines with such low dynamic
range using a base value and an array of differences (∆s) of words pression for bandwidth reduction between the memory controller
within the cache line from either the base value or from zero. The and DRAM. While significant reduction in bandwidth consumption
is reported, prior work achieve this reduction either at the cost of
size of the final compressed cache line depends only on the size of increased memory access latency [11, 32, 36], as they have to both
the base and the size of the ∆s. To employ BDI within our frame- compress and decompress a cache line for every request, or based
work, the memory controller attempts to compress a page with dif- on a specialized main memory design [27], e.g., GDDR3 [16].
8
CPU Processor 1–4 cores, 4GHz, x86 in-order Name Framework Compression Algorithm
CPU L1-D cache 32KB, 64B cache-line, 2-way, 1 cycle Baseline None None
CPU L2 cache 2 MB, 64B cache-line, 16-way, 20 cycles RMC-FPC RMC [10] FPC [2]
Main memory 2 GB, 4 Banks, 8 KB row buffers, LCP-FPC LCP FPC [2]
1 memory channel, DDR3-1066 [23]
LCP-BDI LCP BDI [25]
LCP Design Type-1 Overflow Penalty: 20,000 cycles
MXT MXT [1] Lempel-Ziv [40]
Table 1: Major Parameters of the Simulated Systems.
ZPC None Zero Page Compression
pages and all-zero cache lines. Because this pattern is quite com- LZ None Lempel-Ziv [40]
mon, we propose two changes to the LCP framework to more ef-
ficiently compress such occurrences of zeros. First, one value of Table 2: List of evaluated designs.
the page compression encoding (e.g., c-type of 0) is reserved to Compression (ZPC) and Lempel-Ziv (LZ)7 to show some practical
indicate that the entire page is zero. When accessing data from a upper bounds on main memory compression. Table 2 summarizes
page with c-type = 0, the processor can avoid any LLC or DRAM all the designs.
access by simply zeroing out the allocated cache line in the L1-
cache. Second, to compress all-zero cache lines more efficiently, 7.1 Effect on DRAM Capacity
we can add another bit per cache line to the first part of the LCP Figure 8 compares the compression ratio of all the designs de-
metadata. This bit, which we call the z-bit, indicates if the cor- scribed in Table 2. We draw two major conclusions. First, as ex-
responding cache line is zero. Using this approach, the memory pected, MXT, which employs the complex LZ algorithm, has the
controller does not require any main memory access to retrieve a highest average compression ratio (2.30) of all practical designs
cache line with the z-bit set (assuming a metadata cache hit). and performs closely to our idealized LZ implementation (2.60).
At the same time, LCP-BDI provides a reasonably high compres-
6. METHODOLOGY sion ratio (1.62 on average), outperforming RMC-FPC (1.59), and
Our evaluations use an in-house, event-driven 32-bit x86 simu- LCP-FPC (1.52). (Note that LCP could be used with both BDI and
lator whose front-end is based on Simics [22]. All configurations FPC algorithms together, and the average compression ratio in this
have private L1 caches and shared L2 caches. Major simulation case is as high as 1.69.)
parameters are provided in Table 1. We use benchmarks from the Second, while the average compression ratio of ZPC is relatively
SPEC CPU2006 suite [29], four TPC-H/TPC-C queries [33], and low (1.29), it greatly improves the effective memory capacity for
an Apache web server. All results are collected by running a repre- a number of applications (e.g., GemsFDTD, zeusmp, and cactus-
sentative portion (based on PinPoints [24]) of the benchmarks for ADM). This justifies our design decision of handling zero pages
1 billion instructions. We build our energy model based on Mc- at the TLB-entry level. We conclude that our LCP framework
Pat [21], CACTI [31], C-Pack [5], and the Synopsys Design Com- achieves the goal of high compression ratio.
piler with 65nm library (to evaluate the energy of compression/de-
compression with BDI and address calculation in [10]). 7.1.1 Distribution of Compressed Pages
Metrics. We measure the performance of our benchmarks using The primary reason why applications have different compression
IPC (instruction per cycle) and effective compression ratio (effec- ratios is the redundancy difference in their data. This leads to the
tive DRAM size increase, e.g., a compression ratio of 1.5 for 2GB situation where every application has its own distribution of com-
DRAM means that the compression scheme achieves the size ben- pressed pages with different sizes (0B, 512B, 1KB, 2KB, 4KB).
efits of a 3GB DRAM). For multi-programmed workloads we use Figure 9 shows these distributions for the applications in our study
iIPCshared when using the LCP-BDI design. As we can see, the percentage of
the weighted speedup [28] performance metric: (∑i ). For
IPCalone
i memory pages of every size in fact significantly varies between the
bandwidth consumption we use BPKI (bytes transferred over bus applications, leading to different compression ratios (shown in Fig-
per thousand instructions [30]). ure 8). For example, cactusADM has a high compression ratio due
Parameters of the Evaluated Schemes. As reported in the re- to many 0B and 512B pages (there is a significant number of zero
spective previous works, we used a decompression latency of 5 cy- cache lines in its data), while astar and h264ref get most of their
cles for FPC and 1 cycle for BDI. compression with 2KB pages due to cache lines with low dynamic
range [25].
7. RESULTS
In our experiments for both single-core and multi-core systems, 7.1.2 Compression Ratio over Time
we compare five different designs that employ different main mem- To estimate the efficiency of LCP-based compression over time,
ory compression strategies (frameworks) and different compression we conduct an experiment where we measure the compression ra-
algorithms: (i) Baseline system with no compression, (ii) robust tios of our applications every 100 million instructions (for a to-
main memory compression (RMC-FPC) [10], (iii) and (iv) LCP tal period of 5 billion instructions). The key observation we make
framework with both FPC and BDI compression algorithms (LCP- is that the compression ratio for most of the applications is stable
FPC and LCP-BDI), and (v) MXT [1]. Note that it is fundamentally over time (the difference between the highest and the lowest ratio is
possible to build a RMC-BDI design as well, but we found that it within 10%). Figure 10 shows all notable outliers from this obser-
leads to either low energy efficiency (due to an increase in the BST vation: astar, cactusADM, h264ref, and zeusmp. Even for these ap-
metadata table entry size [10] with many more encodings in BDI) plications, the compression ratio stays relatively constant for a long
or low compression ratio (when the number of encodings is arti- 7 Our implementation of LZ performs compression at 4KB page-
ficially decreased). Hence, for brevity, we exclude this potential granularity and serves as an idealized upper bound for the in-
design from our experiments. memory compression ratio. In contrast, MXT employs Lempel-Ziv
In addition, we evaluate two hypothetical designs: Zero Page at 1KB granularity.
9
ZPC RMC‐FPC LCP‐FPC LCP‐BDI MXT LZ
4.0 7.6‐8.1 176‐454 3.8 4.1 3.5 4.7‐14.1
3.5
Compression Ratio 3.0
2.5
2.0
1.5
1.0
bzip2
gamess
gromacs
sjeng
tpch17
cactusADM
gcc
dealII
lbm
leslie3d
milc
namd
omnetpp
perlbench
tpcc
zeusmp
calculix
GemsFDTD
gobmk
soplex
GeoMean
h264ref
libquantum
mcf
povray
wrf
xalancbmk
apache
astar
hmmer
sphinx3
tpch2
tpch6
Figure 8: Main memory compression ratio.
period of time, although there are some noticeable fluctuations in LCP-FPC) improve performance by 6.1%/5.2% and also outper-
compression ratio (e.g., for astar at around 4 billion instructions, form RMC-FPC.8 We conclude that our LCP framework is effec-
for cactusADM at around 500M instructions). We attribute this tive in improving performance by compressing main memory.
behavior to a phase change within an application that sometimes
1.68
leads to changes in the applications’ data. Fortunately, these cases 1.6
RMC‐FPC
Normalized IPC
1.5
are infrequent and do not have a noticeable effect on the applica- 1.4 LCP‐FPC
tion’s performance (as we describe in Section 7.2). We conclude 1.3 LCP‐BDI
that the capacity benefits provided by the LCP-based frameworks 1.2
1.1
are usually stable over long periods of time. 1.0
0.9
0B 512B 1KB 2KB 4KB (uncompressed) 0.8
100%
tpch17
apache
gamess
GemsFDTD
gobmk
sjeng
hmmer
GeoMean
astar
calculix
dealII
gromacs
libquantum
namd
omnetpp
perlbench
gcc
soplex
bzip2
h264ref
mcf
milc
wrf
xalancbmk
povray
tpch2
tpch6
cactusADM
lbm
sphinx3
tpcc
zeusmp
leslie3d
90%
Fraction of Pages
80%
70%
60%
50%
40%
30% Figure 11: Performance comparison (IPC) of different compressed
20%
10% designs for the single-core system.
0%
leslie3d
bzip2
gobmk
gromacs
h264ref
libquantum
soplex
sphinx3
apache
calculix
gamess
dealII
mcf
wrf
xalancbmk
lbm
astar
cactusADM
gcc
GemsFDTD
hmmer
milc
omnetpp
namd
perlbench
povray
sjeng
tpcc
tpch17
tpch2
tpch6
zeusmp
9 zeusmp
two accesses to the same main memory page that can be pipelined).
8
7 cactusADM This is especially noticeable in several applications, e.g., astar,
6 milc, and xalancbmk that have low metadata table (BST) hit rates
5
4 astar (LCP can also degrade performance for these applications). We
3 conclude that our LCP framework is more effective in improving
2 h264ref performance than RMC [10].
1
0
0 5 10 15 20 25 30 35 40 45 X 100M instructions 7.2.2 Multi-Core Results
Figure 10: Compression ratio over time with LCP-BDI. When the system has a single core, the memory bandwidth pres-
sure may not be large enough to take full advantage of the band-
7.2 Effect on Performance width benefits of main memory compression. However, in a multi-
Main memory compression can improve performance in two ma- core system where multiple applications are running concurrently,
jor ways: (i) reduced memory bandwidth requirements, which can savings in bandwidth (reduced number of memory bus transfers)
enable less contention on the main memory bus, an increasingly im- may significantly increase the overall system performance.
portant bottleneck in systems, and (ii) reduced memory footprint, To study this effect, we conducted experiments using 100 ran-
which can reduce long-latency disk accesses. We evaluate the per- domly generated multiprogrammed mixes of applications (for both
formance improvement due to memory bandwidth reduction (in- 2-core and 4-core workloads). Our results show that the bandwidth
cluding our optimizations for compressing zero values described benefits of memory compression are indeed more pronounced for
in Section 5.2) in Sections 7.2.1 and 7.2.2. We also evaluate the multi-core workloads. Using our LCP-based design, LCP-BDI, the
decrease in page faults in Section 7.2.3. average performance improvement (normalized to the performance
of the Baseline system without compression) is 13.9% for 2-core
7.2.1 Single-Core Results workloads and 10.7% for 4-core workloads. We summarize our
Figure 11 shows the performance of single-core workloads us- 8 Note that in order to provide a fair comparison, we enhanced the
ing three key evaluated designs (RMC-FPC, LCP-FPC, and LCP- RMC-FPC approach with the same optimizations we did for LCP,
BDI) normalized to the Baseline. Compared against an uncom- e.g., bandwidth compression. The original RMC-FPC design re-
pressed system (Baseline), the LCP-based designs (LCP-BDI and ported an average degradation in performance [10].
10
1.2
Normalized BPKI
multi-core performance results in Figure 12a. RMC‐FPC LCP‐FPC LCP‐BDI
1.0
We also vary the last-level cache size (1MB – 16MB) for both 0.8
0.6
single core and multi-core systems across all evaluated workloads. 0.4
We find that LCP-based designs outperform the Baseline across all 0.2
0.0
evaluated systems (average performance improvement for single-
apache
GemsFDTD
gcc
gromacs
mcf
sjeng
wrf
astar
tpcc
gobmk
hmmer
GeoMean
bzip2
lbm
tpch17
tpch2
tpch6
xalancbmk
calculix
h264ref
zeusmp
leslie3d
libquantum
namd
omnetpp
perlbench
povray
soplex
cactusADM
gamess
milc
sphinx3
dealII
core varies from 5.1% to 13.4%), even when the L2 cache size of
the system is as large as 16MB.
1.2 Baseline LCP‐BDI Figure 13: Effect of different main memory compression schemes
Normalized # of
LCP‐BDI
Improvement, %
15% 1 8%
Performance
Page Faults
10%
0.8 14%
23% on memory bandwidth.
0.6 21%
5% 0.4
0.2
0% 0 namic energy consumed by caches, TLBs, memory transfers, and
1 2 4 256MB 512MB 768MB 1GB
Number of Cores DRAM Size DRAM, plus the energy of additional components due to main
memory compression: BST [10], MD cache, address calculation,
(a) Average performance improve- (b) Number of page faults (normal-
compressor/decompressor units. Two observations are in order.
ment (weighted speedup). ized to Baseline with 256MB).
Figure 12: Performance (with 2 GB DRAM) and number of page RMC‐FPC LCP‐FPC LCP‐BDI
Normalized Energy
1.3
faults (varying DRAM size) using LCP-BDI. 1.2
1.1
1.0
0.9
7.2.3 Effect on the Number of Page Faults 0.8
0.7
0.6
Modern systems are usually designed such that concurrently- 0.5
tpch17
apache
dealII
gamess
gromacs
mcf
milc
namd
omnetpp
perlbench
wrf
GeoMean
astar
gcc
bzip2
gobmk
hmmer
sjeng
tpcc
cactusADM
GemsFDTD
h264ref
tpch2
tpch6
xalancbmk
zeusmp
calculix
lbm
leslie3d
povray
soplex
sphinx3
libquantum
running applications have enough main memory to avoid most of
the potential capacity page faults. At the same time, if the appli-
cations’ total working set size exceeds the main memory capacity,
the increased number of page faults can significantly affect perfor- Figure 14: Effect of different main memory compression schemes
mance. To study the effect of the LCP-based framework (LCP- on memory subsystem energy.
BDI) on the number of page faults, we evaluate twenty randomly
generated 16-core multiprogrammed mixes of applications from First, our LCP-based designs (LCP-BDI and LCP-FPC) improve
our benchmark set. We also vary the main memory capacity from the memory subsystem energy by 5.2% / 3.4% on average over the
256MB to 1GB (larger memories usually lead to almost no page Baseline design with no compression, and by 11.3% / 9.5% over
faults for these workload simulations). Our results (Figure 12b) the state-of-the-art design (RMC-FPC) based on [10]. This is espe-
show that the LCP-based framework (LCP-BDI) can decrease the cially noticeable for bandwidth-limited applications, e.g., zeusmp
number of page faults by 21% on average (for 1GB DRAM) when and cactusADM. We conclude that our framework for main mem-
compared with the Baseline design with no compression. We con- ory compression enables significant energy savings, mostly due to
clude that the LCP-based framework can significantly decrease the the decrease in bandwidth consumption.
number of page faults, and hence improve system performance be- Second, RMC-FPC consumes significantly more energy than
yond the benefits it provides due to reduced bandwidth. Baseline (6.1% more energy on average, as high as 21.7% for
dealII). The primary reason for this energy consumption increase is
7.3 Effect on Bus Bandwidth and Memory the physical address calculation that RMC-FPC speculatively per-
Subsystem Energy forms on every L1 cache miss (to avoid increasing the memory la-
When DRAM pages are compressed, the traffic between the LLC tency due to complex address calculations). The second reason is
and DRAM can be reduced. This can have two positive effects: (i) the frequent (every L1 miss) accesses to the BST table (described
reduction in the average latency of memory accesses, which can in Section 2) that holds the address calculation information.
lead to improvement in the overall system performance, and (ii) Note that other factors, e.g., compression/decompression energy
decrease in the bus energy consumption due to the decrease in the overheads or different compression ratios, are not the reasons for
number of transfers. this energy consumption increase. LCP-FPC uses the same com-
Figure 13 shows the reduction in main memory bandwidth be- pression algorithm as RMC-FPC (and even has a slightly lower
tween LLC and DRAM (in terms of bytes per kilo-instruction, nor- compression ratio), but does not increase energy consumption—
malized to the Baseline system with no compression) using dif- in fact, LCP-FPC improves the energy consumption due to its de-
ferent compression designs. The key observation we make from crease in consumed bandwidth. We conclude that our LCP-based
this figure is that there is a strong correlation between bandwidth framework is a more energy-efficient main memory compression
compression and performance improvement (Figure 11). Applica- framework than previously proposed designs such as RMC-FPC.
tions that show a significant reduction in bandwidth consumption
(e.g., GemsFDTD, cactusADM, soplex, zeusmp, leslie3d, and the 7.4 Analysis of LCP Parameters
four tpc queries) also see large performance improvements. There
are some noticeable exceptions to this observation, e.g., h264ref, 7.4.1 Analysis of Page Overflows
wrf and bzip2. Although the memory bus traffic is compressible in As described in Section 4.6, page overflows can stall an applica-
these applications, main memory bandwidth is not the bottleneck tion for a considerable duration. As we mentioned in that section,
for their performance. we did not encounter any type-2 overflows (the more severe type) in
Figure 14 shows the reduction in memory subsystem energy our simulations. Figure 15 shows the number of type-1 overflows
of three systems that employ main memory compression—RMC- per instruction. The y-axis uses a log-scale as the number of over-
FPC, LCP-FPC, and LCP-BDI—normalized to the energy of Base- flows per instruction is very small. As the figure shows, on average,
line. The memory subsystem energy includes the static and dy- less than one type-1 overflow occurs every one million instructions.
11
Although such overflows are more frequent for some applications cache lines into the LLC, we compare our LCP-based design to a
(e.g., soplex and the three tpch queries), our evaluations show that system that employs a stride prefetcher implemented as described
this does not degrade performance in spite of adding a 20,000 cycle in [14]. Figures 17 and 18 compare the performance and band-
penalty for each type-1 page overflow.9 In fact, these applications width consumption of three systems: (i) one that employs stride
gain significant performance from our LCP design. The main rea- prefetching, (ii) one that employs LCP-BDI, and (iii) one that em-
son for this is that the performance benefits of bandwidth reduction ploys LCP-BDI along with hints from a prefetcher to avoid cache
far outweigh the performance degradation due to type-1 overflows. pollution due to bandwidth compression (Section 5.1). Two con-
We conclude that page overflows do not prevent the proposed LCP clusions are in order.
framework from providing good overall performance. First, our LCP-based designs (second and third bars) are com-
petitive with the more general stride prefetcher for all but a few
libquantum
cactusADM
Type‐1 Overflows per instr.
xalancbmk
perlbench
GeoMean
omnetpp
gromacs
h264ref
zeusmp
sphinx3
leslie3d
gamess
hmmer
calculix
apache
tpch17
povray
gobmk
soplex
prefetcher can sometimes increase the memory bandwidth con-
namd
tpch2
tpch6
dealII
bzip2
sjeng
astar
tpcc
milc
lbm
mcf
wrf
gcc
1.E‐03
1.E‐04
the other hand, LCP obtains the benefits of prefetching without in-
1.E‐05
1.E‐06
creasing (in fact, while significantly reducing) memory bandwidth
1.E‐07 consumption.
1.E‐08 Second, the effect of using prefetcher hints to avoid cache pollu-
Figure 15: Type-1 page overflows for different applications. tion is not significant. The reason for this is that our systems em-
ploy a large, highly-associative LLC (2MB 16-way) which is less
7.4.2 Number of Exceptions susceptible to cache pollution. Evicting the LRU lines from such a
cache has little effect on performance, but we did observe the ben-
The number of exceptions (uncompressed cache lines) in the
efits of this mechanism on multi-core systems with shared caches
LCP framework is critical for two reasons. First, it determines the
(up to 5% performance improvement for some two-core workload
size of the physical page required to store the LCP. The higher the
mixes—not shown).
number of exceptions, the larger the required physical page size.
Second, it can affect an application’s performance as exceptions Normalized IPC Stride Prefetching LCP‐BDI LCP‐BDI + Prefetching hints
1.57 1.54 1.91 1.43 1.68
require three main memory accesses on an MD cache miss (Sec- 1.30
1.25
tion 3.2). We studied the average number of exceptions (across 1.20
1.15
all compressed pages) for each application. Figure 16 shows the 1.10
1.05
results of these studies. 1.00
0.95
The number of exceptions varies from as low as 0.02/page for
gcc
dealII
gobmk
povray
sjeng
sphinx3
astar
cactusADM
gromacs
h264ref
hmmer
lbm
leslie3d
libquantum
mcf
namd
omnetpp
perlbench
soplex
GeoMean
calculix
gamess
apache
bzip2
GemsFDTD
milc
tpcc
wrf
zeusmp
tpch17
tpch2
tpch6
xalancbmk
GemsFDTD to as high as 29.2/page in milc (17.3/page on average).
The average number of exceptions has a visible impact on the com-
pression ratio of applications (Figure 8). An application with a high Figure 17: Performance comparison with stride prefetching, and
compression ratio also has relatively few exceptions per page. Note using prefetcher hints with the LCP-framework.
that we do not restrict the number of exceptions in an LCP. As long
as an LCP fits into a physical page not larger than the uncompressed Stride Prefetching LCP‐BDI LCP‐BDI + Prefetching hints
Normalized BPKI
1.6
1.4
page size (i.e., 4KB in our system), it will be stored in compressed 1.2
1.0
form irrespective of how large the number of exceptions is. This is 0.8
0.6
0.4
why applications like milc have a large number of exceptions per 0.2
0.0
page. We note that better performance is potentially achievable by
apache
mcf
perlbench
bzip2
dealII
GemsFDTD
gobmk
tpch17
wrf
gromacs
lbm
povray
sjeng
xalancbmk
astar
cactusADM
calculix
gamess
gcc
h264ref
hmmer
zeusmp
leslie3d
libquantum
soplex
GeoMean
milc
namd
omnetpp
sphinx3
tpcc
tpch2
tpch6
either statically or dynamically limiting the number of exceptions
per page—a complete evaluation of the design space is a part of our
future work. Figure 18: Bandwidth comparison with stride prefetching.
Avg. # of Exceptions
30
25
20 8. CONCLUSION
15
10
5
Data compression is a promising technique to increase the ef-
0 fective main memory capacity without significantly increasing cost
tpch17
zeusmp
bzip2
calculix
hmmer
Average
apache
cactusADM
dealII
gamess
gcc
gromacs
mcf
namd
perlbench
povray
soplex
tpcc
wrf
astar
GemsFDTD
h264ref
gobmk
milc
omnetpp
sjeng
lbm
leslie3d
libquantum
sphinx3
tpch2
tpch6
xalancbmk
12
We evaluated the LCP-based framework using two state-of-the- [15] Intel Corporation. Intel 64 and IA-32 Architectures Software
art compression algorithms (Frequent Pattern Compression and Developer’s Manual, 2013.
Base-Delta-Immediate Compression) and showed that it can sig- [16] JEDEC. GDDR3 Specific SGRAM Functions, JESD21-C,
nificantly increase effective memory capacity (by 69%) and reduce 2012.
page fault rate (by 23%). We showed that storing compressed data [17] U. Kang et al. 8Gb 3D DDR3 DRAM Using
in main memory can also enable the memory controller to reduce Through-Silicon-Via Technology. In ISSCC, 2009.
memory bandwidth consumption (by 24%), leading to significant [18] S. F. Kaplan. Compressed Caching and Modern Virtual
performance and energy improvements on a wide variety of single- Memory Simulation. PhD thesis, 1999.
core and multi-core systems with different cache sizes. Based on [19] C. Lefurgy et al. Energy Management for Commercial
our results, we conclude that the proposed LCP-based framework Servers. In IEEE Computer, 2003.
provides an effective approach for designing low-complexity and
[20] C. Li, C. Ding, and K. Shen. Quantifying the Cost of Context
low-latency compressed main memory.
Switch. In ExpCS, 2007.
[21] S. Li et al. McPAT: An Integrated Power, Area, and Timing
Acknowledgments Modeling Framework for Multicore and Manycore
Many thanks to Brian Hirano, Kayvon Fatahalian, David Han- Architectures. In MICRO-42, 2009.
squine and Karin Strauss for their feedback during various stages of [22] P. S. Magnusson et al. Simics: A Full System Simulation
this project. We thank the anonymous reviewers and our shepherd Platform. IEEE Computer, 2002.
Andreas Moshovos for their feedback. We acknowledge members [23] Micron. 2Gb: x4, x8, x16, DDR3 SDRAM, 2012.
of the SAFARI and LBA groups for their feedback and for the stim- [24] H. Patil et al. Pinpointing Representative Portions of Large
ulating research environment they provide. We acknowledge the Intel Itanium Programs with Dynamic Instrumentation. In
support of AMD, IBM, Intel, Oracle, Samsung and Microsoft. This MICRO-37, 2004.
research was partially supported by NSF (CCF-0953246, CCF- [25] G. Pekhimenko et al. Base-Delta-Immediate Compression:
1147397, CCF-1212962), Intel University Research Office Mem- Practical Data Compression for On-Chip Caches. In PACT,
ory Hierarchy Program, Intel Science and Technology Center for 2012.
Cloud Computing, Semiconductor Research Corporation and a Mi- [26] G. Pekhimenko et al. Linearly Compressed Pages: A Main
crosoft Research Fellowship. Memory Compression Framework with Low Complexity and
Low Latency. In SAFARI Technical Report No. 2012-002,
References 2012.
[1] B. Abali et al. Memory Expansion Technology (MXT): [27] V. Sathish, M. J. Schulte, and N. S. Kim. Lossless and Lossy
Software Support and Performance. IBM J. Res. Dev., 2001. Memory I/O Link Compression for Improving Performance
[2] A. R. Alameldeen and D. A. Wood. Adaptive Cache of GPGPU Workloads. In PACT, 2012.
Compression for High-Performance Processors. In ISCA-31, [28] A. Snavely and D. M. Tullsen. Symbiotic Jobscheduling for a
2004. Simultaneous Multithreaded Processor. In ASPLOS-9, 2000.
[3] A. R. Alameldeen and D. A. Wood. Frequent Pattern [29] SPEC CPU2006. http://www.spec.org/.
Compression: A Significance-Based Compression Scheme [30] S. Srinath et al. Feedback Directed Prefetching: Improving
for L2 Caches. Tech. Rep., 2004. the Performance and Bandwidth-Efficiency of Hardware
[4] E. D. Berger. Memory Management for High-Performance Prefetchers. In HPCA-13, 2007.
Applications. PhD thesis, 2002. [31] S. Thoziyoor, N. Muralimanohar, J. H. Ahn, and N. P.
[5] X. Chen et al. C-Pack: A High-Performance Microprocessor Jouppi. CACTI 5.1. Technical Report HPL-2008-20, HP
Cache Compression Algorithm. IEEE Transactions on VLSI Laboratories, 2008.
Systems, 2010. [32] M. Thuresson et al. Memory-Link Compression Schemes: A
[6] E. Cooper-Balis, P. Rosenfeld, and B. Jacob. Value Locality Perspective. IEEE TC, 2008.
Buffer-On-Board Memory Systems. In ISCA, 2012. [33] Transaction Processing Performance Council.
[7] R. S. de Castro, A. P. do Lago, and D. Da Silva. Adaptive http://www.tpc.org/.
Compressed Caching: Design and Implementation. In [34] R. B. Tremaine et al. Pinnacle: IBM MXT in a Memory
SBAC-PAD, 2003. Controller Chip. IEEE Micro, 2001.
[8] F. Douglis. The Compression Cache: Using On-line [35] P. R. Wilson, S. F. Kaplan, and Y. Smaragdakis. The Case for
Compression to Extend Physical Memory. In Winter Compressed Caching in Virtual Memory Systems. In
USENIX Conference, 1993. USENIX Annual Technical Conference, 1999.
[9] J. Dusser et al. Zero-Content Augmented Caches. In ICS, [36] J. Yang, R. Gupta, and C. Zhang. Frequent Value Encoding
2009. for Low Power Data Buses. ACM TODAES, 2004.
[10] M. Ekman and P. Stenström. A Robust Main-Memory [37] J. Yang, Y. Zhang, and R. Gupta. Frequent Value
Compression Scheme. In ISCA-32, 2005. Compression in Data Caches. In MICRO-33, 2000.
[11] M. Farrens and A. Park. Dynamic Base Register Caching: A [38] D. H. Yoon, M. K. Jeong, M. Sullivan, and M. Erez. The
Technique for Reducing Address Bus Width. In ISCA, 1991. Dynamic Granularity Memory System. In ISCA, 2012.
[12] E. G. Hallnor and S. K. Reinhardt. A Unified Compressed [39] Y. Zhang, J. Yang, and R. Gupta. Frequent Value Locality
Memory Hierarchy. In HPCA-11, 2005. and Value-Centric Data Cache Design. In ASPLOS-9, 2000.
[13] D. Huffman. A Method for the Construction of [40] J. Ziv and A. Lempel. A Universal Algorithm for Sequential
Minimum-Redundancy Codes. IRE, 1952. Data Compression. IEEE TIT, 1977.
[14] S. Iacobovici et al. Effective Stream-Based and
Execution-Based Data Prefetching. In ICS, 2004.
13