MM 3
MM 3
MM 3
Workloads
Seth H Pugsley1, Jeffrey Jestes1, Huihui Zhang1 , Rajeev Balasubramonian1 , Vijayalakshmi Srinivasan2,
Alper Buyuktosunoglu2, Al Davis1 , and Feifei Li1
1 University of Utah, {pugsley, jestes, huihui, rajeev, ald, lifeifei}@cs.utah.edu
2 IBM T.J. Watson Research Center, {viji, alperb}@us.ibm.com
2.1. Mapper boost processor pin bandwidth lead to higher power consump-
Map. The Mapper applies the Map function to all records in the tion and limit per-pin memory capacity, thus it is hard to simul-
input split, typically producing key-value pairs as output from taneously support higher memory capacity and memory band-
this stage. This is a linear scan of the input split, so this phase width.
is highly bandwidth intensive. The computational complexity Recently, Micron has announced the imminent release of its
varies across workloads. Hybrid Memory Cube (HMC) [51]. The HMC uses 3D die-
Sort. The Mapper next sorts the set of key-value pairs by their stacking to implement multiple DRAM dies and an interface
keys, with an in-place quick sort. logic chip on the same package. TSVs are used to ship data from
the DRAM dies to the logic chip. The logic chip implements
Combine. The Combine phase is applied to the local output
high-speed signaling circuits so it can interface with a processor
of the Mapper, and can be viewed as a local Reduce function.
chip through fast, narrow links.
This phase involves a linear scan through the sorted output data,
applying the Reduce function to each key (with its associated 3.2. Analyzing an HMC-Based Design
set of values) in the output set.
In Table 1, we provide a comparison between DDR3, DDR4,
Partition. The Mapper’s final action is to divide its sorted-and-
and HMC-style baseline designs, in terms of power, bandwidth
combined output into a number of partitions equal to the number
and pin-count. This comparison is based on data and assump-
of Reducers in the system. Each Reducer gets part of the output
tions provided by Micron [27, 36].
from each Mapper. This is done by another linear scan of the
HMC is optimized for high-bandwidth operation and tar-
output, copying each item into its correct partition.
gets workloads that are bandwidth-limited. HMC has better
2.2. Reducer bandwidth-per-pin, and bandwidth-per-watt characteristics than
either DDR3 or DDR4. We will later show that the MapReduce
Shuffle and Sort. The Reducer’s first job is to gather all of applications we consider here are indeed bandwidth-limited,
its input from the various Mapper output partitions, which are and will therefore best run on systems that maximize bandwidth
scattered throughout the system, into a single, sorted input set. for a given pin and power budget.
This is done by a merge of all the already-sorted partitions into
a single input set (a merge sort). 4. Related Work
Reduce. Finally, the Reducer applies the Reduce function to 2D Processing-in-Memory: Between 1995-2005, multiple
all of the keys (with their associated sets of values) in its sorted research teams built 2D PIM designs and prototypes (e.g.,
input set. This involves a linear scan of its input, applying the [32, 50, 38, 48]) and confirmed that there was potential for great
Reduce function to each item. speedup in certain application classes, such as media [29, 38],
2.3. Computational Requirements irregular computations [15, 32], link discovery algorithms [10],
query processing [32, 47, 38], etc.
Mappers and Reducers have different computational and band- None of this prior work has exploited 3D stacking. While
width needs. The Map phase is largely bandwidth constrained, a few have examined database workloads, none have leveraged
and consumes the bulk of the execution time for our workloads. the MapReduce framework to design the application and to map
It would therefore be beneficial to execute the Mapper on pro- tasks automatically to memory partitions. MapReduce is unique
cessors that have high levels of memory bandwidth, and not because the Map phase exhibits locality and embarrassing par-
necessarily high single-thread performance. allelism, while the Reduce phase requires high-bandwidth ran-
dom memory access. We show that NDC with a 3D-stacked
3. Memory System Background logic+memory device is a perfect fit because it can handle both
phases efficiently. We also argue that dynamic activation of
3.1. Moving from DDR3 to HMC cores and SerDes links is beneficial because each phase uses
In a conventional memory system, a memory controller on a different set of cores and interconnects. This compelling case
the processor is connected to dual in-line memory modules for NDC is made possible by the convergence of emerging tech-
(DIMMs) via an off-chip electrical DDR3 memory channel nology (3D stacking), workloads (big-data analytics), and ma-
(bus). Modern processors have as many as four memory con- ture programming models (MapReduce).
trollers and four DDR3 memory channels [4]. Processor pin 3D Stacking: A number of recent papers have employed 3D
counts have neared scaling limits [35]. Efforts to continually stacking of various memory chips on a processor chip (e.g.,
[44, 45, 46, 22, 59, 61, 28, 40]) to reduce memory latencies. Energy Efficient / ND Core
Even a stack of 4 DRAM chips can only offer a maximum ca- Process 32 nm
pacity of 2 GB today. Hence, in the high-performance domain, Power 80 mW
such memory chips typically serve as a cache [37] and must be Frequency 1 GHz
backed up by a traditional main memory system. Loh [44] de- Core Type single-issue in-order
scribes various design strategies if the memory chips were to Caches 32 KB I and D
be used as main memory. Kim et al. [40] and Fick et al. [28] Area (incl. caches) 0.51 mm2
build proof-of-concept 3D-stacked devices that have 64 cores EE Core Chip Multiprocessor
on the bottom die and small SRAM caches on the top die. Core Count 512
These works do not explore the use of similar future devices Core Power 41.0 W
for big-data processing. None of this prior work aggregates NOC Power 36.0 W
several 3D-stacked devices on a single board to cost-effectively LLC and IMC 20.0 W
execute big-data workloads. The 3D-Maps prototype has mea- Total CMP Power 97.0 W
sured the bandwidth and power for some kernels, including the
Table 2: Energy Efficient Core (EECore) and baseline system
histogram benchmark that resembles the data access pattern of
some Map phases [40]. Industrial 3D memory prototypes and
products include those from Samsung [55, 54], Elpida [23, 24], ered down because of their long wake-up times. So the HMC
Tezzaron [60], and Micron [3, 51]. Many of these employ a will dissipate at least 6 W even when idle.
logic controller at the bottom of the stack with undisclosed func- We begin by considering a server where a CPU is attached to
tionality. Tezzaron plans to use the bottom die for self-test and 4 HMC devices with 8 total links. Each HMC has a capacity of
soft/hard error tolerance [60]. Micron has announced an interest 4 GB (8 DRAM layers each with 4 Gb capacity). This system
in incorporating more sophisticated functionality on the bottom has a memory bandwidth of 320 GB/s (40 GB/s per link) and
die [9]. a total memory capacity of 16 GB. Depending on the applica-
tion, the memory capacity wall may be encountered before the
Custom Architectures for Big-Data Processing: Some pa- memory bandwidth wall.
pers have argued that cost and energy efficiency are optimized Memory capacity on the board can be increased by using a
for cloud workloads by using many “wimpy” processors and re- few links on an HMC to connect to other HMCs. In this pa-
placing disk access with Flash or DRAM access [42, 11, 49, 16]. per, we restrict ourselves to a daisy-chain topology to construct
Chang et al. [52, 17] postulate the Nanostore idea, where a 3D an HMC network. Daisy chains are simple and have been used
stack of non-volatile memory is bonded to a CPU. They eval- in other memory organizations, such as the FB-DIMM. We as-
uate specific design points that benefit from fast NVM access sume that the processor uses its eight links to connect to four
(relative to SSD/HDD) and a shallow memory hierarchy. Lim HMCs (two links per HMC), and each HMC connects two of
et al. [43] customize the core and NIC to optimize Memcached its links to the next HMC in the chain (as seen in Figure 1b).
execution. Guo et al. [31, 30] design associative TCAM acceler- While daisy chaining increases the latency and power overhead
ators that help reduce data movement costs in applications that for every memory access, it is a more power-efficient approach
require key-value pair retrieval. The design relies on custom than increasing the number of system boards.
memory chips and emerging resistive cells. Phoenix [53] is a For power-efficient execution of embarrassingly-parallel
programming API and runtime that implements MapReduce for workloads like MapReduce, it is best to use as large a number
shared-memory systems. The Mars framework does the same of low energy-per-instruction (EPI) cores as possible. This will
for GPUs [33]. DeKruijf and Sankaralingam evaluate MapRe- maximize the number of instructions that are executed per joule,
duce efficiency on the Cell Processor [21]. A recent IBM paper and will also maximize the number of instructions executed per
describes how graph processing applications can be efficiently unit time, within a given power budget. According to the anal-
executed on a Blue Gene/Q platform [19]. ysis of Azizi et al. [13], at low performance levels, the lowest
EPI is provided by a single-issue in-order core. This is also con-
5. Near Data Computing Architecture sistent with data on ARM processor specification sheets. We
5.1. High Performance Baseline therefore assume an in-order core similar to the ARM Cortex
A5 [1].
A Micron study [36] shows that energy per bit for HMC access Parameters for a Cortex A5-like core, and a CMP built out
is measured at 10.48 pJ, of which 3.7 pJ is in the DRAM layers of many such cores, can be found in Table 2. Considering that
and 6.78 pJ is in the logic layer. If we assume an HMC device large server chips can be over 400 mm2 in size. We assume
with four links that operate at their peak bandwidth of 160 GB/s, that 512 such cores are accommodated on a server chip (leav-
the HMC and its links would consume a total of 13.4 W. About ing enough room for interconnect, memory controllers, etc.).
43% of this power is in the SerDes circuits used for high-speed To construct this table, we calculated the power consumed by
signaling [36, 56]. In short, relative to DDR3/DDR4 devices, on-chip wires to support its off-chip bandwidth, not including
the HMC design is paying a steep power penalty for its superior the overheads for the intermediate routers [39], we calculated
bandwidth. Also note that SerDes links cannot be easily pow- the total power consumed by the on-chip network [41], and fac-
Figure 1: The Near Data Computing Architecture.
tored in the power used by the last level caches and memory duce workloads exhibit high data locality and can be executed
controllers [34]. This is a total power rating similar to that of on the memory device; the Reduce phase also exhibits high data
other commercial high-end processors [4] locality, but it is still executed on the central host processor chip
The processor can support a peak total throughput of 512 because it requires random access to data. For random data ac-
BIPS and 160 GB/s external read memory bandwidth, i.e., cesses, average hop count is minimized if the requests originate
a peak bandwidth of 0.32 read bytes/instruction can be sus- in a central location, i.e., at the host processor. NDC improves
tained. On such a processor, if the application is compute- performance by reducing memory latency and by overcoming
bound, then we can build a simpler memory system with DDR3 the bandwidth wall. We further show that the proposed design
or DDR4. Our characterization of MapReduce applications can reduce power by disabling expensive SerDes circuits on the
shows that the applications are indeed memory-bound. The read memory device and by powering down the cores that are inac-
bandwidth requirements of our applications range from 0.47 tive in each phase. Additionally, the NDC architecture scales
bytes/instruction to 5.71 bytes/instruction. So the HMC-style more elegantly as more cores and memory are added, favorably
memory system is required. impacting cost.
We have designed a baseline server that is optimized for in- 3D NDC Package. As with an HMC package, we assume that
memory MapReduce workloads. However, this design pays a the NDC package contains 8 4 Gb DRAM dies stacked on top of
significant price for data movement: (i) since bandwidth is vi- a single logic layer. The logic layer has all the interface circuitry
tal, high-speed SerDes circuits are required at the transmitter required to communicate with other devices, as in the HMC. In
and receiver, (ii) since memory capacity is vital to many work- addition, we introduce 16 simple processor cores (Near-Data
loads, daisy-chained devices are required, increasing the num- Cores, or NDCores).
ber of SerDes hops to reach the memory device, (iii) since all 3D Vertical Memory Slice. In an HMC design, 32 banks are
the computations are aggregated on large processor chips, large used per DRAM die, each with capacity 16 MB (when assum-
on-chip networks have to be navigated to reach the few high- ing a 4 Gb DRAM chip). When 8 DRAM die are stacked on top
speed memory channels on the chip. of each other, 16 banks align vertically to comprise one 3D ver-
tical memory slice, with capacity 256 MB, as seen in Figure 1a.
5.2. NDC Hardware
Note that a vertical memory slice (referred to as a “vault” in
We next show that a more effective approach to handle MapRe- HMC literature) has 2 banks per die. Each 3D vertical mem-
duce workloads is to move the computation to the 3D-stacked ory slice is connected to an NDCore below on the logic layer
devices themselves. We refer to this as Near Data Computing by Through-Silicon Vias (TSVs). Each NDCore operates ex-
to differentiate it from the processing-in-memory projects that clusively on 256 MB of data, stored in 16 banks directly above
placed logic and DRAM on the same chip and therefore had it. NDCores have low latency, high bandwidth access to their
difficulty with commercial adoption. 3D slice of memory. In the first-generation HMC, there are
While the concept of NDC will be beneficial to any mem- 1866 TSVs, of which, 512 are used for data transfers at 2 Gb/s
ory bandwidth-bound workload that exhibits locality and high each [36].
parallelism, we use MapReduce as our evaluation platform in NDCores. Based on our analysis earlier, we continue to use
this study. Similar to the baseline, a central host processor with low-EPI cores to execute the embarrassingly parallel Map phase.
many EECores is connected to many daisy-chained memory de- We again assume an in-order core similar to the ARM Cortex
vices augmented with simple cores. The Map phases of MapRe- A5 [1]. Each core runs at a frequency of 1 GHz and consumes
80 mW, including instruction and data caches. We are thus that the overall system will consume less energy per workload
adding only 1.28 W total power to the package (and will shortly task. This is because the energy for data movement has been
offset this with other optimizations). Given the spatial locality greatly reduced. The new design consumes lower power than
in the Map phase, we assume a prefetch mechanism that fetches the baseline by disabling half the SerDes circuits. Faster execu-
five consecutive cache lines on a cache miss. We also apply tion times will also reduce the energy for constant components
this prefetching optimization to all baseline systems tested, not (clock distribution, leakage, etc.).
just NDC, and it helps the baseline systems more than the NDC
system, due to their higher latency memory access time. 5.3. NDC Software
Host CPUs and 3D NDC Packages. User Programmability. Programming for NDC is similar to
Because the host processor socket has random access to the the programming process for MapReduce on commodity clus-
entire memory space, we substitute the Shuffle phase with a Re- ters. The user supplies Map and Reduce functions. Behind the
duce phase that introduces a new level of indirection for data scenes, the MapReduce runtime coordinates and spawns the ap-
access. When the Reduce phase touches an object, it is fetched propriate tasks.
from the appropriate NDC device (the device where the object Data Layout. Each 3D vertical memory slice has 256 MB total
was produced by a Mapper). This is a departure from the typi- capacity, and each NDCore has access to one slice of data. For
cal Map, Shuffle, and Reduce pattern of MapReduce workloads, our workloads, we populate an NDCore’s 256 MB of space with
but minimizes data movement when executing on a central host a single 128 MB database split, 64 MB of output buffer space,
CPU. The Reduce tasks are therefore executed on the host pro- and 64 MB reserved for code and stack space, as demanded
cessor socket and its 512 EECores, with many random data by the application and runtime. Each of these three regions is
fetches from all NDC devices. NDC and both baselines follow treated as large superpages. The first two superpages can be ac-
this model for executing the Reduce phase. cessed by their NDCore and by the central host processor. The
Having full-fledged traditional processor sockets on the third superpage can only be accessed by the NDCore. The logi-
board allows the system to default to the baseline system in case cal data layout for one database split is shown in Figure 1c.
the application is not helped by NDC. The NDCores can remain MapReduce Runtime. Runtime software is required to orches-
simple as they are never expected to handle OS functionality or trate the actions of the Mappers and Reducers. Portions of the
address data beyond their vault. The overall system architec- MapReduce runtime execute on the host CPU cores and por-
ture therefore resembles the optimized HMC baseline we con- tions execute on the NDCores, providing functionalities very
structed in Section 5.1. Each board has two CPU sockets. Each similar to what might be provided by Hadoop. The MapReduce
CPU socket has 512 low-EPI cores (EECores). Each socket has runtime can serve as a lighweight OS for an NDCore, ensuring
eight high-speed links that connect to four NDC daisy-chains. that code and data do not exceed their space allocations, and
Thus, every host CPU core has efficient (and conventional) ac- possibly re-starting a Mapper on a host CPU core if there is an
cess to the board’s entire memory space, as required by the Re- unserviceable exception or overflow.
duce function.
Power Optimizations. Given the two distinct phases of MapRe- 6. Evaluation
duce workloads, the cores running the Map and Reduce phases
6.1. Evaluated Systems
will never be active at the same time. If we assume that the
cores can be power-gated during their inactive phases, the over- In this work, we compare an NDC-based system to two base-
all power consumption can be kept in check. line systems. The first system uses a traditional out-of-order
Further, we maintain power-neutrality within the NDC pack- (OoO) multi-core CPU, and the other uses a large number of
age. This ensures that we are not aggravating thermal con- energy-efficient cores (EECores). Both of these processor types
straints in the 3D package. In the HMC package, about 5.7 W are used in a 2-socket system connected to 256 GB of HMC
can be attributed to the SerDes circuits used for external com- memory capacity, which fits 1024 128 MB database splits. All
munication. HMC devices are expected to integrate 4-8 external evaluated systems are summarized in Table 3.
links and we’ve argued before that all of these links are required 6.1.1. OoO System On this system, both the Map and Reduce
in an optimal baseline. However, in an NDC architecture, exter- phases of MapReduce run on the high performance CPU cores
nal bandwidth is not as vital because it is only required in the on the two host sockets. Each of the 16 OoO cores must sequen-
relatively short Reduce phase. To save power, we therefore per- tially process 64 of the 1024 input splits to complete the Map
manently disable 2 of the 4 links on the HMC package. This phase. As a baseline, we assume perfect performance scaling
2.85 W reduction in SerDes power offsets the 1.28 W power for more cores, and ignore any contention for shared resources,
increase from the 16 NDCores. other than memory bandwidth, to paint this system configura-
The cores incur a small area overhead. Each core occupies tion in the best light possible.
0.51 mm2 in 32 nm technology. So the 16 cores only incur a 6.1.2. EECore System Each of the 1024 EECores must com-
7.6% area overhead, which could also be offset if some HMC pute only one each of the 1024 input splits in a MapReduce
links were outright removed rather than just being disabled. workload. Although the frequency of each EECore is much
Regardless of whether power-gating is employed, we expect lower than an OoO core, and the IPC of each EECore is lower
Out-of-Order System 6.3. Methodology
CPU configuration 2x 8 cores, 3.3 GHz
Core parameters 4-wide out-of-order
We use a multi-stage CPU and memory simulation infrastruc-
128-entry ROB
ture to simulate both CPU and DRAM systems in detail.
L1 Caches 32 KB I and D, 4 cycle
L2 Cache 256 KB, 10 cycle To simulate the CPU cores (OoO, EE, and NDC), we use the
L3 Cache 2 MB, 20 cycle Simics full system simulator [8]. To simulate the DRAM, we
NDC Cores — use the USIMM DRAM simulator [18], which has been modi-
EECore System fied to model an HMC architecture. We assume that the DRAM
CPU configuration 2x 512 cores, 1 GHz core latency (Activate + Precharge + ColumnRead) is 40 ns.
Core parameters single-issue in-order Our simulations model a single Map or Reduce thread at a time
L1 Caches 32 KB I and D, 1 cycle and we assume that throughput scales linearly as more cores
NDC Cores — are used. While NDCores have direct access to DRAM banks,
EECores must navigate the memory controller and SerDes links
NDC System
on their way to the HMC device. Since these links are shared
CPU configuration 2x 512 cores, 1 GHz by 512 cores, it is important to correctly model contention at
Core parameters single-issue in-order the memory controller. A 512-core Simics simulation is not
L1 Caches 32 KB I and D, 1 cycle tractable, so we use a trace-based version of the USIMM simula-
NDC Cores 1024 tor. This stand-alone trace-based simulation models contention
Table 3: System parameters. when the memory system is fed memory requests from 512
Mappers or 512 Reducers. These contention estimates are then
fed into the detailed single-thread SIMICS simulation.
than an OoO core, the EECore system still has the advantage of We wrote the code for the Mappers and Reducers of our
massive parallelism, and we show in our results that this is a net five workloads in C, and then compiled them using GCC ver-
win for the EECore system by a large margin. sion 3.4.2 for the simulated architecture. The instruction mix
of these workloads is strictly integer-based. For each workload,
6.1.3. NDCore System We assume the same type and
we have also added 1 ms execution time overheads for begin-
power/frequency cores for NDCores as EECores. The only dif- ning a new Map phase, transitioning between Map and Reduce
ference in their performance is the way they connect to memory.
phases, and for completing a job after the Reduce phase. This
EECores must share a link to the system of connected HMCs, conservatively models the MapReduce runtime overheads and
but each NDCore has a direct link to its dedicated memory, with the cost of cache flushes between phases.
very high bandwidth, and lower latency. This means NDCores
will have higher performance than EECores. We evaluate the power and energy consumed by our systems
taking into account workload execution times, memory band-
In order to remain power neutral compared to the EECore sys-
width, and processor core activity rates. We calculate power for
tem, each HMC device in the NDC system has half of its 4 data
the memory system as being equal to the the sum of the power
links disabled, and therefore can deliver only half the bandwidth
used by each logic layer in each HMC, including SerDes links,
to the host CPU, negatively impacting Reduce performance.
the DRAM array background power, and power used to access
the DRAM arrays for reads and writes. We assume that the
6.2. Workloads four SerDes links consume a total of 5.78 W per HMC, and the
remainder of the logic layer consumes 2.89 W [56]. Total max-
We evaluate the Map and Reduce phases of 5 different MapRe- imum DRAM array power per HMC is assumed to be 4.7 W
duce workloads, namely Group-By Aggregation (GroupBy), for 8 DRAM die [36]. We approximate background DRAM ar-
Range Aggregation (RangeAgg), Equi-Join Aggregation (Equi- ray power at 10% of this maximum value [6], or 0.47 W, and
Join), Word Count Frequency (WordCount), and Sequence the remaining DRAM power is dependent on DRAM activity.
Count Frequency (SequenceCount). GroupBy and EquiJoin Energy is consumed in the arrays on each access at the rate of
both involve a sort, a combine, and a partition in their Map an additional 3.7 pJ/bit (note that the HMC implements narrow
phase, in addition to the Map scan, but the RangeAgg work- rows and a close page policy [36]). For data that is moved to
load is simply a high-bandwidth Map scan through the 64 MB the processor socket, we add 4.7 pJ/bit to navigate the global
database split. These first three workloads use 50 GB of the wires between the memory controller and the core [39]. This is
1998 World Cup website log [12]. WordCount and Sequence- a conservative estimate because it ignores intermediate routing
Count each find the frequency of words or sequences of words elements, and favors the EECore baseline. For the core power
in large HTML files, and as input we use 50 GB of Wikipedia estimates, we assume that 25% of the 80 mW core peak power
HTML data [7]. These last two workloads are more computa- can be attributed to leakage (20 mW). The dynamic power for
tionally intensive than the others because they involve text pars- the core varies linearly between 30 mW and 60 mW, based on
ing and not just integer compares when sorting data. IPC (since many circuits are switching even during stall cycles).
Figure 2: Execution times of a single Mapper task, measured in
absolute time (top), and normalized to EE execution Figure 3: Execution times of all Mapper tasks, measured in ab-
time (bottom). solute time (top), and normalized to EE execution time
(bottom).
7. Performance Results
cores 64-to-1, so each OoO processor must sequentially execute
7.1. Individual Mapper Performance 64 Mapper tasks.
We first examine the performance of a single thread working Because of this, the single-threaded performance advantage
on a single input split in each architecture. Figure 2 shows the of the OoO cores becomes irrelevant, and both EE and NDC
execution latency of a single mapper for each workload. systems are able to outperform the OoO system by a wide mar-
We show both normalized and absolute execution times to gin. As seen in Figure 3, compared to the OoO system, the
show the scale of each of these workloads. When executing EE system reduces Map phase execution times from 69.4%
on an EECore, a RangeAgg Mapper task takes on the order of (RangeAgg), up to 89.8% (WordCount). The NDC system im-
milliseconds to complete, GroupBy and EquiJoin take on the or- proves upon the EE system by further reducing execution times
der of seconds to complete, and WordCount and SequenceCount from 23.7% (WordCount), up to 93.2% (RangeAgg).
take on the order of minutes to complete.
7.3. Bandwidth
RangeAgg, GroupBy, and EquiJoin have lower compute re-
quirements than WordCount and SequenceCount, so in these The NDC system is able to improve upon the performance of
workloads, because of its memory latency advantage, an ND- the OoO and EE systems because it is not constrained by HMC
Core is able to nearly match the performance of an OoO core. link bandwidth during the Map phase. Figure 4 shows the read
The EECore system falls behind in executing a single Mapper and write bandwidth for each 2-socket system, as well as a bar
task compared to both OoO and NDCores, because its HMC representing the maximum HMC link bandwidth, which sets an
link bandwidth is maxed out for some workloads, as seen in upper bound for the performance of the OoO and EE systems.
Section 7.3. The OoO system is unable to ever come close to saturating
7.2. Map Phase Performance the available bandwidth of an HMC-based memory system. The
EE system is able to effectively use the large amounts of avail-
Map phase execution continues until all Mapper tasks have been able bandwidth, but because the bandwidth is a limited resource,
completed. In the case of the EE and NDC systems, the number it puts a cap on the performance potential of the EE system. The
of Mapper tasks and processor cores is equal, so all Mapper NDC system is not constrained by HMC link bandwidth, and is
tasks are executed in parallel, and the duration of the Map phase able to use an effective bandwidth many times that of the other
is equal to the time it takes to execute one Mapper task. In systems. While the two baseline systems are limited to a maxi-
the case of the OoO system, Mapper tasks outnumber processor mum read bandwidth of 320 GB/s, the NDC system has a maxi-
7.4. MapReduce Performance
7.6. HMC Power Consumption and Thermal Analysis Figure 8: Heatmap of the logic layer in the NDC system (best
viewed in color).
In addition to a system-level evaluation of energy consumption,
we also consider the power consumption of an individual HMC activity, because each HMC device can contribute on average
device. In the EE system, the HMC device is comprised of a only 1/8th the bandwidth supported by the SerDes links. The
logic layer, including 4 SerDes links, and 8 vertically stacked NDC architecture, on the other hand, is able to keep the DRAM
DRAM dies. An NDC HMC also has a logic layer and 8 DRAM arrays busier by utilizing the available TSV bandwidth. Overall,
dies, but it only uses 2 SerDes links and also includes 16 NDC the NDC HMC device consumes up to 16.7% lower power than
cores. As with the energy consumption evaluation, we consider the baseline HMC device.
core and DRAM activity levels in determining HMC device We also evaluated the baseline HMC and NDC floorplans
power. Figure 7 shows the contribution of HMC power from with Hotspot 5.0 [2], using default configuration parameters,
the logic layer, the DRAM arrays, and NDC cores, if present. an ambient temperature of 45◦ C inside the system case, and
The baseline HMCs do not have any NDC cores, so they see a heat spreader of thickness 0.25 mm. We assumed that each
no power contribution from that source, but they do have twice DRAM layer dissipates 0.59 W, spread uniformly across its
the number of SerDes links, which are the single largest con- area. The logic layer’s 8.67 W is distributed across various units
sumer of power in the HMC device. based on HMC’s power breakdown and floorplan reported by
The NDC design saves some power by trading 2 SerDes links Sandhu [56]. We assumed that all 4 SerDes links were active.
for 16 NDCores. However, we also see an increase in DRAM For each NDCore, we assumed that 80% of its 80 mW power is
array power in NDC. In the EECore baseline, host processor pin dissipated in 20% of its area to model a potential hotspot within
bandwidth is shared between all HMCs in the chain, and no one the NDCore. Our analysis showed a negligible increase in de-
HMC device is able to realize its full bandwidth potential. This vice peak temperature from adding NDCores. This is shown
leads to a low power contribution coming from DRAM array by the logic layer heatmap in Figure 8; the SerDes units have
much higher power densities than the NDCore, so they con- [14] BerkeleyDB, “Berkeley DB: high-performance embedded database
tinue to represent the hottest units on the logic chip. We carried for key/value data,” http://www.oracle.com/technetwork/products/
berkeleydb/overview/index.html.
out a detailed sensitivity study and observed that the NDCores [15] J. Brockman, S. Thoziyoor, S. Kuntz, and P. Kogge, “A Low Cost, Multi-
emerge as hotspots only if they consume over 200 mW each. threaded Processing-in-Memory System,” in Proceedings of WMPI, 2004.
[16] A. Caulfield, L. Grupp, and S. Swanson, “Gordon: Using Flash Memory
The DRAM layers exceed 85◦ C (requiring faster refresh) only to Build Fast, Power-efficient Clusters for Data-Intensive Applications,”
if the heat spreader is thinner than 0.1 mm. in Proceedings of ASPLOS, 2009.
[17] J. Chang, P. Ranganathan, D. Roberts, T. Mudge, M. Shah, and K. Lim,
“A Limits Study of the Benefits from Nanostore-based Future Data Centric
8. Conclusions System Architectures,” in Proceedings of Computing Frontiers, 2012.
[18] N. Chatterjee, R. Balasubramonian, M. Shevgoor, S. Pugsley, A. Udipi,
This paper argues that the concept of Near-Data Computing is A. Shafiee, K. Sudan, M. Awasthi, and Z. Chishti, “USIMM: the Utah
SImulated Memory Module,” University of Utah, Tech. Rep., 2012,
worth re-visiting in light of various technological trends. We ar- UUCS-12-002.
gue that the MapReduce framework is a good fit for NDC archi- [19] F. Checconi, F. Petrini, J. Willcock, A. Lumsdaine, A. Choudhury, and
Y. Sabharwal, “Breaking the Speed and Scalability Barriers for Graph Ex-
tectures. We present a high-level description of the NDC hard- ploration on Distributed-memory Machines,” in Proceedings of SC, 2012.
ware and accompanying software architecture, which presents [20] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on
Large Clusters,” in Proceedings of OSDI, 2004.
the programmer with a MapReduce-style programming model. [21] M. deKruijf and K. Sankaralingam, “MapReduce for the Cell B.E. Archi-
We first construct an optimized baseline that uses daisy-chains tecture,” IBM Journal of Research and Development, vol. 53(5), 2009.
[22] X. Dong, Y. Xie, N. Muralimanohar, and N. Jouppi, “Simple but Effec-
of HMC devices and many energy-efficient cores on a tradi- tive Heterogeneous Main Memory with On-Chip Memory Controller Sup-
tional processor socket. This baseline pays a steep price for port,” in Proceedings of SC, 2010.
[23] Elpida Memory Inc., “News Release: Elpida Completes Development
data movement. The move to NDC reduces the data movement of Cu-TSV (Through Silicon Via) Multi-Layer 8-Gigabit DRAM,” http://
cost and helps overcome the bandwidth wall. This helps reduce www.elpida.com/pdfs/pr/2009-08-27e.pdf, 2009.
[24] ——, “News Release: Elpida, PTI, and UMC Partner on 3D IC Integra-
overall workload execution time by 12.3% to 93.2%. We also tion Development for Advanced Technologies Including 28nm,” http://
employ power-gating for cores and disable SerDes links in the www.elpida.com/en/news/2011/05-30.html, 2011.
NDC design. This ensures that the HMC devices consume less [25] F. Färber, S. K. Cha, J. Primsch, C. Bornhövd, S. Sigg, and W. Lehner,
“SAP HANA Database: Data Management for Modern Business Applica-
power than the baseline and further bring down the energy con- tions,” SIGMOD Record, vol. 40, no. 4, pp. 45–51, 2011.
sumption. Further, we expect that NDC performance, power, [26] F. Färber, N. May, W. Lehner, P. Große, I. Müller, H. Rauhe, and J. Dees,
“The SAP HANA Database – An Architecture Overview,” IEEE Data Eng.
energy, and cost will continue to improve as the daisy chains Bull., vol. 35, no. 1, pp. 28–33, 2012.
are made deeper. [27] T. Farrell, “HMC Overview: A Revolutionary Approach to System Mem-
ory,” 2012, exhibit at Supercomputing.
[28] D. Fick et al., “Centip3De: A 3930 DMIPS/W Configurable Near-
Acknowledgments Threshold 3D Stacked System with 64 ARM Cortex-M3 Cores,” in Pro-
ceedings of ISSCC, 2012.
We thank the anonymous reviewers for their many useful sug- [29] J. Gebis, S. Williams, C. Kozyrakis, and D. Patterson, “VIRAM-1: A
Media-Oriented Vector Processor with Embedded DRAM,” in Proceed-
gestions. This work was supported in part by NSF grant CNS- ings of DAC, 2004.
1302663 and IBM Research. [30] Q. Guo, X. Guo, Y. Bai, and E. Ipek, “A Resistive TCAM Accelerator for
Data-Intensive Computing,” in In Proceedings of MICRO, 2011.
[31] Q. Guo, X. Guo, R. Patel, E. Ipek, and E. Friedman, “AC-DIMM: Asso-
ciative Computing with STT-MRAM,” in Proceedings of ISCA, 2013.
References [32] M. Hall, P. Kogge, J. Koller, P. Diniz, J. Chame, J. Draper, J. LaCoss,
J. Granacki, J. Brockman, A. Srivastava, W. Athas, V. Freeh, J. Shin,
[1] “Cortex-A5 Processor,” http://www.arm.com/products/processors/ and J. Park, “Mapping Irregular Applications to DIVA, a PIM-based Data-
cortex-a/cortex-a5.php. Intensive Architecture,” in Proceedings of SC, 1999.
[2] “HotSpot 5.0,” http://lava.cs.virginia.edu/HotSpot/index.htm. [33] B. He, W. Fang, Q. Luo, N. Govindaraju, and T. Wang, “Mars: A MapRe-
[3] “Hybrid Memory Cube, Micron Technologies,” http://www.micron.com/ duce Framework on Graphics Processors,” in Proceedings of PACT, 2008.
innovations/hmc.html. [34] J. Howard et al., “A 48-Core IA-32 Message-Passing Processor with
[4] “Intel Xeon Processor E5-4650 Specifications,” http://ark.intel.com/ DVFS in 45nm CMOS,” in Proceedings of ISSCC, 2010.
products/64622/. [35] ITRS, “International Technology Roadmap for Semiconductors, 2009 Edi-
[5] “Memcached: A Distributed Memory Object Caching System,” http:// tion.”
memcached.org. [36] J. Jeddeloh and B. Keeth, “Hybrid Memory Cube – New DRAM Archi-
tecture Increases Density and Performance,” in Symposium on VLSI Tech-
[6] “Micron System Power Calculator,” http://www.micron.com/products/ nology, 2012.
support/power-calc. [37] X. Jiang, N. Madan, L. Zhao, M. Upton, R. Iyer, S. Makineni, D. Newell,
[7] “PUMA Benchmarks and dataset downloads,” http://web.ics.purdue.edu/ Y. Solihin, and R. Balasubramonian, “CHOP: Adaptive Filter-Based
~fahmad/benchmarks/datasets.htm. DRAM Caching for CMP Server Platforms,” in Proceedings of HPCA,
[8] “Wind River Simics Full System Simulator,” http://www.windriver.com/
products/simics/. 2010.
[38] Y. Kang, M. Huang, S. Yoo, Z. Ge, D. Keen, V. Lam, P. Pattnaik, and
[9] “Open-Silicon and Micron Align to Deliver Next-Generation Memory J. Torrellas, “FlexRAM: Toward an Advanced Intelligent Memory Sys-
Technology,” 2011, http://www.open-silicon.com/news-events/press- tem,” in Proceedings of ICCD, 1999.
releases/open-silicon-and-micron-align-to-deliver- next-generation- [39] S. Keckler, “Life After Dennard and How I Learned to Love the Pico-
memory-technology.html. joule,” Keynote at MICRO, 2011.
[10] J. Adibi, T. Barrett, S. Bhatt, H. Chalupsky, J. Chame, and M. Hall, [40] D. Kim et al., “3D-MAPS: 3D Massively Parallel Processor with Stacked
“Processing-in-Memory Technology for Knowledge Discovery Algo- Memory,” in Proceedings of ISSCC, 2012.
rithms,” in Proceedings of DaMoN Workshop, 2006. [41] P. Kundu, “On-Die Interconnects for Next Generation CMPs,” in Work-
[11] D. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and shop on On- and Off-Chip Interconnection Networks for Multicore Sys-
V. Vasudevan, “FAWN: A Fast Array of Wimpy Nodes,” in Proceedings tems (OCIN), 2006.
of SOSP, 2009. [42] K. Lim et al., “Understanding and Designing New Server Architectures
[12] M. Arlitt and T. Jin, “1998 World Cup Web Site Access Logs,” http://www. for Emerging Warehouse-Computing Environments,” in Proceedings of
acm.org/sigcomm/ITA/, August 1998. ISCA, 2008.
[13] O. Azizi, A. Mahesri, B. Lee, S. Patel, and M. Horowitz, “Energy- [43] K. Lim, D. Meisner, A. Saidi, P. Ranganathan, and T. Wenisch, “Thin
Performance Tradeoffs in Processor Architecture and Circuit Design: A Servers with Smart Pipes: Designing Accelerators for Memcached,” in
Marginal Cost Analysis,” in Proceedings of ISCA, 2010. Proceedings of ISCA, 2013.
[44] G. Loh, “3D-Stacked Memory Architectures for Multi-Core Processors,”
in Proceedings of ISCA, 2008.
[45] G. Loi, B. Agrawal, N. Srivastava, S. Lin, T. Sherwood, and K. Banerjee,
“A Thermally-Aware Performance Analysis of Vertically Integrated (3-D)
Processor-Memory Hierarchy,” in Proceedings of DAC-43, June 2006.
[46] N. Madan, L. Zhao, N. Muralimanohar, A. N. Udipi, R. Balasubramonian,
R. Iyer, S. Makineni, and D. Newell, “Optimizing Communication and Ca-
pacity in a 3D Stacked Reconfigurable Cache Hierarchy,” in Proceedings
of HPCA, 2009.
[47] R. Murphy, P. Kogge, and A. Rodrigues, “The Characterization of Data In-
tensive Memory Workloads on Distributed PIM Systems,” in Proceedings
of Workshop on Intelligent Memory Systems, 2000.
[48] M. Oskin, F. Chong, and T. Sherwood, “Active Pages: A Model of Com-
putation for Intelligent Memory,” in Proceedings of ISCA, 1998.
[49] J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich,
D. Mazieres, S. Mitra, A. Narayanan, G. Parulkar, M. Rosenblum, S. Rum-
ble, E. Stratmann, and R. Stutsman, “The Case for RAMClouds: Scalable
High-Performance Storage Entirely in DRAM,” SIGOPS Operating Sys-
tems Review, vol. 43(4), 2009.
[50] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton,
C. Kozyrakis, R. Thomas, and C. Yelick, “A Case for Intelligent DRAM:
IRAM,” IEEE Micro, vol. 17(2), April 1997.
[51] T. Pawlowski, “Hybrid Memory Cube (HMC),” in HotChips, 2011.
[52] P. Ranganathan, “From Microprocessors to Nanostores: Rethinking Data-
Centric Systems,” IEEE Computer, Jan 2011.
[53] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis,
“Evaluating MapReduce for Multi-Core and Multiprocessor Systems,” in
Proceedings of HPCA, 2007.
[54] Samsung, “Samsung to Release 3D Memory Modules with 50% Greater
Density,” 2010, http://www.computerworld.com/s/article/9200278/
Samsung_to_release_3D_memory_modules_with_50_greater_density.
[55] Samsung Electronics Corporation, “Samsung Electronics Develops
World’s First Eight-Die Multi-Chip Package for Multimedia Cell Phones,”
2005, (Press release from http://www.samsung.com).
[56] G. Sandhu, “DRAM Scaling and Bandwidth Challenges,” in NSF Work-
shop on Emerging Technologies for Interconnects (WETI), 2012.
[57] SAP, “In-Memory Computing: SAP HANA,” http://www.sap.com/
solutions/technology/in-memory-computing-platform.
[58] SAS, “SAS In-Memory Analytics,” http://www.sas.com/software/
high-performance-analytics/in-memory-analytics/.
[59] G. Sun, X. Dong, Y. Xie, J. Li, and Y. Chen, “A Novel Architecture of
the 3D Stacked MRAM L2 Cache for CMPs,” in Proceedings of HPCA,
2009.
[60] Tezzaron Semiconductor, “3D Stacked DRAM/Bi-STAR Overview,”
2011, http://www.tezzaron.com/memory/Overview_3D_DRAM.htm.
[61] D. H. Woo et al., “An Optimized 3D-Stacked Memory Architecture by
Exploiting Excessive, High-Density TSV Bandwidth,” in Proceedings of
HPCA, 2010.
[62] M. Zaharia, M. Chowdhury, M. Franklin, S. Shenker, and I. Stoica,
“Spark: Cluster Computing with Working Sets,” in Proceedings of Hot-
Cloud, 2010.