0% found this document useful (0 votes)
14 views

Reconfigurable Dataflow Graphs For Processing-In-memory

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Reconfigurable Dataflow Graphs For Processing-In-memory

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/330267076

Reconfigurable dataflow graphs for processing-in-memory

Conference Paper · January 2019


DOI: 10.1145/3288599.3288605

CITATIONS READS
3 49

2 authors, including:

Krishna Kavi
University of North Texas
225 PUBLICATIONS 1,931 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Curriculum Development View project

Intelligent memory View project

All content following this page was uploaded by Krishna Kavi on 25 January 2023.

The user has requested enhancement of the downloaded file.


Reconfigurable Dataflow Graphs For Processing-In-Memory

Charles F. Shelor Krishna M. Kavi


Computer Science and Engineering Computer Science and Engineering
University of North Texas University of North Texas
Denton, Texas, USA Denton, Texas, USA
[email protected] [email protected]

ABSTRACT 1 Introduction
In order to meet the ever-increasing speed differences between One of the major problems facing today’s computer systems is the
processor clocks and memory access times, there has been an disparate speeds between processor instruction cycles and memory
interest in moving computation closer to memory. The near data access times. This is not a new issue as this timing mismatch was
processing or processing-in-memory is particularly suited for very recognized and known as the memory wall [42]. Advances in
high bandwidth memories such as the 3D-DRAMs. There are processor clock rates and architectures have outpaced
different ideas proposed for PIMs, including simple in-order improvements in memory bandwidth and access delay. Processors
processors, GPUs, specialized ASICs and reconfigurable designs. are running at 2 to 4 GHz range, while DRAM latency is 41.25 ns
In our case, we use Coarse-Grained Reconfigurable Logic to build for the fastest DDR4 device from Micron [31]. The cache
dataflow graphs for computational kernels as the PIM. We show hierarchy in computer systems has been used to reduce the effects
that our approach can achieve significant speedups and save of the memory wall by providing fast access to data items on
energy consumed by computations. We evaluated our designs subsequent accesses. The use of multilevel memory caches,
using several processing technologies for building the coarse- prefetching of data, multiple memory channels, and wide data
gained logic units. The DFPIM concept showed good paths mitigate the memory access delay, but there are still times
performance improvement and excellent energy efficiency for the when the processor must wait the 83 to 166 clocks for the
streaming benchmarks that were analyzed. The DFPIM in a 28 requested data to arrive. Increasing computer system performance
nm process with an implementation in each of 16 vaults of a 3D- through multicore processors increases the pressure on the
DRAM logic layer showed an average speed-up of 7.2 over that memory system as more memory requests are being generated.
using 32 cores of an Intel Xeon server system. The server Conditions will occur where one or more cores must wait until an
processor required 368 times more energy to execute the existing memory access completes before beginning its own
benchmarks than the DFPIM implementation. memory access. With every fifth instruction [17] being a data
request, the memory access delay and imperfect caching leads to
CCS CONCEPTS high end servers being idle three out of four clocks [21]. Energy
• Computer systems organization ~ Data flow architectures consumption of computer systems has been an increasing issue in
recent years.
KEYWORDS Advances in silicon technology have dramatically decreased the
Dataflow Architectures, Coarse Grained Reconfigurable Logic, energy per computation for the processor core. However, the
Processing in Memory, 3D-Stacked Memories energy for memory accesses is increasing to achieve improved
bandwidth and latency to attempt to match processor performance
ACM Reference format:
[34, 35]. The memory system is an increasingly significant fraction
of the computing system energy use [26]. A 64-bit external
Permission to make digital or hard copies of all or part of this work for memory access requires approximately 100 times the energy of a
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
double precision computation [20, 9, 25].
bear this notice and the full citation on the first page. Copyrights for Energy is particularly important to both high-performance
components of this work owned by others than ACM must be honored. applications and emerging Big Data and Deep Learning
Abstracting with credit is permitted. To copy otherwise, or republish, to post applications. For Exascale systems, the goals include a memory
on servers or to redistribute to lists, requires prior specific permission and/or
bandwidth of 4 TB/s at each node for 100,000 nodes with a
a fee. Request permissions from [email protected].
maximum power budget of 20 MW [41]. Aggressive assumptions
about memory technology improvements show that 70% of the
ICDCN '19, January 4–7, 2019, Navi mumbai, India
© 2019 Association for Computing Machinery. power budget will be needed for memory accesses [43].
ACM ISBN 978-1-4503-6094-4/19/01...$15.00 Demand for higher performance computer systems has pushed
https://doi.org/10.1145/3288599.3288605 processor architectures to longer pipelines with multiple issue, out-
of-order capabilities and larger memory caches to supply data.
ICDCN-2019, Bangaluru, India, Jan 4-7, 2019 C.F. Shelor and K.M. Kavi

These high-performance microarchitectural features require an W


i j

ar1 ai1
energy overhead that reduces the energy efficiency of the 1 fp * fp * 1
select

processor. A 4-issue core has six integer ALU, two load-store, two ar2 ai2
rd_adr

floating point, and one integer multiply-divide. Only 26% of the fp - fp -


select

energy is used by functional units that generate algorithmic results. ar0 ai0
wr_adr
RealOut[k]
The remaining energy is consumed by cache hierarchy, network ImagOut[k]

on the chip, instruction scheduling, renaming registers and other


logic needed for out of order execution. fp * fp * fp * fp *

Our architecture addresses both the execution and energy


fp - fp -
consumption. Execution performance is improved by moving tr
RealOut[j]
ti
ImagOut[j]

computations closer to memory (that is, Processing in Memory)


and eliminating traditional instruction pipelines with a fp - fp + fp - fp +
RealOut[k] ImagOut[k]
reconfigurable graph describing a computation. Energy savings RealOut[j]
RealOut
ImagOut[j]
ImagOut

result from the elimination of instruction fetch/decode/issue rd_adr


rd_dat
rd_adr
rd_dat

cycles, cache memories and using lower clock frequencies. select


wr_adr
select
wr_adr
wr_dat wr_dat
The rest of the paper is organized as follows. Section 2 describes
the technologies that enabled our work. Section 3 provides an Figure 1: An Example Dataflow Graph for FFT
overview of our dataflow processing-in-memory (DFPIM) 2.3. Processing in Memory (PIM) using 3D DRAM. One approach
architecture. Section 4 provides details of our experimental setup. to mitigating the memory wall for applications that do not work
Section 5 contains results of our evaluation and discussions. well with caches is moving the processing of the data closer to
Section 6 includes research that is closely related to ours and the data itself [36, 43]. Th e advent of 3D-stacked DRAMS, which
Section 7 contains conclusions of our study. include a logic layer makes this Near Data Computing (NDC) or
Processing-in-Memory (PIM).
2. Enabling Technologies The close, physical proximity of the stacked layers combined with
the low capacitance of the TSV interconnect [30] provides a faster
Our architecture is enabled by (hybrid) dataflow model of
and lower power communication path than the standard memory
computation, coarse-grained reconfigurable logic and 3D-stacked
controller to DRAM DIMM path through sockets and PCB
memories with room for processing-in-memory logic.
traces. The multiple independent channels and high-speed serial
2.1. Dataflow model represents a computation as a graph where
links provides 256 GB/s for HBM [22, 31] and 160 GB/s for HMC
the nodes represent operations on inputs received via the incoming
[19, 32, 35].
edges and results are sent to other nodes via outgoing edges [3, 4,
5, 23, 24]. In our system, we deviate from the pure dataflow model. A B C D

We use load units to bring input data from DRAM memory into A
B
C
local buffers. There are delay operations in the dataflow graphs to D

* ALU + ALU pass ALU - ALU


balance and synchronize all paths in the graph, eliminating the A*B B+D A C-D

need for additional inputs to trigger when data is consumed. The


dataflow graph is ’executed’ only when all graph inputs for the pass + pass *
ALU ALU ALU ALU
next computation are available (not just inputs to nodes in the input A*B A+B+D A (B+D)*(C-D)

layer). This pipelined execution also handles loop carried


dependencies and simplifies memory ordering issues. pass ALU * ALU * ALU nop ALU
A*B * (A+B+D)
Programmable state machines are used to implement looping A*B A * (B+D) * (C-D)

structures within the dataflow graphs to increase graph execution


independence from a host processor or controller. Figure 1 shows * ALU pass ALU pass ALU nop ALU

a dataflow graph representation of FFT. Detailed description of the


W X Y Z
operations is omitted due to space limitations. (A*B) * (A*B) A*B * (A+B+D) A * (B+D) * (C-D)
2.2. Coarse Grained Reconfigurable Logic (CGRL) is similar to Figure 2. An Example of CGRL Configuration
FPGA, but the reconfigurability is at a functional block level and
not at gate level. The CGRL fabric consists of functional units such 3. Dataflow Processing In Memory (DFPIM)
as Integer ALUs, Floating Point Adders, Floating point
multipliers, or other specialized functional units. The inputs of DFPIM uses a hybrid dataflow technology to extract parallelism
functional units can be connected to the outputs of other functional and pipelining for high performance computation in streaming
units, thus creating a dataflow graph representing a computation. data applications. The dataflow logic is configured into the
Reconfiguring the input to output connections results in a new application solution graph by using CGRL comprised of
computational graph. We assume a partitionable crossbar functional blocks and connectivity elements. The CGRL is
interconnection network to communicate inputs and outputs. An implemented as PIM on the logic layer within a 3D stacked
example of a CGRL that is configured is shown in Figure 2. DRAM. Figure 3 shows a high-level architecture of the proposed
dataflow PIM.
Reconfigurable Dataflow Graphs for Processing in Memory ICDCN-2019, Bangaluru, India, Jan. 4-7, 2019

HOST PROCESSOR
ONE CLUSTER
graphics processor is copied from a data segment of the host
CORE CORE CORE CORE AMM
3-D STACK HSL if
CGRL executable to the graphics processor to be used as the
L1 L1 L1 L1
AMM DRAM SPM instructions to execute. The data segment directed to the DFPIM
L2 L2 L2 L2
L3
AMM
CGRL
MC
is copied through the high-speed link to an address dedicated to
MC
MEMORY CONTROLLER AMM CGRL this.
HSL
SPM
PCIe PERIPHERALS uCTRL
CGRL
The DFPIM logic accepts input data and generates results until
LOGIC &
DFPIM it runs out of input data. If there is no input data ready for a
COMMODITY DIMM
particular clock cycle, the entire logic network waits for the
Figure 3: DFPIM Architecture data to become available. This is needed to ensure data stays
synchronized through the computational sequence. If an
The left section represents the host computer for the DFPIM
exception condition is encountered it can be posted for detection
elements. This is a standard server or workstation computer
after the computation has completed or it can be passed to
system. The only feature that is not standard on current
the micro-controller which will terminate processing and notify
systems is a high-speed serial interface for connecting the
the host processor that the operation has failed.
accelerated memory modules. Processor manufacturers are
incorporating these links in new products to take advantage The update phase uses the micro-controller to download any
of the higher bandwidth and lower energy of 3D stacked results that are contained in scratchpad memories. The host
DRAM devices [7, 8]. processor is then notified that the requested operation has
completed. In some cases, the results might be transferred to
There can be multiple accelerated memory modules (AMM) as
the host processor, in other cases it might just be an
shown in the figure. The center section shows an accelerated
acknowledgement that the operation completed and the results
memory module expanded into the logic layer base and a stack
are available at the requested location.
of DRAMs, including representation of the sixteen independent
vertical vaults. The logic layer base contains one memory All DFPIM operations are based on physical address offsets since
controller and one DFPIM instance for each vault. A the DFPIM resides inside a physical memory. Any indirect or
microcontroller is included on the logic layer to assist DFPIM pointer accesses within the application must either be based on
configuration and minimize the communication between the host physical addresses or the application data must have been
and DFPIM for optimum performance. allocated as a large, continuous segment and all pointers are
simply offsets within the segment. DFPIM applications are limited
The right section shows an expanded view of the logic layer
to the memory within the 3D-stacked component that contains
for one vault. The memory controller accesses the memory stack
the DFPIM logic. A communication network on the logic layer
vault directly above the vault controller [2, 36, 43]. The memory
allows DFPIM elements to access data in other vaults, but this is
controller communicates with the high-speed link for data
likely to introduce delays.
transfers with the host. The DFPIM instance has load and store
units within the CGRL that access the DRAM stack through 3.2. DFPIM Layout
the memory controller and buffer input data for the dataflow
Figure 4 shows a possible floor plan for a DFPIM implementation.
graphs. The DFPIM instance consists of the CGRL logic and
The DFPIM components use 50% of a 68 mm stacked DRAM 2

the scratch pad memories that are local to the CGRL functional
logic layer (the other 50% is set aside for memory controllers and
units. for implementing the dataflow graphs of the applications.
TSVs). The illustration is drawn to scale using logic synthesis
There is also a link to the DFPIM microcontroller that is used
estimates for each block for a 28nm process technology. ARM core
to configure the CGRL and to initialize and store data from the
is used as the microcontroller for DFPIM.
scratch pad memories as needed.
128 256 128 256 128 256 128 256
KB D M2 F M2 KB KB D M2 F M2 KB KB D M2 F M2 KB KB D M2 F M2 KB

3.1. DFPIM Operation 128


KB M I M I M I
M
I
M2
128
KB M I M I M I
M
I
M2
128
KB M I M I M I
M
I
M2
128
KB M I M I M I
M
I
M2
M D M D M D M D
XIF I I I I
MC MC MC MC
M M2 M M2 M M2 M M2
DFPIM operation can be divided into four phases. Initiation
1.4 x 1.4 I 1.4 x 1.4 I 1.4 x 1.4 I 1.4 x 1.4 I
M F M F M F M F
I I I I
128 128 128 128 128 128 128 128

is performed by the host processor. Configuration is executed 128


KB KB
256 128 D M2 F M2
KB KB
256 128 D M2 F M2
KB KB
256 128 D M2 F M2
KB KB
256
KB D M2 F M2 KB KB KB KB KB KB KB

by the DFPIM micro-controller. Computation is executed by 128


KB M I M I M I
M
I
128
M2 KB M I M I M I M
I
128
M2 KB M I M I M I M
I
128
M2 KB M I M I M I M
I
M2
M D M D M D M D
the DFPIM logic until the input data is exhausted. An update MC
1.4 x 1.4
M
I

I
M2
MC
1.4 x 1.4
M
I

I
M2
MC
1.4 x 1.4
M
I

I
M2
MC
1.4 x 1.4
M
I

I
M2
M F M F M F M F
phase is conducted by the micro-controller for storing results. Arm
I
128
KB
128
KB
I
128
KB
128
KB
I
128
KB
128
KB
I
128
KB
128
KB
128 256 128 D M2 F M2 256 128 D M2 F M2 256 128 D M2 F M2 256
KB D M2 F M2 KB KB KB KB KB KB KB

The host computer initiates a DFPIM operation when a 128


KB M I M I M I
M
I
128
M2 KB M I M I M I M
I
128
M2 KB M I M I M I M
I
128
M2 KB M I M I M I M
I
M2
M D M D M D M D
command that uses the DFPIM is executed. As an example, a MC
1.4 x 1.4
M
I

I
M2
MC
1.4 x 1.4
M
I

I
M2
MC
1.4 x 1.4
M
I

I
M2
MC
1.4 x 1.4
M
I

I
M2
M F M F M F M F
user could enter a command via the keyboard. The operating XIF
I
128
KB
128
KB
I
128
KB
128
KB
I
128
KB
128
KB
I
128
KB
128
KB

system reads an executable file that contains the machine 128


KB
128
D M2 F M2 256 128 D M2 F M2
KB KB
128
256 128 D M2 F M2
KB KB
128
256 128 D M2 F M2
KB KB
128
256
KB
M M2 KB M I M I M I M M2 KB M I M I M I M M2 KB M I M I M I M
KB M I M I M I M2
instructions that implement the given command. When the MC
M
I

I
D
MC
M
I

I
D
MC
M
I

I
D
MC
M
I

I
D
M M2 M M2 M M2 M M2
DFPIM is to be involved, there is a data segment within the 1.4 x 1.4
M
I

I
F
1.4 x 1.4
M
I

I
F
1.4 x 1.4
M
I

I
F
1.4 x 1.4
M
I

I
F
128 128 128 128 128 128 128 128
file that must be transferred to the DFPIM. This is very similar KB KB KB KB KB KB KB KB

to executing a command implemented in OpenCL or CUDA that Figure 4: An Example PIM Layout
involves a graphics processor. The code to be executed by the
ICDCN-2019, Bangaluru, India, Jan 4-7, 2019 C.F. Shelor and K.M. Kavi

In this layout the memory controllers are located in the lower left address space and thus DFPIM will access the data from the shared
corner of each memory vault of an HMC like 3D stacked memory. 3D DRAM memory. Both host and DFPIM rely on physical
The DFPIM logic is above and to the right of each memory controller. addresses. The load units contained within DFPIM will buffer
The logic units include integer units I (2 load, 1 store, 20 ALUs, four inputs for use by the computational functional units, and the store
multiply units, some specialized units, two FIFOs), floating point units copy results back to memory.
units F (32 single precision adders and multipliers, ten double
The execution delays and energy consumed by the various DFPIM
precision adders and multipliers) and a small local memory M. The
logic components are based on the values obtained by our logic
interconnection bus is a 16 x 32 crossbar which can be segmented
synthesis as described above in Section 4.
into smaller buses. This layout is only one example configuration.

5. Results
4. Experimental Setup
In this section we describe the results of our experiments comparing
In this paper we compare the execution times and energy consumed
the execution times and energy consumed by DFPIM with a host
by DFPIM with a host system with two 14-core Intel Xeon E5-
system as described in the previous section.
2683v3 processor running at 2GHz. Intel Performance Counter
Monitor tools package [39] was used to monitor the power 5.1. Benchmarks
consumed by the CPUs during benchmark execution. We carefully
We selected representative benchmarks from a wide-variety of
isolated the execution time and energy consumed only for the
application domains. The map-reduce benchmarks from HiBench
benchmark kernels for a fair comparison with DFPIM. The
[18], the map-reduce benchmarks from PUMA [1], the Rodinia
execution and energy values for our DFPIM components
(functional units and ARM core) are estimated using very detailed benchmarks [6], SPEC benchmarks [38], and MiBench
benchmarks [16] were reviewed. We selected benchmarks that
logic synthesis value using TSMC libraries for 7nm and 16nm
had significant differences in their suitability for dataflow
FinFET and 28nm planar CMOS technologies and obtained
implementation. The seven benchmarks used in this paper are
through ARM Limited Artisan physical IP [27]. We evaluated
histogram, word occurrence count, fast Fourier transform,
eleven clock rates in six variants of 7nm libraries, eight clock rates
in twelve variants of 16nm libraries and seven clock rates in four breadth first search, string match, linear regression, and SHA256.
The SHA256 benchmark had three versions implemented in
variants of 28nm libraries. From these 190 different synthesis runs
DFPIM for a total of nine analyses. We now describe these
for each DFPIM component, we selected the best configuration
benchmarks.
(clock rate and library) that results in optimal energy-delay values.
Table 1 shows the selected libraries and clock frequencies for the Histogram. The histogram benchmark inputs an RGB image
three criteria of minimum energy, minimum energy-delay product, and generates a histogram of values for the red (R), green (G),
and the average of those two. We use the libraries and clock rates and blue (B) components of the pixels. The benchmark code
from the average column as a balance of energy and performance is isolates the 8-bit color values from a 24-bit input with shift and
desired. mask operations. The pixel component values are used as
addresses to three arrays (scratch pad memory in DFPIM) that
Table 1: Optimal Silicon Libraries and Clock Rates
returns the current count, increments it, and stores the new
Energy Energy * Delay E, E*D Ave
count. The input file contained 468,750,000 RGB pixels.
07nm svt-c8, 1.0 GHz ulvt-c8, 2.0 GHz lvt-c8, 1.8 GHz
16nm lvt-c16, 0.8 GHz ilvt-c16, 1.5 GHz ilvt-c16, 1.0 GHz Word Count. The word occurrence count benchmark is based on
28nm svt-c35, 0.6 GHz svt-c30, 1.1 GHz svt-c30, 1.1 GHz tasks used in web indexing and searches. The first part of the
benchmark inputs a character stream and isolates it into words.
4.1. Dataflow Graph Generation The second part of the benchmark creates a hash for the word
and looks for the word in a hash table. If the word is found in
We developed a backend to LLVM compiler [28, 29] to generate
the table its occurrence count is incremented, otherwise it is
dataflow graphs. The portion of the C program that is targeted for
added to the table with a count of 1. The server implementation
execution on DFPIM is first identified. Using LLVM intermediate
of the benchmark serially finds a word then processes the word,
representation for the identified kernel code, a dataflow graph is
then looks for the next word. The DFPIM implementation has
generated. For our purpose the output is represented in XML
two sections. The first section processes the input looking for
representing the various functional units used by the graph and the
words. When a word is found it is put into a FIFO. The second
connections between these units to form the graph. This XML code
section pulls a word from the FIFO, processes the word and
is used by our DFPIM simulator for producing execution results
then pulls the next word. The DFPIM uses word-wide
presented in Section 5.
comparison for word matching. The two sections work
4.2. DFPIM Simulator independently and are synchronized through the FIFO. The
benchmark input was 94,858,002 bytes in length.
Our simulator takes the input (in XML) generated by LLVM for
each benchmark kernel, configures the functional units to represent Fast Fourier Transform. The FFT benchmark processes a frame
the dataflow graph represented by the LLVM output, and executes of time sampled data into a frame of frequency bins. The number
the graph with inputs transmitted by the host processor. For our of samples in the input frame is a power of 2, designated as N.
purpose we assume that both the host and DFPIM use the same The butterfly implementation of the algorithm is a triple nested
Reconfigurable Dataflow Graphs for Processing in Memory ICDCN-2019, Bangaluru, India, Jan. 4-7, 2019

loop where the outer loop is repeated log (N) times. The middle
2 round. Since the round takes three clocks to pass through the
loop iterates based on powers of 2 from log (N)-1 to 1 while
2
pipeline, each component is idle two-thirds of the time. The third
the inner loop iterates based on powers of 2 from 1 to log (N)-
2
implementation interleaves three different input streams to
1. The code within the inner loop is executed log (N) * N/2. achieve three times the throughput of the first two
2

The FFT algorithm does benefit from caching as each data implementations. This version is designated SHAmac3. The
sample is accessed log (N) times during a frame processing. pipeline is fully utilized resulting in higher energy to obtain the
2
better performance. A 50 MiB file is processed by the SHA256
However, a program using an FFT is likely to process many
benchmark in this evaluation. It should be noted that the two
frames in sequentially ordered streaming. The outer loop
alternatives described here are not available with server
includes two sine and two cosine operations. As the DFPIM does
implementation of SHA (since we cannot modify the functional
not have sine and cosine blocks defined, the micro-controller
units of the server).
must intervene and perform these operations. Alternately, they
could be precomputed and stored in a scratch pad, eliminating the 5.2. Server Benchmark Results
micro-controller involvement. This analysis was based on a
The results of running the benchmarks on the server processor
frame size of 4096 samples and processing 500 frames of data.
are shown in Table 2. The first eight rows of the table provide
FFT is an example of a benchmark that is not very well suited for
the measured data from the benchmark execution. The last five
pure dataflow implementation.
rows of the table are calculated values derived from the measured
Breadth First Search. The breadth first search benchmark is data. All three variants of the SHA256 benchmark are shown in
neither streaming nor cache friendly. It searches through a tree this table even though the server data is the same for the three
resulting in a random memory access pattern. The only variants since there is only one implementation of SHA on the
advantage of the DFPIM is its faster access time to memory. The server. This keeps the four result tables consistent.
input file contains a tree with one million nodes.
The Base Clocks / Item is the number of processor clocks
String Match. The string match benchmark searches a text file needed to complete the benchmark while running on a single
for a list of keys. Whenever a match is found, it’s location in processor divided by the number of benchmark items processed.
the text file is saved. The algorithm first locates the end of the This baseline is compared to the clocks per item measured when
current word and then compares the word to each of the keys. the server is running 32 instances of the benchmark
The pointer to the word is stored in the results block when a simultaneously to show how well the benchmark scales. The
match occurs. This analysis searched a 502 MiB file while Clocks (M) row is the total number of processor clocks to
searching for four keys. complete the benchmark. The number is in millions of clock
ticks. The Freq (clk / usec) entry is the actual operating frequency
Linear Regression. The linear regression benchmark takes a
of the processor as reported by the hardware during the
file of points and accumulates 5 information components: x-
benchmark. This is measured to ensure the operating system
coordinate value, x-coordinate squared, y-coordinate value, y-
did not change the frequency of the processors during operation.
coordinate squared, and x-coordinate times y-coordinate. The
five accumulated values are returned when the end of the input
data is reached. This benchmark was evaluated with a 670
MiB file. The Items / Proc (K) measurement is the number of
thousands of benchmark items processed on each of the 32
SHA256. The SHA256 benchmark is a cryptographic instances of the benchmark executed on the server. The Item Size
application that creates a digest of a message that can later be (bytes) is the size of the benchmark item. The histogram
used to guarantee the message has not been modified. A large benchmark processes pixels that are composed of a red, a green,
sequence of rotation, logical and arithmetic operations are and a blue component for a total of three bytes per pixel. The word
performed on the input data to generate the 32-byte message count benchmark processes characters as input for a 1 byte size.
digest. Each round of the algorithm requires 6 rotate, 3 logical The FFT benchmark operates on complex numbers with a
AND, 6 logical XOR, 2 logical OR, and 7 addition operations floating- point real and imaginary values for a total of 8 bytes.
for a total of 24 operations per round. Sixty-four rounds are Breadth first search operates on pointers with a size of 8 bytes.
performed on each 64-byte input block for a total of 1536 String match processes characters as input for an item size of
operations per block Three DFPIM implementations of the 1 byte. The linear regression processes data points with an x
SHA256 benchmark are used in this evaluation. The first is a component and y component that are both bytes. The SHA256
straightforward implementation where an integer ALU is used algorithm processes a 64-byte block of data as an item.
for each of the operations. One input stream is processed at a
time. The second implementation creates three new DFPIM The CPU Power (W) row contains the measured CPU power
components implementing macros of the processing round. This in Watts for the benchmark. The Mem Power (W) row provides
implementation is designated as SHAmac. The 24 integer ALUs the measured memory power in Watts. There is very little
per round are reduced to three special components and two variation in power for the different benchmarks. The power
integer ALUs. The reduced component count decreases power measurements showed more correlation to the number of cores
and energy while maintaining the same performance. The that were active than the type of activity being performed.
algorithm loops the result of each round to the start of the next
ICDCN-2019, Bangaluru, India, Jan 4-7, 2019 C.F. Shelor and K.M. Kavi

______________________________________________________________________________________________________
Table 2: Server Benchmark Results
Hist Word FFT BFS Str M Lin R SHA256 SHAmac SHAmac3
Base Clocks / Item 45.08 18.57 348.85 82.08 12.28 25.08 2160.19 2160.19 2160.19
Clocks (M) 299.30 76.70 776.64 164.64 262.74 372.49 131.07 131.07 131.07
Freq (clk/usec) 1997 1997 1997 1997 1997 1997 1997 1997 1997
Items / Proc (K) 4882 2964 2048 1000 16454 10986 51 51 51
Item Size (bytes) 3 1 8 8 1 2 64 64 64
CPU Power (W) 126.17 125.86 125.62 125.35 125.73 124.81 126.34 126.34 126.34
Mem Power (W) 2.27 2.07 1.96 2.09 2.02 2.02 2.02 2.02 2.02
Kernel percent 99.99 99.99 98.88 99.25 99.99 99.99 95.81 95.81 95.81
Execution Time (S) 0.1499 0.0384 0.3889 0.0824 0.1316 0.1865 0.0656 0.0656 0.0656
Server Energy (J) 19.250 4.913 49.617 10.507 16.808 23.657 8.425 8.425 8.425
Bandwidth (MB/S) 3127.6 2469.8 1348.1 3105.1 4002.1 3769.6 1604.5 1604.5 1604.5
Clocks per item 61.30 25.87 379.22 164.64 15.97 33.91 2549.05 2549.05 2549.05
Congestion Factor 0.36 0.39 0.09 1.01 0.30 0.35 0.18 0.18 0.18

______________________________________________________________________________________________________

The Kernel percent measurement indicates the percentage of compared. The data for the server processor was collected with
benchmark clocks that were used for execution of the 32 active cores. The DFPIM has 16 vaults and each DFPIM
benchmark kernel section. executes an instance of the benchmark (server system represents
twice as many “cores” that are executing twice as many copies of
The Execution Time (S) is derived from the total clocks
the benchmark kernel when compared DFPIM implementation).
executed divided by the clock frequency and expressed in
Therefore, the Items / Vault (K) will be double the number of
seconds.
items per process used for the server processor data.
The Server Energy (J) is computed by adding the CPU power
The DF Power.. (W) is the power of the DFPIM components
and memory power and multiplying the sum by the execution
which is equivalent to the CPU power of the server data (and
time. This expresses the energy in Joules.
we show the values for 28, 16 and 7nm versions. The Mem
The Bandwidth (MB/S) is the number of benchmark items Power.. (W) is the power of the memory accesses within the
times the size of each item divided by the execution time. It stacked DRAM. The Clocks per item measurement does not
does not include any instruction accesses, incidental cache hits, have three components as the underlying silicon technology does
or memory accesses from algorithmic overhead such as loop not impact the dataflow pipeline organization resulting in the
indexing or address calculations. same number of clocks for each technology. The lower clocks per
item in the DFPIM results compared to the server processor
The Clocks per item metric is the number of processor clock cycles
results show the benefits of the dataflow parallelism and
divided by the number of benchmark items of a benchmark
pipelining. The power values show the benefits of low power
instance running on each of 32 processors. As the actual
silicon libraries and not pushing the technology to its performance
number of clock cycles required per item is given in the base
limits (running at lower frequency than maximum possible).
clocks per item, any additional cycles must be attributed to
congestion in accessing memory or other resources. The The advantages of using specialized functional units for the
Congestion Factor row expresses this congestion as a SHAmac version of the SHA256 benchmark can be seen in
percentage of the base clocks per item. There is a noticeable the DF Power values. There is no difference in timing as the
trend for higher memory bandwidths to have higher congestion benchmark round still requires three clocks. Interleaving three
factors. benchmark instances in the SHAmac3 version does show a
factor of three timing improvement. This version shows an
5.3. DFPIM Benchmark Results
increase in power as each component is active every cycle while
Table 3 displays the measurements for the DFPIM benchmarks. We components are only active for 1 in 3 clocks in the other two
analyzed three different silicon technologies as described in SHA256 versions.
Section 4. This resulted in a separate measurement for each
technology. The units used in this table are consistent with
the units in Table 2 allowing the numbers to be directly
Reconfigurable Dataflow Graphs for Processing in Memory ICDCN-2019, Bangaluru, India, Jan. 4-7, 2019

______________________________________________________________________________________________________
Table 3: DFPIM Benchmark Results
Hist Word FFT BFS Str M Lin R SHA256 SHAmac SHAmac3
Clocks (M) 9.77 6.02 49.27 57.56 33.57 21.97 19.75 19.75 6.58
Freq28 (clk/usec) 1100 1100 1100 1100 1100 1100 1100 1100 1100
Freq16 (clk/usec) 1300 1300 1300 1300 1300 1300 1300 1300 1300
Freq07 (clk/usec) 1800 1800 1800 1800 1800 1800 1800 1800 1800
Items / Vault (K) 9765 5928 4096 2000 32909 21972 102 102 102
Item Size (bytes) 3 1 8 8 1 2 64 64 64
DF Power28 (W) 0.2000 4.6165 1.7990 0.2885 0.5598 0.6358 0.2899 0.0972 0.3803
DF Power16 (W) 0.1313 2.8157 1.1809 0.1845 0.3587 0.4120 0.1840 0.0634 0.2442
DF Power07 (W) 0.0779 1.8954 0.7514 0.1132 0.2198 0.2690 0.1063 0.0376 0.1556
Mem Power28 (W) 2.0329 0.9828 0.8165 0.6148 0.9807 1.5119 0.6437 0.6437 0.9909
Mem Power16 (W) 2.3170 1.0760 0.8795 0.6411 1.0736 1.7014 0.6752 0.6752 1.0857
Mem Power07 (W) 3.0274 1.3091 1.0369 0.7070 1.3058 2.1750 0.7542 0.7542 1.3224
Exec Time28 (S) 0.0089 0.0055 0.0448 0.0523 0.0305 0.0200 0.0180 0.0180 0.0060
Exec Time16 (S) 0.0075 0.0046 0.0379 0.0443 0.0258 0.0169 0.0152 0.0152 0.0051
Exec Time07 (S) 0.0054 0.0033 0.0274 0.0320 0.0186 0.0122 0.0110 0.0110 0.0037
DF Energy28 (J) 0.0261 0.0336 0.1484 0.0815 0.0631 0.0537 0.0160 0.0142 0.0123
DF Energy16 (J) 0.0221 0.0198 0.0960 0.0569 0.0466 0.0423 0.0109 0.0101 0.0091
DF Energy07 (J) 0.0185 0.0115 0.0571 0.0352 0.0327 0.0327 0.0067 0.0063 0.0065
BW28 (MB/S) 52800.0 17322.8 11704.8 4892.5 17254.9 35200.0 5866.6 5866.6 17599.4
BW17 (MB/S) 62400.0 20472.4 13832.9 5782.0 20392.2 41600.0 6933.2 6933.2 20799.2
BW07 (MB/S) 86400.0 28346.4 19153.2 8005.8 28235.3 57600.0 9599.9 9599.9 28798.9
Clocks per item 1.00 1.02 12.03 28.78 1.02 1.00 192.00 192.00 64.00

______________________________________________________________________________________________________
single clock. The FFT speedup is achieved by parallelism and
pipelining to achieve 14 algorithm operations per clock cycle.
5.4. Server to DFPIM 28 nm Comparison
The breadth first search benchmark has only a marginal speedup
The Intel Xeon E5-2683v3 processor used in this evaluation is due to its unstructured and limited parallelism. The string match
implemented in a 22 nm FinFET silicon process. It is being benchmark speedup results primarily from processing the four keys
compared to the DFPIM in a 28 nm planar process. This gives the in parallel and independent character and word processing sections.
server processor a moderate technology advantage. The lower The linear regression benchmark performs all five updates in
production volume of a PIM logic layer compared to a server parallel with the three multiplications pipelined for eight
processor would favor the lower development and production operations per clock. The standard SHA256 algorithm provides an
cost of the 28 nm planar technology making this a reasonable average of eight operations per clock, but is limited to a
comparison. The benefits of the two smaller FinFET threeclock pipeline latency due to data dependencies. Interleaving
technologies are shown in Table 4 for applications needing the three benchmark instances increases the speedup for SHAmac3.
additional performance while maintaining low energy.
Likewise, E-ratio 28 shows the energy consumed on server
The speedup 28 shows the execution time on server compared to compared to the energy consumed on DFPIM using 28nm
that on DFPIM using 28nm technology. technology. \ The energy ratio of 334.4 for the FFT indicates
performing an FFT on a server processor requires 334 times that
of DFPIM implementation. The speedup is then multiplied
The large histogram speedup is a result of computing the red, with the energy ratio to get a ratio of the energy-delay products
green, and blue pixel components in parallel. The DFPIM ability of the benchmark implementations. The Table also includes the
for single clock-cycle read-modify-write of the scratch pad memory bandwidths taken directly from Table 3.
memories is another factor contributing to the large histogram
speedup. The word occurrence count benchmark speedup results
from separate, independent character and word processing sections
and the DFPIM capability to perform a full word comparison in a
ICDCN-2019, Bangaluru, India, Jan 4-7, 2019 C.F. Shelor and K.M. Kavi

________________________________________________________________________________________________________
Table 4: Server vs 28nm DFPIM comparisons

Hist Word FFT BFS Str M Lin R SHA256 SHAmac SHAmac3


Speedup 28 16.9 7.0 7.9 1.6 4.3 9.3 3.2 3.2 7.5
Server energy (J) 19.250 4.913 49.617 10.507 16.808 23.657 8.425 8.425 8.425
E-ratio 28 737.3 146.3 334.4 128.9 266.5 440.5 525.6 592.3 686.3
S * E-ratio 28 12425 1025 2646 201 1149 4109 1667 1878 5158
Server BW 3127.6 2469.8 1348.1 3105.1 4002.1 3769.6 1604.5 1604.5 1604.5
DFPIM 28 BW 52800.0 17322.8 11704.8 4892.5 17254.9 35200.0 5866.6 5866.6 17599.4

_____________________________________________________________________________________________________

5.5. DFPIM 28 to DFPIM 16 nm and 7 nm Comparison benefits are quantified in Table 5. The first section of the table
uses all three DFPIM time values from Table 3. The speedup 16
The 28 nm planar technology has been in production since 2009.
(speedup 7) row is the DFPIM 28 execution time divided by the
The 16 nm technology began production in 2013 and the 7 nm
DFPIM 16 (DFPIM 7)execution time. Likewise, E-ratio 16 and
technology began production in 2017. The newer technologies
E-ratio 7 show the energy comparisons of 28, 16 and 7nm
offer both performance and energy efficiency benefits. These
technologies
_____________________________________________________________________________________________________
Table 5. Comparing 28nm, 16nm and 7nm DFPIM results

Hist Word FFT BFS Str M Lin R SHA256 SHAmac SHAmac3


Speedup 16 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2
Speedup 7 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6
E-ratio 16 1.18 1.69 1.55 1.43 1.35 1.27 1.46 1.41 1.34
E-ratio 7 1.41 2.92 2.60 2.31 1.93 1.64 2.39 2.25 1.90
S * E-ratio 16 1.40 2.00 1.83 1.69 1.60 1.50 1.73 1.67 1.59
S * E-ratio 7 2.31 4.78 4.25 3.79 3.15 2.69 3.91 3.68 3.10

______________________________________________________________________________________________________

The speedup, energy ratio, and speedup * energy product ratio Gan uses a reconfigurable dataflow architecture [11] to
factors are multiplicative with the values for the server to DFPIM implement stencil operations for atmospheric modeling. The
28 nm comparison shown in Table 4. Thus, the average speedup FPGA implemented system achieved a speedup of 18 compared
of 7.2 for server to DFPIM 28 nm becomes 11.5 (7.2 * 1.6) for the to a server processor. The server processor used 427 Watts,
server to the DFPIM 7 nm speedup. The energy efficiency of 368 while the FPGA hardware added 523 Watts to achieve the
for server to DFPIM 28 nm becomes 810 for server to DFPIM 7nm. speedup. The overall power efficiency was 8.3. The power
required is not suitable for a PIM application, but the
6. Related Works performance gain showed the effectiveness of a dataflow
We only include works that rely on dataflow like processing. There implementation.
are too many proposals for PIM or Near Data Processing that use The Heterogenous Reconfigurable Logic (HRL) near data processing
conventional processing architectures or GPUs. [12] uses CGRL functional units and bus-based routing as well as
The Near DRAM Accelerator (NDA) [10] utilizes a dataflow dedicated memory load and store units. This paper illustrates the
network of functional devices to reduce energy by 46% and area, performance, and energy advantages of mixed granularity
increase performance by an average 1.67 times. The NDA does systems such as HRL and DFPIM. The HRL system requires 8
not include sequencing functional units or scratch pad memories memory stacks to achieve an average 2.5 speedup, while DFPIM
which DFPIM has shown to be necessary for improved gets a 7.2 speedup with a single memory stack. Part of this is
performance in some benchmarks. The NDA connects each attributable to the difference between the 45 nm process of HRL
accelerator to a single DRAM die rather than a 3D-DRAM stack and the 28 nm process of DFPIM. DFPIM uses a flexible,
used by DFPIM. This results in a higher accelerator-to-memory partitioned bus rather than the mesh network of the HRL which may
cost ratio as a single DFPIM can support 4 or 8 DRAM dies. allow more efficient implementation of some dataflow graphs.
Reconfigurable Dataflow Graphs for Processing in Memory ICDCN-2019, Bangaluru, India, Jan. 4-7, 2019

HRL does not have the programmable state machine for version of DFPIM with the host (which uses 22 nm FinFET
sequencing and depends on the host for looping. technology).
The DySER system integrates dataflow graph processing into the The DFPIM concept showed good performance improvement
pipeline of a processor essentially transforming the dataflow and excellent energy efficiency for the streaming benchmarks
graph into a processor instruction [13, 14]. The CPU instruction that were analyzed. The DFPIM in a 28 nm process with an a
fetch and single memory access per instruction greatly limits the DFPIM core in each of 16 vaults showed an average speedup of
performance of DySER. Harmonic mean speedup ranged from 7.2 over 32 cores in the server system. The server processor
1.3 on SPECint benchmarks to 3.8 on GPU benchmarks. Being required 368 times more energy to execute the benchmarks than
integrated into the processor pipeline restricts the parallelism the DFPIM implementation. These values result from the
and pipelining that a full dataflow construct can provide. parallelism and pipelining available in the DFPIM architecture
and the use of low power libraries in the silicon process.
The bundled execution of recurring traces (BERET) research
implements basic blocks as a dataflow subgraph in a coprocessor Better performance and higher energy efficiency are possible by
[15]. Each subgraph is executed through the CPU coprocessor using the more recently available 16 nm and 7 nm silicon
interface. A set of eight subgraphs were selected through trace technologies. The 16 nm technology provides a modest speedup
analysis to be implemented. The system resulted in a 19% of 1.2 with an energy efficiency improvement of 1.4 compared to
performance improvement and a 35% energy savings. A the 28 nm. The 7 nm technology provides a speedup of 1.6 with
coprocessor implementation of a standard eight subgraphs limits an energy efficiency of 2.2 compared to the 28 nm.
the capability of a full dataflow approach.
The 7 nm DFPIM implementation has an average speed-up of 11.5
Single Graph Multiple Flows (SGMF) [40] uses a dynamic with an energy efficiency ratio of 810 when compared to the
dataflow paradigm and CGRL to compare with an Nvidia Fermi server processor system.
streaming multiprocessor. The applications for SGMF are
Acknowledgements. This research is supported in part by NSF
compute intensive applications so it is not suitable as a PIM.
Net-centric Industry/University Cooperative Research Center and
However, the advantages of using dataflow with CGRL is shown
its industrial memberships.
in this paper with an average speedup of 2.2 and energy efficiency
of 2.6. 8. References
The Wave Computing dataflow based neural net accelerator [33] [1] F. Ahmad, S. Lee, M. Thottethodi, and T. N. Vijaykumar,
uses an array of small processors to execute basic block Puma: Purdue mapreduce benchmarks suite, Tech. report,
instructions. Each processor is assigned a basic block and Purdue University, 2012.
accepts data from its predecessors and provides data to its [2] B. Akin, F. Franchetti, and J. C. Hoe, Data reorganization
successors. The processor contains a 256-entry instruction RAM in memory using 3d-stacked dram, Computer Architecture
and a 1KB data RAM. The network routing forms the dataflow (ISCA), 2015 ACM/IEEE 42nd Annual International
graph. Current implementation of a Wave compute appliance Symposium on, June 2015, pp. 131–143.
consists of four data processing units per board, multiple boards [3] Arvind, Data flow languages and architectures, Proceedings
per chassis and multiple chassis. This is not suitable for a low of the 8th Annual Symposium on Computer Architecture
power PIM implementation (Los Alamitos, CA, USA), ISCA ’81, IEEE Computer
Society Press, 1981, pp. 1–.
7. Conclusions
[4] Arvind and D. E. Culler, Dataflow architectures, Annual
In this paper we described a processing-in-memory accelerator Review of Computer Science Vol. 1, 1986 (Joseph F. Traub,
based on dataflow computing model and we show that our system Barbara J. Grosz, Butler W. Lampson, and Nils J. Nilsson,
can be used for distributed applications such as Big Data analytics. eds.), Annual Reviews Inc., Palo Alto, CA, USA, 1986, pp.
225–253.
We used careful and extensive logic syntheses to obtain execution
[5] Arvind and R. S. Nikhil, Executing a program on the mit tagged-
and energy values for our DFPIMs components using 28nm, 16nm
token dataflow architecture, IEEE Transactions on
and 7nm technologies. We developed a backend to LLVM to
Computers 39 (1990), no. 3, 300–318.
generate dataflow graphs from C code kernels identified for PIM
[6] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaff
processing. These graphs are then used by our simulator, which
and K. Skadron, Rodinia: A benchmark suite for
executes the graph with inputs and generates results. We have
heterogeneous computing, Workload Characterization,
verified the correctness of execution by our simulator by comparing
2009. IISWC 2009. IEEE International Symposium on, Oct
the results generated from an execution on a host processor that uses
2009, pp. 44–54.
Intel Xeon cores.
[7] Xilinx Corp, UltraScale Architecture and Product Data Sheet:
We compared the performance and energy values of DFPIM Overview, Jan 2018.
implementations with those obtained from our baseline host [8] Nvidia17 Corporation, Nvidia TESLA V100 GPU
consisting of 2 14-core Intel Xeon processors for a variety of Architecture, 2017.
benchmarks. We evaluated three different versions of DFPIM using [9] Elpida Corp, Elpida begins sample shipments of ddr3 sdram
28nm, 16nm and 7nm technologies. We compared 28nm planar (x32) based on tsv stacking technology, 2011.
ICDCN-2019, Bangaluru, India, Jan 4-7, 2019 C.F. Shelor and K.M. Kavi

[10] A. Farmahini-Farahani, J. H. Ahn, K. Morrow, and N.S. [24] K. M. Kavi, C. Shelor, and D. Pace, Concurrency,
Kim, NDA: Near-dram acceleration architecture synchronization, and speculation - the dataflow way,
leveraging commodity dram devices and standard memory Advances in Computers 96 (2015), 47–104.
modules, 2015 IEEE International Symposium on High [25] S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and
Performance Computer Architecture (HPCA), IEEE D. Glasco, Gpus and the future of parallel computing, IEEE
Conference papers, March 2015, pp. 283–295. Micro 31 (2011), no. 5, 7–17.
[11] L. Gan, H. Fu, W. Luk, C. Yang, W. Xue, and G. Yang, [26] C. Lefurgy, K. Rajamani, F. Rawson, W. Felter, M.
Solving mesoscale atmospheric dynamics using a Kistler, and T. W. Keller, Energy management for
reconfigurable dataflow architecture, IEEE Micro 37 (2017), commercial servers, Computer 36 (2003), no. 12, 39–48.
no. 4, 40–50. [27] Arm Limited, Arm Physical IP, 2017.
[12] M. Gao and C. Kozyrakis, Hrl: Efficient and flexible [28] LLVM Project, The llvm compiler infrastructure, 2018.
reconfigurable logic for near-data pro- cessing, 2016 IEEE [29] ___, Writing an LLVM Pass, 2018.
International Symposium on High Performance Computer [30] G.H. Loh, 3d-stacked memory architectures for multi-core
Architecture (HPCA), March 2016, pp. 126–137. processors, Computer Architec- ture, 2008. 35th
[13] V. Govindaraju, C. H. Ho, and K. Sankaralingam, International Symposium on, June 2008, pp. 453–464.
Dynamically specialized datapaths for energy efficient [31] Micron Technology, 16gb: x16 twindie single rank
computing, 2011 IEEE 17th International Symposium on ddr4 sdram datasheet, nov 2014.
High Performance Computer Architecture, Feb 2011, pp. [32] ___, Hmc high-performance memory brochure, jun
503–514. 2016.
[14] V. Govindaraju, Chen-Han Ho, T. Nowatzki, J. Chhugani, [33] C. Nicol, A Dataflow Processing Chip for Training Deep
N. Satish, K. Sankaralingam, and Changkyu Kim, Dyser: Neural Networks, Hot Chips: A Symposium on High
Unifying functionality and parallelism specialization for Performance Chips, August 2017.
energy- efficient computing, Micro, IEEE 32 (2012), no. 5, [34] D. A. Patterson, Latency lags bandwith, Commun. ACM 47
38–51. (2004), no. 10, 71–75. [90] J. Thomas Pawlowski, Hybrid
[15] S. Gupta, S. Feng, A. Ansari, S. Mahlke, and David August, memory cube (HMC), 2011.
Bundled execution of recurring traces for energy-efficient [35] J. Thomas Pawlowski, Hybrid memory cube (HMC), 2011.
general purpose processing, Proceedings of the 44th Annual [36] M. Scrbak, M. Islam, K. M. Kavi, M. Ignatowski, and N.
IEEE/ACM International Symposium on Microarchitecture Jayasena, Processing-in-Memory: Exploring the Design
(New York, NY, USA), MICRO-44, ACM, 2011, pp. 12–23. Space, Architecture of Computing Systems ARCS 2015
[16] M.R. Guthaus, J.S. Ringenberg, D. Ernst, T.M. Austin, (Lus Miguel Pinho Pinho, Wolfgang Karl, Albert Cohen, and
T. Mudge, and R.B. Brown, Mibench: A free, Uwe Brinkschulte, eds.), Lecture Notes in Computer
commercially representative embedded benchmark suite, Science, vol. 9017, Springer International Publishing,
Workload Char- acterization, 2001. WWC-4. 2001 IEEE 2015, pp. 43–54.
International Workshop on, Dec 2001, pp. 3–14. [37] SK hynix, DRAM HBM products, 2018.
[17] J. L. Hennessy and D. A. Patterson, Computer architecture, [38] SPEC Benchmarks, https://www.spec.org/benchmarks.html.
fifth edition: A quan- titative approach, 5th ed., Morgan [39] P. Fay, T. Willhalm, R. Dementiev, Intel Performance
Kaufmann Publishers Inc., San Francisco, CA, USA, 2011. Counter Monitor - A Better Way to Measure CPU
[18] S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang, The Utilization, January 2017.
hibench benchmark suite: Characterization of the mapreduce- [40] D. Voitsechov and Y. Etsion, Single-Graph Multiple Flows:
based data analysis, Data Engineering Work- shops Energy Efficient Design Alternative for GPGPUs,
(ICDEW), 2010 IEEE 26th International Conference on, Proceeding of the 41st Annual International Symposium on
March 2010, pp. 41–51. Com- puter Architecture (Piscataway, NJ, USA), ISCA ’14,
[19] Hybrid Memory Cube Consortium, Hybrid memory cube IEEE Press, 2014, pp. 205–216.
specification 2.1, 2014. [41] A. White, Exascale Challenges: Applications, Technologies,
[20] International Technology Roadmap for and Co-design, From Petascale to Exascale: R&D Challenges
Semiconductors, Itrs interconnect working group, 2012 for HPC Simulation Environments, 03 2011.
update, 2012. [42] W. A. Wulf and S. A. McKee, Hitting the Memory Wall:
[21] J. Jeddeloh and B. Keeth, Hybrid memory cube new dram Implications of the Obvious, SIGARCH Comput. Archit.
architecture increases density and performance, 2012 News 23 (1995), no. 1, 20–24.
Symposium on VLSI Technology (VLSIT), June 2012, pp. [43] D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse,
87–88. L. Xu, and M. Ignatowski, TOP-PIM: Throughput-oriented
[22] JEDEC Solid State Technology Association, Jesd235a high Programmable Processing in Memory, Proceedings of the
bandwidth memory (HBM) dram, 2015. 23rd International Symposium on High-performance
[23] K. M. Kavi, R. Giorgi, and J. Arul, Scheduled dataflow: Parallel and Distributed Computing (New York, NY, USA),
execution paradigm, architecture, and performance HPDC ’14, ACM, 2014, pp. 85–98
evaluation, IEEE Transactions on Computers 50 (2001), no.
8, 834–846.

View publication stats

You might also like