0% found this document useful (0 votes)
30 views14 pages

A Graph Neural Network Accelerator

The document describes GRIP, a graph neural network accelerator architecture designed for low-latency inference. GRIP splits GNN inference into edge- and vertex-centric phases that are implemented with specialized hardware units. It aims to improve performance over CPUs and GPUs by addressing challenges like irregular memory accesses and limited data reuse that are common in GNN inference.

Uploaded by

guantongpeng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views14 pages

A Graph Neural Network Accelerator

The document describes GRIP, a graph neural network accelerator architecture designed for low-latency inference. GRIP splits GNN inference into edge- and vertex-centric phases that are implemented with specialized hardware units. It aims to improve performance over CPUs and GPUs by addressing challenges like irregular memory accesses and limited data reuse that are common in GNN inference.

Uploaded by

guantongpeng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

GRIP: A Graph Neural Network Accelerator

Architecture
Kevin Kiningham∗ , Christopher R, Philip Levis
Stanford University
[email protected]

Abstract—We present GRIP, a graph neural network accelera- often sparse and irregular structure of the input graph. This
tor architecture designed for low-latency inference. Accelerating results in many random memory accesses and limited data
GNNs is challenging because they combine two distinct types of reuse, but also requires relatively little computation.
computation: arithmetic-intensive vertex-centric operations and
arXiv:2007.13828v2 [cs.AR] 30 Jul 2020

memory-intensive edge-centric operations. GRIP splits GNN infer- The combination of these two types of computation makes
ence into a fixed set of edge- and vertex-centric execution phases GNN inference inefficient on existing architectures. As a result,
that can be implemented in hardware. We then specialize each GNNs have much higher inference latency than other neural
unit for the unique computational structure found in each phase. networks, limiting them to applications where inference can
For vertex-centric phases, GRIP uses a high performance matrix be pre-computed offline [46]. Most DNN accelerators (e.g.
multiply engine coupled with a dedicated memory subsystem for
weights to improve reuse. For edge-centric phases, GRIP use the TPU [25]) are optimized for dense, regular computation,
multiple parallel prefetch and reduction engines to alleviate the making edge operations hard to implement efficiently [3]. Graph
irregularity in memory accesses. Finally, GRIP supports several analytics accelerators (e.g Graphicionado [20]) are designed
GNN optimizations, including a novel optimization called vertex- for workloads that require little computation per-vertex and
tiling which increases the reuse of weight data. have difficulty exploiting data reuse in vertex-centric operations.
We evaluate GRIP by performing synthesis and place and
route for a 28 nm implementation capable of executing inference Prior work has demonstrated inference on CPUs and GPUs
for several widely-used GNN models (GCN, GraphSAGE, G- is limited by architectural issues, such as cache and memory
GCN, and GIN). Across several benchmark graphs, it reduces bandwidth bottlenecks [19], [45].
99th percentile latency by a geometric mean of 17× and 23× This paper proposes GRIP (GRaph Inference Processor), an
compared to a CPU and GPU baseline, respectively, while accelerator architecture designed for low-latency GNN infer-
drawing only 5 W.
Index Terms—Deep Learning; Hardware Acceleration; ence. GRIP’s programming model is inspired by GReTA [27],
Algorithm-Hardware co-Design; ASIC; a decomposition of GNN inference into a fixed set of edge-
and vertex-centric phases. GRIP implements each phase with
I. I NTRODUCTION separate specialized on-chip memory and execution units. For
example, GRIP alleviates irregularity in the edge-accumulate
Traditional deep neural networks (DNNs) rely on regularly
phase by using multiple parallel prefetch engines to load data.
structured inputs (e.g. vectors, images, or sequences) making
This allows GRIP to support a broader class of GNNs than
them difficult to use in domains where data is naturally
prior work, including emerging models that perform complex
irregular (e.g. user connections on social media). Graph neural
computation per-edge. Finally, GRIP includes hardware support
networks (GNNs) tackle this limitation by extending DNNs to
for several optimizations: caching partitions of feature data,
allow arbitrarily structured graph-valued inputs, where feature
1 inter-phase pipelining, and preloading weights between layers.
vectors are associated with the edges and vertices of a graph .
We also introduce a novel GNN optimization called vertex-
GNNs have found significant success in a range of practical
tiling that substantially improves latency by increasing the
tasks, including surfacing related content on social media [46],
reuse of weight values during inference.
recommending meals on delivery platforms [24], and improving
circuit testability for EDA [32]. A. Contributions
GNNs combine two distinct types of operations [17], [31]: This paper makes the following contributions:
vertex-centric, which are associated with graph vertices, and
1) GRIP, an accelerator architecture for low-latency GNN
edge-centric, which are associated with edges. Vertex-centric
inference. GRIP is efficient across a wide range of models
operations are computationally regular and primarily consist
and has numerous hardware optimizations to improve
of multiplying vertex feature vectors by large weight matrices.
inference latency.
These weights are shared across all vertices, leading to
2) A novel optimization for GNN inference called vertex-
significant opportunities for data reuse. Edge-centric operations
tiling, which improves performance by increasing reuse
are similar to those found in graph analytics (e.g. neighborhood
of weights.
reduction [42]). Their computational structure depends on the
3) A detailed description of a 28 nm implementation of
1 Following the convention in prior work [21], for clarity we call a GNN’s GRIP capable of executing four representative GNNs
input a graph and the GNN itself a network. (GCN, GraphSAGE, G-GCN, and GIN). Evaluated across
B Algorithm 1 Message Passing Layer Forward Pass
A
C Input: Graph G = (V, E); Vertex and edge features hv ,
D h(u,v)
F
E Output: Updated vertex features zv
(a) Input graph. 1: for (u, v) in E do
2: mu,v ← Send(hv , hu , h(u,v) )
B Layer 2
A C D E
3: for v in V do
4: av ← Aggregate({mu,v | u ∈ N (v)})
A B D E F Layer 1 5: zv ← Update(hv , av )
(b) Nodeflow.

Message Passing Layer which messages are considered, typically using a fixed
A size random sample.
A
B • Update combines each vertex’s current value with the
Shared C

D Weights MPL B output of aggregation to produce an updated vector zv .


D
E
E By iteratively applying K of these layers, the final state for
F
1. Send 2. Aggregate 3. Update each vertex captures information about the structure of its
K-hop neighborhood.
(c) Inference dataflow.
Pooling and Readout. Two other layer types are also used
in some GNNs. Pooling defines a method of coarsening a graph
Fig. 1: An example of performing GCN inference on vertex B by combining the features of clusters of vertices. Readout is a
with two layers. The nodeflow (b) describes the propagation special case of pooling which uses a single graph-wide cluster
of features during inference (c). to produce a representation for an entire graph (e.g. for graph
classification). In this paper, we treat both layers as slightly
modified versions of message-passing, where edges connect
several benchmark graphs, our implementation reduces vertices to clusters rather than other vertices.
99th percentile latency by a geometric mean of 17× and Nodeflow. A nodeflow [22] is a bipartite data structure
23× compared to an Intel Xeon CPU and Nvidia P100 that describes how features are propagated during message-
GPU baseline, respectively. passing. It is typically generated during a preprocessing step
before inference, but can also be created on-demand (e.g. for
II. BACKGROUND AND M OTIVATION dynamic graphs). The nodeflow is most useful when performing
A. Graph Neural Networks inference on a subset of the graph since it makes it easy to
determine which edges and vertices are required to update
GNNs [4], [43] are a class of DNN that operate on graph- a specific vertex. It can also be used to separate sampling
valued data. Unlike traditional DNNs, GNNs directly take from inference by precomputing the neighborhood function
advantage of graph structure during learning and inference. and encoding the result directly in the nodeflow.
For example, consider the task of classifying web-pages by In this paper, we denote the nodeflow for a particular layer
topic. A pure content approach (e.g. a classic recurrent neural as the three-tuple (U, V, E), where U is the set of vertices read
network) considers only features derived from a page’s content. during inference, V is the set of vertices updated, and E is a
However, a significant amount of information is stored in the set of edges connecting vertices in U to V . Fig. 1 shows an
structure of links between pages. By modeling these links as example of using the nodeflow to compute inference in a two
a graph, a GNN can natively leverage both page content and layer GNN.
link structure. GNN-based methods have achieved state of
GCN. We use the Graph Convolutional Network [28] (GCN)
the art performance on a diverse set of graph-related tasks,
as a concrete running example of a GNN model. GCN uses
including link prediction [50], vertex classification [46], and
multiple message-passing layers with the following send,
clustering [47].
aggregate, and update operations
Message-Passing Layer. Modern GNNs are typically com-
posed of multiple message-passing layers [17], shown in Alg. 1. mu,v ← hu
The layer takes as input a graph G consisting of a set of vertices
av ← mean({mu,v | u ∈ N (v)})
V and edges E. Each vertex and edge is assigned a feature
vector hv and h(u,v) respectively. Computation is split into zv ← ReLU(W av )
three operations:
where W is a trainable weight matrix. We can rewrite this to
• Send computes a message vector mu,v for each edge. use sparse-dense matrix multiplication (SpMM)
• Aggregate reduces incoming messages for each vertex to
a vector av . The neighborhood function N (v) determines Z ← ReLU(ÂHW ) (1)

2
where  is a sparse matrix derived from the nodeflow and H data rather than loading on demand during execution. Taken
and Z are dense matrices formed from the set of input and together, this gives a significant opportunity for improving
output features respectively. This allows GCN inference to be GNN inference performance.
implemented using operations from highly optimized sparse
matrix libraries, such as Intel MKL [23] or cuSPARSE [35]. III. R ELATED W ORK
DNN Accelerators. A significant number of custom neural
B. Performance Challenges of GNNs
network accelerates have been developed, mostly focused on
To demonstrate the performance challenges of GNNs in dense operations [8], [9], [10], [13], [14], [15], [16], [25],
practice, we implement 2-layer GCN using the SpMM form [30], [41], [49]. However, edge-centric operations are difficult
in Eq. 1. Our implementation uses Tensorflow compiled with to implement efficiently on these architectures [3], which are
Intel MLK run on a single socket of an Intel Xeon E5-2690v4. much more computationally irregular than traditional DNNs.
In Fig. 2, we plot measured performance verses arithmetic GRIP natively supports edge-centric operations by using a
intensity for each vertex in the Pokec dataset. Arithmetic graph-processing based programming model (Sec. IV) and by
intensity depends on the number of unique neighbors that a combination of specialized memory for edge accesses and
must be read during inference, which is determined by the software techniques. In Sec. VIII-F we estimate GRIP to be
local graph structure of each vertex. 2.4× faster than a comparable TPU-like accelerator modified
specifically to improve GNN inference performance.
GCN Accelerators. HyGCN [45] and GraphACT [48]
are two accelerators designed to for graph convolutional
networks, a subclass of graph neural networks. Like GRIP,
these accelerators use separate edge- and vertex-centric units for
GNN computation. GRIP builds on these designs by handling
a much more general set of GNNs that includes models that
use computation associated with edges. This is important for
Fig. 2: CPU performance of GCN inference for vertices in many emerging state-of-the-art GNNs, such as Graph Attention
the Pokec [29] dataset. Bottlenecks in cache bandwidth result Networks [40]. Additionally, GRIP’s support for vertex-tiling
in a significant gap between measured performance and the reduces the amount of weight bandwidth required during vertex-
roofline upper bound. oriented operations. In Sec. VIII-F we estimate this improves
performance by 4.5× compared to HyGCN.
In this dataset, inference performance is theoretically bottle- Graph Analytics Accelerators. Specialized accelerators
necked by off-chip memory bandwidth for all vertices. However, have also been proposed for graph analytics workloads [34],
there is a significant gap between the theoretical upper bound [36], [37]. However, these workloads require relatively little
and the actual measured performance at higher levels of computation per-vertex and typically use scalars rather than
arithmetic intensity. Profiling shows the primary bottleneck large feature vectors. Thus, the computation and memory access
is last level cache bandwidth, a result consistent with prior patterns are very different. In Sec. VIII-F, we estimate GRIP
analysis of GPU performance [19]. In our experiment, the to be 8.1× faster than the approach of Graphicionado [20].
highest arithmetic intensities occur when a vertex appears GNN Optimizations. Many optimizations have been pro-
in multiple neighborhoods and its feature vector can be posed to improve GNN performance. Common techniques
reused. However, if multiple cores are reading or writing a include scheduling computation to reducing the impact of
vertex in parallel, this also results in higher utilization of sparsity [3], [11], [31], improved sampling [7], or eliminating
cache bandwidth. Additionally, features must compete with redundant computation [48]. These techniques are compatible
large weight values that also occupy the cache and consume with GRIP and can be used for additional performance.
bandwidth during the vertex-centric Update operation.
Opportunities for Acceleration. While GNN inference per- IV. P ROGRAMMING M ODEL
formance may be limited on existing hardware, the difficulties GRIP’s programming model is based on GReTA [27], a
described in this section can be overcome with a custom graph-processing abstraction specialized for implementing
architecture. In particular, we propose using separate specialized GNNs. GReTA decomposes GNN layers into four stateless user-
memory and execution units for each edge- and vertex-centric defined functions (UDFs): gather, reduce, transform,
operation. To specialize for vertex-centric operations, we use a and activate. GRIP invokes each UDF in one of three
dedicated high performance matrix-multiplication unit. Weights execution phases: edge-accumulate, vertex-accumulate, and
are stored on-chip in dedicated memory with a level of caching vertex-update. GRIP also allows programs to be composed by
to improve reuse. For edge-centric operations, we prefetch data using the result of one program as the features or accumulator
for multiple edges in parallel and specialize the on-chip feature in another. This flexibility allows a wide range of GNN models
memory to enable fast gather and reduction operations. Finally, to be implemented.
since the nodeflow is known statically, we can also improve off- Data Model. UDFs are restricted in the types of data they
chip access efficiency by scheduling bulk transfers of feature can access to in order to simplify hardware implementation.

3
Algorithm 2 GRIP Program Execution Semantics Edge-Accumulate Vertex-Accumulate Vertex-Update

Input: Layer nodeflow (U, V, E); Vertex data hu and hv ;


Edge data h(u,v) ; Accumulators ev and av ; Weights and
biases W (a) (b)
Output: Updated vertex data zv
1: /* Edge-Accumulate Phase */ Fig. 3: Modifying the GCN Send operation (Eq. 2) requires
2: for (u, v) in E do splitting the layer into two sequential programs (a) and (b).
3: ev = reduce(ev , gather(hu , hv , h(u,v) )) The dashed box in (a) indicates a phase with no computation.
4: /* Vertex-Accumulate Phase */
5: for v in V do
6: av = transform(av , ev , W )
7: /* Vertex-Update Phase */
(a) GraphSAGE [21]
8: for v in V do
9: zv = activate(av )

GRIP programs use four types of data: (1) A nodeflow N F = (b) GIN [44]
(U, V, E) encodes computational structure by defining the
vertices and edges to read and update. (2) Feature vectors hu ,
hv , and h(u,v) associated with nodeflow input vertices, output
vertices, and edges respectively. (3) A set of constant layer (c) G-GCN [2], [5], [33]
weights W . (4) Edge-accumulator ev and vertex-accumulator
av associated with each output vertex.
Fig. 4: GRIP implementation of several GNN models. Plus-
Execution Semantics. UDFs are executed in three phases:
boxes indicate the output of one program is used as the edge
1) Edge-accumulate iterates over nodeflow edges and in- or vertex-accumulator of another. Phases with no associated
vokes gather and reduce. Gather reads features computation are omitted.
associated with an edge to produce a message value.
Reduce accumulates messages sharing an output vertex
into ev . This results in a single value per output vertex. implement this layer by splitting it into two GRIP programs as
2) Vertex-accumulate iterates over each output vertex and shown in Fig. 3. Note that splitting a layer may result in each
combines ev with the previous accumulator state av program iterating over a different nodeflow. For example, the
using transform. Transform is the only UDF program in Fig. 3a iterates over an identity nodeflow where
with access to layer weights and is usually the most all vertices are only self-connected. In Fig. 4, we demonstrate
computationally expensive operation in a layer (e.g. the flexibility of this approach by showing the implementation
matrix multiplication). of a variety of different GNN models.
3) Vertex-update again iterates over each output vertex
and applies activate to av . The activate UDF V. T HE GRIP A RCHITECTURE
typically implements the non-linear operations required GRIP is an accelerator architecture for low-latency GNN
in a layer (e.g. the activation function). This produces a inference. Rather than designing around a specific GNN, GRIP
final updated value for each vertex zv0 . allows users to customize the architecture by implementing
Alg. 2 shows the full execution semantics of a GRIP program. four processing elements (PEs) corresponding to each UDF
A. Layer Implementation of GReTA. This allows GRIP to be used to accelerate a wide
variety of models. In this section, we describe an overview of
The decomposition used by GRIP is expressive enough to
GRIP and the microarchitecture of each execution unit. A high
allow implementing a wide variety of GNNs. Implementing a
level overview of GRIP is shown in Fig. 5.
particular layer is typically straightforward since each phase
naturally maps to the operations of the massage-passing layer A. Overview
introduced in Sec. II. However, some complex models may
Control. GRIP is controlled by a host system that sends
require a layer to be split into multiple programs. This is
commands to execute different operations or transfer data. The
especially true for models that require significant computation
control unit dequeues each command in-order and issues them
per-edge. For example, consider the following modified GCN
asynchronously to individual execution units or the memory
Send operation
controller. Additionally, almost all buffers use double-buffering
mu,v ← W0 hu (2)
to allow overlapping the execution of different operations with
This cannot be mapped directly to gather and reduce moving data between buffers or loading from off-chip. A barrier
since they do not have access to the layer weights. Instead, we command is used to enforce dependencies by preventing new

4
Accumulator
Status Control Tile
Host
Global Weight Buffer Weight Sequencer Control
Cmd. FIFO Unit Buffer
Custom PE

SRAM
Nodeflow Edge Unit Edge Vertex Vertex Update
Memory Control Buffer Accum. Unit Accum. Unit
Off-Chip

Edges

Bank
Pre- Gather Reduce Transform Activate
Bank Bank
DRAM

Features fetch PE PE PE PE

Crossbar


Edges
Bank

Pre- Gather Reduce Transform Activate


Bank Bank
Features fetch PE PE PE PE

Fig. 5: High-level overview of GRIP.

commands from being issued until all previous commands have scale by constant; reduce to be element-wise sum, max, or
completed. Each command also updates a global status register mean; transform to be matrix multiplication followed by
on completion, which can be queried by the host to monitor element-wise sum; and activate to be either ReLU or a
execution. LUT operation which we describe in Sec. V-D. While these
Execution Units. GRIP has three core execution units: the cover most GNN models we investigated, expanding the set
edge unit, the vertex unit, and the update unit. The edge unit of supported operations may be required for other GNNs. We
performs the edge-accumulate phase by iterating over the edges leave exploring other possible implementations for future work.
of the nodeflow, which is stored in the nodeflow buffer. The Memory Controller. The memory controller is responsible
edge unit then reads the associated features, executes gather, for moving data on- and off-chip. Instead of each execution
and finally accumulates the result into the edge accumulator unit issuing requests to the memory controller directly, the
using reduce. host is required to statically schedule memory transfers before
The vertex unit performs the vertex-accumulate phase by execution. This is possible since the set of features required
iterating over the output vertices corresponding to the accu- for inference can be easily determined from the nodeflow. This
mulated edge values. It then executes transform, accumulating also prevents individual units from stalling on external memory
the result into the vertex accumulator. The vertex unit also accesses, but requires scheduling commands such that loading
reads weight values from the tile buffer, which caches tiles of data fully overlaps with execution (Sec. VI-A).
weight values from the global weight buffer. To synchronize the
tile buffer and the vertex unit, the weight sequencer controls B. Edge Unit
iterating over the tiles as described in Section VI-B. Since
weight values are shared across all nodeflow output vertices, Prefetch Lane Reduce Lane
Crossbar

the global weight buffer is only required to be loaded once at Dequeue Iterate Read Read
Gather
Read
Reduce
Write
Source Outgoing Source Dest. Edge Edge
PE PE
the beginning of a GRIP program. Vertex ID Edges Feature Feature Accum. Accum.
P0 P1 P2 R0 R1 R2 R3 R4
Finally, the update unit performs the vertex-update phase by
reading the accumulated values for each vertex and passing
the values to the activate PE. The result is written to the Fig. 6: The edge unit pipeline is split into source-oriented
nodeflow buffer as an updated feature, or to the edge or vertex (P0-P2) and destination-oriented (R0-R4) sections called lanes
accumulator. This allows efficiently passing values between that can be independently replicated. Stage R0 is only used
different GRIP programs when they are executed in sequence. for models that require reading source features.
PE Implementation. GRIP allows users to customize four
PEs corresponding to the UDFs introduced in Sec. IV. These can The edge unit pipeline is split into two distinct halves (Fig. 6).
be implemented in multiple ways depending on the user’s needs. Stages P0-P2 implement prefetch, which iterates over the edges
For example, they could be implemented using a reconfigurable of the nodeflow and reads the features corresponding to the
fabric (e.g. an FPGA) for maximum flexibility. Alternatively, source vertex. The result is passed to reduce (stages R0-R4),
they could be implemented using a model specific circuit to which reads the corresponding destination feature and then
optimize for area or performance. applies gather and reduce, accumulating the result into
Our implementation uses a programmable ALU based the edge accumulator. GRIP allows optionally disabling stage
approach. Since most common GNNs only require a small R0 since most models do not require reading source features.
number of operations in practice, this allows us to support Parallelization. The edge-accumulator value for each output
a range of models on the same hardware while remaining vertex can be computed independently. This means there is
reasonably efficient in practice. Specifically, we allow gather a significant amount of parallelism that can be exploited to
to be identity (e.g. hu or hv ), element-wise sum, product, or improve the performance of the edge unit. A simple method to

5
parallelize execution is to duplicate the elements of the edge A V1 V2
A Time
B A B C D E
unit into N identical copies. Each copy can then be assigned B
A V1 LD
C U1 1,1
C B U1 LD EA
a subset of output vertices to process in parallel (e.g. by a D C U2 LD EA
VA VU
D U2 D U3 LD EA
random hash of the vertex ID). However, since the nodeflow E
E V2 LD
F E U3 F
3,2 U2 LD EA
VA VU
buffer is read every cycle by each lane, this approach requires
adding 2N ports to the nodeflow buffer. (a) (b) (c)
Instead, GRIP duplicates prefetch and reduce into N and
M copies called lanes. Each lane is statically assigned a Fig. 7: An example of a nodeflow (a) and corresponding
partition of input vertices (for prefetch) or output vertices partitions (b) which are processed column-wise. GRIP also
(for reduce). Similarly, edges are assigned to a prefetch lane pipelines transferring feature data with execution (c).
based on the edge’s source vertex. During execution, each
prefetch lane iterates over its assigned edges and reads each
edge’s corresponding feature. It then sends the feature data as a separate table with 33 and 9 entries, respectively. Both
through an N × M crossbar to the reduce lane assigned to the cover overlapping ranges of input: the first level from −2a to
destination vertex. This design restricts each lane to accessing 2a , and the second level from −2b to 2b , where a and b are
only its assigned subset of features and edges, allowing GRIP user configurable values. The LUT entries lineally partition
to partition the nodeflow buffer into N + M separate SRAMs. the range, e.g. entry 0 of level 1 corresponds to −2a , entry 1
As a result, GRIP scales to a much larger number of lanes than corresponds to −2a + 2a+1 /32, etc. To perform an activation
the simpler design. Additionally, our implementation of GRIP computation, the input is first converted to a 16-bit fixed point
extends this scheme to include off-chip memory by storing representation with 4-bits of integer precision. Each level is
feature data pre-partitioned and setting the number of prefetch then checked in series to see if the input falls in its range.
lanes equal to the number of DRAM channels. If so, the closest two LUT values are linearly interpolated
to produce an output. If the values overflow the range for
C. Vertex Unit both levels, the input is either clamped to the closest value
The vertex unit implements the vertex-accumulate phase by in the second level, or a user configured linear function is
iterating over the output vertices and applying transform. used. Additionally, the overflow behavior can be configured for
Our implementation restricts transform to a matrix multipli- both positive and negative inputs independently, allowing the
cation, which we implement using a 16 × 32 weight stationary implementation of non-symmetric activation functions. This
PE array [9]. Each PE contains a 16-bit multiplier, as well as simple approximation covers a large number of activation
a local double buffered weight register. The PE array is broken functions, including sigmoid, which is required for models
into two 16 × 16 blocks. Blocks can be configured to use one such as G-GCN.
of two modes: cooperative, where both blocks operate on the
VI. O PTIMIZATIONS
same vertex, or parallel, where blocks operate on different
vertices in parallel. Parallel mode broadcasts weight values to GRIP implements two major GNN optimizations: execution
both blocks, allowing for slightly lower energy consumption partitioning and vertex-tiling. Execution partitioning describes
at the expense of higher latency when there is only a single a method to split a GRIP program to operate on partitions of
output vertex. a nodeflow, reducing the amount of on-chip memory required.
Unlike many other neural network accelerators, GRIP does GRIP supports pipelining operations on different partitions,
not use a systolic array structure. Instead, GRIP broadcasts improving performance. Vertex-tiling improves the locality of
inputs across the rows of the array and accumulates results weights the vertex-accumulate phase, reducing the memory
down columns using a reduction tree. The entire operation is bandwidth required by the vertex unit. Collectively, these
pipelined to allow multiple matrix operations to occur without optimizations reduce inference latency for GRIP by a significant
stalling, even as weights are transferred in and out of the array. factor.
This results in a significant savings in latency for a single
matrix-vector operations; instead of requiring 16 + 32 = 48 A. Execution Partitioning
cycles, our implementation requires just six (three to distribute A common GNN optimization is to split the graph into
values, one for multiplication, and two for reduction). This also partitions that can be computed on separately [27], [31], [45].
eliminates the buffers required for input skewing in a systolic This reduces the peak amount of on-chip memory required to
design. compute inference since only a portion of the graph must be
loaded at once. GRIP supports a similar optimization we refer
D. Update Unit to as execution partitioning, shown in Fig. 7. First, the user
The update unit iterates over each vector in the vertex partitions the nodeflow offline by splitting the input and output
accumulator and applies activate. Our activate PE allows vertices into fixed chunks of size N and M . Likewise, the
two possible operations: element-wise ReLU and a two-level edges are partitioned into blocks of size N × M , where block
configurable lookup-table (LUT) that can be used to approxi- N Fi,j stores the edges connecting input vertices in chunk Ui to
mate many activation functions. Each LUT level is implemented output vertices in chunk Vj . During inference, GRIP executes

6
edge-accumulate for each partition in a column, skipping blocks the tile buffer and the matrix unit by a factor of 1/m. Thus, by
that are empty. Then, GRIP executes the vertex-accumulate and tuning f and m we can trade-off the required bandwidth with
vertex-update phases once, updating values in the corresponding the amount of storage required for tiles and edge-accumulate
partition of output vertices. This ensures every incoming edge values.
for each output vertex is processed before vertex-accumulate.
Another advantage of execution partitioning is that operations VII. E XPERIMENTAL M ETHODOLOGY
can be pipelined between partitions. GRIP implements two Datasets. Table I describes the properties of the datasets
kinds of pipelining related to partitioning. First, GRIP pipelines chosen for evaluation. Datasets were selected from previous
loading data from off-chip with the edge-accumulate phase. evaluations of GNNs [21], the SNAP project [29], and the UF
This allows overlapping execution with bulk loading feature sparse matrix collection [12]. Included datasets were designed
data for an entire partition. If enough space is available in to be similar to the workloads used by GNNs, as well as provide
the nodeflow buffer, GRIP also optionally caches partition a range of connectivity. We prepossessed each dataset using the
feature data loaded during the processing of the first column same procedure outlined by the authors of GraphSAGE [21].
to avoid reload data while processing later columns. Second, The column “2-Hop” denotes the median number of unique
transferring weights from the global buffer can be pipelined vertices within the 2-hop neighborhood of a vertex picked
with processing an entire column. GRIP performs inter-layer uniformly at random from the graph, taking into account the
pipelining by loading the weights of the next layer while sampling procedure.
processing the last column, and preloads the tile buffer before
processing the first column. TABLE I: Datasets used for evaluation.

B. Vertex-Tiling Dataset Nodes Edges 2-Hop

The bandwidth required to load layer weights can be a Youtube (YT) 1,134,890 2,987,624 25
Livejournal (LJ) 3,997,962 34,681,189 65
significant bottleneck. For example, consider a GCN layer with Pokec (PO) 1,632,803 30,622,564 167
a feature size of 256. Since our implementation of transform Reddit (RD) 232,383 47,396,905 239
cannot hold the entire 1 MB weight matrix locally, new weight
values must be loaded every cycle. At an operating frequency Models. We implemented four GNNs which cover a broad
of 1 GHz, this requires a maximum of 2 TB/s of tile buffer range of different model types: GCN [28], the max variant of
bandwidth, which we found difficult to implement physically. GraphSage [21], GIN [44], and G-GCN [2], [5], [33]. For our
While this could be resolved by increasing the number of neighborhood function, we use the same sampling proceedure
weights stored within the multiplier array, this increases energy as described by the authors of GraphSage. Specifically, we
usage and lacks flexibility since a model with a larger feature deterministically map a given vertex to a fixed-sized, uniform
size would still run into the same limitation. sample of its neighbors. For all models, we use two layers with
sample sizes 25 and 10 for the first and second layer, respec-
Features Edge Accum. Tile Buffer Vertex Accum.
tively. Samples between layers are independent. Additionally,
A
m we use a feature size of 602 (the feature size of the Reddit
=
B
C
D f
M
X o
dataset), a hidden dimension of 512, and an output dimension
E of 256 for all layers.
F
F O Baseline. Our CPU baseline was a dual socket server
containing two, 14-core 2.60 GHz Intel Xeon E5-2690 v4
Fig. 8: Vertex-tiling allows materializing a small tile of edge CPUs, each with four channels of DDR4-2400 memory. We
accumulator values (m × f ) instead of the full M × F matrix. restricted our experiments to a single socket to adhere to
This reduces the memory bandwidth required since a tile of Tensorflow performance guidelines [18] and to avoid latency
weight values can be reused across m vertices. variation resulting from NUMA. In this configuration, we
measured a sustained 1.084 TFlop/s in a matrix multiply
GRIP’s approach is to instead use an optimization we call benchmark (93% of 1.164 TFlop/s theoretical peak) and
vertex-tiling. The key insight of vertex-tiling is that in almost 64.5 GiB/s of off-chip memory bandwidth (84% of 76.8 GiB/s
all cases transform is affine, which allows us to perform theoretical peak).
an optimization similar to tiling matrix multiplication. Fig. 8 We implemented both the baseline and our optimized
shows a graphical representation of this strategy. Here, edge- inference algorithm in Tensorflow v2.0 [1] with eager mode
accumulate produces f elements for m output vertices. This disabled and compiled with the Intel Math Kernel Library [23].
requires storing f ×m elements in the edge accumulator instead To discount the overhead of the Tensorflow library for each
of the full F × M matrix. Then, we run vertex-accumulate model, we measured the time to evaluate an equivalent model
(in this case matrix multiplication), which loads each f × o with all tensor dimensions set to zero and subtract it from the
tile from the tile buffer. We then repeat this process, first for latency measurement. We also perform a warm-up inference
all vertex tiles and then for all weight slices, maximizing the before all measurements to allow Tensorflow to compile and
locality of the weights. This reduces the bandwidth between optimize the network.

7
TABLE II: Architectural characteristics of baseline and GRIP. TABLE III: 99%-ile inference latency for GRIP, CPU, and
GPU
CPU GRIP
1.164 TOP/s 1.088 TOP/s CPU GPU
Compute
@ 2.6 GHz @ 1.0 GHz
Model Dataset GRIP µs × µs ×
L1D: 14 × 32 KiB Nodeflow: 4 × 20 KiB
On-chip youtube 15.4 309.2 (20.1) 1082.4 (70.5)
L2: 14 × 256 KiB Tile: 2 × 64 KiB
memory livejournal 15.8 466.8 (29.5) 1313.6 (83.1)
LLC: 35 MiB Weight: 2 MiB GCN
pokec 16.0 477.1 (29.8) 1085.6 (67.7)
Off-chip 4× DDR4-2400 4× DDR4-2400 reddit 16.3 407.1 (25.0) 813.2 (50.0)
memory 76.8 GiB/s 76.8 GiB/s
youtube 134.1 2315.9 (17.3) 1332.5 ( 9.9)
Total Area 306.18 mm2 11.27 mm2
livejournal 146.3 2493.2 (17.0) 1837.6 (12.6)
G-GCN
Power 135 W 4.9 W pokec 146.7 2637.9 (18.0) 1409.2 ( 9.6)
reddit 147.0 2864.2 (19.5) 1133.9 ( 7.7)
youtube 113.7 1545.1 (13.6) 1309.0 (11.5)
ASIC Synthesis: We implemented GRIP in SystemVerilog, GS
livejournal 124.4 1947.4 (15.7) 2193.8 (17.6)
choosing the architectural parameters to have similar compute pokec 124.9 2075.7 (16.6) 1759.1 (14.1)
reddit 125.3 2099.0 (16.8) 1252.8 (10.0)
and memory bandwidth as our CPU baseline (Table II).
The implementation uses 16-bit fixed point, which maintains youtube 30.5 344.7 (11.3) 1387.6 (45.5)
suitable inference accuracy in the models we evaluate. We then livejournal 30.9 416.1 (13.5) 1221.5 (39.5)
GIN
pokec 31.1 340.7 (10.9) 855.5 (27.5)
performed synthesis and place and route in a 28 nm CMOS reddit 31.4 354.8 (11.3) 1009.4 (32.2)
process, targeting a 1 GHz operating frequency and worst case
PVT corner. The critical path of GRIP was determined to
be 0.93 ns, inside the weight SRAMs. Power estimates of
each unit was performed by generating activity factors from a
cycle accurate simulation of our implementation and applying
them to our synthesized design. We used Cacti v6.5 [39] to computation during the Update step of the message-passing
estimate the area and power of the SRAM memories. We also layer. For example, GIN’s Update uses a two-layer MLP that
integrated Ramulator [26] into our simulator to estimate DRAM requires roughly double the computation of GCN’s single
timings and produce a command trace. These traces were fed matrix multiplication. However, the additional computation
to DRAMPower [6] to estimate DRAM power. results in similar overall CPU inference latency since our
implementation is largely bottlenecked by non-computational
VIII. E VALUATION factors (Sec. II-B). This results in GRIP achieving a smaller
GRIP aims to accelerate GNN inference for a wide range of performance improvement of 10.9-13.5× compared to an
models, specifically targeting low latency. We evaluate this by improvement of more than 13.6× for all other models.
measuring overall inference latency for four different models
and compare to a CPU and GPU baseline (Sec. VIII-A). To Performance vs. GPU. Practical deployments of online
better understand GRIP’s performance, we then breakdown GNN inference most often use CPUs due to the large memory
the contribution of each architectual feature (Sec. VIII-B) requirements for graph features and low utilization at small
and how the overall speedup changes as we modify both batch sizes. However, for completeness we also benchmark
architectural (Sec. VIII-C) and model parameters (Sec. VIII-D). GRIP against an Nvidia P100 GPU implementation for each
We also measure the impact of each GNN optimization we model. GRIP’s speedup on GPU ranges from 83.1× (Live-
implemented (Sec. VIII-E). Finally, we compare GRIP to journal, GCN) to 7.7× (Reddit, G-GCN) with a geometric
alternative approaches (Sec. VIII-F) and present a breakdown mean of 23.4×. For models with relatively low overall latency
of energy consumption during inference (Sec. VIII-G). (GCN, GIN) we see a significantly higher speedup than
with our CPU implementation. This is largely due to the
A. Overall Performance overhead of transferring embeddings from host to GPU memory
To evaluate GRIP’s overall performance, we measured the (roughly 200-500 µs, depending on the neighborhood size)
total end-to-end execution time (latency) to compute inference which comprises a large portion of the overall execution time
with a variety of models and datasets. Table III shows GRIP’s for models like GCN (25%-50% of total latency). GRIP does
inference latency and speedup versus our CPU and GPU not incur this penalty since features and weights are already
implementation. We use 99th percentile latency for consistency stored in device DRAM and do not have to be transferred
with prior evaluations of inference performance [38]. from the host. On models with a higher total execution time
Performance vs. CPU. Compared to our CPU implementa- (e.g. G-GCN), GRIP still achieves a significant speedup due
tion, GRIP achieves a latency improvement of between 29.8× to low GPU utilization. With a batch size of 1, there is not
(GCN, Pokec) and 10.9× (GIN, Pokec) with a geometric mean sufficient computation during each layer to fully utilize the
of 17.0× across all datasets and models. GRIP tends to give a computational resources of the GPU and overhead of launching
smaller speedup on models that perform a larger portion of their each kernel tends to dominate.

8
Baseline 1.0 Baseline 1.0 1.5 1.5 1.5 1.5
+ Split Mem. 3.0 Graphici. 2.4

Speedup
+ Edg. Unit 10.3 HyGCN 4.4 1 1 1 1
+ Vert. Unit 19.2 TPU+ 11.3
+ Other 19.5 GRIP 19.5 0.5 0.5 0.5 0.5
0 5 10 15 20 0 4 8 12 16 20 1 2 4 8 16 32 128 512 8 32 128 ¼ ½ 1 2 4
Speedup Speedup # DRAM Chan. Weight GiB/s XBar Width Mat. TOP/s
(a) Speedup breakdown for each (b) Estimated speedups of (a) (b) (c) (d)
component of GRIP versus baseline. prior work versus baseline and Fig. 10: Impact of scaling architectural parameters. Dashed
GRIP. vertical line indicates our implementation’s parameters. In
Fig. 10a the number of edge unit lanes is kept equal to the
Fig. 9: Breakdown of performance improvements. number of channels.

B. Breakdown of Performance due to increased crossbar bandwidth after adjusting the number
of fetch and gather units (1.14×), the majority of the speedup
In this subsection, we breakdown the performance impact of is due to allowing loading data, edge-accumulate, and vertex-
each architectural feature of GRIP. Specifically, we modify our accumulate phases to overlap by using a dedicated unit for
cycle-accurate simulator to match the bottlenecks exhibited by each phase (2.97×). Third, we enable the vertex unit and revert
our CPU implementation and then progressively remove each to using a single 16 × 32 matrix multiply unit, resulting in
modification to measure the impact of different units. As a an additional 1.87× speedup. This is due to increased overall
performance benchmark, we use the geometric mean speedup TOP/s (1.63×) and using a single unit rather than multiple
of GCN for the largest neighborhood in each dataset. units, which allows units to not be wasted when the overall
Baseline Configuration. Our baseline configuration em- number of output vertices is small (1.15×). Finally, separating
ulates each core being assigned independent vertices and and pipelining the update unit produces a small speedup of
performing all GReTA phases, with weights and partition 1.02×.
data being first loaded into L3 cache and intermediate values
accumulated directly in L2. This results in the following C. Architectural Parameters
simulator modifications. First, we modify our vertex unit to Here, we discuss the impact of several high level architectural
use 14, 8 × 2 matrix multiply units, with each unit assigned parameters on inference performance.
independent vertices within a partition. This emulates the Number of DRAM Channels. The number of DRAM
effect of each CPU core using two 8-element SIMD units. channels determines the overall memory bandwidth available
Second, we increase the number of fetch and gather units to to transfer data on- and off-chip. In Fig. 10a, we observe
14 and the crossbar width to 32 bytes, matching the number of that GRIP’s performance is strongly related to the number of
cores and L2 cache bandwidth, respectively. We also disable channels until around 8 channels (∼150 GiB/s). This indicates
pipelining between the edge and vertex units to emulate a that GRIP’s performance is primarily limited by off-chip
single core performing both functions. Third, we merge the memory bandwidth.
weight and nodeflow buffers into a single SRAM and limit Weight Bandwidth. The weight bandwidth determines how
the maximum read bandwidth to 16 bytes per cycle per fetch many values can be read from the global weight buffer each
unit, matching the bandwidth of the L3 cache. Finally, we cycle. If this is set too low, loading weight values can become
disable pipelining between the vertex and update unit to a bottleneck during vertex-accumulate. We observe this effect
model both operations being performed by the same core. in Fig. 10b below 128 GiB/s, which corresponds to loading
This configuration overestimates the performance of the CPU 64 weight values each cycle.
in practice since it models ideal performance and no additional Crossbar Port Width. The crossbar port width determines
computation required for auxiliary operations, such as indexing the number of elements accumulated by each gather unit in a
calculations. In particular, with a 2.6 GHz clock and an element single cycle. In our experiments, the average number of edges
width of 4-bytes, our model is 2.07× faster than the measured per vertex is fairly small (sampled to be less than 25). Since
CPU latency. edge-accumulate typically takes much less time than vertex-
Breakdown. In Fig. 9a, we show the impact of different accumulate or loading data from DRAM, increasing the width
units in GRIP by progressively removing each modification has a limited impact on performance (Fig. 10c). However, it
from our baseline in reverse order. First, we split the weight is preferable to over-allocate the crossbar width in order to
and nodeflow memories into separate SRAMs. This results in ensure high performance even on dense nodeflows.
a 2.8× speedup due to removing contention between fetching Matrix Multiply TOP/s. The total number of TOP/s GRIP
features and weights from the same SRAM (2.0×), as well can achieve is determined primarily by the size of the matrix
doubling the bandwidth available to load weight values into multiply unit. In Fig. 10d we see that performance is strongly
the vertex unit (1.4×). Second, we add the edge unit, resulting related to the size of this unit, until reaching around 2 TOP/s at
in an additional improvement of 3.4×. While this is partially which point GRIP is limited by memory bandwidth. Thus, our

9
100 100 35
% Total Time Input 1.8 99th

% Total Time
Output median

Normalized Latency
MatMul 75 75 min 30

Gather
50 50 1.6

Speedup
25 25
25 1.4
0 0 20
1 32 64 128 256 512 1K 2K 4K 1 2 4 8 16 32 64 128 1.2
Dimension Size # Sampled Edges 1.0
15
(a) Impact of feature dimensions. (b) Impact of sampling. 50 100 150 200 250 50 100 150 200 250
Neighborhood Size Neighborhood Size
(a) (b)
Fig. 11: The impact of scaling different GCN parameters on
the balance of time spent in each operation. Scaling the output
feature size increases the amount of time spent performing Fig. 12: Impact of different neighborhood sizes on latency
matrix multiplication, while increasing the number of edges for the GCN model. GRIP’s latency linearly increases with
decreases it. neighborhood size due to more computation being required for
inference. The speedup is roughly constant until a neighborhood
size of about 95, at which point intermediate values no longer
implementation of GRIP would see a relatively small benefit fit into the cache of a single CPU core.
from a substantially larger matrix unit (1.14× for a 4× larger
unit).
size on performance, we plot GRIP’s minimum, median, and
D. Model Parameters 99th percentile inference latency for GCN across different
A key aspect GRIP’s design is balancing the performance neighborhoods of the LiveJournal dataset in Fig. 12a. The
between GReTA’s edge and vertex-centric phases. Here, we result is a strong linear relationship between the neighborhood
evaluate how this balance changes as the parameters of the size and latency across the entire distribution. Each vertex added
GNN model are altered. to the neighborhood results in a roughly constant increase in
Feature Dimensions. In Fig. 11a, we evaluate how varying the amount of work during inference. Additionally, we observe
the number of the input and output features impacts the percent that as the neighborhood size increases, the median latency
of time spent in matrix multiplication. The proportion is initially moves closer to the 99th percentile. This is the result of larger
low (∼8%) for small features and increases linearly until neighborhoods being more likely to be densely connected,
32 features. This is due to the fact that when the feature leading to a larger number of reductions that must be computed.
size is smaller than the native width of the DRAM interface, In Fig. 12, we evaluate the latency speedup compared to
DRAM bandwidth is poorly utilized due to many random the CPU baseline across different neighborhood sizes. Below
accesses. In our implementation, we use two dual-channel a neighborhood size of 95, we see a roughly constant speedup
DRAM controllers, which each have an interface of 64 2- of between 12× and 18×. For these neighborhood sizes, all
byte elements. Above this point, the proportion of time spent intermediate values fit into the L1 and L2 cache of a single
performing vertex-accumulation stays flat, reflecting the fact CPU core. After this point, some feature values must be stored
that each additional feature results in a constant amount of in the L3 cache and inference performance becomes limited
additional computation during inference. However, this analysis by the cache bandwidth (Sec. II-B).
does not hold for the output features, which can be increased
E. Optimizations
without needing to increase the number of values loaded from
DRAM. We see that increasing the output feature size always
increases the percent of time performing vertex-accumulate. Baseline 1.0 256 1.1 3.7 5.2 7.4 6.6
Thus, models with large output feature sizes are likely to be 128 1.1 3.7 5.3 8.0 7.3
Tile Size

+ Cache 1.3
limited by compute rather than memory.
+ Pipeline 1.7 64 1.1 3.8 5.4 8.0 7.3
Sampled Edges. Another important model parameter is
+ Weight 2.5 32 0.6 2.1 3.1 5.2 4.9
the number of sampled edges per output vertex. In Fig. 11b, Preload
we evaluate how the number of edges impacts the percent 0 1 2 3 1 4 8 12 16
Speedup # Vertices
of total time spent performing edge-accumulate. For less
(a) Impact of pipelining (b) Speedup of vertex-tiling
than 8 edges per vertex, GRIP’s performance is mostly
limited by computation and overhead related to accessing
data from DRAM. Above this threshold, the memory and Fig. 13: Impact of partitioning and tiling optimizations.
crossbar bandwidth becomes a bottleneck, and GRIP spends
an increasing portion of execution loading data. In this subsection, we evaluate the impact of each optimiza-
Neighborhood Size. The neighborhood size heavily impacts tion used by GRIP.
GRIP’s overall latency and is influenced by local graph Partitioning and pipelining. In Fig. 13a we show the cu-
structure. To demonstrate the impact of the neighborhood mulative speedups of each optimization enabled by partitioning.

10
We compare to an unoptimized baseline, where feature values vectors. This allows GRIP to use a roughly 10, 000× smaller
are loaded from off-chip on demand and no pipelining exists buffer (1.5 KiB) while achieving comparable performance.
between stages. First, by caching feature data on-chip, GRIP We demonstrate these limitations by modifying our simulator
achieves a 1.3× speedup. This is due to the decreased memory to emulate the HyGCN approach. Specifically, we set the
traffic required to reload data between partition columns number of gather and fetch units to 1 and the crossbar width
and by the improved throughput from bulk loading data for to 256 to match the number of SIMD lanes. We disable all
an entire partition. Second, pipelining operations between tiling and force feature vectors to be fully accumulated before
different partitions results in an additional 1.3× speedup due vertex-accumulate. We then set all other parameters to be the
to overlapping execution with memory transfers. Finally, we same as GRIP, including the same partitioning used in our
can also pipeline the transfer of weights from the global weight evaluation of GRIP.
buffer into the vertex-update unit. This increases the overall This configuration results in a speedup of 4.4× the baseline,
speed-up to a total of 2.5×. shown in Fig. 9b. However, it performs 4.5× slower than
Vertex-Tiling. In Fig. 13b, we show the speedup compared GRIP due to limits in the available on-chip memory bandwidth
to no tiling as we alter the two tiling parameters M (the number for weights. Incorporating vertex tiling would allow for a
of vertices in a tile) and F (the number of elements per vertex). much smaller edge accumulate buffer and reduce the required
We see that performance generally reaches a maximum around bandwidth by increasing the reuse of the weights.
F = 64 elements. Above F = 64, increasing F causes the Modified TPU. The TPU [25] is a DNN accelerator designed
vertex unit to stall more often while waiting for a tile to be around a large 2-D systolic array. Unfortunately, GNNs are
produced by the edge unit. The performance degradation is difficult to implement efficiently for the TPU due to a lack of
not linear because the time taken to accumulate a tile depends support for edge-oriented operations [3]. Instead, we compare
on the connectedness of the nodeflow. We also see degraded GRIP to a modified version of the TPU architecture that
performance below F = 64. This is because F features addresses this limitation by incorporating features from GRIP.
are loaded from memory for each vertex. As F decreases, We refer to this modified design as the TPU+.
more random DRAM accesses are required to load features, Specifically, the TPU+ has an additional unit similar to
degrading DRAM throughput. Increasing M also increases GRIP’s edge-unit between the TPU’s unified buffer and the
performance until around 12 vertices. The maximum number systolic data setup. This allows the TPU+ to natively support
of output vertices in our model is 11. Increasing M beyond the GReTA programming model by mapping edge-accumulate
this only adds additional latency associated with processing onto the new edge-unit, vertex-accumulate onto TPU’s systolic
empty dummy vertices. array, and vertex-update onto the activation pipeline. This
design also supports both the execution partitioning and vertex-
F. Comparisons to Prior Work tiling optimizations described in Sec. VI.
Several other approaches have been proposed to accelerate We estimate the performance of the TPU+ by modifying our
neural networks and graph algorithms. Here, we analyze the cycle-accurate model to use a single fetch and gather unit. We
bottlenecks present in each approach and compare performance also replace the vertex-unit with an identically sized 16 × 32
with GRIP. systolic array. As in the original TPU design, weights are
HyGCN. HyGCN [45] is an accelerator designed for graph stored off-chip and the dedicated weight bandwidth is limited
convolutional networks, a subset of GNNs that do not have to 30 GiB/s. All other parameters remain unchanged compared
computation associated with edges. HyGCN and GRIP take to our evaluation of GRIP, including the use of 4× DDR4-2400
a similar approach of using separate units for edge- and for off-chip memory and the same partitioning and vertex-tiling
vertex-centric operations. However, GRIP addresses two major optimizations.
bottlenecks present in the HyGCN design. This configuration achieves a 11.3× speedup (Fig. 9b)
First, HyGCN uses 32 8-lane SIMD units to perform edge- compared to our baseline in Sec. VIII-B. The main bottleneck
oriented operations, but can only issue a single edge at a time. in this approach is the limited bandwidth dedicated to weights.
This means the throughput of edge operations will be limited Moving weights on-chip as in GRIP results in a 1.72× speedup.
when the number of features is smaller than the total number Higher performance memory for weights (e.g. HBM as used by
of SIMD lanes. In contrast, GRIP allows for multiple edges to later versions of the TPU) could also address this bottleneck.
be issued in parallel. However, we leave a fuller exploration for future work.
Second, HyGCN requires an entire feature vector to be Graphicionado. Graphicionado [20] is an accelerator archi-
computed and stored before performing vertex-oriented oper- tecture designed for graph analytics. Like GRIP, Graphicionado
ations. In order to process multiple vertices in parallel, this allows several units to be specialized for a particular algorithms,
requires a large buffer to store accumulated values (16 MB in such as GCN inference. However, it is designed for algorithms
the HyGCN implementation). The size of this buffer is also that use a small amount of state per-vertex. As a result, it
reported by the HyGCN authors to have a significant impact suffers from two bottlenecks. First, like HyGCN, it cannot
on their overall performance (1.3 – 4× worse performance for perform vertex-tiling since it requires full feature vectors to be
a 16× smaller buffer.) In contrast, GRIP uses vertex-tiling to accumulated. This results in a bottleneck similar to HyGCN
only store a small number of elements from multiple feature since weight data cannot be easily reused between different

11
vertices. Second, each lane has independent vertex units instead latency by a geometric mean of 17× and 23× compared to a
of using a single shared unit, increasing the required weight CPU and GPU baseline, respectively, while drawing only 5 W.
bandwidth by an amount proportional to the number of lanes.
We estimate the impact of these bottlenecks by modifying R EFERENCES
our simulator by disabling tiling and splitting the vertex unit [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.
into two units lanes that share a single tile buffer port. We Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,
A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,
also use the same partitioning scheme used for GRIP. This M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray,
configuration results in a small speedup of 2.4× over the C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar,
baseline, shown in Fig. 9b. However, this is 8.1× slower than P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden,
M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow:
GRIP due to significant bottlenecks in weight bandwidth. Large-scale machine learning on heterogeneous systems,” 2015, software
available from tensorflow.org. [Online]. Available: http://tensorflow.org/
G. Energy [2] M. Allamanis, M. Brockschmidt, and M. Khademi, “Learning to
represent programs with graphs,” in 6th International Conference
Table IV shows the power consumption for each of GRIP’s on Learning Representations, ICLR, 2018. [Online]. Available:
core top level modules during GCN inference. The single https://openreview.net/forum?id=BJOFETxR-
most energy intensive during inference is loading embeddings [3] M. Balog, B. van Merriënboer, S. Moitra, Y. Li, and D. Tarlow, “Fast
training of sparse graph neural networks on dense hardware,” arXiv
from DRAM, consuming more than the rest of the accelerator preprint arXiv:1906.11786, 2019.
combined (53.7%). This is due to the fact that both the number [4] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zam-
of vertices and the feature size is the largest at the input of baldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner
et al., “Relational inductive biases, deep learning, and graph networks,”
GCN, leading to more data being initially loaded in the first arXiv preprint arXiv:1806.01261, 2018.
layer. Additionally, GRIP optimizes for latency with four high [5] X. Bresson and T. Laurent, “Residual gated graph convnets,” arXiv
performance DRAM channels, requiring a large amount of preprint arXiv:1711.07553, 2017.
energy per transfer. The rest of the energy is mostly used by [6] K. Chandrasekar, C. Weis, Y. Li, S. Goossens, M. Jung, O. Naji,
B. Akesson, N. Wehn, and K. Goossens, “DRAMPower: Open-source
loading weights from the global weight and nodeflow buffers. dram power & energy estimation tool,” URL: http://www.drampower.info,
Both are fairly large, leading to a high energy cost per read and vol. 22, 2012.
write. In total, GRIP uses just 4.9 W, a significant improvement [7] J. Chen, T. Ma, and C. Xiao, “FastGCN: Fast learning with graph con-
volutional networks via importance sampling,” International Conference
over the 135 W TDP of the baseline CPU. on Learning Representations (ICLR), 2018.
[8] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam,
TABLE IV: Breakdown of power for GCN inference. “DianNao: A small-footprint high-throughput accelerator for ubiquitous
machine-learning,” in Proceedings of the 19th International Conference
Module mW (%) on Architectural Support for Programming Languages and Operating
Systems, ser. ASPLOS ’14. New York, NY, USA: ACM, 2014, pp. 269–
Edge 4.1 0.1 284. [Online]. Available: http://doi.acm.org/10.1145/2541940.2541967
Execution
Vertex 656.6 12.6 [9] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-
Units efficient reconfigurable accelerator for deep convolutional neural net-
Update 0.4 < 0.1
works,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–138,
Weight 1476.7 28.3 2017.
SRAM
Nodeflow 269.5 5.1 [10] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen,
Z. Xu, N. Sun, and O. Temam, “DaDianNao: A machine-learning
DRAM - 2794.7 53.7 supercomputer,” in Proceedings of the 47th Annual IEEE/ACM
International Symposium on Microarchitecture, ser. MICRO-47.
Total 4932.4 100 Washington, DC, USA: IEEE Computer Society, 2014, pp. 609–622.
[Online]. Available: http://dx.doi.org/10.1109/MICRO.2014.58
[11] W.-L. Chiang, X. Liu, S. Si, Y. Li, S. Bengio, and C.-J. Hsieh,
IX. C ONCLUSION “Cluster-GCN: An efficient algorithm for training deep and large graph
convolutional networks,” in Proceedings of the 25th ACM SIGKDD
GNNs represent a promising new method in machine learning International Conference on Knowledge Discovery; Data Mining, ser.
to learn directly from graph-structured data. However, the KDD ’19. New York, NY, USA: ACM, 2019, pp. 257–266. [Online].
Available: http://doi.acm.org/10.1145/3292500.3330925
computational costs of GNNs represent a significant barrier for [12] T. A. Davis and Y. Hu, “The University of Florida sparse matrix
deployment in many applications, especially in the scenario of collection,” ACM Transactions on Mathematical Software (TOMS),
online inference. vol. 38, no. 1, p. 1, 2011.
[13] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen,
This paper presents GRIP, an accelerator architecture de- and O. Temam, “ShiDianNao: Shifting vision processing closer to the
signed for low latency GNN inference. GRIP splits GNN sensor,” in 2015 ACM/IEEE 42nd Annual International Symposium on
operations into a series of edge- and vertex-centric phases. Each Computer Architecture (ISCA), June 2015, pp. 92–104.
[14] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and
phase is implemented independently in hardware, allowing for Y. LeCun, “Neuflow: A runtime reconfigurable dataflow processor for
specialization of both the memory subsystem and execution vision,” in 2011 IEEE Computer Society Conference on Computer Vision
units to improve performance. Additionally, GRIP has hardware and Pattern Recognition Workshops (CVPR Workshops 2011). IEEE,
2011, pp. 109–116.
support for several optimizations that further reduce latency, [15] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “TETRIS:
including pipelining operations between nodeflow partitions Scalable and efficient neural network acceleration with 3d memory,”
and vertex-tiling. We then implement GRIP as 28 nm ASIC in Proceedings of the Twenty-Second International Conference on
Architectural Support for Programming Languages and Operating
capable of executing a range of different GNNs. On a variety Systems, ser. ASPLOS ’17. New York, NY, USA: ACM, 2017, pp. 751–
of real graphs, our implementation improves 99th percentile 764. [Online]. Available: http://doi.acm.org/10.1145/3037697.3037702

12
[16] M. Gao, X. Yang, J. Pu, M. Horowitz, and C. Kozyrakis, “TANGRAM: [33] D. Marcheggiani and I. Titov, “Encoding sentences with graph
Optimized coarse-grained dataflow for scalable NN accelerators,” convolutional networks for semantic role labeling,” in Proceedings
in Proceedings of the Twenty-Fourth International Conference on of the 2017 Conference on Empirical Methods in Natural Language
Architectural Support for Programming Languages and Operating Processing. Copenhagen, Denmark: Association for Computational
Systems, ser. ASPLOS ’19. New York, NY, USA: ACM, 2019, pp. 807– Linguistics, September 2017, pp. 1507–1516. [Online]. Available:
820. [Online]. Available: http://doi.acm.org/10.1145/3297858.3304014 https://www.aclweb.org/anthology/D17-1159
[17] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, [34] E. Nurvitadhi, G. Weisz, Y. Wang, S. Hurkat, M. Nguyen, J. C. Hoe, J. F.
“Neural message passing for quantum chemistry,” in Proceedings of Martı́nez, and C. Guestrin, “Graphgen: An FPGA framework for vertex-
the 34th International Conference on Machine Learning-Volume 70. centric graph computation,” in 2014 IEEE 22nd Annual International
JMLR.org, 2017, pp. 1263–1272. Symposium on Field-Programmable Custom Computing Machines. IEEE,
[18] N. Greeneltch and J. X, “Maximize TensorFlow performance on CPU: 2014, pp. 25–28.
Considerations and recommendations for inference workloads,” https: [35] NVIDIA Corporation, cuSPARSE Library. NVIDIA Corporation, 2019.
//software.intel.com/en-us/articles/maximize-tensorflow-performance- [36] T. Oguntebi and K. Olukotun, “Graphops: A dataflow library for
on-cpu-considerations-and-recommendations-for-inference, 2019. graph analytics acceleration,” in Proceedings of the 2016 ACM/SIGDA
[19] Gunrock Developers, “Hive workflow report for GraphSage GPU im- International Symposium on Field-Programmable Gate Arrays. ACM,
plementation,” https://gunrock.github.io/docs/hive/hive graphSage.html, 2016, pp. 111–117.
accessed: 2020-02-20. [37] M. M. Ozdal, S. Yesil, T. Kim, A. Ayupov, J. Greth, S. Burns, and
[20] T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi, O. Ozturk, “Graph analytics accelerators for cognitive systems,” IEEE
“Graphicionado: A high-performance and energy-efficient accelerator Micro, vol. 37, no. 1, pp. 42–51, 2017.
for graph analytics,” in 2016 49th Annual IEEE/ACM International [38] V. J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, C.-J.
Symposium on Microarchitecture (MICRO). IEEE, 2016, pp. 1–13. Wu, B. Anderson, M. Breughe, M. Charlebois, W. Chou, R. Chukka,
[21] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning C. Coleman, S. Davis, P. Deng, G. Diamos, J. Duke, D. Fick, J. S.
on large graphs,” in Advances in Neural Information Processing Systems, Gardner, I. Hubara, S. Idgunji, T. B. Jablin, J. Jiao, T. S. John, P. Kanwar,
2017, pp. 1024–1034. D. Lee, J. Liao, A. Lokhmotov, F. Massa, P. Meng, P. Micikevicius,
[22] Z. Huang, D. Zheng, Q. Gan, J. Zhou, and Z. Zhang, “Nodeflow C. Osborne, G. Pekhimenko, A. T. R. Rajan, D. Sequeira, A. Sirasao,
and sampling,” 2019, https://doc.dgl.ai/tutorials/models/5 giant graph/1 F. Sun, H. Tang, M. Thomson, F. Wei, E. Wu, L. Xu, K. Yamada,
sampling mx.html#nodeflow, Accessed 2020-01-01. B. Yu, G. Yuan, A. Zhong, P. Zhang, and Y. Zhou, “MLPerf inference
[23] Intel Corporation, Intel Math Kernel Library. Reference Manual. Santa benchmark,” arXiv preprint arXiv:1911.02549, 2019.
Clara, USA: Intel Corporation, 2019. [39] P. Shivakumar and N. P. Jouppi, “Cacti 3.0: An integrated cache timing,
[24] A. Jain, I. Liu, A. Sarda, and P. Molino, “Food discovery with Uber power, and area model,” Compaq Computer Corporation, Tech. Rep.,
Eats: Using graph learning to power recommendations,” https://eng.uber. 2001.
com/uber-eats-graph-learning/, accessed: 2020-02-20. [40] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio,
[25] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, and Y. Bengio, “Graph attention networks,” in International
S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin, Conference on Learning Representations, 2018. [Online]. Available:
C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, https://openreview.net/forum?id=rJXMpikCZ
T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, [41] S. Venkataramani, A. Ranjan, S. Banerjee, D. Das, S. Avancha,
D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Jagannathan, A. Durg, D. Nagaraj, B. Kaul, P. Dubey, and
A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, A. Raghunathan, “ScaleDeep: A scalable compute architecture for
J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, learning and evaluating deep networks,” in Proceedings of the 44th
G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, Annual International Symposium on Computer Architecture, ser. ISCA
R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, ’17. New York, NY, USA: ACM, 2017, pp. 13–26. [Online]. Available:
N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, http://doi.acm.org/10.1145/3079856.3080244
C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, [42] Y. Wang, Y. Pan, A. Davidson, Y. Wu, C. Yang, L. Wang, M. Osama,
M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, C. Yuan, W. Liu, A. T. Riffel et al., “Gunrock: GPU graph analytics,”
R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “In-datacenter ACM Transactions on Parallel Computing (TOPC), vol. 4, no. 1, pp.
performance analysis of a tensor processing unit,” in Proceedings of 1–49, 2017.
the 44th Annual International Symposium on Computer Architecture, [43] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “A comprehen-
ser. ISCA ’17. New York, NY, USA: ACM, 2017, pp. 1–12. [Online]. sive survey on graph neural networks,” arXiv preprint arXiv:1901.00596,
Available: http://doi.acm.org/10.1145/3079856.3080246 2019.
[26] Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A fast and extensible dram [44] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are
simulator,” IEEE Computer architecture letters, vol. 15, no. 1, pp. 45–49, graph neural networks?” in International Conference on Learning
2016. Representations, 2019. [Online]. Available: https://openreview.net/forum?
[27] K. Kiningham, P. Levis, and C. Re, “GReTA: Hardware Optimized Graph id=ryGs6iA5Km
Processing for GNNs,” in Proceedings of the Workshop on Resource- [45] M. Yan, L. Deng, X. Hu, L. Liang, Y. Feng, X. Ye, Z. Zhang, D. Fan, and
Constrained Machine Learning (ReCoML 2020), March 2020. Y. Xie, “HyGCN: A GCN accelerator with hybrid architecture.” IEEE,
[28] T. N. Kipf and M. Welling, “Semi-supervised classification with graph 2020.
convolutional networks,” in International Conference on Learning [46] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and
Representations (ICLR), 2017. J. Leskovec, “Graph convolutional neural networks for web-scale
[29] J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network recommender systems,” in Proceedings of the 24th ACM SIGKDD
dataset collection,” http://snap.stanford.edu/data, Jun. 2014. International Conference on Knowledge Discovery & Data Mining.
[30] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, “FlexFlow: A flexible ACM, 2018, pp. 974–983.
dataflow accelerator architecture for convolutional neural networks,” in [47] Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec,
2017 IEEE International Symposium on High Performance Computer “Hierarchical graph representation learning with differentiable pooling,”
Architecture (HPCA). IEEE, 2017, pp. 553–564. in Advances in neural information processing systems, 2018, pp. 4800–
[31] L. Ma, Z. Yang, Y. Miao, J. Xue, M. Wu, L. Zhou, and Y. Dai, 4810.
“NeuGraph: Parallel deep neural network computation on large graphs,” [48] H. Zeng and V. Prasanna, “GraphACT: Accelerating GCN training
in 2019 USENIX Annual Technical Conference (USENIX ATC 19). on CPU-FPGA heterogeneous platforms,” in The 2020 ACM/SIGDA
Renton, WA: USENIX Association, Jul. 2019, pp. 443–458. [Online]. International Symposium on Field-Programmable Gate Arrays, 2020, pp.
Available: https://www.usenix.org/conference/atc19/presentation/ma 255–265.
[32] Y. Ma, H. Ren, B. Khailany, H. Sikka, L. Luo, K. Natarajan, and B. Yu, [49] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
“High performance graph convolutional networks with applications in FPGA-based accelerator design for deep convolutional neural networks,”
testability analysis,” in Proceedings of the 56th Annual Design Automation in Proceedings of the 2015 ACM/SIGDA International Symposium on
Conference 2019, 2019, pp. 1–6. Field-Programmable Gate Arrays. ACM, 2015, pp. 161–170.

13
[50] M. Zhang and Y. Chen, “Link prediction based on graph neural networks,” 5175.
in Advances in Neural Information Processing Systems, 2018, pp. 5165–

14

You might also like