Efficient and Effective Sparse LSTM On FPGA With Bank-Balanced Sparsity

Efficient and Effective Sparse LSTM on FPGA with
Bank-Balanced Sparsity
Shijie Cao∗ Chen Zhang Zhuliang Yao∗
Harbin Institute of Technology Microsoft Research Tsinghua University
[email protected] [email protected] [email protected]
Wencong Xiao∗ Lanshun Nie Dechen Zhan

Beihang University Harbin Institute of Technology Harbin Institute of Technology
Yunxin Liu Ming Wu Lintao Zhang

Microsoft Research Microsoft Research Microsoft Research
ABSTRACT energy efficiency and 7.0 ~34.4x reduction on latency with negligible
Neural Networks based on Long Short-Term Memory (LSTM) are loss of model accuracy.
widely deployed in latency-sensitive language and speech appli-
cations. To speed up LSTM inference, previous research proposes KEYWORDS
weight pruning techniques to reduce computational cost. Unfortu- FPGA; Deep Neural Networks; LSTM; Weight Pruning; Inference;
nately, irregular computation and memory accesses in unrestricted Bank-Balanced Sparsity
sparse LSTM limit the realizable parallelism, especially when imple- ACM Reference Format:
mented on FPGA. To address this issue, some researchers propose Shijie Cao, Chen Zhang, Zhuliang Yao, Wencong Xiao, Lanshun Nie, Dechen
block-based sparsity patterns to increase the regularity of sparse Zhan, Yunxin Liu, Ming Wu, and Lintao Zhang. 2019. Efficient and Ef-
weight matrices, but these approaches suffer from deteriorated fective Sparse LSTM on FPGA with Bank-Balanced Sparsity. In The 2019
prediction accuracy. ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
This work presents Bank-Balanced Sparsity (BBS), a novel spar- (FPGA ’19), February 24–26, 2019, Seaside, CA, USA. ACM, New York, NY,
sity pattern that can maintain model accuracy at a high sparsity USA, 10 pages. https://doi.org/10.1145/3289602.3293898
level while still enable an efficient FPGA implementation. BBS
partitions each weight matrix row into banks for parallel comput- 1 INTRODUCTION
ing, while adopts fine-grained pruning inside each bank to main- Neural networks based on Long Short-Term Memory (LSTM) have
tain model accuracy. We develop a 3-step software-hardware co- been widely used in interactive and latency-sensitive applications
optimization approach to apply BBS in real FPGA hardware. First, such as machine translation, speech recognition and speech syn-
we propose a bank-balanced pruning method to induce the BBS pat- thesis [13, 20, 24]. The size and computational cost of these LSTM
tern on weight matrices. Then we introduce a decoding-free sparse models continue to grow in order to achieve better model accuracy.
matrix format, Compressed Sparse Banks (CSB), that transparently However, the stringent requirement on computational resources
exposes inter-bank parallelism in BBS to hardware. Finally, we de- makes it challenging to achieve low inference latency for large
sign an FPGA accelerator that takes advantage of BBS to eliminate LSTMs. The most time-consuming part of LSTM inference is matrix-
irregular computation and memory accesses. Implemented on Intel vector multiplication (MxV). As the size of the LSTM network grows,
Arria-10 FPGA, the BBS accelerator can achieve 750.9 GOPs on MxV cost grows quadratically, thus significantly increasing the in-
sparse LSTM networks with a batch size of 1. Compared to state- ference cost.
of-the-art FPGA accelerators for LSTM with different compression Weight pruning is a model compression technique to reduce
techniques, the BBS accelerator achieves 2.3 ~3.7x improvement on overall memory and computational costs. Early works [8, 10] dis-
cover that removing LSTM weights below a small threshold has
∗ Contribution during internship at Microsoft Research. negligible impact on model accuracy. By clamping a significant por-
tion of the weights to 0, weight pruning approach converts dense
weight matrices to unstructured sparse matrices, thus reducing the
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed computation and memory required to carry out inference.
for profit or commercial advantage and that copies bear this notice and the full citation After pruning, the most significant part of LSTM inference changes
on the first page. Copyrights for components of this work owned by others than the from dense MxV to sparse matrix-vector multiplication (SpMxV).
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission Though requiring less computation, the irregularity of SpMxV lim-
and/or a fee. Request permissions from [email protected]. its the maximum performance and energy efficiency achievable
FPGA ’19, February 24–26, 2019, Seaside, CA, USA on hardware accelerators [17, 19, 27]. Unstructured sparse matri-
© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-6137-8/19/02. . . $15.00 ces cannot efficiently utilize underlying hardware resources due
https://doi.org/10.1145/3289602.3293898 to three reasons: 1) the unbalanced non-zero weights distribution
might cause workload skew among processing elements (PEs); 2) 2.3 ~3.7x improvement on energy efficiency and 7.0 ~34.4x
concurrent irregular memory accesses to a dense vector lead to reduction on latency with negligible loss of model accuracy.
memory access conflicts, which could stall the parallel execution;
and 3) sparse matrix representations such as compressed sparse row 2 BACKGROUND
(CSR) use indexes to track non-zero values, which require decoding
before computation.
2.1 Long Short-Term Memory.
To address these issues, further works [17, 19] suggest using LSTM is one of the most successful cells used in Recurrent Neural
coarser-grained weight pruning methods to induce more structured Networks (RNNs) [11]. An LSTM network computes a mapping
sparsity patterns for better hardware acceleration. Coarse-grained from an input sequence X = (x 1 , ..., xT ) to an output sequence
pruning methods prune weights in the granularity of blocks. From Y = (y1 , ..., yT ) by using the following equations iteratively from t
the hardware perspective, blocks of non-zero weights can enable = 1 to T :
contiguous memory accesses and better utilize parallel computa- i t = σ (Wix x t + Wir yt −1 + Wic c t −1 + bi ) (1)
tion resources. Unfortunately, it becomes challenging to maintain
the same model accuracy when block sparsity is applied. Block ft = σ (Wf x x t + Wf r yt −1 + Wf c c t −1 + bf ) (2)
sparsity constraints the locality of the non-zero weights, and im- дt = σ (Wcx x t + Wcr yt −1 + bc ) (3)
portant weights could be mistakenly pruned, resulting in model c t = ft ⊙ c t −1 + дt ⊙ i t (4)
accuracy loss. Furthermore, the block size (i.e., pruning granularity)
ot = σ (Wox x t + Wor yt −1 + Woc c t −1 + bo ) (5)
is application-sensitive, making it another hyper-parameter to tune.
Existing work often needs to search a range of block sizes to find a mt = ot ⊙ h(c t ) (6)
trade-off between model accuracy and hardware efficiency [13, 17]. yt = Wym mt (7)
This work presents Bank-Balanced Sparsity (BBS), a novel spar-
sity pattern for pruning LSTM. Bank-balanced pruning splits each where the W terms denote weight matrices, the b terms denote bias
weight matrix row into multiple equal-sized banks, and adopts fine- vectors. The symbols i, f , o and c are respectively the input gate,
grained pruning to each bank independently to obtain identical forget gate, output gate and cell activation (long-term memory).
sparsity among banks. BBS preserves the unstructured distribution The ⊙ operator denotes element-wise multiplication, and the +
of non-zero weights inside each bank, thus maintaining higher operator denotes element-wise addition. σ is the logistic activation
model accuracy than that of block sparsity. Experimental results in function and h is the hyperbolic tangent (Tanh) activation function.
Section 6 demonstrate that BBS achieves almost the same model Among all operators in LSTM, matrix-vector multiplication (MxV)
accuracy as unstructured sparsity and significantly outperforms is the most memory-intensive and computation-intensive operator.
block sparsity when pruning weights at the same sparsity level. The dimensions of x t , yt and c t are often the same, say D. There-
Importantly, BBS is also amenable to FPGA acceleration because fore, the number of weights is 12 × D 2 . In each step of the inference
it inherently provides a balanced matrix partitioning for parallel calculation, the number of operations in MxV is 24 × D 2 , and the
computing. We design an FPGA accelerator to take advantage of the number of operations in element-wise operators (EWOP) is 9 × D.
benefits of BBS to eliminate the computational overheads existed As a consequence, accelerating MxV is the key to low latency LSTM
in unstructured sparsity. Specifically: 1) our accelerator utilizes the inference.
intrinsic bank-balanced property in BBS to achieve high parallelism
in SpMxV with guaranteed load balance; 2) our accelerator supports 2.2 Weight Pruning
concurrent random access requests to vector elements without con- It is widely observed that deep neural networks (DNNs) have a
flicts in SpMxV by adopting banked scratchpad memory to buffer lot of redundancy in weights. Pruning away (forcing to zero) a
vectors; 3) to avoid decoding overheads of sparse matrix formats, proper number of unimportant weights won’t affect model accu-
we introduce a novel format for BBS matrices that is decoding-free racy. Moreover, weight pruning can reduce the model size and
in our FPGA accelerator. Notably, the BBS accelerator is highly computational complexity for energy efficient hardware accelera-
efficient even for inference with a batch size of 1, by exploiting tion. Deep Compression [9, 10] provides a threshold-based weight
fine-grained parallelism from a single sample which is challenging pruning technique. This method prunes away small weights whose
for unstructured sparsity. absolute values are less than a predefined threshold and retrains the
Overall, this paper makes the following contributions: remaining weights. Pruning and retraining are iteratively applied
to generate the sparse DNN model.
(1) We propose Bank-Balanced Sparsity, a novel sparsity pat- As mentioned in the introduction, unrestricted pruning of weight
tern that can both maintain model accuracy and enable an matrices is unfriendly to hardware acceleration. Further work [17,
efficient FPGA accelerator implementation. 19] proposes coarse-grained pruning methods to prune blocks of
(2) We design an FPGA-based accelerator for BBS that eliminates weights. They pick the maximum magnitude or the average mag-
load-imbalance, irregular memory accesses and decoding nitude of the weights within a block as the representative of the
overheads, and achieves good efficiency for LSTM inference entire block. If the representative magnitude is less than a pre-
even at a batch size of 1. defined threshold, the entire block will be pruned. However, the
(3) Implemented on Intel Arria-10 FPGA, the BBS accelerator pruning granularity affects hardware efficiency as well as model
achieves 750.9 GOPs on large LSTMs without batching. Com- accuracy. Deep neural network designers struggle to balance model
pared to state-of-the-art LSTM FPGA accelerators, we achieve accuracy and hardware efficiency.
3 BANK-BALANCED SPARSITY Another potential design for a sparsity pattern would be to split
Our proposed sparsity pattern, Bank-Balanced Sparsity (BBS), achieves weight matrices into 2-D blocks like block sparsity and apply fine-
both high model accuracy and high hardware efficiency. In this sec- grained pruning within each 2-D block. Larger weights within each
tion, we first describe the pattern of BBS and the motivation for block can be preserved as well in this scheme. However, after prun-
designing it. Then, we present the detailed bank-balanced pruning ing, each 2-D block is still an unstructured sparse matrix. It is still
algorithm to induce BBS on LSTM weight matrices. Finally, we challenging to design an efficient hardware accelerator architec-
analyze the pruning effectiveness of BBS in terms of achievable ture due to the irregularity of sparse sub-matrices. For example,
accuracy and sparsity. The efficient hardware acceleration design parallelizing SpMxV across 2-D blocks leads to concurrent irregular
for BBS will be introduced in the next section. vector accesses.
3.1 Bank-Balanced Sparsity Pattern 3.2 Bank-Balanced Pruning Algorithm

To induce BBS on LSTM weight matrices, we adopt a bank-balanced
For matrices represented in BBS, each matrix row is split into mul-
pruning method that prunes each bank independently with the
tiple equal-sized banks (i.e., sub-rows), and each bank has the same
same threshold percentage to obtain the same sparsity ratio among
number of non-zero values. Figure 1 illustrates BBS with an example
banks.
and compares it with unstructured sparsity and block sparsity. In
this example, three sparse matrices with different sparsity patterns
Algorithm 1 Bank-Balanced Pruning Algorithm
are all pruned from the dense example weight matrix in Figure 1(a))
Input:
with a sparsity ratio of 50%. Fine-grained pruning globally sorts
The matrix to be pruned, M;
the weights and prunes the smallest 50% of weights, leading to an
The number of banks per row, BankNum;
unstructured sparse matrix (Figure 1(b)); Coarse-grained pruning
The expected sparsity, Sparsity;
induces a block sparse matrix (Figure 1(c)) by setting the block
Output:
size to 2x2 and the block representative with the block average;
The pruned matrix, Mp ;
Our bank-balanced pruning induces a bank-balanced sparse matrix
1: for each Mi ∈ M.rows do
(Figure 1(d)) by splitting each matrix row into 2 equal-sized banks
2: Divide the row Mi into BankNum blocks;
and applying fine-grained pruning inside each bank independently.
3: for each bank ∈ Mi do
4: Sort the elements in bank;
0.2 0.1 0.2 -0.6 0.1 0.4 -0.1 0.6 -0.6 0.4 0.6 5: Calculate the bank internal threshold T in line with
0.4 -0.3 0.4 0.1 0.2 -0.4 0.1 0.5 0.4 0.4 -0.4 0.5 Sparsity;
0.7 -0.1 -0.3 0.1 0.5 -0.1 0.5 0.1 0.7 0.5 0.5
6: for each element ∈ bank do
7: prune element if element < T ;
-0.1 0.6 -0.5 0.3 -0.4 -0.2 0.3 0.6 0.6 -0.5 0.3 -0.4 0.3 0.6
8: end for
(a)Original Dense matrix (b) Unstructured sparse matrix 9: end for
by global pruning
10: end for
11: return the pruned matrix, Mp ;
0.2 -0.6 -0.1 0.6 0.2 -0.6 0.4 0.6
0.4 0.1 0.1 0.5 0.4 0.4 -0.4 0.5
0.7 -0.1 0.5 0.1 0.7 -0.3 0.5 0.5

Like previous pruning methods, we apply the bank-balanced
-0.1 0.6 0.3 0.6 0.6 -0.5 -0.4 0.6
pruning method iteratively to a pre-trained network, and fine-tune
(c) Block sparse matrix by pruning 2x2 (d) Bank-balanced sparse matrix by the network after each pruning iteration to restore the model accu-
blocks according to block average. local pruning inside each 1x4 bank
racy. Algorithm 1 illustrates the detailed bank-balanced pruning
method to induce BBS on LSTM weight matrices. In each pruning
Figure 1: Comparing BBS with unstructured sparsity and
iteration, bank-balanced pruning first partitions each matrix row to
block sparsity by pruning a dense matrix with a sparsity ra-
multiple equal-sized banks and sorts the weights within each bank
tio of 50%.
by their absolute values. The importance of weights is represented
We design this BBS sparsity pattern with consideration on both as their bank internal ranking of absolute values. Iteratively, a per-
hardware efficiency and model accuracy. In general, partitioning centage of weights with the smallest absolute values are pruned.
weight matrix into multiple sub-matrices is mandatory for parallel We slowly increase the pruning percentage from 0% to the target
computing. In BBS, each matrix row is split into multiple banks sparsity, while the rate of increase decreases with each pruning
with the same size and same sparsity. This bank-balanced partition- iteration. During pruning, if the model accuracy drops significantly
ing enables an efficient SpMxV design to exploit both inter-row and cannot be recovered via fine-tuning, we withdraw this pruning
parallelism and intra-row parallelism (i.e., inter-bank parallelism) iteration and stop the pruning procedure.
with guaranteed load balance and no vector access conflicts. The
detailed SpMxV design for BBS will be described in Section 4.1. In 3.3 Analysis of Our Pruning Method
addition, since BBS applies fine-grained pruning within each bank Intuitively, a pruning method should remove only smaller weights
independently, the relatively large weights which contribute more and preserve larger weights that contribute more to model accu-
to model accuracy in each bank can be preserved. racy. Fine-grained pruning clamps weights of small magnitudes to
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
(a) Unstructured Sparsity (b) BBS (c) Block Sparsity
Figure 2: Weight map visualization after pruning with (a) unstructured sparsity, (b) BBS, and (c) block sparsity (Sparsity ratio
= 90%). These weight maps are 64 × 64 sub-matrices of the whole 1500 × 1500 matrix.
zero and preserves large weights to maintain model accuracy. For of the dashed line). Each bank has 3 non-zero weights. We can see
bank-balanced pruning, we adopt fine-grained pruning inside banks that the weight map of BBS is very similar to the weight map of
independently, so large weights inside each bank can be preserved. unstructured sparsity, but the weight map of block sparsity is quite
In contrast, coarse-grained pruning prunes blocks of weights, which different because of the locality constraint.
constraints the locality of preserved non-zero weights, and there- In terms of achievable sparsity and accuracy, experimental results
fore some important weights could be mistakenly pruned while on two typical data sets [7, 18] demonstrate BBS has almost the
some unimportant weights are instead preserved. For the example same effectiveness as unstructured sparsity and outperforms block
in Figure 1, the bank-balanced sparse matrix in (d) preserves similar sparsity, described in Section 6.2.
larger weights as the unstructured sparse matrix in (b), but the
block sparse matrix in (c) removes some large weights (e.g., 0.4 and 4 SPARSE MATRIX COMPUTATION AND
0.5) but preserves some small weights (e.g. 0.1 and -0.1). FORMAT FOR BBS
Table 1: Percentages of the largest weights that are pre- As mentioned, the irregularity of unstructured sparsity is not hard-
served in various sparsity patterns (Sparsity ratio = 90%). ware friendly due to unbalanced computation, irregular memory
accesses and decoding overheads. In contrast, the intrinsic bank-
balanced property of BBS enables effective hardware designs to
Weight Unstructured Block
BBS address these issues. For BBS, we introduce a highly parallel Sp-
Matrices Sparsity Sparsity
MxV design with guaranteed load balance and no vector access
Wix 100.00% 91.30% 42.76%
conflicts, and an associated decoding-free sparse matrix format for
Wf x 100.00% 81.39% 24.26%
the SpMxV design.
Wcx 100.00% 84.45% 24.24%
Wox 100.00% 85.62% 22.97%
4.1 Highly Parallel SpMxV Design
To verify the pruning effectiveness of BBS and compare it with SpMxV consists of multiple dot product operations, one for each
unstructured sparsity and block sparsity, we analyze and visualize sparse matrix row and the dense vector. The standard practice of
the weight matrices after corresponding pruning methods in a real using multiple PEs to parallelize dot products across matrix rows
LSTM model [28]. The hidden size of this LSTM model is 1500. Ta- can reduce computation time. However, irregular memory access
ble 1 shows the percentage of the largest weights that are preserved patterns of unstructured sparse matrices restrict further parallelism
in various sparsity patterns. Here we show the results of Wix , Wf x , within a dot product.
Wcx and Wox , other weight matrices have similar results. In this In addition to inter-row parallelism, BBS enables an efficient
analysis, the sparsity ratios are all 90%, the bank size of BBS is 32 SpMxV design to exploit intra-row parallelism (i.e. inter-bank par-
and the block size of block sparsity is 4 × 4. Unstructured sparsity allelism) through the bank-balance partitioning. Figure 3 illustrates
by fine-grained pruning naturally preserves 100% largest weights how to exploit inter-bank parallelism in computing a dot product
because it globally prunes weights with smallest magnitudes. BBS of two vectors (i.e., a BBS matrix row and the dense vector). The
preserves more than 80% of the largest weights by fine-grained multiplication for the non-zero elements inside each bank is per-
pruning inside each bank, while block sparsity only preserves less formed serially, while the multiplications in different banks are
than half of (or even quarter of) the largest weights. Figure 2 vi- performed in parallel. In this example, the sparse matrix row is
sualizes these three kinds of sparse weight matrices of a 64 × 64 divided into 4 banks, as is shown in different colors. The size of
sub-matrix which is randomly selected from the whole 1500 × 1500 each bank is 3 and the sparsity is 1/3. The multiplied dense vec-
Wix . Grey grids indicate non-zero parameters and the grey level tor is divided into 4 banks accordingly. Our design computes the
indicates the magnitude of the absolute value. For the second matrix dot product of two vectors by accumulating dot products of sub-
represented in BBS, each row has two banks (left and right sides vectors whose sizes are all the number of banks (N). Each bank
of the sparse matrix row provides one non-zero element to form 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
one sub-vector (e.g., (A, C, E, G)), while dense vector elements are 0 A B C D E F G H
fetched based on the indices of non-zero values to form another 1 I J K L M N O P
sub-vector (e.g., (v 0 , v 3 , v 7 , v 9 )). For computing a dot product of (a) Original densely represented matrix
sub-vectors, N pair-wise multiplications are executed in parallel.
Multiple dot products of sub-vectors are calculated in sequential CSR 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
and accumulated to obtain the dot product of complete vectors. VALUES A B C D E F G H I J K L M N O P
COLUMN INDICES 0 2 4 5 8 11 13 14 0 1 4 6 9 11 12 13
ROW POINTERS 0 8 16
BSB Matrix Row
(b) CSR represented matrix
A 0 B C D 0 0 E F G 0 H
Bank 0 Bank 1 Bank 2 Bank 3
Data Rearrangement for
inter-bank parallelization
Dense Vector Skip Zeros
CSB 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
V0 V1 V2 V2 V4 V8 V11 B D F H VALUES A C E G B D F H I K M O J L N P
Accessed T=1
Bank 0 Vector BANK INTERNAL 0 0 0 1 2 2 3 2 0 0 1 3 1 2 3 1
V0 V3 V7 V9 A C E G T=0 INDICES
Elements
V3 V4 V5 Physical BRAM Addresses
Bank 1
Partial dot product: V0A+V3C+V7E+V9G (c) CSB represented matrix
V6 V7 V8
Bank 2
Accumulate
V9 V10 V11 Figure 4: The comparison between CSB and CSR.
Bank 3
Complete dot product
The proposed CSB format takes advantage of the balanced prop-
erty of BBS and eliminates the need for decoding. Figure 4(c) shows
Figure 3: Exploiting inter-bank parallelism in dot product
the CSB representation of the corresponding matrix. The CSB en-
computation of one BBS matrix row and the dense vector.
coding uses two arrays to represent a bank-balanced sparse matrix.
The bank-balanced property in BBS eliminates load imbalance In the first array (i.e., values), all non-zero values are first arranged
and irregular memory accesses. In BBS matrices, every row (and in row-major order. Inside each row, the first non-zero elements
every bank) has the same number of elements which automatically in each banks (e.g., (A, C, E, G)) are listed first, then the second
guarantees the load balance across rows and banks in SpMxV. When elements in each bank, and so on. The purpose of this data re-
calculating a partial dot product, BBS ensures one and only one arrangement is to explicitly expose inter-bank parallelism, thus
element is accessed in each bank. Therefore, storing each vector every successive N elements in CSB can be directly fetched and
bank in an independently accessible block RAM can supply vector computed upon in parallel. The second array (i.e., indices) lists the
elements simultaneously with high bandwidth and without memory bank internal indexes of non-zero values which are column indices
access conflicts. The detailed FPGA implementation is shown in modulo bank size K. When each of the N vector banks is stored in
Section 5. a separate BRAM block on FPGA, the bank internal indices can be
directly regarded as physical addresses to fetch the N corresponding
4.2 Decoding-Free Sparse Matrix Format vector elements in the BRAM blocks.
Various sparse matrix formats have been proposed to reduce the
memory footprint of sparse matrices. However, existing formats
5 LSTM ACCELERATOR
introduce decoding overheads when performing sparse matrix mul- In this section, we introduce the BBS accelerator, an FPGA-based
tiplications. For FPGA implementation, decoding sparse formats accelerator for LSTM networks with bank-balanced pruning. The
consumes hardware resources and incurs latency. In order to elim- BBS accelerator is implemented as an accelerator on the PCIe I/O
inate decoding overheads, we introduce a sparse matrix format bus to serve LSTM inference requests from the host server. Our
called Compressed Sparse Banks (CSB) that is specifically designed design specially accelerates LSTM networks at a batch size of one
for BBS. to reduce inference latency by devoting the on-chip resources to
Compressed Sparse Row (CSR) is a commonly used sparse matrix exploiting as much parallelism as possible from one single sample.
format [1]. We use CSR as a representative encoding of existing
formats for explanation and comparison. Figure 4(a) shows a bank- 5.1 Overall Architecture
balanced sparse matrix represented in dense format. Figure 4(b) Figure 5 shows the overall architecture of the BBS accelerator, which
shows its corresponding CSR encoding. CSR incurs two types of consists of a sparse matrix-vector multiplication unit (SpMxV Unit),
overheads for SpMxV operation. First, CSR format encodes all non- an element-wise vector operation unit (EWOP Unit), a direct mem-
zero elements in a row-major order. Thus, rearranging the non-zero ory access module (DMA) for load/store operations, on-chip memo-
elements are inevitable in order to exploit inter-bank parallelism in ries for matrices and vectors (Matrix Memory and Vector Memory),
SpMxV. Second, CSR format stores column indices and row pointers and a central controller. Before hardware acceleration, the host
to track the location of each non-zero value. Thus, calculating mem- server uses the bank-balanced pruning method to prune weight ma-
ory addresses is required to fetch vector elements. Other encoding trices and represents sparse matrices in our proposed Compressed
formats, such as CSC and COO have similar limitations [1]. Sparse Banks (CSB) format, then a lightweight compiler generates
Controller
FPGA
Instruction Buffer
Host PCIe
Server Cntlr
SpMxV PE EWOP
*+
Matrix
Values * ++
ACT
Memory * +
D
M * Output *
Private
A
Off-chip DRAM
Indices ... Vector
Buffer
+
DRAM Cntlr
Vector Memory
Figure 5: Overall architecture
instructions for the hardware accelerator to accomplish the com- and the dense vector concurrently to exploit inter-row parallelism,
putation of LSTM. The controller receives and stores instructions while each PE is designed to exploit intra-row (i.e., inter-bank)
from the host server in the instruction buffer and dispatches them parallelism in a single dot product operation.
to their corresponding modules to execute. In the center of Figure 5, we show the detailed architecture
The two important types of instructions are load/store instruc- of a PE. Each PE contains a private vector buffer (PVB) to buffer
tions and computational instructions: the dense vector being multiplied, because vector elements are
Load/Store Instructions. Load/Store instructions are executed randomly accessed multiple times for all matrix rows in SpMxV.
in the DMA module to transfer weight matrices and input/output The PE computes the dot product of two vectors by accumulating
vectors. A load instruction reads data (model weights and inputs) dot products of sub vectors. This computation includes 5 steps: (1)
from host memory/off-chip DRAM to on-chip memories. A store The PE reads N matrix row elements from the matrix memory and
instruction writes data (outputs) from on-chip memories to host N vector elements based on the sparse indices from the private
memory/off-chip DRAM. In practice, in many cases weight prun- vector buffer. (2) N multipliers operate simultaneously to obtain N
ing can reduce model size enough to fit in on-chip memories. For scalar products. (3) An N-input adder tree sums N scalar products
serving real-time LSTM with low latency, the default mode is to to calculate the partial dot product. (4) One more accumulator is
completely rely on on-chip memories. For large models that can not used to obtain the complete dot product. (5) The dot product result
completely fit into on-chip memories even with compression, the is written back to global vector memory. The PE is fully pipelined
BBS accelerator uses load/store instructions to read/write weight so that one operation can be processed per clock cycle.
matrices from/to off-chip DRAM. With M PEs and N multipliers per PE, this PE array achieves
Computational Instructions. As introduced in Section 2, all M × N parallelism for a single SpMxV operation.
operations in sparse LSTM can be put into 2 categories: SpMxV
and EWOP (including addition, multiplication and three kinds of
activations). Therefore, we design two kinds of computational in- 5.3 Private Vector Buffer
struction: SpMxV instruction and EWOP instruction to fulfill LSTM In each SpMxV PE, N weight elements can be simultaneously ac-
computation. The SpMxV instruction is executed in the SpMxV cessed in one clock cycle because non-zero values have already
unit to read the required matrix and vector from on-chip memo- been rearranged by CSB encoding format and contiguously stored
ries, then compute dot products for matrix rows, and finally write in matrix memory. However, to access dense vector elements, the
the result vector back to the vector memory. The EWOP instruc- PVB needs to support N random memory accesses concurrently.
tion is executed in the EWOP unit to read required vector(s) from Each BRAM in FPGA provides only two read and/or write ports.
the vector memory and write the resulting vector of element-wise Using a single BRAM to buffer dense vectors can not supply N
addition/multiplication/activations back to the vector memory. elements from random addresses concurrently. Multi-pumping [25]
and vector replication [14] are two alternative solutions. Multi-
pumping supplies N elements by running the PEs with N times
5.2 SpMxV Unit lower frequency than the BRAM. This approach decreases clock
The SpMxV unit implements the highly parallel design described rate significantly. Vector replication provides more ports by creat-
in Section 4.1. The SpMxV unit consists of M parallel processing ing replicas of the entire vector. Although this approach is simple
elements (PEs) that compute dot products of distinct matrix rows to implement, it is difficult to scale due to limited on-chip storage
resources in FPGA and generally large input/output/state vectors 5.5 Controller
in LSTM. In the computation flow of LSTM, some SpMxV operations and
In order to support random vector accesses at a high bandwidth EWOP operations among different gates can be performed simulta-
without replicas inside a PE, we adopt the banking approach to neously. The software compiler analyzes the dependencies and in-
buffer vectors [5]. In this approach, the multiplied vector is also split dicates the dependencies to instructions. The controller parallelizes
into banks according to the bank partitioning of matrix rows in BBS. instructions according to their dependent instructions indicated by
As shown in Figure 6, N banks of vector elements are stored in N the software compiler. When the SpMxV unit or the EWOP unit is
independently accessible BRAMs. Therefore, the PVB can provide N idle (which means an instruction is finished), the controller checks
elements simultaneously with N bank internal indices (i.e., physical whether the next instruction has a dependency on the instruction
addresses for each BRAM). Weight matrices in LSTMs usually have being executed on the other unit. If not, the controller dispatches
the same size, so we use a unified N in pruning and configure N the next instruction to the idle unit, so that the SpMxV unit and
as the number of BRAMs in PVB. However for some LSTMs that EWOP unit can work simultaneously.
have weight matrices of different sizes, different Ns are selected in
pruning to find an optimal sparsity, and the largest N is configured
as the number of BRAMs in PVB.
6 EVALUATION
Our evaluation centers around two aspects: the model accuracy of
Bank Accessed
Internal Vector BBS and the hardware efficiency of BBS accelerator.
Indices 0 1 2 3 4 5 6 7 Elements
2 V0 V1 V2 V3 V4 V5 V6 V7 V2 6.1 Experimental Setup

Bank0 We implemented the BBS accelerator in System Verilog, synthesized
3 V8 V9 V10 V11 V12 V13 V14 V15 V11 with Quartus Prime 17.1, and evaluated on a custom FPGA PCIe
Bank1 card with an Intel-Arria 10 FPGA [3]. The FPGA has 4 GB DDR3-
···
···
···
1600 DRAM external memory. The host CPU is an Intel Xeon E5

6 V32 V33 V34 V35 V36 V37 V38 V39 V38 2650 processor which is only responsible for data pre-processing
Bank4 and result collecting. The FPGA communicates with the host CPU
···
···
through a PCIe Gen 3x8 bus, which supports up to 16 GB/s of

···
bidirectional bandwidth.
Figure 6: Banked private vector buffer. We evaluate the system with an LSTM language model of the
PTB dataset [18] and an LSTM speech recognition model of the
TIMIT dataset [7]. PTB dataset is widely used in Natural Language
In some studies, banking is adopted to support random memory
Processing (NLP) research. It consists of 929k training words, 73k
accesses to achieve high memory bandwidth [5, 31]. However, due
validation words, and 82k test words and it has 10k words in its
to the irregularity of data accesses, banked memory cannot handle
vocabulary. We adopt the LSTM model in [28], which achieves very
imbalance workloads across banks and concurrent access requests
good quality on the PTB dataset. The small model has 200 hidden
to the same BRAM. Addressing these issues requires additional
units per layer, while the medium one has 650 and the large one
logic and clock cycles [5, 31]. The biggest difference of our banked
has 1,500. The TIMIT corpus is designed to provide speech data for
private vector buffer is that balanced memory access requests and
acoustic-phonetic studies. It contains broadband recordings of 630
no memory access conflicts are automatically guaranteed because
speakers of eight major dialects of American English, each reading
of the intrinsic bank-balanced property of BBS. The SpMxV PE
ten phonetically rich sentences. For the LSTM speech recognition
accesses one and only one element in each BRAM per cycle.
model, we set the input size to 153, the hidden size to 1024, and
Before a SpMxV operation, the vector to be multiplied requires
projection size to 512 which are consistent with previous studies
to be duplicated in each PE’s private vector memory to exploit
[8, 21].
inter-row parallelism. This brings new challenges. First, broadcast-
ing vector elements to various PEs leads to high fan-out and thus
results in a low achievable clock frequency. We use a systolic array 6.2 BBS Model Accuracy
structure to achieve high clock frequency, similar to [22]. The sec- 6.2.1 Comparison with Unstructured Sparsity and Block Spar-
ond is the additional access latency. We double-buffer the private sity. We first evaluate the model accuracy of BBS and compare it
vector buffer for pipelined data transfer and computation. with unstructured sparsity and block sparsity. Figure 7 and Figure
8 show the sparsity-accuracy trade-off results of various sparsity
5.4 EWOP Unit patterns on PTB and TIMIT data sets, respectively. We use 64 banks
The EWOP unit performs various element-wise operations on vec- in BBS and 4 × 4 blocks in block sparsity. For experiments of the
tors based on the instruction opcode. Vector addition and multipli- LSTM language model, we use the large model with the hidden size
cation generate one result vector by reading two source vectors. of 1,500. Perplexity is a metric to quantify language model quality
Activation functions only read one source vector and apply non- (lower is better). As shown in Figure 7, the perplexity curve of our
linear functions to it to generate one result vector. The EWOP BBS is very close to the perplexity curve of unstructured sparsity.
unit contains M operators operating in parallel for each kind of Both unstructured sparsity and BBS can preserve the perplexity
operations to reduce latency. until 80% of weights are pruned away. These two patterns even
achieve slightly better model accuracy at around 60% sparsity com- Table 2: Perplexity sensitivity to the block size in block spar-
pared to the dense baseline one. The perplexity of block sparsity sity and the bank size in BBS.
starts to increase significantly at 40% sparsity. Experiments on the
LSTM speech recognition model show similar results (shown in Perplexity on Sparsity
Model
Figure 8). BBS and unstructured sparsity can achieve 90% sparsity 60% 70% 80%
without accuracy loss, while block sparsity can only achieve 70% block size: 4×4 80.6 83.2 88.1
sparsity. These experimental results demonstrate that BBS has al- Block Sparsity block size: 8×8 82.4 86.4 95.2
most the same effectiveness as random sparsity and outperforms block size: 16×16 83.7 88.3 99.5
block sparsity in terms of achievable accuracy or sparsity during bank size: 25 78.3 78.6 79.4
pruning. BBS bank size: 50 78.4 78.7 79.2
bank size: 100 78.4 78.6 79.2
85
84
Block Sparsity
shows the effects of quantization under different bits on the large
(the lower the better)
BBS
83 LSTM language model after bank-balanced pruning. The perplex-
Unstructured Sparsity
Perplexity
ity is 78.8 for the original dense model and slightly increases to
82 Dense Baseline
79.2 after pruning away 80% weights with BBS. 16-bit quantization
81 on the pruned model maintains the same perplexity, while more
80 aggressive quantization deteriorates perplexity.
79
78
Table 3: Language model perplexity after quantization un-
0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0% 80.0% 90.0% der different bits.
Sparsity
Quantization Scheme Perplexity (%)
Figure 7: Sparsity-Perplexity trade-off of various sparsity
float-32 dense model 78.8
patterns on PTB dataset.
float-32 BBS model 79.2
fixed-16 BBS model 79.2
27.0
26.5 Block Sparsity
(the lower the better)
BBS
Phone Error Rate
26.0
25.5
Unstructured Sparsity
Dense Baseline
25.0 6.3 BBS Accelerator Efficiency
24.5 6.3.1 Resource Utilization, Clock Rate and Power Consump-
24.0 tion. Table 4 shows the resource utilization, clock rate and power
23.5 consumption of our BBS accelerators. The reported results are based
23.0 on post-fit results from Quartus Prime 17.1. The operator bits (i.e.,
0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0% 80.0% 90.0%
data precision) is 16-bit since 16-bit is accurate enough to maintain
Sparsity
model accuracy. The BBS accelerator sets to M = 64, N = 64, and
thus the accelerator contains 64 PEs in the SpMxV unit, and each
Figure 8: Sparsity - Phone Error Rate trade-off of various
PE has 64 multipliers executing in parallel. The Intel Arria 10 FPGA
sparsity patterns on TIMIT dataset.
contains 1518 DSPs which can be implemented as 3036 multipliers.
6.2.2 Sensitivity to Bank Size. We further explore the accuracy The LSTM accelerator fully utilizes DSPs for multipliers, and use
sensitivity of BBS to the bank size. As a comparison, we also explore additional ALMs for extra multipliers. We use M20Ks for the matrix
the accuracy sensitivity of block sparsity to the block size. Table 2 memory, and use MLABs for the private vector buffer because it
shows the model accuracy at varying block/bank sizes for the large consists of relatively small memories that require independently
LSTM language model. As shown, BBS achieves almost the same accessible ports.
model accuracy regardless of the change of bank size. For block
sparsity, however, increasing the block size adversely affects model Table 4: Resource utilization, clock rate and power consump-
accuracy. tion
6.2.3 Quantization on Pruned Model. Quantization can achieve
more compression rate and hardware efficiency for deep learning ALMs (%) M20Ks (%) DSPs (%)
models by reducing the number of bits that represents a weight 289k (68%) 2509 (92%) 1518 (100%)
[9, 15]. In this work, we study the accuracy sensitivity of BBS to Clock Rate (MHz) Power (Watt)
quantization bits. We apply the linear quantization method to LSTM 200 19.1
models after bank-balanced pruning with 16-bit, 8-bit, and 4-bit
fixed points. Both weights and activations are quantized. Table 3
6.3.2 Latency and Throughput. Our accelerator is highly effi- dimension of parallelism and addresses the low memory bandwidth
cient even with a batch size of 1, so we measure the latency of our issue of irregular memory access in SpMxV.
BBS accelerator without batching and calculate the corresponding
throughput. For small, medium and large LSTM language models 7 RELATED WORK
on the PTB data set, we also use three different numbers of banks Network Compression. Network compression can reduce the mem-
(16,32,64) to prune models. Pruning away 80% weights incurs no ory and computation requirements of a neural network, increase
effect on model accuracy. Table 5 shows the latency of one LSTM its inference speed and save energy [9]. Compression algorithms
and its corresponding throughput. The achievable performance in- mainly include pruning [10], sparsity-inducing regularization [23],
creases as the model scale or the number of bank increases because quantization [15]. Based on the original sparsity method, further
of higher hardware utilization of the underlying PEs. In the case studies propose structured sparsity methods by adding constraints
of the large model with 1,500 hidden units and using 64 banks in on the locality of non-zero weights [17, 19, 26]. Structured sparsity
matrix partitioning, our accelerator takes 4.8us to finish a whole is more amenable to hardware acceleration compared to unstruc-
LSTM layer, corresponding to 750.9 GOPS at a batch size of one. tured sparsity.
Table 5: Latency and throughput results of running LSTM DNN accelerators. Hardware acceleration of DNNs has received
language networks of various scales and various numbers significant attention from both industry and academia [4, 12, 16, 29].
of banks. Due to the widely adopted pruning-based compression techniques,
many accelerators for sparse neural networks are proposed [8, 21,
Latency Throughput 30]. These works explored specialized sparse matrix multiplication
LSTM hidden size Num of Banks module that directly operates on sparse neural networks. Although
(us) (GOPS)
these accelerators achieve higher performance than general pro-
16 1.7 37.3 cessors, the irregular computation and memory accesses in sparse
200 (small) 32 1.4 43.4 neural networks still restrict the maximum parallelism achievable
64 1.3 47.4 on customized accelerators.
16 4.3 158.8
650 (medium) 32 2.8 238.0 SpMxV accelerators. SpMxV is most computation-intensive and
64 2.1 318.5 memory-intensive part in LSTM inference. Many FPGA and GPU ac-
celerators for SpMxV have been proposed [2, 5]. However, SpMxV is
16 13.9 257.7
hard to optimize due to its irregular memory access characteristics.
1500 (large) 32 7.8 458.5
By contrast, neural network pruning methods bring a restricted
64 4.8 750.9
freedom to define the sparsity structure (e.g. hardware friendly
sparsity) in weight matrices. BBS is a kind of structured sparsity
6.3.3 Comparison with state-of-the-art LSTM Accelerators.
pattern that increases hardware efficiency, while incurs negligible
We compare the performance of our BBS accelerator with three
loss on model accuracy.
state-of-the-art LSTM accelerators on FPGA: ESE [8], C-LSTM [21]
and DeltaRNN [6]. These three studies adopt different optimiza-
8 CONCLUSION
tion techniques to reduce computation requirements. ESE [8] uses
the weight pruning based compression technique and improve in- This paper proposes a novel sparsity pattern, BBS (bank-balanced
ference efficiency through batching multiple samples, but lacks sparsity), that achieves both high model accuracy for pruning LSTM
optimization of irregular memory accesses to reduce latency for and high hardware efficiency on FPGA. Our insight into designing
a single batch request. C-LSTM [21] represents weight matrices BBS is partitioning weight matrix rows into banks for parallel
with block-circulant matrices and proposes an accelerator with an computing and adopting fine-grained pruning inside each bank
FFT-based computing kernel. DeltaRNN [6] uses the delta network to maintain model accuracy. Evaluated on speech recognition and
algorithm to reduce MxV operations and corresponding weight language model tasks, BBS achieves the same model accuracy as
fetches by skipping dispensable neuron activation changes below a purely unstructured sparsity at various sparsity levels. Our BBS
threshold. accelerator on FPGA takes advantage of the intrinsic bank-balanced
Table 6 shows the comparison results. We apply BBS to the property of BBS, achieving high efficiency even for a batch size
same LSTM model on the TIMIT dataset as ESE and C-LSTM adopt. of 1. Compared to state-of-the-art FPGA accelerators for LSTM
We use the accuracy and performance numbers of ESE, C-LSTM with different compression techniques, BBS accelerator achieves
and GRU reported in their papers. The performance numbers of 2.3 ~3.7x improvement on energy efficiency and 7.0 ~34.4x reduction
DeltaRNN are based on GRU which is an optimistic estimation on latency with negligible loss of model accuracy.
because GRU is simpler than LSTM. With the same model on the
same data set, BBS achieves comparable compression rate and model 9 ACKNOWLEDGEMENTS
accuracy as ESE and C-LSTM. While our BBS accelerators achieves We would like to thank Ningyi Xu, Wenqiang Wang, Bojie Li and
2.3x and 3.7x improvement on energy efficiency, and 34.4x and 7.0x Yun Wang for all technical discussions and valuable suggestions on
speedup on latency (or throughput at a batch size of one) compared improving this paper. We thank the anonymous reviewers for their
to ESE and C-LSTM. The reason why BBS accelerator can achieve insightful feedbacks and comments. Shijie Cao was partly supported
better single batch performance than ESE is that it enables the extra by National Nature Science Foundation of China (No.61772159).
Table 6: Speedup comparison with state-of-the-art LSTM accelerators
ESE[8] C-LSTM[21] DeltaRNN[6] Ours

Platform XCKU060 Virtex-7 XC7Z100 Arria 10 GX1150
Frequency (MHz) 200 200 125 200
Sparsity (%) 88.7 87.5 - 87.5
Quantization fixed-12 fixed-16 fixed-16 fixed-16
Accuracy Degradation 0.30% 0.32% - 0.25%
Throughput (GOPS) 282.2 131.1 192.0 304.1
Power (W) 41.0 22.0 7.3 19.1
Energy Efficiency (GOPS/W) 6.9 6.0 26.3 15.9
Latency(us) 82.7 16.7 - 2.4
Throughput at batch 1 (GOPS) 8.8 43.7 192.0 304.1
Effective Throughput
79.2 349.6 1198.0 2432.8
at batch 1 (GOPS)
REFERENCES [16] Shaoli Liu, Zidong Du, Jinhua Tao, Dong Han, Tao Luo, Yuan Xie, Yunji Chen,
[1] 2018. Sparse Matrix Formats. https://docs.scipy.org/doc/scipy/reference/sparse. and Tianshi Chen. 2016. Cambricon: An instruction set architecture for neural
html/. (2018). networks. In ACM SIGARCH Computer Architecture News, Vol. 44. IEEE Press,
[2] Nathan Bell and Michael Garland. 2008. Efficient sparse matrix-vector multiplica- 393–405.
tion on CUDA. Technical Report. Nvidia Technical Report NVR-2008-004, Nvidia [17] Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, and William J
Corporation. Dally. 2017. Exploring the regularity of sparse structure in convolutional neural
[3] Adrian M Caulfield, Eric S Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, networks. arXiv preprint arXiv:1705.08922 (2017).
Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, [18] Mitchell Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, and Ann Taylor.
and others. 2016. A cloud-scale acceleration architecture. In The 49th Annual 1999. Treebank-3 LDC99T42. CD-ROM. Philadelphia, Penn.: Linguistic Data
IEEE/ACM International Symposium on Microarchitecture. IEEE Press, 7. Consortium (1999).
[4] Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming [19] Sharan Narang, Eric Undersander, and Gregory Diamos. 2017. Block-Sparse
Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, Recurrent Neural Networks. arXiv preprint arXiv:1711.02782 (2017).
and others. 2018. A Configurable Cloud-Scale DNN Processor for Real-Time AI. [20] Haşim Sak, Andrew Senior, and Françoise Beaufays. 2014. Long short-term mem-
In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture ory recurrent neural network architectures for large scale acoustic modeling. In
(ISCA). IEEE. Fifteenth annual conference of the international speech communication association.
[5] Jeremy Fowers, Kalin Ovtcharov, Karin Strauss, Eric S Chung, and Greg Stitt. [21] Shuo Wang, Zhe Li, Caiwen Ding, Bo Yuan, Qinru Qiu, Yanzhi Wang, and Yun
2014. A high memory bandwidth fpga accelerator for sparse matrix-vector Liang. 2018. C-LSTM: Enabling Efficient LSTM using Structured Compression
multiplication. In Field-Programmable Custom Computing Machines (FCCM), 2014 Techniques on FPGAs. In Proceedings of the 2018 ACM/SIGDA International Sym-
IEEE 22nd Annual International Symposium on. IEEE, 36–43. posium on Field-Programmable Gate Arrays. ACM, 11–20.
[6] Chang Gao, Daniel Neil, Enea Ceolini, Shih-Chii Liu, and Tobi Delbruck. 2018. [22] Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han
DeltaRNN: A Power-efficient Recurrent Neural Network Accelerator. In Proceed- Hu, Yun Liang, and Jason Cong. 2017. Automated systolic array architecture
ings of the 2018 ACM/SIGDA International Symposium on Field-Programmable synthesis for high throughput CNN inference on FPGAs. In Design Automation
Gate Arrays. ACM, 21–30. Conference (DAC), 2017 54th ACM/EDAC/IEEE. IEEE, 1–6.
[7] John S Garofolo, Lori F Lamel, William M Fisher, Jonathan G Fiscus, and David S [23] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. Learning
Pallett. 1993. DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. structured sparsity in deep neural networks. In Advances in Neural Information
NIST speech disc 1-1.1. NASA STI/Recon technical report n 93 (1993). Processing Systems. 2074–2082.
[8] Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang [24] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi,
Xie, Hong Luo, Song Yao, Yu Wang, and others. 2017. Ese: Efficient speech recog- Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, and
nition engine with sparse lstm on fpga. In Proceedings of the 2017 ACM/SIGDA others. 2016. Google’s neural machine translation system: Bridging the gap
International Symposium on Field-Programmable Gate Arrays. ACM, 75–84. between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).
[9] Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing [25] Hasan Erdem Yantir, Salih Bayar, and Arda Yurdakul. 2013. Efficient implemen-
deep neural networks with pruning, trained quantization and huffman coding. tations of multi-pumped multi-port register files in FPGAs. In Digital System
arXiv preprint arXiv:1510.00149 (2015). Design (DSD), 2013 Euromicro Conference on. IEEE, 185–192.
[10] Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights [26] Zhuliang Yao, Shijie Cao, and Wencong Xiao. 2018. Balanced Sparsity for Efficient
and connections for efficient neural network. In Advances in neural information DNN Inference on GPU. arXiv preprint arXiv:1811.00206 (2018).
processing systems. 1135–1143. [27] Jiecao Yu, Andrew Lukefahr, David Palframan, Ganesh Dasika, Reetuparna Das,
[11] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural and Scott Mahlke. 2017. Scalpel: Customizing dnn pruning to the underlying
computation 9, 8 (1997), 1735–1780. hardware parallelism. In Proceedings of the 44th Annual International Symposium
[12] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, on Computer Architecture. ACM, 548–560.
Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, and others. [28] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural
2017. In-datacenter performance analysis of a tensor processing unit. In Proceed- network regularization. arXiv preprint arXiv:1409.2329 (2014).
ings of the 44th Annual International Symposium on Computer Architecture. ACM, [29] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong.
1–12. 2015. Optimizing fpga-based accelerator design for deep convolutional neural
[13] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on
Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Field-Programmable Gate Arrays. ACM, 161–170.
Koray Kavukcuoglu. 2018. Efficient Neural Audio Synthesis. arXiv preprint [30] Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo,
arXiv:1802.08435 (2018). Tianshi Chen, and Yunji Chen. 2016. Cambricon-X: An accelerator for sparse
[14] Charles Eric LaForest, Ming G Liu, Emma Rae Rapati, and J Gregory Steffan. 2012. neural networks. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM
Multi-ported memories for FPGAs via XOR. In Proceedings of the ACM/SIGDA International Symposium on. IEEE, 1–12.
international symposium on Field Programmable Gate Arrays. ACM, 209–218. [31] Shijie Zhou, Rajgopal Kannan, Yu Min, and Viktor K Prasanna. 2018. FASTCF:
[15] Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy. 2016. Fixed point FPGA-based Accelerator for STochastic-Gradient-Descent-based Collaborative
quantization of deep convolutional networks. In International Conference on Filtering. In Proceedings of the 2018 ACM/SIGDA International Symposium on
Machine Learning. 2849–2858. Field-Programmable Gate Arrays. ACM, 259–268.

Efficient and Effective Sparse LSTM On FPGA With Bank-Balanced Sparsity

Uploaded by

Copyright:

Available Formats

Efficient and Effective Sparse LSTM On FPGA With Bank-Balanced Sparsity

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Efficient and Effective Sparse LSTM On FPGA With Bank-Balanced Sparsity

Uploaded by

Copyright:

Available Formats

Efficient and Effective Sparse LSTM on FPGA with

Wencong Xiao∗ Lanshun Nie Dechen Zhan

Yunxin Liu Ming Wu Lintao Zhang

3.1 Bank-Balanced Sparsity Pattern 3.2 Bank-Balanced Pruning Algorithm

0.4 0.1 0.1 0.5 0.4 0.4 -0.4 0.5

0.7 -0.1 0.5 0.1 0.7 -0.3 0.5 0.5

(a) Unstructured Sparsity (b) BBS (c) Block Sparsity

Figure 5: Overall architecture

2 V0 V1 V2 V3 V4 V5 V6 V7 V2 6.1 Experimental Setup

1600 DRAM external memory. The host CPU is an Intel Xeon E5

through a PCIe Gen 3x8 bus, which supports up to 16 GB/s of

ESE[8] C-LSTM[21] DeltaRNN[6] Ours

You might also like