0% found this document useful (0 votes)
29 views8 pages

FP-DNN An Automated Framework For Mapping

Microelectronics thesis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views8 pages

FP-DNN An Automated Framework For Mapping

Microelectronics thesis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines

FP-DNN: An Automated Framework for Mapping


Deep Neural Networks onto FPGAs with RTL-HLS
Hybrid Templates
Yijin Guan1,3∗ , Hao Liang2,3∗ , Ningyi Xu3 , Wenqiang Wang3 , Shaoshuai Shi3 ,
Xi Chen3 , Guangyu Sun1,5 , Wei Zhang2 and Jason Cong4,5,1†
1 Center for Energy-Efficient Computing and Applications, Peking University, Beijing, China
2 Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, China
3 Microsoft Research Asia, Beijing, China
4 Computer Science Department, University of California, Los Angeles, USA
5 PKU/UCLA Joint Research Institute in Science and Engineering

Abstract—DNNs (Deep Neural Networks) have demonstrated Due to these characteristics, it is challenging to achieve high
great success in numerous applications such as image classifi- performance and good energy efficiency when mapping DNNs
cation, speech recognition, video analysis, etc. However, DNNs onto generic computing system. To solve this problem, many
are much more computation-intensive and memory-intensive hardware accelerators for DNN inference have been investi-
than previous shallow models. Thus, it is challenging to deploy
DNNs in both large-scale data centers and real-time embedded gated recently. Among these designs, FPGA-based accelerators
systems. Considering performance, flexibility, and energy effi- have gained great popularity because of their outstanding
ciency, FPGA-based accelerator for DNNs is a promising solution. flexibility, performance and energy efficiency.
Unfortunately, conventional accelerator design flows make it Unfortunately, hand-coded FPGA-based accelerators face
difficult for FPGA developers to keep up with the fast pace of both productivity and programmability challenges for mapping
innovations in DNNs.
To overcome this problem, we propose FP-DNN (Field DNNs in real applications. On the one hand, the design
Programmable DNN), an end-to-end framework that takes and optimization of FPGA-based accelerators require much
TensorFlow-described DNNs as input, and automatically gen- experience and expertise. It may cost a professional hardware
erates the hardware implementations on FPGA boards with developer several weeks to map a DNN model onto FPGAs,
RTL-HLS hybrid templates. FP-DNN performs model inference even with the help of high-level synthesis tools. For DNN
of DNNs with our high-performance computation engine and
carefully-designed communication optimization strategies. We designers, there are no programming interfaces or libraries
implement CNNs, LSTM-RNNs, and Residual Nets with FP- (like cuBLAS and cuDNN in NVDIA GPUs) to easily map
DNN, and experimental results show the great performance and their model onto FPGAs. On the other hand, prior work on
flexibility provided by our proposed FP-DNN framework. FPGA-based accelerators for DNNs focused on accelerating
certain type of layers [28] or certain models [23] [19]. Since
I. I NTRODUCTION
DNNs evolves rapidly, various model structures and optimiza-
DNNs have brought in profound and revolutionary changes tion techniques are emerging so fast that re-designing FPGA-
to the realm of artificial intelligence, and achieved great based accelerator for every new model or technique is quite
improvements in many domains such as computer vision [22] inefficient.
[13] [15], speech recognition [9], natural language processing According to the analysis above, there is a strong demand
[20], etc. Inspired by the impressive breakthroughs achieved for an easy-to-use framework that can automatically map
by DNNs, many researchers in both academia and industry are DNNs onto FPGAs. In this paper, we propose FP-DNN,
longing to solve their problems with powerful DNNs. With which takes symbolic descriptions (in TensorFlow) of DNNs
their model accuracy closer to or even better than human, as input, and outputs implementations of the corresponding
DNNs are widely deployed at scale in data centers, as well as FPGA-based accelerators for model inference. We implement
in embedded systems like mobile phones and robots. accelerators with RTL-HLS hybrid templates, and convert
DNNs are well-known to be computation-intensive and model inference into general-purpose computations like matrix
memory-intensive because of their deep topological structures, multiplication. Several optimization kernels are developed and
complicated neural connections, and massive data to process. invoked to ensure the functionality, performance and energy
efficiency of the accelerator. The entire compilation procedure
*Yijin Guan and Hao Liang contributed equally to this work.
†In addition to being a faculty member at UCLA, Jason Cong is also a is end-to-end and automated, which makes it possible for all
co-director of the PKU/UCLA Joint Research Institute and a visiting chair DNNs researchers and users to use FPGA as a powerful device
professor of Peking University. to perform model inference.
This research was performed while Yijin Guan, Hao Liang, Shaoshuai Shi
and Xi Chen were interns at Microsoft Research Asia. We make the following contributions in this paper:

978-1-5386-4037-1/17 $31.00 © 2017 IEEE 152


DOI 10.1109/FCCM.2017.25
• We build a framework that automatically maps DNNs frameworks have also been proposed. Among these designs,
onto FPGAs for model inference. Compared with pre- [16] [21] [27] and [25] are four representatives.
vious accelerating work, this automated framework can In [16], Mahajan et.al proposed TABLA, a template-based
save design time significantly. framework for accelerating statistical machine learning. They
• We divide the operations involved in model inference focus on accelerating the training phase by automatically
into computation-intensive part and layer-specific part. generating the corresponding accelerators for stochastic gra-
We implement high-performance matrix multiplication dient descent with Verilog-based templates. [21] proposed
kernel for the computation-intensive part, and carefully DNNWEAVER, a framework that automatically generates a
optimize communication bandwidth for the layer-specific synthesizable accelerator for a given (DNN, FPGA) pair.
part. FP-DNN automatically generates the hardware im- And they generates accelerators using hand-optimized design
plementation with RTL-HLS hybrid templates. templates (RTL-based). In [27], Zhang et.al proposed Caf-
• Our framework can support almost all types of DNNs, feine, a hardware/software co-designed library to accelerate
and we implement several DNNs (CNNs, LSTM-RNNs, convolutional neural networks on FPGAs. And they propose
and Residual Nets) as case studies. FPGA-based acceler- to accelerate convolutional layers and fully-connected layers
ators generated by this framework can achieve good per- with a uniformed representation. [25] proposed DeepBurning,
formance and energy efficiency. To the best of our knowl- an automation tool to generate FPGA-based accelerators for
edge, this is the first literature to implement ResNet-152 NN models. DeepBurning compiles DNNs described in a
on FPGA. Such a design has demonstrated flexibility, Caffe-like script and generates the corresponding RTL-level
scalability and productivity of our FP-DNN framework. accelerator under user-specified constraints.
The rest of this paper is organized as follows: Section
III. F RAMEWORK
II reviews some related work on DNNs and FPGA-based
automated frameworks. Section III describes the architecture A. Overview
of our proposed FP-DNN framework. Then, the hardware The overall FP-DNN framework is shown in Figure 1.
implementation details are provided in Section IV. In Section Model description, usually in the format of protobuf gener-
V, we show the experimental setup and results of our case ated by TensorFlow, is fed into our Symbolic Compiler. The
studies. At last, Section VI concludes this paper. compiler generates C++ program and FPGA programming
II. R ELATED W ORK bitstream, which are executed by the Host and Device re-
A. Deep Neural Networks spectively for model inference. Inside the Symbolic Compiler,
DNNs have evolved into a big community, and many Model Mapper analyzes the model description, and extracts
interesting and powerful models have been proposed. They topological structure and operations of the target model.
have achieved great success in computer vision, speech recog- After optimizations and parameterization for the hardware
nition, scene analysis, etc. Typically, DNNs can be divided implementation, Model Mapper outputs the hardware kernel
into several categories. By topological structure, we can di- schedule and kernel configuration to the code generators.
vide these models into Artificial Neural Networks, Recurrent Software Generator uses kernel schedule to generate the host
Neural Networks, Residual Nets, etc. All these models are code in C++. The host code is compiled by commercial C++
comprised of several neural layers, so by type of layers, there compiler to generate host programs. With kernel configura-
are convolutional layers, LSTM layers, fully-connected layers, tion, Hardware Generator generates the device codes by in-
recurrent layers, pooling layers, activation layers, etc. A single stantiating RTL-HLS hybrid templates. Commercial synthesis
DNN can choose any topological structure mentioned above, tools compile these hardware codes to get the programming
and it may includes several types of layers in its configuration. file for final hardware implementation. The whole FP-DNN
So this results in a huge design space of possible model framework works in an “end-to-end” manner: from software-
structures. based model descriptions to FPGA-based model inference
Currently, many open-source frameworks have been re- implementations. This procedure is all done automatically
leased for DNN research: TensorFlow [5], Caffe [14], Theano without any human intervention.
[6], Torch [4], CNTK [3], etc. TensorFlow is one of the most B. Model Mapper
popular DNN frameworks. It constructs DNNs in python/c++
Model Mapper analyzes model description to map the model
front-end as a data flow graph with a operation library,
onto hardware platform, and it generates the schedule and
and performs computation on the graph. TensorFlow support
configuration for hardware kernels. Figure 2 shows an example
various types of DNNs (ANN/CNN/RNN/...) as well as other
of the working flow of Model Mapper. Figure 2a shows
scientific computation. We appreciate the concepts of tensor
the example python code snippet in TensorFlow describing
and data flow graph in TensorFlow, and choose TensorFlow
a CNN model, which contains three convolution layers. The
as the high-level descriptions for DNNs in FP-DNN.
pooling and activation layers are omitted for simplicity. The
B. FPGA-based Automated Frameworks corresponding Data Flow Graph generated and executed by
Accelerating the inference phase of DNNs on FPGAs has TensorFlow is shown in Figure 2b, where computation and
been a hot research topic, and many automation tools or data are shown as operating nodes and tensors respectively.

153
Fig. 2: Working Flow of Model Mapper

During model inference, each data buffer has a range of time


during which its contents must be kept intact. Thus, any two
data buffers whose life spans intersect can not be placed in the
same physical buffer. We construct an interval graph in which
each vertex represents a data buffer. For any two data buffers
whose life spans intersect, we connect their vertexes with an
edge. We need to color this graph with minimum number of
distinct colors so that no adjacent nodes are assigned the same
color, which indicates that the data buffers with the same color
can be assigned to the same physical buffer. The coloring
problem for an interval graph can be solved optimally by left-
edge algorithm in polynomial time. An algorithm description
for applying left-edge algorithm in Model Mapper for physical
buffer allocating is shown in Algorithm 1.

Algorithm 1: Physical Buffer Allocating Algorithm


Fig. 1: FP-DNN Framework Input: Initial Data Buffer Graph(G), and Data Buffers(V )
Output: Physical Buffer Allocations
Denote the left-edge and right-edge of the interval corresponding
to data buffer vi ’s life span as li and ri respectively;
The Model Mapper uses the model description to extract infor- Sort V in ascending order of left-edge to get V  ;
# of physical buffers = 1;
mation about model structure and configurations of each layer. while not all v in V  have been allocated do
R = 0;
Although storing model parameters and intermediate results while ∃ vi in V  with li >R do
in on-chip BRAM can significantly improve performance, we vx = first v in V  with lx >R;
R = rx ;
do not have enough on-chip BRAMs on a single FPGA to allocate vx with current physical buffer;
store all of them for modern DNNs. As a result, we have V  = V  - vx ;
to allocate data buffers in off-chip DDR memory for storing # of physical buffers += 1;
intermediate activations and model parameters. Then, Model
Mapper generates an Execution Graph shown in Figure 2c, With carefully designed resource allocation strategies,
which shows ideally how the model inference is performed Model Mapper outputs the kernel schedule and kernel config-
on hardware. uration, which are shown in Figure 2d. In this example, only
However, this Execution Graph can not be mapped onto one convolution kernel will be allocated for computation, and
FPGAs directly due to the limitation on computation resource two physical buffers are allocated for intermediate data storing.
and DRAM storage of modern FPGAs. So we propose to adapt
resource reuse strategies to allocate hardware resources reason- C. SW Generator and HW Generator
ably. To reuse computation resource, Model Mapper allocates SW Generator takes kernel schedule to generate the C++
only one hardware kernel, which will perform model inference codes for Host, which are in charge of kernel execution
layer by layer. Thus, in the example shown in Figure 2, only scheduling, model initializing, data buffer managing, etc. SW
one convolution kernel is allocated. For storage resource reuse, Generator instantiates a host code template with some key
Model Mapper allocates several physical buffers in DRAM parameters extracted by Model Mapper, like the number of
as a memory pool, and we aim to minimize the number of kernels, the number of physical buffers, kernel execution order,
physical buffers in final implementation. We formulate the data and so on. This host code is written in C++, and can be
buffer reuse problem as a graph coloring problem. compiled by any commercial C++ compiler.

154
HW Generator is in fact a library of RTL-HLS hybrid
templates for various types of layers. We use RTL-HLS
hybrid templates instead of pure RTL templates or pure HLS
templates for the following reasons: Compared with HLS,
RTL designs usually utilize resources more efficiently, but
it is well-known that RTL design is quite hard and time-
consuming. HLS tools receive designs programmed in high-
level programming languages (C, C++, OpenCL, etc.), then
compile them into FPGA programming files. HLS design
has a better abstraction for external modules or interfaces
(like off-chip DRAM), which makes it easier and faster to
implement complex control logics. However, currently HLS
designs cannot explore as much fine-grained optimization as Fig. 3: Overall Architecture of Accelerator
those in RTL designs. layer-specific part respectively, then we will introduce the
To fully utilize the advantages of both design approaches, communication optimizations in detail.
we take an RTL-HLS hybrid approach for template design:
we use RTL for designing a high-performance computation A. Computation-Intensive Part
engine, and we use the OpenCL-based HLS framework to Considering code efficiency and hardware performance, we
implement the control logics for the RTL part. With the kernel implemented MM using Verilog. Accelerating matrix multipli-
configuration generated by Model Mapper, HW Generator cation has been a classical problem in the FPGA society, and
instantiates the corresponding optimized kernel template to massive optimizations have been adopted. To better explore
generate the hardware codes for Device. The RTL part of these the data locality, and make the limited DRAM bandwidth of
kernel templates are written in Verilog, and the HLS part is modern FPGA board match the computing power of MM, we
written in OpenCL-based HLS. The generated hardware codes take advantage of the tiling strategy to perform matrix multipli-
can be compiled by commercial synthesis tools for FPGA cation. To insure the multiplication is correctly performed in a
implementation. The library of kernel templates can be further tiling manner, we pad zeros to input matrices if any dimension
extended when new types of model layer emerge. of them is not divisible by its tiling size.
We focus on PCI-e based systems for its popularity in data MM takes in two tiles of input matrices, and performs
center computing systems. Data communication between Host the tiled multiplication vector by vector. All the input data
and Device are accomplished through a PCI-e slot, and this slot are fed into multipliers simultaneously, then the intermediate
is also used to power on and program FPGA. Inside FPGA, results are summed up through a reduction tree to minimize
hardware kernels are compiled and invoked by Host to perform the computing latency. Besides, we use double buffers for
computation. the input tiles, and these buffers operate in a ping-pong
manner to overlap data communication with computation,
IV. I MPLEMENTATION which significantly improve the throughput of MM.
The great complexity and variability of DNN structures have B. Layer-specific Part
brought big challenges to generating hardware for each of
them individually. It is well-known that DNNs are always DNNs are constructed by many different types of layers,
constructed by stacking layers. These layers share similar and the computation and data accessing pattern vary among
structure in the computation-intensive part, which can always these layers, so the strategies for converting them into matrix
be expressed as or converted into matrix multiplication. As multiplication are also quite different. In the following sub-
a result, we divide the operations involved in each layer sections, we provide details about the operations performed
into computation-intensive part and layer-specific part, and in typical layers, what the computation-intensive part is, and
implement the computing architecture shown in Figure 3 on how the MM kernel is reused.
FPGA board. 1) Convolutional Layers: Convolutional layers are over-
whelmingly popular in applications like image recognition,
For the computation-intensive part, we use a layer-
object detection, object classification, etc. Suppose we have
independent matrix multiplication kernel (MM) to perform
Nin input channels and Nout output channels. The size of
the calculations. For the layer-specific part, we use a Data
each convolution kernel is K × K, and sliding stride is set to
Arranger to perform data communication with DRAM di-
S. The computation during inference phase can be summarized
rectly, and it communicates with MM through on-chip chan-
as Equation 1 (bias adding is omitted for simplicity).
nels. We store the model configuration file in DRAM. This
configuration file includes information of model topological Nin K
 K
structure, layer specifications, etc. During model inference, out[x][y][z] = (in[i][y × S + j][z × S + k] × W [x][i][j][k]) (1)
Data Arranger accesses this configuration file, and parses it to i=1 j=1 k=1

schedule data accessing and kernel execution. In the following To perform the computation in convolutional layers with
subsections, we will present computation-intensive part and the MM kernel, we need to convert the convolution operations

155
into matrix multiplication. Firstly, we need to turn the input 4) Other layers: Recurrent layer inside simple RNNs (not
features from a 3-D array into a 2-D array that we can calculate using LSTM) is actually constructed by adding recurrent con-
as a matrix. To get a single feature in an output channel, we nection to fully-connected layer, so the computation-intensive
need to convolve a 3-D cube of input features (also known part of both layers is the same. Thus, we can also map
as a patch) with the corresponding convolution kernels. So recurrent layers to MM as we do for fully-connected layers.
we take each one of these input patches and flatten them Activation layers are always element-wise functions applied
into a single row of input matrix. This operation is known to the features, and typical activation functions include tanh(),
as Im2col (image to column), which is widely applied in prior sigmoid(), ReLU (). Thus, before the layer outputs are of-
CPU and GPU studies [7]. With the input features being in floaded to DRAM, we perform activation functions directly
a matrix form, we can do similar conversions for convolution instead of mapping them to MM.
kernels by partitioning the corresponding 3-D cubes into a Pooling layers extract input features through a sliding win-
single column of kernel matrix. According to the rules of dow, and choose the average or maximum of this window as
matrix multiplication, each output channel is serialized into output. So there is little computation involved in pooling lay-
a column of output matrix. ers, and there is no need to map them to MM. Before outputs
2) LSTM Layers: In recent years, Long Short-Term Mem- are offloaded to DRAM, pooling operations are adopted.
ory (LSTM) has gained great success in Recurrent Neural Net-
work (RNN) design. Numerous variants of LSTM structures C. Communication Optimization
have been proposed, while [10] finds that all these variants Since our computation kernels need to communicate with
show little difference in model accuracy. In FP-DNN, we off-chip DRAM for inputs and outputs, the achieved band-
implement the LSTM cell used in [26], which is also supported width is also an important factor to be considered in system
in TensorFlow. The input of LSTM layer is the combination of design, especially in bandwidth-limited platforms like FPGAs.
input vector at current time-step (int ) and hidden layer vector Previous studies [28] [27] showed that the effective DRAM
at previous time-step (ht−1 ). Then LSTM layer multiplies bandwidth can be raised up by increasing the DRAM burst
input with different weight matrices to get the output vectors length. In our on-board test, discontinuous access to DRAM
of four gates: input gate (It ), forget gate (Ft ), output gate (Ot ), will results in limited burst length, which will degrade the
and cell gate (C̃t ). Then these output vectors generate the final achieved bandwidth to ∼1GB/s. While performing continuous
output vector of LSTM layer with the cell memory of previous access to DRAM will improve the achieved bandwidth to
time-step (Ct−1 ) through element-wise operations (element- ∼8GB/s. To prevent I/O from becoming a serious bottleneck
wise addition, multiplication, and activation). The computation of the overall performance, we propose several methods to
performed in LSTM layers is generally shown in Equation 2 optimize effective DRAM bandwidth for different layers.
to Equation 6, where sig() represents sigmoid function. 1) Convolutional layers: For communication optimizations
Nin Nh
  in convolutional layers, we use Figure 4 as a simplified
It [x] = sig( int [i] × Wini [x][i] + ht−1 [i] × Whi [x][i] + Bi [x]) (2)
i=1 i=1
example to illustrate the problems and our solutions. In this
Nin
 Nh
 example, we set the number of input channels as 8, and each
Ft [x] = sig( int [i] × Winf [x][i] + ht−1 [i] × Whf [x][i] + Bf [x]) (3) channel has 3×3 elements, so we get 72 input elements in
i=1 i=1
Nin Nh
total. The size of convolution kernel is 2×2, and the sliding
 
C̃t [x] = tanh( int [i]×Winc [x][i]+ ht−1 [i]×Whc [x][i]+Bc̃ [x]) (4) stride is 1. According to the Im2col operations introduced
i=1 i=1
in Section IV.B.1, we can convert the input features into
Nin Nh
  an Input Matrix, and we divide this matrix into 4×2 equal
Ot [x] = sig( int [i] × Wino [x][i] + ht−1 [i] × Who [x][i]] + Bo [x]) (5)
i=1 i=1 tiles. In Figure 4, we show three different layout schemes for
ht [x] = Ot [x] × tanh(Ft [x] × Ct−1 [x] + It [x] × C̃t [x]) (6) comparison: Im2col, Row-major and Channel-major. For each
scheme, we show its DRAM layout and DRAM accessing
From the equations above, we can see that the computation- pattern for the first tile.
intensive part of LSTM layers is matrix to vector multiplica- Im2col: As Figure 4a shows, we can store the entire
tion. Considering vector as a matrix(length at one dimension Input Matrix on DRAM by flattening each tile, and insure
set to 1), we can map LSTM layer inference to MM. continuous accessing. However, it is obvious that this scheme
3) Fully-Connected Layers: Fully-Connected layer outputs stores the whole Input Matrix (128 elements) in DRAM, which
a vector (out) with input vector (in) and weight matrix (W ). requires data duplication for adjacent sliding windows. The
Fully-connected layers are also widely deployed in ANNs and data duplication brings great overhead on memory footprint,
classifiers in DNNs. As a result, we design a uniform template which should be avoided. Besides, to offload the outputs of
for all these layers. The inference phase of fully-connected this convolutional layer as inputs for the following layers, extra
layers can be summarized as Equation 7. The computation- operations for data reorganizing and duplication are needed.
intensive part is matrix to vector multiplication. Thus, we can Row-major: A straight-forward way to avoid data dupli-
re-use MM kernel to perform it. cation is: for each channel of Input Features, we can store
Nin
 the elements in a row-major manner. We show this scheme
out[x] = in[i] × W [x][i] + B[x] (7)
i=1
in Figure 4b. So this scheme stores 72 elements in DRAM

156
actually convert matrix to vector multiplication into matrix
multiplication, which can be efficiently accomplished by the
MM.
3) Other layers: The computation-intensive part of recur-
rent layers is matrix to vector multiplication, which is the
same as LSTM layers and fully-connected layers. So we apply
similar batching scheme to optimize DRAM communication
for them. Other layers like pooling layers and activation layers
do not need much data communication with DRAM, so no
communication optimization is applied.
D. Data Quantization
Note that numerous prior works [11] [12] have shown that
the accuracy of DNNs is robust enough with a decrease in
data precision. Many previous works on accelerating DNN
inference [23] [19] used fixed-point parameters in their de-
signs for performance improving and resource saving, and
this optimization is also called data quantization. So in our
implementation, we support implementing fixed-point versions
Fig. 4: Layout Optimization of the target model. Designers using our FP-DNN framework
in total, which means there is no data duplication. But we can specify the fixed-point precision by simply using the “-
can find that it takes two DRAM bursts to fetch the first fixed point” compilation option in our Symbolic Compiler. In
tile, which indicates discontinuous DRAM accessing. And practice, data quantization is done off-line, and the accuracy
this discontinuous accessing pattern will degrade the effective loss brought by data quantization should be estimated and
bandwidth. Similar to Im2col, offloading the outputs still tested by the users of FP-DNN in advance.
requires extra operations.
V. E VALUATION
Channel-major: Different from Row-major, Channel-major
stores Input Features in a channel-major manner. Thus, Input A. Experimental setup
Matrix needs to be reorganized correspondingly: each row In FP-DNN, Symbolic Compiler is written in C++ and
(input patch) is also flattened in a channel-major manner. OpenCL. The HLS code is synthesized by Altera OpenCL
The contents of reorganized first tile is also shown in Figure Offline Compiler (AOC) [1] (v16.0). HLS-synthesized RTL
4c. So there are in total 72 elements stored in DRAM for code is combined with hand-written RTL code and then fed
Input Features without any data duplication. And DRAM to Quartus 16.0. The code running on the host is written in
is also accessed continuously for fetching input elements. C++, and compiled with Visual Studio 2013.
Furthermore, in this scheme, the outputs are also generated in For the FPGA platform, we use Catapult [18] system
a channel-major manner, which indicates no extra operations with Altera Stratix-V GSMD5 FPGAs integrated. We use the
for data reorganizing or duplication are needed. PikesPeak version of Catapult in our experiments, which has a
With the comparison above, we choose to use the Channel- 4GB DDR3 DRAM as the external memory. The FPGA logic
major scheme to optimize communication of convolutional clock frequency is at 150MHz, and the run-time power of the
layers. Along with the Channel-major scheme, weight matrix FPGA board is about 25W. This FPGA board is plugged into
also needs to be adjusted accordingly, but the overhead brought a PCI-e Gen2 x8 slot of a host computer.
by this can be ignored since weights are pre-trained and these For performance comparison, we use TensorFlow(r0.9) to
adjustments can be applied before model deployment. run model inference on both CPU and GPU. We use a server
2) LSTM Layers & Fully-Conneted Layers: According that includes 2 processors for the CPU implementation, and
to the algorithm descriptions in Section IV.B.2 and Section each processor is a 8-core Xeon [email protected] with
IV.B.3, the computation-intensive part of LSTM layers and a 40MB L3 cache, and the thermal design power (TDP) is
Fully-connected layers mainly includes matrix to vector mul- 95W. The GPU is an NVIDIA GeForce GTX TITAN X,
tiplication. Unfortunately, the matrix to vector multiplication which has 3072 cuda cores and 12GB GDDR5 memory. The
is inefficient in terms of data locality, because every weight run-time power of it is about 250W. Both CPU- and GPU-
element fetched from DRAM is used only once for a single implementations run with batch size set to 256.
inference. Thus, most of the inference time are spent on data
communication. This indicates that performing model infer- B. FPGA Resource Utilization
ence directly with MM kernel will bring much performance The resource utilization of our MM implementations are
loss. To perform these computations with MM efficiently, shown in Table II. Among all the utilized resources, the Cat-
we propose to batch input vectors together. In this batching apult Shell (responsible for peripheral interfaces and memory
way, every element of weight matrices is reused, and we management) and matrix multiplication module are in Verilog,

157
TABLE I: CNN Performance Comparison with Prior Work
[23] [19] [27] Our Imp.
FPGA chip Stratix-V GSD8 Zynq XC7Z045 Virtex-7 690T Stratix-V GSMD5
Frequency 120 MHz 150 MHz 150MHz 150 MHz
Precision fixed8-16 fixed16 fixed16 fixed16
DSP Utilization 727/1963 780/900 2833/3600 1036/1590
Overall GOP/S 117.8 137.0 354.0 364.4

TABLE II: Resource Utilization of MM


Precision float32 fixed16
Logic 164100(95%) 42349(25%)
BRAM 1343(67%) 919(46%)
DSP 264(17%) 1036(65%)
and they take most of the resources. Our Data Arranger
implemented in OpenCL is very efficient and only takes 2%
Logic, 2% BRAM and almost negligible number (8) of DSPs.
C. MM Performance
MM is the major build block of our FPGA-based model
inference computation, so we compare the performance of our
MM kernel with other state-of-the-art implementations first. (a) Square Matrix Multiplication
We show the performance (in GOP/S, giga operations per
second) of our implementations, Intel MKL [24] and Altera
example design [2] for matrix multiplication in Figure 5. Our
implementations and Altera example design run on the same
FPGA board, and Intel MKL runs on the CPU where all 16
physical cores are fully occupied.
We first evaluate square matrix multiplication in Figure
5a. Among the three implementations on FPGA, our fixed16
implementation achieves the highest performance, and its
advantage over the other two implementations accumulates
when the matrix size grows. When compared with Intel MKL
implementation, our fixed16 implementation runs faster when
matrix is small, and MKL only perform better than our (b) Square Matrix to Rectangle Matrix Multiplication
implementation when matrix size grows over 4096.
Fig. 5: MM Performance Comparison
The observation is further confirmed by a wider space ex-
ploration of a square matrix to rectangle matrix multiplication use an existing matrix multiplication kernel (Altera example
in Figure 5b. MKL performs nicely when both matrices are design) to perform convolution.[19] design customized convo-
large enough on dimensions, but if the rectangle matrix is very lution kernel for convolutional layers. [27] designed uniformed
long or very wide, our implementation clearly outperforms covolution kernel for both convolutional layers and fully-
MKL, which is usually the case for fully-connected layers connected layers. Different from them, our FP-DNN performs
and convolutional layers after Im2col operations. In another convolution with a generalized MM kernel, which is designed
perspective, the MKL performance is achieved when all phys- and optimized under certain hardware constraints. From Table
ical core are fully occupied, which could hinder other tasks I, we can conclude that the implementations generated by
from being executed in time. our FP-DNN framework achieve state-of-the-art performance
even when compared with hand-coded accelerators which are
D. DNN Performance optimized for certain models. Furthermore, we take the DSP
To show the performance of our FP-DNN framework on utilization into consideration, and compare our design with the
a complete model, we compare our CNN implementations other implementations. It is obvious that our designs perform
with previous accelerators, as shown in Table I. We imple- much better, and use hardware resource more efficiently.
ment VGG-19 [22], which has 16 convolution layers, 3 fully
connected layers and 5 max pooling layers. Since state-of-the- E. Cross-Platform Comparison
art designs use fixed-point numbers in their implementations, To show the great performance and energy efficiency pro-
we compare our fixed-point version with them for a fair com- vided by our FP-DNN framework, we compare our imple-
parison on performance and resource utilization. The works in mentations with those on CPU and GPU in Table III. We use
[23], [19] and [27] all take the HLS approach (OpenCL-based TensorFlow(r0.9) to run the CPU- and GPU- implementations.
in [23], C/C++-based in [19] and [27]) for FPGA design. [23] We implement several DNNs as benchmarks: VGG-19 [22]

158
TABLE III: Performance Comparison on Different Platforms R EFERENCES
Model VGG-19[22] [1] Altera AOCL. https://www.altera.com/.
Platform CPU GPU FPGA [2] Altera OpenCL Example Design. https://www.altera.com/support/support-
Pecision float32 float32 float32 fixed16 resources/design-examples/design-software/opencl.html.
Accuracy 89.99% 89.99% 89.99% 89.9% [3] CNTK. https://www.cntk.ai/.
[4] Torch. http://torch.ch/.
GOP/S 119 1704 81 364.36
[5] Martın Abadi, Ashish Agarwal, et al. TensorFlow: Large-Scale Machine
GOP/J 0.63 6.82 3.24 14.57
Learning on Heterogeneous Systems, 2015.
Model LSTM-LM[26] [6] Frédéric Bastien, Pascal Lamblin, et al. Theano: new features and speed
Platform CPU GPU FPGA improvements. Deep Learning and Unsupervised Feature Learning,
Pecision float32 float32 float32 fixed16 NIPS 2012 Workshop.
Perplexity 78.42 78.42 78.42 78.42 [7] Sharan Chetlur, Cliff Woolley, et al. cuDNN: Efficient Primitives for
GOP/S 103 1828 86 315.85 Deep Learning. arXiv preprint arXiv: 1410.0759.
GOP/J 0.54 7.31 3.44 12.63 [8] Jia Deng, Wei Dong, et al. ImageNet: A large-scale hierarchical image
Model Res-152[13] database. 2009.
Platform CPU GPU FPGA [9] Alex Graves, Abdel-rahman Mohamed, et al. Speech recognition with
Pecision float32 float32 float32 fixed16 deep recurrent neural networks. In IEEE International Conference on
Accuracy 93.84% 93.84% 93.84% 93.83% Acoustics, Speech and Signal Processing (ICASSP), 2013.
GOP/S 119 1661 73 226.47 [10] Klaus Greff, Rupesh Kumar Srivastava, et al. LSTM: A search space
GOP/J 0.63 6.60 2.92 9.06 odyssey. arXiv preprint arXiv:1503.04069, 2015.
[11] Song Han, Xingyu Liu, et al. Eie: efficient inference engine on com-
(CNN), LSTM-LM [26] (LSTM- RNN), Res-152 [13] (Resid- pressed deep neural network. In Proceedings of the 43rd International
ual Net). Performance is evaluated in GOP/S, and energy Symposium on Computer Architecture, pages 243–254. IEEE Press,
efficiency is evaluated in GOP/J (giga operations per joule). 2016.
[12] Song Han, Huizi Mao, et al. Deep Compression: Compressing Deep
We applied data quantization strategies to all three models, Neural Networks with Pruning, Trained Quantization and Huffman
and compared the model accuracy between 32-bit floating- Coding. CoRR, abs/1510.00149, 2015.
point (float32) and 16-bit fixed-point (fixed16) in Table III. [13] Kaiming He, Xiangyu Zhang, et al. Deep residual learning for image
recognition. arXiv preprint arXiv:1512.03385, 2015.
We report the top-5 accuracy of VGG-19 and Res-152 on [14] Yangqing Jia, Evan Shelhamer, et al. Caffe: Convolutional architecture
ImageNet dataset [8]. Higher accuracy indicates the model for fast feature embedding. In Proceedings of the ACM International
performs better in the image recognition task. Perplexity of Conference on Multimedia, 2014.
[15] Haoxiang Li, Zhe Lin, et al. A Convolutional Neural Network Cascade
LSTM-LM on PTB dataset [17] is used to evaluate the model. for Face Detection. In Proceedings of the IEEE Conference on Computer
The lower the perplexity is, the better the model performs Vision and Pattern Recognition, 2015.
in the language modeling task. We observe that fixed16 [16] Divya Mahajan, Jongse Park, et al. TABLA: A unified template-
based framework for accelerating statistical machine learning. In
implementations are sufficient for all the networks. IEEE International Symposium on High PERFORMANCE Computer
We also compare implementations generated by FP-DNN Architecture, 2016.
with other implementations in performance. When we use full- [17] Mitchell P. Marcus et al. Building a large annotated corpus of english:
the penn treebank. Computational Linguistics, 1993.
precision (float32) data, the implementation generated by FP- [18] Andrew Putnam, Adrian M Caulfield, et al. A reconfigurable fabric
DNN is slower than the implementations on CPU. When the for accelerating large-scale datacenter services. In ACM/IEEE 41st
data precision is lowered to fixed16, FP-DNN implementations International Symposium on Computer Architecture (ISCA), 2014.
[19] Jiantao Qiu, Jie Wang, et al. Going Deeper with Embedded FPGA Plat-
are faster than CPU implementations by about 1.9x∼3.06x. We form for Convolutional Neural Network. In ACM/SIGDA International
observe that FP-DNN cannot compete with GPU implemen- Symposium on Field-Programmable Gate Arrays, 2016.
tations in performance. Regarding energy efficiency, FP-DNN [20] Ruhi Sarikaya, Geoffrey E Hinton, et al. Application of deep belief
networks for natural language understanding. IEEE/ACM Transactions
implementations is always better than CPU implementations on Audio, Speech, and Language Processing, 2014.
in all models and precisions. And FP-DNN can easily beat [21] Hardik Sharma, Jongse Park, et al. From high-level deep neural models
GPU implementations when the data precision is lowered to to FPGAs. In IEEE/ACM International Symposium on Microarchitec-
ture, 2016.
fixed16. [22] Karen Simonyan and Andrew Zisserman. Very deep convolu-
VI. C ONCLUSIONS tional networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556, 2014.
In this paper, we propose FP-DNN, a framework that [23] Naveen Suda, Vikas Chandra, et al. Throughput-Optimized OpenCL-
automatically maps DNNs onto FPGAs to accelerate model based FPGA Accelerator for Large-Scale Convolutional Neural
inference. FP-DNN analyzes model descriptions to perform Networks. In ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays, 2016.
model mapping and code generating, then it implements [24] Endong Wang, Qing Zhang, et al. Intel Math Kernel Library. In High-
model inference with high-performance computation engine Performance Computing on the Intel R Xeon Phi. Springer, 2014.

and carefully-designed communication optimization strategies. [25] Ying Wang, Jie Xu, et al. DeepBurning: automatic generation of FPGA-
based learning accelerators for the neural network family. In IEEE/ACM
Our case studies show the great performance and effectiveness Proceedings of Design Automation Conference, 2016.
achieved by FP-DNN. [26] Wojciech Zaremba, Ilya Sutskever, et al. Recurrent neural network
regularization. arXiv preprint arXiv:1409.2329, 2014.
VII. ACKNOWLEDGEMENT [27] Chen Zhang, Zhenman Fang, et al. Caffeine: Towards Uniformed Rep-
resentation and Acceleration for Deep Convolutional Neural Networks.
This work is supported in part by NSF China (No.61572045) In International Conference on Computer Aided Design, 2016.
[28] Chen Zhang, Peng Li, et al. Optimizing FPGA-based Accelerator Design
and Microsoft Research Asia (No.FY16-RES-THEME-037). for Deep Convolutional Neural Networks. In ACM/SIGDA International
We also would like to thank Sixiao Zhu for many inspiring Symposium on Field-Programmable Gate Arrays, 2015.
discussions.

159

You might also like