FP-DNN An Automated Framework For Mapping
FP-DNN An Automated Framework For Mapping
Abstract—DNNs (Deep Neural Networks) have demonstrated Due to these characteristics, it is challenging to achieve high
great success in numerous applications such as image classifi- performance and good energy efficiency when mapping DNNs
cation, speech recognition, video analysis, etc. However, DNNs onto generic computing system. To solve this problem, many
are much more computation-intensive and memory-intensive hardware accelerators for DNN inference have been investi-
than previous shallow models. Thus, it is challenging to deploy
DNNs in both large-scale data centers and real-time embedded gated recently. Among these designs, FPGA-based accelerators
systems. Considering performance, flexibility, and energy effi- have gained great popularity because of their outstanding
ciency, FPGA-based accelerator for DNNs is a promising solution. flexibility, performance and energy efficiency.
Unfortunately, conventional accelerator design flows make it Unfortunately, hand-coded FPGA-based accelerators face
difficult for FPGA developers to keep up with the fast pace of both productivity and programmability challenges for mapping
innovations in DNNs.
To overcome this problem, we propose FP-DNN (Field DNNs in real applications. On the one hand, the design
Programmable DNN), an end-to-end framework that takes and optimization of FPGA-based accelerators require much
TensorFlow-described DNNs as input, and automatically gen- experience and expertise. It may cost a professional hardware
erates the hardware implementations on FPGA boards with developer several weeks to map a DNN model onto FPGAs,
RTL-HLS hybrid templates. FP-DNN performs model inference even with the help of high-level synthesis tools. For DNN
of DNNs with our high-performance computation engine and
carefully-designed communication optimization strategies. We designers, there are no programming interfaces or libraries
implement CNNs, LSTM-RNNs, and Residual Nets with FP- (like cuBLAS and cuDNN in NVDIA GPUs) to easily map
DNN, and experimental results show the great performance and their model onto FPGAs. On the other hand, prior work on
flexibility provided by our proposed FP-DNN framework. FPGA-based accelerators for DNNs focused on accelerating
certain type of layers [28] or certain models [23] [19]. Since
I. I NTRODUCTION
DNNs evolves rapidly, various model structures and optimiza-
DNNs have brought in profound and revolutionary changes tion techniques are emerging so fast that re-designing FPGA-
to the realm of artificial intelligence, and achieved great based accelerator for every new model or technique is quite
improvements in many domains such as computer vision [22] inefficient.
[13] [15], speech recognition [9], natural language processing According to the analysis above, there is a strong demand
[20], etc. Inspired by the impressive breakthroughs achieved for an easy-to-use framework that can automatically map
by DNNs, many researchers in both academia and industry are DNNs onto FPGAs. In this paper, we propose FP-DNN,
longing to solve their problems with powerful DNNs. With which takes symbolic descriptions (in TensorFlow) of DNNs
their model accuracy closer to or even better than human, as input, and outputs implementations of the corresponding
DNNs are widely deployed at scale in data centers, as well as FPGA-based accelerators for model inference. We implement
in embedded systems like mobile phones and robots. accelerators with RTL-HLS hybrid templates, and convert
DNNs are well-known to be computation-intensive and model inference into general-purpose computations like matrix
memory-intensive because of their deep topological structures, multiplication. Several optimization kernels are developed and
complicated neural connections, and massive data to process. invoked to ensure the functionality, performance and energy
efficiency of the accelerator. The entire compilation procedure
*Yijin Guan and Hao Liang contributed equally to this work.
†In addition to being a faculty member at UCLA, Jason Cong is also a is end-to-end and automated, which makes it possible for all
co-director of the PKU/UCLA Joint Research Institute and a visiting chair DNNs researchers and users to use FPGA as a powerful device
professor of Peking University. to perform model inference.
This research was performed while Yijin Guan, Hao Liang, Shaoshuai Shi
and Xi Chen were interns at Microsoft Research Asia. We make the following contributions in this paper:
153
Fig. 2: Working Flow of Model Mapper
154
HW Generator is in fact a library of RTL-HLS hybrid
templates for various types of layers. We use RTL-HLS
hybrid templates instead of pure RTL templates or pure HLS
templates for the following reasons: Compared with HLS,
RTL designs usually utilize resources more efficiently, but
it is well-known that RTL design is quite hard and time-
consuming. HLS tools receive designs programmed in high-
level programming languages (C, C++, OpenCL, etc.), then
compile them into FPGA programming files. HLS design
has a better abstraction for external modules or interfaces
(like off-chip DRAM), which makes it easier and faster to
implement complex control logics. However, currently HLS
designs cannot explore as much fine-grained optimization as Fig. 3: Overall Architecture of Accelerator
those in RTL designs. layer-specific part respectively, then we will introduce the
To fully utilize the advantages of both design approaches, communication optimizations in detail.
we take an RTL-HLS hybrid approach for template design:
we use RTL for designing a high-performance computation A. Computation-Intensive Part
engine, and we use the OpenCL-based HLS framework to Considering code efficiency and hardware performance, we
implement the control logics for the RTL part. With the kernel implemented MM using Verilog. Accelerating matrix multipli-
configuration generated by Model Mapper, HW Generator cation has been a classical problem in the FPGA society, and
instantiates the corresponding optimized kernel template to massive optimizations have been adopted. To better explore
generate the hardware codes for Device. The RTL part of these the data locality, and make the limited DRAM bandwidth of
kernel templates are written in Verilog, and the HLS part is modern FPGA board match the computing power of MM, we
written in OpenCL-based HLS. The generated hardware codes take advantage of the tiling strategy to perform matrix multipli-
can be compiled by commercial synthesis tools for FPGA cation. To insure the multiplication is correctly performed in a
implementation. The library of kernel templates can be further tiling manner, we pad zeros to input matrices if any dimension
extended when new types of model layer emerge. of them is not divisible by its tiling size.
We focus on PCI-e based systems for its popularity in data MM takes in two tiles of input matrices, and performs
center computing systems. Data communication between Host the tiled multiplication vector by vector. All the input data
and Device are accomplished through a PCI-e slot, and this slot are fed into multipliers simultaneously, then the intermediate
is also used to power on and program FPGA. Inside FPGA, results are summed up through a reduction tree to minimize
hardware kernels are compiled and invoked by Host to perform the computing latency. Besides, we use double buffers for
computation. the input tiles, and these buffers operate in a ping-pong
manner to overlap data communication with computation,
IV. I MPLEMENTATION which significantly improve the throughput of MM.
The great complexity and variability of DNN structures have B. Layer-specific Part
brought big challenges to generating hardware for each of
them individually. It is well-known that DNNs are always DNNs are constructed by many different types of layers,
constructed by stacking layers. These layers share similar and the computation and data accessing pattern vary among
structure in the computation-intensive part, which can always these layers, so the strategies for converting them into matrix
be expressed as or converted into matrix multiplication. As multiplication are also quite different. In the following sub-
a result, we divide the operations involved in each layer sections, we provide details about the operations performed
into computation-intensive part and layer-specific part, and in typical layers, what the computation-intensive part is, and
implement the computing architecture shown in Figure 3 on how the MM kernel is reused.
FPGA board. 1) Convolutional Layers: Convolutional layers are over-
whelmingly popular in applications like image recognition,
For the computation-intensive part, we use a layer-
object detection, object classification, etc. Suppose we have
independent matrix multiplication kernel (MM) to perform
Nin input channels and Nout output channels. The size of
the calculations. For the layer-specific part, we use a Data
each convolution kernel is K × K, and sliding stride is set to
Arranger to perform data communication with DRAM di-
S. The computation during inference phase can be summarized
rectly, and it communicates with MM through on-chip chan-
as Equation 1 (bias adding is omitted for simplicity).
nels. We store the model configuration file in DRAM. This
configuration file includes information of model topological Nin K
K
structure, layer specifications, etc. During model inference, out[x][y][z] = (in[i][y × S + j][z × S + k] × W [x][i][j][k]) (1)
Data Arranger accesses this configuration file, and parses it to i=1 j=1 k=1
schedule data accessing and kernel execution. In the following To perform the computation in convolutional layers with
subsections, we will present computation-intensive part and the MM kernel, we need to convert the convolution operations
155
into matrix multiplication. Firstly, we need to turn the input 4) Other layers: Recurrent layer inside simple RNNs (not
features from a 3-D array into a 2-D array that we can calculate using LSTM) is actually constructed by adding recurrent con-
as a matrix. To get a single feature in an output channel, we nection to fully-connected layer, so the computation-intensive
need to convolve a 3-D cube of input features (also known part of both layers is the same. Thus, we can also map
as a patch) with the corresponding convolution kernels. So recurrent layers to MM as we do for fully-connected layers.
we take each one of these input patches and flatten them Activation layers are always element-wise functions applied
into a single row of input matrix. This operation is known to the features, and typical activation functions include tanh(),
as Im2col (image to column), which is widely applied in prior sigmoid(), ReLU (). Thus, before the layer outputs are of-
CPU and GPU studies [7]. With the input features being in floaded to DRAM, we perform activation functions directly
a matrix form, we can do similar conversions for convolution instead of mapping them to MM.
kernels by partitioning the corresponding 3-D cubes into a Pooling layers extract input features through a sliding win-
single column of kernel matrix. According to the rules of dow, and choose the average or maximum of this window as
matrix multiplication, each output channel is serialized into output. So there is little computation involved in pooling lay-
a column of output matrix. ers, and there is no need to map them to MM. Before outputs
2) LSTM Layers: In recent years, Long Short-Term Mem- are offloaded to DRAM, pooling operations are adopted.
ory (LSTM) has gained great success in Recurrent Neural Net-
work (RNN) design. Numerous variants of LSTM structures C. Communication Optimization
have been proposed, while [10] finds that all these variants Since our computation kernels need to communicate with
show little difference in model accuracy. In FP-DNN, we off-chip DRAM for inputs and outputs, the achieved band-
implement the LSTM cell used in [26], which is also supported width is also an important factor to be considered in system
in TensorFlow. The input of LSTM layer is the combination of design, especially in bandwidth-limited platforms like FPGAs.
input vector at current time-step (int ) and hidden layer vector Previous studies [28] [27] showed that the effective DRAM
at previous time-step (ht−1 ). Then LSTM layer multiplies bandwidth can be raised up by increasing the DRAM burst
input with different weight matrices to get the output vectors length. In our on-board test, discontinuous access to DRAM
of four gates: input gate (It ), forget gate (Ft ), output gate (Ot ), will results in limited burst length, which will degrade the
and cell gate (C̃t ). Then these output vectors generate the final achieved bandwidth to ∼1GB/s. While performing continuous
output vector of LSTM layer with the cell memory of previous access to DRAM will improve the achieved bandwidth to
time-step (Ct−1 ) through element-wise operations (element- ∼8GB/s. To prevent I/O from becoming a serious bottleneck
wise addition, multiplication, and activation). The computation of the overall performance, we propose several methods to
performed in LSTM layers is generally shown in Equation 2 optimize effective DRAM bandwidth for different layers.
to Equation 6, where sig() represents sigmoid function. 1) Convolutional layers: For communication optimizations
Nin Nh
in convolutional layers, we use Figure 4 as a simplified
It [x] = sig( int [i] × Wini [x][i] + ht−1 [i] × Whi [x][i] + Bi [x]) (2)
i=1 i=1
example to illustrate the problems and our solutions. In this
Nin
Nh
example, we set the number of input channels as 8, and each
Ft [x] = sig( int [i] × Winf [x][i] + ht−1 [i] × Whf [x][i] + Bf [x]) (3) channel has 3×3 elements, so we get 72 input elements in
i=1 i=1
Nin Nh
total. The size of convolution kernel is 2×2, and the sliding
C̃t [x] = tanh( int [i]×Winc [x][i]+ ht−1 [i]×Whc [x][i]+Bc̃ [x]) (4) stride is 1. According to the Im2col operations introduced
i=1 i=1
in Section IV.B.1, we can convert the input features into
Nin Nh
an Input Matrix, and we divide this matrix into 4×2 equal
Ot [x] = sig( int [i] × Wino [x][i] + ht−1 [i] × Who [x][i]] + Bo [x]) (5)
i=1 i=1 tiles. In Figure 4, we show three different layout schemes for
ht [x] = Ot [x] × tanh(Ft [x] × Ct−1 [x] + It [x] × C̃t [x]) (6) comparison: Im2col, Row-major and Channel-major. For each
scheme, we show its DRAM layout and DRAM accessing
From the equations above, we can see that the computation- pattern for the first tile.
intensive part of LSTM layers is matrix to vector multiplica- Im2col: As Figure 4a shows, we can store the entire
tion. Considering vector as a matrix(length at one dimension Input Matrix on DRAM by flattening each tile, and insure
set to 1), we can map LSTM layer inference to MM. continuous accessing. However, it is obvious that this scheme
3) Fully-Connected Layers: Fully-Connected layer outputs stores the whole Input Matrix (128 elements) in DRAM, which
a vector (out) with input vector (in) and weight matrix (W ). requires data duplication for adjacent sliding windows. The
Fully-connected layers are also widely deployed in ANNs and data duplication brings great overhead on memory footprint,
classifiers in DNNs. As a result, we design a uniform template which should be avoided. Besides, to offload the outputs of
for all these layers. The inference phase of fully-connected this convolutional layer as inputs for the following layers, extra
layers can be summarized as Equation 7. The computation- operations for data reorganizing and duplication are needed.
intensive part is matrix to vector multiplication. Thus, we can Row-major: A straight-forward way to avoid data dupli-
re-use MM kernel to perform it. cation is: for each channel of Input Features, we can store
Nin
the elements in a row-major manner. We show this scheme
out[x] = in[i] × W [x][i] + B[x] (7)
i=1
in Figure 4b. So this scheme stores 72 elements in DRAM
156
actually convert matrix to vector multiplication into matrix
multiplication, which can be efficiently accomplished by the
MM.
3) Other layers: The computation-intensive part of recur-
rent layers is matrix to vector multiplication, which is the
same as LSTM layers and fully-connected layers. So we apply
similar batching scheme to optimize DRAM communication
for them. Other layers like pooling layers and activation layers
do not need much data communication with DRAM, so no
communication optimization is applied.
D. Data Quantization
Note that numerous prior works [11] [12] have shown that
the accuracy of DNNs is robust enough with a decrease in
data precision. Many previous works on accelerating DNN
inference [23] [19] used fixed-point parameters in their de-
signs for performance improving and resource saving, and
this optimization is also called data quantization. So in our
implementation, we support implementing fixed-point versions
Fig. 4: Layout Optimization of the target model. Designers using our FP-DNN framework
in total, which means there is no data duplication. But we can specify the fixed-point precision by simply using the “-
can find that it takes two DRAM bursts to fetch the first fixed point” compilation option in our Symbolic Compiler. In
tile, which indicates discontinuous DRAM accessing. And practice, data quantization is done off-line, and the accuracy
this discontinuous accessing pattern will degrade the effective loss brought by data quantization should be estimated and
bandwidth. Similar to Im2col, offloading the outputs still tested by the users of FP-DNN in advance.
requires extra operations.
V. E VALUATION
Channel-major: Different from Row-major, Channel-major
stores Input Features in a channel-major manner. Thus, Input A. Experimental setup
Matrix needs to be reorganized correspondingly: each row In FP-DNN, Symbolic Compiler is written in C++ and
(input patch) is also flattened in a channel-major manner. OpenCL. The HLS code is synthesized by Altera OpenCL
The contents of reorganized first tile is also shown in Figure Offline Compiler (AOC) [1] (v16.0). HLS-synthesized RTL
4c. So there are in total 72 elements stored in DRAM for code is combined with hand-written RTL code and then fed
Input Features without any data duplication. And DRAM to Quartus 16.0. The code running on the host is written in
is also accessed continuously for fetching input elements. C++, and compiled with Visual Studio 2013.
Furthermore, in this scheme, the outputs are also generated in For the FPGA platform, we use Catapult [18] system
a channel-major manner, which indicates no extra operations with Altera Stratix-V GSMD5 FPGAs integrated. We use the
for data reorganizing or duplication are needed. PikesPeak version of Catapult in our experiments, which has a
With the comparison above, we choose to use the Channel- 4GB DDR3 DRAM as the external memory. The FPGA logic
major scheme to optimize communication of convolutional clock frequency is at 150MHz, and the run-time power of the
layers. Along with the Channel-major scheme, weight matrix FPGA board is about 25W. This FPGA board is plugged into
also needs to be adjusted accordingly, but the overhead brought a PCI-e Gen2 x8 slot of a host computer.
by this can be ignored since weights are pre-trained and these For performance comparison, we use TensorFlow(r0.9) to
adjustments can be applied before model deployment. run model inference on both CPU and GPU. We use a server
2) LSTM Layers & Fully-Conneted Layers: According that includes 2 processors for the CPU implementation, and
to the algorithm descriptions in Section IV.B.2 and Section each processor is a 8-core Xeon [email protected] with
IV.B.3, the computation-intensive part of LSTM layers and a 40MB L3 cache, and the thermal design power (TDP) is
Fully-connected layers mainly includes matrix to vector mul- 95W. The GPU is an NVIDIA GeForce GTX TITAN X,
tiplication. Unfortunately, the matrix to vector multiplication which has 3072 cuda cores and 12GB GDDR5 memory. The
is inefficient in terms of data locality, because every weight run-time power of it is about 250W. Both CPU- and GPU-
element fetched from DRAM is used only once for a single implementations run with batch size set to 256.
inference. Thus, most of the inference time are spent on data
communication. This indicates that performing model infer- B. FPGA Resource Utilization
ence directly with MM kernel will bring much performance The resource utilization of our MM implementations are
loss. To perform these computations with MM efficiently, shown in Table II. Among all the utilized resources, the Cat-
we propose to batch input vectors together. In this batching apult Shell (responsible for peripheral interfaces and memory
way, every element of weight matrices is reused, and we management) and matrix multiplication module are in Verilog,
157
TABLE I: CNN Performance Comparison with Prior Work
[23] [19] [27] Our Imp.
FPGA chip Stratix-V GSD8 Zynq XC7Z045 Virtex-7 690T Stratix-V GSMD5
Frequency 120 MHz 150 MHz 150MHz 150 MHz
Precision fixed8-16 fixed16 fixed16 fixed16
DSP Utilization 727/1963 780/900 2833/3600 1036/1590
Overall GOP/S 117.8 137.0 354.0 364.4
158
TABLE III: Performance Comparison on Different Platforms R EFERENCES
Model VGG-19[22] [1] Altera AOCL. https://www.altera.com/.
Platform CPU GPU FPGA [2] Altera OpenCL Example Design. https://www.altera.com/support/support-
Pecision float32 float32 float32 fixed16 resources/design-examples/design-software/opencl.html.
Accuracy 89.99% 89.99% 89.99% 89.9% [3] CNTK. https://www.cntk.ai/.
[4] Torch. http://torch.ch/.
GOP/S 119 1704 81 364.36
[5] Martın Abadi, Ashish Agarwal, et al. TensorFlow: Large-Scale Machine
GOP/J 0.63 6.82 3.24 14.57
Learning on Heterogeneous Systems, 2015.
Model LSTM-LM[26] [6] Frédéric Bastien, Pascal Lamblin, et al. Theano: new features and speed
Platform CPU GPU FPGA improvements. Deep Learning and Unsupervised Feature Learning,
Pecision float32 float32 float32 fixed16 NIPS 2012 Workshop.
Perplexity 78.42 78.42 78.42 78.42 [7] Sharan Chetlur, Cliff Woolley, et al. cuDNN: Efficient Primitives for
GOP/S 103 1828 86 315.85 Deep Learning. arXiv preprint arXiv: 1410.0759.
GOP/J 0.54 7.31 3.44 12.63 [8] Jia Deng, Wei Dong, et al. ImageNet: A large-scale hierarchical image
Model Res-152[13] database. 2009.
Platform CPU GPU FPGA [9] Alex Graves, Abdel-rahman Mohamed, et al. Speech recognition with
Pecision float32 float32 float32 fixed16 deep recurrent neural networks. In IEEE International Conference on
Accuracy 93.84% 93.84% 93.84% 93.83% Acoustics, Speech and Signal Processing (ICASSP), 2013.
GOP/S 119 1661 73 226.47 [10] Klaus Greff, Rupesh Kumar Srivastava, et al. LSTM: A search space
GOP/J 0.63 6.60 2.92 9.06 odyssey. arXiv preprint arXiv:1503.04069, 2015.
[11] Song Han, Xingyu Liu, et al. Eie: efficient inference engine on com-
(CNN), LSTM-LM [26] (LSTM- RNN), Res-152 [13] (Resid- pressed deep neural network. In Proceedings of the 43rd International
ual Net). Performance is evaluated in GOP/S, and energy Symposium on Computer Architecture, pages 243–254. IEEE Press,
efficiency is evaluated in GOP/J (giga operations per joule). 2016.
[12] Song Han, Huizi Mao, et al. Deep Compression: Compressing Deep
We applied data quantization strategies to all three models, Neural Networks with Pruning, Trained Quantization and Huffman
and compared the model accuracy between 32-bit floating- Coding. CoRR, abs/1510.00149, 2015.
point (float32) and 16-bit fixed-point (fixed16) in Table III. [13] Kaiming He, Xiangyu Zhang, et al. Deep residual learning for image
recognition. arXiv preprint arXiv:1512.03385, 2015.
We report the top-5 accuracy of VGG-19 and Res-152 on [14] Yangqing Jia, Evan Shelhamer, et al. Caffe: Convolutional architecture
ImageNet dataset [8]. Higher accuracy indicates the model for fast feature embedding. In Proceedings of the ACM International
performs better in the image recognition task. Perplexity of Conference on Multimedia, 2014.
[15] Haoxiang Li, Zhe Lin, et al. A Convolutional Neural Network Cascade
LSTM-LM on PTB dataset [17] is used to evaluate the model. for Face Detection. In Proceedings of the IEEE Conference on Computer
The lower the perplexity is, the better the model performs Vision and Pattern Recognition, 2015.
in the language modeling task. We observe that fixed16 [16] Divya Mahajan, Jongse Park, et al. TABLA: A unified template-
based framework for accelerating statistical machine learning. In
implementations are sufficient for all the networks. IEEE International Symposium on High PERFORMANCE Computer
We also compare implementations generated by FP-DNN Architecture, 2016.
with other implementations in performance. When we use full- [17] Mitchell P. Marcus et al. Building a large annotated corpus of english:
the penn treebank. Computational Linguistics, 1993.
precision (float32) data, the implementation generated by FP- [18] Andrew Putnam, Adrian M Caulfield, et al. A reconfigurable fabric
DNN is slower than the implementations on CPU. When the for accelerating large-scale datacenter services. In ACM/IEEE 41st
data precision is lowered to fixed16, FP-DNN implementations International Symposium on Computer Architecture (ISCA), 2014.
[19] Jiantao Qiu, Jie Wang, et al. Going Deeper with Embedded FPGA Plat-
are faster than CPU implementations by about 1.9x∼3.06x. We form for Convolutional Neural Network. In ACM/SIGDA International
observe that FP-DNN cannot compete with GPU implemen- Symposium on Field-Programmable Gate Arrays, 2016.
tations in performance. Regarding energy efficiency, FP-DNN [20] Ruhi Sarikaya, Geoffrey E Hinton, et al. Application of deep belief
networks for natural language understanding. IEEE/ACM Transactions
implementations is always better than CPU implementations on Audio, Speech, and Language Processing, 2014.
in all models and precisions. And FP-DNN can easily beat [21] Hardik Sharma, Jongse Park, et al. From high-level deep neural models
GPU implementations when the data precision is lowered to to FPGAs. In IEEE/ACM International Symposium on Microarchitec-
ture, 2016.
fixed16. [22] Karen Simonyan and Andrew Zisserman. Very deep convolu-
VI. C ONCLUSIONS tional networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556, 2014.
In this paper, we propose FP-DNN, a framework that [23] Naveen Suda, Vikas Chandra, et al. Throughput-Optimized OpenCL-
automatically maps DNNs onto FPGAs to accelerate model based FPGA Accelerator for Large-Scale Convolutional Neural
inference. FP-DNN analyzes model descriptions to perform Networks. In ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays, 2016.
model mapping and code generating, then it implements [24] Endong Wang, Qing Zhang, et al. Intel Math Kernel Library. In High-
model inference with high-performance computation engine Performance Computing on the Intel R Xeon Phi. Springer, 2014.
and carefully-designed communication optimization strategies. [25] Ying Wang, Jie Xu, et al. DeepBurning: automatic generation of FPGA-
based learning accelerators for the neural network family. In IEEE/ACM
Our case studies show the great performance and effectiveness Proceedings of Design Automation Conference, 2016.
achieved by FP-DNN. [26] Wojciech Zaremba, Ilya Sutskever, et al. Recurrent neural network
regularization. arXiv preprint arXiv:1409.2329, 2014.
VII. ACKNOWLEDGEMENT [27] Chen Zhang, Zhenman Fang, et al. Caffeine: Towards Uniformed Rep-
resentation and Acceleration for Deep Convolutional Neural Networks.
This work is supported in part by NSF China (No.61572045) In International Conference on Computer Aided Design, 2016.
[28] Chen Zhang, Peng Li, et al. Optimizing FPGA-based Accelerator Design
and Microsoft Research Asia (No.FY16-RES-THEME-037). for Deep Convolutional Neural Networks. In ACM/SIGDA International
We also would like to thank Sixiao Zhu for many inspiring Symposium on Field-Programmable Gate Arrays, 2015.
discussions.
159