Research On Opencl Optimization For Fpga Deep Learning Application
Research On Opencl Optimization For Fpga Deep Learning Application
Abstract
a1111111111
In recent years, with the development of computer science, deep learning is held as compe-
a1111111111
a1111111111 tent enough to solve the problem of inference and learning in high dimensional space.
a1111111111 Therefore, it has received unprecedented attention from both the academia and the busi-
a1111111111 ness community. Compared with CPU/GPU, FPGA has attracted much attention for its
high-energy efficiency, short development cycle and reconfigurability in the aspect of deep
learning algorithm. However, because of the limited research on OpenCL optimization on
FPGA of deep learning algorithms, OpenCL tools and models applied to CPU/GPU cannot
OPEN ACCESS be directly used on FPGA. This makes it difficult for software programmers to use FPGA
Citation: Zhang S, Wu Y, Men C, He H, Liang K when implementing deep learning algorithms for a rewarding performance. To solve this
(2019) Research on OpenCL optimization for FPGA problem, this paper proposed an OpenCL computational model based on FPGA template
deep learning application. PLoS ONE 14(10): architecture to optimize the time-consuming convolution layer in deep learning. The compar-
e0222984. https://doi.org/10.1371/journal.
pone.0222984
ison between the program applying the computational model and the corresponding optimi-
zation program provided by Xilinx indicates that the former is 8-40 times higher than the
Editor: Wajid Mumtaz, National University of
Sciences and Technology, PAKISTAN
latter in terms of performance.
reasoning stage of convolutional neural network, Microsoft team uses FPGA (Stratix V D5)
to achieve the acceleration performance of 134 pictures processing per second and the
power consumption is only 25 watts. If the superior FPGA (Arria 10 GX1150) is used, this
acceleration performance is expected to 233 pictures processing per second, while the power
consumption is basically unchanged. For high-performance GPU implementation (Caffe +
cuDNN), the acceleration performance is 500-824 pictures processing per second, and the
power consumption is 235 watts [5]. It means that FPGA has better energy efficiency com-
pared with GPU [6] [7] [8] [9]. Unlike GPU and ASIC with fixed hardware architectures,
FPGA is reconfigurable hardware, which means developers can connect the logical blocks
within the FPGA through programmable connections to achieve their desired function
[10]. This programmability enables developers to adjust their hardware design at any time
according to the deep learning algorithm. However, hardware acceleration design based on
FPGA requires software developers have a certain amount of hardware expertise, which is a
high threshold for them. In recent years, FPGA programming environment has been greatly
improved. Until now, the developers without corresponding hardware expertise have been
allowed to develop FPGA with advanced programming languages such as C, C++ and
OpenCL. It to some extent reduces the difficulty of FPGA development, shortens the FPGA
development cycle and provides convenience for researchers and developers [11]. In order
to reduce the difficulty of FPGA development, the key technologies in the automated high
level synthesis tool chain are studied. These researches can be easily classified from different
perspectives. From the perspective of the input language used by the user, it can be divided
into C language and C−like language. The research uses C/C++ as its input language [12]
[13] [14], this kind of research is divided into two categories when implementing automated
generation of FPGA hardware architecture. One category of research is a complete auto-
mated generation tool chain. The process of generating hardware architecture is completely
controllable, but the disadvantage is the insufficient universality of tools [14]. The other is to
use the current mainstream hardware generation high level synthesis tool chain [12] [13],
but it need to study the automation code generator in depth. The C/C++ language is trans-
lated by users to generate the input language supported by the commercial tool chain. The
main research of this category is how to map one high-level language to another high-level
language (such as OpenCL). Another kind of research work that directly uses C−like lan-
guage (such as OpenCL) as an input language, focuses on different architecture of the CNN
Accelerator [15] [16] [17]. However, because the same function of the program is imple-
mented in different OpenCL code, the hardware architecture generated by the automation
tool optimization is different. To implement efficient hardware circuits, developers need to
constantly try to optimize various configuration combinations. Even though the push-but-
ton automated tool also requires iterative optimization. The key innovation point of this
paper is proposed a computational model to help software engineers rationally design
parameters without additional third-party tools, how to quickly reduce the iteratively
written OpenCL code, and generate efficient hardware based on deeply loop pipelined
architecture.
yðkÞ ðkÞ
ij ¼ maxðaðl1 iþsÞðl2 jþtÞ Þ ð2Þ
In the formula, L1 and L2 represent the core pool size. For pooling layer and convolution
layer, we tend to pool after convolution and put an activation function after convolution. The
activation function is a simple non-linear operation, which improves the ability of non-linear
characterization. With the process of convolution-activation function-pooling, CNN can
obtain more robust features.
Deep learning convolutional neural networks bring is that the convolution layer needs to
consume a lot of memory [18], especially in the training process, because back-propagation
needs all the intermediate values of forward transmission. If the size of the input image is
H × W and the filter size is m × n, the convolution can be expressed in the Eq (3):
m 1X
X n 1
zijðkÞ ¼ wðkÞ
st xði þ sÞðj þ tÞ ð3Þ
s¼0 t¼0
In the equation, w is the weight of the kernel. However, the equation above is not enough with
multiple convolution layers considered. Thus, a parameter is added to the kernel. The modi-
fied equation is as Eq (4):
XX
m 1X
n 1
zijðkÞ ¼ wðk;cÞ ðcÞ
st xðiþsÞðjþtÞ ð4Þ
c s¼0 t¼1
In the equation, c represents the image channel. If the number of kernels is k and the channel
is c, the convolution image size is (M − m + 1) × (N − n + 1) through the above equation.
Assuming that the size of the convolution kernel is 5×5, 200 feature map with the size of
150×100 are needed to output. If the input is three channels, the whole process requires 225
million floating point multiplication. This process involves a large number of multiplication
addition calculations, which requires a reasonable calculation computational model to
improve the performance of the system. However, in the actual optimization, we should not
only consider the optimization of computation, but also whether the storage resources on the
FPGA chip can transmit the data needed for multiplication addition calculation at one time
[19]. Assuming that ThroughputRate is the throughput of the system, it is affected by two
aspects of computation and memory access. The relationship between system throughput and
Template parameters
According to the parameterized optimization architecture diagram, the following parameters
are needed for calculation and determination:
• Determining the parameter data Nd, the number of elements in the vector data type is Ne.
Nd ¼ Kn � Ne ð6Þ
In the equation, nun_bit is the number of data digits corresponding to the data type.
• The theoretical value of the total number of data transfer times is Nt, and the average data
amount of each data transfer is Kt.
X n � �
Ni
Nt ¼ ð8Þ
i¼1
Bui
X
n
Ni Bi
i¼1
ð9Þ
Kt ¼
Nt � 8 � 1000
In the equation, Ni is the total number of per variable data transfer and Bi is the data bits.
• The number of DSP needed for the calculation is dsp_need.
dspneed ¼ Nunroll � K ð10Þ
In the equation, Nunroll is the unrolling loop degree in the unrolling loop scheme; K is the
number of using DSP for each loop iteration. This value can be obtained from the resource
use report. The number of DSP required for each multiplication operation can be obtained
through the related development board documents, and then K also can be calculated.
• Calculate the number of sub parts of the array data memory, p_num.
8
< minðnum; nummax Þ; loop partition
pnum¼ ð11Þ
: min num d ; num �; block partition
addr i max
In the equation, num_d is the total number of data in the array. num is the number of conse-
cutive data addresses for each calculation. num_max is the upper limit of the array partition
supported by the compiler. addr_i is the address interval of each of the adjacent data.
• The cyclic boundary Lb.
(
x; cyclic boundary is variable
Lb ¼ ð12Þ
c; cyclic boundary is constant
In the equation, B and D are the percentages of BRAM resources consumed by a single cell
and the percentage of DSP resources consumed by a single cell respectively, and Nk is the
number of cells restricted by the compiler.
• The number of memory ports for storing data is numport, and the theoretical parallelism of
the data calculation is v_cal. The maximum parallelism of data reading/writing is v_data,
and these three parameters can be obtained from the program execution information.
Computational model
1. The relevant parameters are obtained by calculation out of hardware architecture template
drawing and parameter calculation equation. The specific OpenCL optimization technol-
ogy and related parameters are selected by the following algorithm steps.
(1) Judge whether the address stored in the off-chip global memory of the corresponding
parameter is continuous. If so, entered (2); if not, the data vectorization is not optimized.
(2) Judge whether the data quantity contained in parameter is suitable for data vectoriza-
tion. Ne represents the number of elements in vector data type supported by
OpenCL compiler. Ne 2 {2, 3, 4, 8, 16}. Traverse the value of Ne. If 9Kn 2 N+ can
make Nd and Ne satisfy the formula (1), the value of Kn is recorded. After the traversal
is completed, the whole value of Kn recorded is combined into a set called L. If the set L
is not empty, enter (3). If the set is empty, the data vectorization optimization is not
carried out.
(3) The minimum value in the set is recorded as Lmin. The data are grouped in ascending
order of size, and the number of groups is Lmin. The number of data in each group is Nd/
Lmin. After completing the grouping, each group of data are used as a whole to replace the
original data in kernel. If they can be substituted equivalently and do not affect the correct
execution of the program, they enter (4). Otherwise, remove the Lmin from the set L and
repeat (3).
(4) Carry out data vectorization optimization. Nd/Lmin represents the number of elements
contained in vector type data.
2. Configure the number of data ports and the bit width. The bit width of a data port is usually
related to the data type of data transmitted through the port. At present, the OpenCL com-
piler supports a bit width of 32, 64, 128, 256 and 512 bits. If the data type corresponding to
the data bit num_bit 2 {32, 64, 128, 256, 512}, the port bit width is set as num_bit. Other-
wise, keep the default setting. By default, the OpenCL compiler automatically configures
the bit width of the data port according to the actual situation.
3. Assuming that there are n global variables involved in the data transfer(read), the total
number of data transfers (read) per variable is N1, N2, � � �, Nn. The data digits are B1, B2, � � �,
Bn, and the burst length of data read and write is Bu1, Bu2, � � �, Bun. The burst length of
burst read-write model is usually 16, and the length of non-burst read-write model is 1.
According to the procedure execution report, it is judged whether the new optimization is
carried out. The steps are as follows:
(1) Record the total number of data transfers (reading/writing) in the program execution
report and the average amount of data (reading/writing) per data transfer. The values are
Nr, Nw, Kr, Kw respectively.
(2) According to the Eqs (3) and (4), calculate the total number of data transfers (reading/
writing) and the average amount of data (reading/writing) per data transfer after memory
optimization. The values obtained are Ntr, Ntw, Ktr, and Ktw respectively.
(3) If Ntr is less than Nr (or Ktr is greater than Kr) or Ntw is less than Nw (or Ktw is greater than
Kw), and the difference is larger, it is necessary to adjust the optimization; Otherwise,
there is no need to be re-optimized.
4. The set A is all iterations in the nested loop to be analyzed. Unrolling loop and optimizing
array partition are carried out according to the following process:
(1) The analysis is started from the most inner loop in the set A. If the layer is already the out-
ermost loop or the cycle order of the layer cannot be exchanged with the innermost loop,
record the innermost loop and all loops that can exchange order with the inner loop as
the set B. Remove the elements in set B from set A and enter (2); Otherwise, analyze the
outer loop.
(2) According to the Eq (10), the number of DSP dspneed required for the scheme is calcu-
lated. And then, compare dspneed with the total number of DSP on-chip dsptotal. If
dspneed < dsptotal, and the array division that conforms to the computational parallelism
of the scheme can be realized, enter (3); Otherwise, analyze the next scheme.
(3) In this scheme, if all loops in set B are fully expanded and set A is not empty, then enter
(1) and calculate the degree of parallelism; otherwise, enter (4).
(4) Optimization is carried out according to the unrolling loop scheme and the correspond-
ing array partition scheme. Analyze whether an array partition that satisfies the compu-
tation parallelism in (2) can be achieved. Next, analyze the data after the unrolling loop
and group the data, and then the array stored on the same FPGA on-chip memory is
divided into a set. Analyze each group of data sequentially, and select the corresponding
analysis method according to its storage method on FPGA in the forms of one-dimen-
sional array and multi-dimensional array. If all arrays can be partitioned to satisfy
computational parallelism, it is shown that an efficient array partition can be made for
the unrolling loop scheme. Otherwise, the effective array partition cannot be carried
out.
(5) It is analyzed from two aspects: one-dimensional array and multi-dimensional array.
The steps of one-dimensional array analysis are given as follows:
(a) Analyze the address characteristics of each calculation involving data after the loop
unrolling. If the addresses are continuous, carry out cyclic division to the array and enter
(c). If the address is not continuous but the interval is uniform, carry out block division
to the array and enter (c). If the data address characteristics do not meet the both of the
above conditions, enter (b). The calculation method of dividing the number of sub-parts
storing array data memory is like Eq (11).
(b) If num_d < num_reg, the array is to be divided entirely and enter (c), otherwise it cannot be
effectively divided.num_reg is the total number of FPGA on-chip registers available.
(c) Verify whether the parallelism of data reading/writing after array partition satisfies the
parallelism of computation in the unrolling loop scheme. If it is satisfied, the array parti-
tion is effective and the array partition scheme is recorded; otherwise, the array partition
is invalid.
Experimental section
The compiler tool used in this experiment is the Xilinx SDx tool, and the FPGA development
board produced by the Alpha Data company is the ADM-PCIE-7V3 board. Linux is the execu-
tion environment of the Host terminal. The specific environment of this experiment is shown
in Table 1 and the specific configuration of the ADM-PCIE-7V3 board is given in Table 2.
Optimization example
In this section, this paper introduces the OpenCL example of convolution layer on FPGA
firstly. Based on this example, the computational model is applied to the convolution layer.
Later the application of the computational model is explained in detail. Finally, the results of
the optimized program execution are given.
Convolution layer OpenCL example on FPGA. In the convolutional neural network
model, the operation of each convolution layer is consistent and it is convolution operation.
The difference lies in data processing and the scale of data, so the optimization methods and
ideas are basically the same in the optimization of different convolution layer. Accordingly,
this section focuses on the example of a single convolution layer in convolutional neural net-
work. The example program given in this section is an ordinary convolution layer program
without any optimization, whose parameters are shown in Table 3.
The number of convolution kernel channels is 48, and the number of convolution kernels
is 256. The convolution layer is mainly implemented in the OpenCL kernel program, and the
Host is mainly responsible for configuring the environment required by the kernel program,
calling the kernel program, carrying out data transfer with kernel, etc. The specific implemen-
tation of pseudo code is shown in Algorithm 1.
Algorithm 1: Realization of pseudo code in convolution layer
Input: � image, input feature map data �
weights, input weight data
Output: � out output feature map data
1 async_work_group_copy(local_image,image,i_channel� ISize� ISize, 0);
2 async_work_group_copy(local_weight,weights,o_channel� i_channel� WSize
�
WSize, 0);
3 index 0;
4 outputLoop: for o_num 0 to o_channel do
5 outYAxis: for o_y 0 to OSize do
6 outXAxis: for o_x 0 to OSize do
7 sum 0;
8 convInchan: for conv_num 0 to i_channel do
9 convILoop:for conv_y 0 to WSize do
10 convJLoop: for conv_x 0 to WSize do
11 x_padding o_x� Stride+con_x-Padding;
12 y_padding o_y� Stride+con_y-Padding;
Table 4. Related information of data transfer in the convolution layer basic program.
Transfer Type Number of Transfers Transfer Rate(MB/s) AvgBandwidth Utilization(%) Avg Size(KB) Avg Time(ns)
Data Transfer: Read 408969216 63.049 0.547 0.004 27.240
Kernels and Global Memory Write 186624 0.029 2.4975E-4 0.004 15.000
https://doi.org/10.1371/journal.pone.0222984.t004
According to the fourth step of the computational model, the dspneed is figured out to be
251 by the parameter calculation Eq (10). The total number of DSP on-chip is 3600. The num-
ber of DSP needed is less than the total number and set B is fully unrolled loop. Fig 2 shows
the use of optimized instruction and the equivalent code after unrolling. Graph (a) shows the
equivalent code when the unrolling factor is 2, and graph (b) shows the equivalent code when
the unrolling factor is by default.
For the size of convolution kernel is 5×5, the theoretical parallelism of computation is 25.
However, the input characteristic graph data involved in the calculation and the convolution
kernel data are stored locally in the form of one-dimensional arrays. Without the array parti-
tion optimization, the OpenCL compiler only assigns two ports to it at most. That is, the
degree of parallelism of reading is 2, which is much less than the degree of parallelism of calcu-
lation, so the arrays need to implement array partitioning.
According to the fifth step of the computational model, the p_num of cyclic partitioning
and block partitioning are calculated by Eq (11). Since the total number of num_d in the array
is less than that of num_reg of available registers on-chip, the array is completely partitioned.
In the process of loop unwrapping and array partitioning optimization, the last three layers of
the convolution layer implementation code (calculation of single pixel in output characteristic
graph) are optimized and the corresponding array partitioning is carried out. Meanwhile, for
the convenience of optimization, this section divides the last three layers into double three-
layer according to convolution multiplication and addition. For the two three-layer loops are
consistent in architecture, the corresponding optimization strategies are nearly identical. The
Table 7. Performance comparison of the different optimization programs in the convolution layer OpenCL.
Original Convolution Layer The Optimization Program Provided by Xilinx The Example Optimization Speedup
Program Company Program Ratio
Execution Time 1142.26 291.977 9.76 29x
(ms)
https://doi.org/10.1371/journal.pone.0222984.t007
when each computing unit transmits data with the off-chip global memory, this division of the
kernel program is based on the number of channels in the output characteristic graph.
According to the eighth step of the computational model, unrolling loop and array partition
optimization of the last three layer loops of convolution operation are carried out. The degree
of parallelism of the optimized convolution multiplication is 480, which is two fifths of that of
the ideally optimized convolution multiplication. Multiple computing units optimization of
the outermost loop of the first three layers is carried out, and the parallelism of the output fea-
ture image pixel calculation after optimization is 6, which is much less than that of the pixel
calculation of the output feature image after optimization. The optimization of cyclic flow is
carried out for the inner loop. Finally, by comparing the values of additions, it is found that
there is no need to repartition the array.
Optimization performance analysis of the program. According to the computational
model proposed, the example code is directed toward optimization, whose result compared to
the latest Xilinx optimization program [17] is shown in Table 6. From the runtime of each cell
and the whole kernel, it is found that these four cells are basically executed in parallel.
The final optimization result of this example program is shown in Table 7. The final execu-
tion time of the example program is 9.76 milliseconds after optimization. Moreover, this paper
also tests the performance of the convolution layer optimization program provided by Xilinx,
as summarized in Table 5 where it can be seen that the final performance of the program is 29
times higher than that of the optimization program provided by Xilinx company.
The final optimization results of this experiment are compared with the CPU implementa-
tion [20], as indicated in Table 8. The 1-thread in the table is set as single thread execution,
and the 16-thread is set as the 16-thread execution. -O3 represents the optimization level of a
compiler is -O3.
From Table 8, it can be seen that the performance of the optimized convolution on FPGA is
9.76 times higher than that of single-thread CPU, 2.8 times higher than that of 16-thread CPU.
Also, it is indicated that the energy consumption of the convolution program optimized by the
computational model proposed and implemented on FPGA is significantly lower than that of
CPU.
Table 10. Performance comparison of different convolution scale optimization program with Xilinx company.
Convolution scale Execution time (ms)
The optimization program provided by Xilinx company The program optimized by the computational model
Layer1 12.4 1.4
Layer2 48.81 3.2
Layer3 21.5 2.3
Layer4 20.4 2.6
Layer5 52 4.6
Layer6 54.5 5.6
Layer7 216.2 8.1
Layer8 508.4 12.67
https://doi.org/10.1371/journal.pone.0222984.t010
This paper put the code link into the paper openly accessible for other researchers to study
and explore new accelerator method for deep neural networks. It can be found at the following
link: https://github.com/PoetryAndWine/FPGA_CNN_Acceleration.
Conclusion
This paper proposes an computational model based on OpenCL, which enables the transfor-
mation of the OpenCL model on GPU/CPU to FPGA. This computational model is used
to help software programmers without fundamental hardware knowledge for a quick imple-
mentation in deep learning algorithm with high performance using FPGA. In terms of per-
formance, the computational model not only reduces the cost of data interaction, but also
improves the efficiency of data calculation. In terms of adaptability, the computational
model is flexible and suitable for convolution layers of different sizes. The results of the pro-
posed computational model applied to convolution layers of different scales show that the
performance of the proposed computational model is 8-40 times higher than that of the cor-
responding optimization program provided by Xilinx Company.
Supporting information
S1 File. Convolution layer optimization code and the performance data. (Xilinx and this
paper).
(ZIP)
Acknowledgments
Our work is supported by the National Key Research and Development Program
(2016YFB1000400), the Harbin Outstanding Young Talents Fund (2017RAYXJ016),Nature
Scientific Foundation of Heilongjiang Province (F2018008).
Author Contributions
Formal analysis: Shuo Zhang, Hongtao He.
Funding acquisition: Yanxia Wu.
Methodology: Chaoguang Men.
Project administration: Yanxia Wu.
Resources: Kai Liang.
Supervision: Yanxia Wu, Chaoguang Men.
Writing – original draft: Shuo Zhang, Kai Liang.
Writing – review & editing: Shuo Zhang, Hongtao He.
References
1. Yu Q, Wang C, Ma X, Li X, Zhou X. A Deep Learning Prediction Process Accelerator Based FPGA.
Ieee/acm International Symposium on Cluster, Cloud and Grid Computing. IEEE, 2015:585-594.
2. Lecun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553):436. https://doi.org/10.1038/
nature14539 PMID: 26017442
3. Véstias M, Duarte RP, de Sousa JT, Neto H. Parallel dot-products for deep learning on FPGA. Field
Programmable Logic and Applications (FPL), 2017 27th International Conference on. IEEE, 2017: 1-4.
4. Zhu J, Qian Z, Tsui CY. LRADNN: High-throughput and energy-efficient Deep Neural Network accelera-
tor using Low Rank Approximation. Design Automation Conference. IEEE, 2016:581-586.
5. Lacey G, Taylor GW, Areibi S. Deep Learning on FPGAs: Past, Present, and Future. arXiv: Distributed,
Parallel, and Cluster Computing. 2016
6. Chen DT, Singh DP. Fractal video compression in OpenCL: An evaluation of CPUs, GPUs, and FPGAs
as acceleration platforms. Asia and south pacific design automation conference. 2013:297-304
7. Zhang C, Sun G, Fang Z, Zhou P, Pan P, Cong J. Caffeine: towards uniformed representation and
acceleration for deep convolutional neural networks. International conference on computer aided
design. 2016.
8. Nurvitadhi E, Sim J, Sheffield D, Mishra A, Krishnan S, Marr D. Accelerating recurrent neural networks
in analytics servers: Comparison of FPGA, CPU, GPU, and ASIC. Field programmable logic and appli-
cations. 2016:1-4.
9. Ouyang J, Lin S, Qi W, Wang Y, Yu B, Jiang S. SDA: Software-defined accelerator for large-scale DNN
systems. Hot Chips 26 Symposium. IEEE. 2016:1-23
10. Nurvitadhi E, Sim J, Sheffield D, Mishra A, Krishnan S, Marr D. Accelerating recurrent neural networks
in analytics servers: Comparison of FPGA, CPU, GPU, and ASIC International Conference on Field
Programmable Logic and Applications. IEEE, 2016:1-4.
11. Stone JE, Gohara D, Shi G. OpenCL: A Parallel Programming Standard for Heterogeneous Computing
Systems. Computing in Science & Engineering, 2010, 12(3):66–73. https://doi.org/10.1109/MCSE.
2010.69
12. Wei X, Yu C, Zhang P, Chen Y, Wang Y, Hu H, et al. Automated Systolic Array Architecture Synthesis
for High Throughput CNN Inference on FPGAs. The 54th Annual Design Automation Conference 2017.
ACM, 2017.
13. Abdelouahab K, Pelcat M, Serot J, Bourrasset C, Quinton JC, Berry F. Hardware Automated Dataflow
Deployment of CNNs. arXiv:1705.04543v3.2017
14. Huang Q, Lian R, Canis A, Choi J, Xi R, Brown S, et al. The Effect of Compiler Optimizations on High-
Level Synthesis for FPGAs. IEEE International Symposium on Field-programmable Custom Computing
Machines. IEEE, 2013.
15. Abdelfattah MS, Hagiescu A, Singh D. Gzip on a chip: high performance lossless data compression on
FPGAs using OpenCL. Proceedings of the International Workshop on OpenCL 2013 & 2014.
16. Farabet C, Martini B, Akselrod P, Talay S, LeCun Y, Culurciello E. Hardware accelerated convolutional
neural networks for synthetic vision systems. IEEE International Symposium on Circuits & Systems.
IEEE, 2010.
17. Qiu J, Wang J, Yao S, Guo K, Li B, Zhou E, et al. Going Deeper with Embedded FPGA Platform for Con-
volutional Neural Network. Proceedings of the 2016 ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays. ACM, 2016.
18. Ko BS, Kim HG, Oh KJ, Choi HJ. Controlled dropout: A different approach to using dropout on deep
neural network. IEEE International Conference on Big Data and Smart Computing. IEEE, 2017:358-
362.
19. Suda N, Chandra V, Dasika G, Mohanty A, Ma Y, Vrudhula S, et al. Throughput-Optimized OpenCL-
based FPGA Accelerator for Large-Scale Convolutional Neural Networks. Proceedings of the 2016
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2016:16-25.
20. Zhang C, Li P, Sun G, Guan Y, Xiao B, Cong J. Optimizing FPGA-based Accelerator Design for Deep
Convolutional Neural Networks. Acm/sigda International Symposium on Field-Programmable Gate
Arrays. ACM, 2015:161-170.
21. Czajkowski TS, Aydonat U, Denisenko D, Freeman J, Kinsner M, Neto D, et al. From OpenCL to high-
performance hardware on FPGAs. Field Programmable Logic and Applications (FPL), 2012 22nd Inter-
national Conference on. IEEE, 2012: 531-534.
22. Luo L, Wu Y, Qiao F, Yang Y, Wei Q, Zhou X, et al. Design of FPGA-Based Accelerator for Convolu-
tional Neural Network under Heterogeneous Computing Framework with OpenCL. International Journal
of Reconfigurable Computing, 2018, 2018:1–10. https://doi.org/10.1155/2018/1785892
23. Tapiador R, Riosnavarro A, Linaresbarranco A, Kim M, Kadetotad D, Seo J. Comprehensive Evaluation
of OpenCL-based Convolutional Neural Network Accelerators in Xilinx and Altera FPGAs. Robotic and
Technology of Computers Lab report. 2016.