0% found this document useful (0 votes)
30 views5 pages

An Efficient Reconfigurable Hardware Accelerator For CNN

This document presents a reconfigurable hardware accelerator for Convolutional Neural Networks (CNN) that enhances computational efficiency by utilizing basic processing elements and a Benes network for layer configuration. The proposed architecture is network-agnostic, allowing it to support various CNN models, and demonstrates a significant performance improvement over existing architectures, particularly for AlexNet. The design achieves a reduction in execution clock cycles, making it suitable for mobile devices and data centers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views5 pages

An Efficient Reconfigurable Hardware Accelerator For CNN

This document presents a reconfigurable hardware accelerator for Convolutional Neural Networks (CNN) that enhances computational efficiency by utilizing basic processing elements and a Benes network for layer configuration. The proposed architecture is network-agnostic, allowing it to support various CNN models, and demonstrates a significant performance improvement over existing architectures, particularly for AlexNet. The design achieves a reduction in execution clock cycles, making it suitable for mobile devices and data centers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

An Efficient Reconfigurable Hardware Accelerator

for Convolutional Neural Networks


Anaam Ansari†∗ , Kiran Gunnam‡∗ , Tokunbo Ogunfunmi§∗
[email protected],‡ [email protected][email protected]
,
∗ Department of Electrical Engineering
Santa Clara University, Santa Clara, California 95053

Abstract—Convolutional Neural Networks (CNN) have proven B. CNN Accelerator


to be very effective in image and speech recognition. The
increasing usage of such applications in mobile devices and Every convolutional layer within the network requires a
data centers have led the researchers to explore application large amount of computation to be carried out. There are
specific hardware accelerators for CNN. However, most of these several challenges in budgeting the resources available on an
approaches are limited to a specific network such as Alexnet. We FPGA and power limitations that come with it. Thus, CNN
propose a reconfigurable technique that can be extended to be a
network-agnostic architecture that supports various networks.
accelerators are designed to iteratively perform the function
The technique described in this paper aims at developing a of each layer.
reconfigurable accelerator that uses basic processing elements
(PE) as building blocks of its computational engine. In our design, II. R ELATED W ORK
we control the configuration of each layer using a switching
control logic and a Benes network. In addition to potentially There are many implementations that adopt some degree of
supporting all the various CNN architectures, our computation reconfigurability in their design. The architecture proposed in
engine design has a 94% improvement in the convolutional layer [3], features a reconfigurable 2-dimensional grid of processing
execution time for AlexNet compared to the state-of-the art
architecture that only supports AlexNet. elements using on-chip memory to perform secondary opera-
Index Terms—Convolutional Neural Network, reconfigurable tions. Each processing elements in the computation engine has
accelerator, computational engine, efficient accelerator, Benes independent off-chip memory. We see a different approach in
Network, AlexNet,processing element, switching control logic, [4], where the authors limit the external memory accesses by
network-agnostic having a memory hierarchy. They use local scratchpads and
global buffers to do so in an energy efficient manner. They
I. I NTRODUCTION achieve energy efficiency as a result of reducing external data
Convolutional Neural Networks are modelled after an optic accesses. The design in [5] uses a variable sized convolu-
nerve, emulate the behavior of an animal’s visual cortex. tional layer processor to distribute computational resources to
Convolutional Neural Networks are a type of multilayer neural compute each layer. In [6],the authors determine the optimal
network that are trained with a back propagation algorithm. implementation parameters for each convolutional layer of
Images have the property of spatial correlation which is ex- Alexnet [7]. FPGA optimization techniques such as loop
ploited by convolution neural networks for image recognition unrolling, loop tiling and transformation have been employed
and classification. Owing to this correlation, CNNs are able to push the accelerator to achieve efficiency. The analysis done
to shed the full connectivity of a regular neural network and by the authors in [6] delineates that variable loop dimensions
be locally connected instead [1]. Thus, convolutional neural warrants different implementation variants. The work done in
networks are best suited for applications involving image and [6], explores the communication to computation ratio, and
video processing. the computation roofline space for each layer to determine
the computational performance with optimal unroll factors
A. CNN complexity < Tm , Tn >, where Tm and Tn are the tile size of the outputs
A convolutional layer is the core building block of a and inputs of the computation engine respectively as seen in
CNN. It performs dot products using its processing elements Table I These unroll factors are variable in nature. Variable
(mostly termed a neuron) that have adaptable weights and implementations parameters are very difficult to implement
biases[2]. The convolution operation essentially performs dot in hardware. Their solution tries to circumvent this challenge
products between the filters and local regions of the input. The by choosing uniform unroll factors which are < 64, 7 >.
forward movement of a CNN involves filtering the input which Using a uniform implementation parameter to design the
produces a 2-dimensional activation map. This activation map computation engine, causes a degradation in performance. In
is a response of the filters that are locally focused and moved our design, we address this degradation by choosing variable
across the input payload. As the forward movement of the implementation parameters that can be customized for each
layers execute, the filters adaptively update their weights in layer of CNN and simulate the HDL design. Since, our design
order to better detect geometric patterns and visual features. is reconfigurable, it can be extended to any CNN such as

978-1-5386-1823-3/17/$31.00 ©2017 IEEE 1337 Asilomar 2017


GoogleNet or Microsoft’s ResidualNet. However, in this paper • The requisite network of processing elements will be
we focus on AlexNet. managed by a switching control logic.
Table I: Optimal unroll factors for Alexnet [6] 1) Processing Element: A processing element will be the
denominational unit in our computational engine.The design
Layer Tm Tn of our processing element is hierarchical in nature.
1 48 3
2 20 24 The atomic unit of our design is P E 2 which is described
3 96 5 in Figure 2. It has 4 inputs and one output and an enable input.
4 95 5 The function of the P E 2 block is to perform dot products
5 32 15
uniform unroll factor 64 7
with the inputs in[1] and in[2] and adaptable filter weights
w[1] and w[2]. The unit is activated only if the enable pin is
set to high. We will use P E 2 blocks to construct P E 4 and
III. E XTENDED S UMMARY P E 8 blocks.
Our design makes use of the variable implementation pa-
rameters needed to implement each layer of the CNN. We in-
tend to transform the fixed computational engine into an array
of reconfigurable processing elements (P E). The processing
elements in the array that are required to execute a layer will
be activated with the help of a Benes network switch and
some switching control logic. This results in reducing the cost
of operation in terms of execution clock-cycles and makes for
a faster computational engine than the one implemented with
uniform parameters. Thus, convolutional layers will execute
with a lower number of execution clock-cycles than they would
for a fixed unroll factor for all layers.
A. Architecture Overview
Figure 1 describes our proposed architecure. The novel
features of our architecure are as follows: Figure 2: Building block processing element P E 2

The P E 4 block is made up of three P E 2 blocks. This


is described in Figure 3. It has 6 input and 3 outputs. These
can also be used as individual outputs of the constituent P E 2
block. The P E 4 block has 4 enable inputs that are distributed
among the composite P E 2 blocks and the complete unit. We
have the choice to use the block as a whole or individual P E 2
blocks.
We use 2 P E 4 blocks and a P E 2 blocks to construct
the P E 8 block or in other words 7 P E 2 blocks make
the P E 8 unit. We treat P E 8 as the universal processing
element module of our design. It has 14 inputs and 7 outputs
as described in Figure 4. The seven outputs are the outputs of
the 7 composite P E 2 blocks that make up the P E 8 block.
There are 10 enable inputs to this block. They are distributed
among the 7 P E 2 blocks on it. These enable lines help us
invoke the individual subblock of P E 8 in thier individual
capacity as P E 4, P E 2 or as a complete unit, P E 8.
Figure 1: Architecture Overview The design of the new processing element will facilitate the
use of its sub-blocks selectively.
• The computational engine will be an array of P E units. 2) Benes Switch: We use a Benes network[8, 9] to manage
This will act as a bank of processing elements, ready to the input to the reconfigurable computational engine as shown
be used for the execution of a convolutional layer. These in Figure 5. A Benes network which receives N inputs has
processing elements are held together with a switching 2log2 (N ) − 1 stages of interconnected switches. There are
layer. This switching layer will facilitate the interconnec- exactly N/2 number of 2 × 2 crossbar switches per stage as
tions in the P E engine. shown in Figure 5b. Every switch in the Benes network has
• The input to the computational engine will be managed binary functions based on the control bit b: (i) The input is
by a Benes network. translated to the output as it was received if the bit was b. (ii)

1338
(a) 2 × 2 crossbar

(b) N × N Benes network

Figure 5: Benes Network

accepts inputs Tm and Tn and outputs control bits for the


Switching Layer that controls the P E engine and the input to
the Benes network.

Figure 3: Building block processing element P E 4

Figure 6: Control Signals for Switching layer and Processing


Engine

IV. FPGA AND H ARDWARE IMPLEMENTATION


A. Layer 1
Layer 1 operates with unroll parameters < 48, 3 > where
Tn is 3 which is the input unroll factor and Tm is 48 which
is the output unroll factor. For this operation, we let 3 inputs
and weights at a time into the Benes network. For this layer,
consider one P E 8, from which we use the composite two
P E 4 units to process the inputs and weights. Thus, each
P E 8 will produce 2 outputs as shown in Figure 7. As a
result, we will require 24 P E 8 units in stage 1 and none
in stage 2. The PE engine simulation with this configuration
completes up in 21680 clock cycles.

Figure 4: Building block processing element P E 8 B. Layer 2


The unroll factors for this layer are < 20, 24 >. For this
layer, we need a Benes network to send 24 pixels worth of
and the two inputs are swapped and translated to the output information. We will need 3 complete P E 8 blocks along
if the bit received was b0 as shown in Figure 5a [8]. In our with P E 4 to compute a result as seen in Figure 8. The
design we use a 32 × 32 Benes network to parse a certain PE simulation completes in 2711 clock cycles. We require
number of pixels of data from the 256 × 256 sized image. It 60 P E 8 units in stage and 1 and 10 P E 8 in stage 2.
provides a fast solution with a minimum critical path delay.
3) Switching Control Logic: The primary role of the C. Layer 3
switching control is to select the processing elements, that Layer 3 has unroll factors < 96, 5 >. Since Tn is 5 we
maybe needed to carry out the computation of a convolutional need to send 5 pixel worth of information at a time. The
layer. It can invoke the P E 8, P E 4 or P E 2 processing arrangement would require a P E 4 from the P E 8 in stage 1
elements in the best combination that may be required to and then a P E 2 from the next stage P E 8 unit to complete
execute a convolutional layer on our computation engine. It the operation. Thus, we will require 48 P E 8 units in stage

1339
Figure 7: Configuration Setup for Layer 1
Figure 9: Configuration Setup for Layer 3 and Layer 4

used for layer 4. Thus the number of clockcycles to execute


this layer are the same as Layer 3 which is 13006 clock cycles.

E. Layer 5
Layer 5 has unroll factors < 32, 15 >. We need to send
15 pixels worth of information at a time. The arrangement
requires that we use two P E 8 at the stage 1 and P E 2 in
stage 2 as seen in Figure 10. Thus, we would require 64 P E 8
units in stage 1 and 16 P E 8 in stage 2.

V. S IMULATION R ESULTS
According our configuration setup we determined the num-
ber of P E 8 units that will be required for each stage. It is
given in the Table II.

Table II: Processing elements (P E 8) required for each state


Layer Stage 1 Stage 2
1 24 0
2 60 10
3 48 24
4 48 24
5 64 16
Figure 8: Configuration Setup for Layer 2 design choice 64 24

The total number of clock cycles our design takes to parse


1 and 24 P E 8 units in stage 2 as seen in Figure 9. The PE and compute the entire input payload of an image of size
simulation completes in 13006 clock cycles. 256 × 256 respectively are given in Table III.
Thus, the total time taken for this design is 54490 clock
D. Layer 4
cycles. As we can see, our design does better than the design
Layer 4 is similar to Layer 3 where in the unroll factors are with uniform unroll factor < 64, 7 > which takes 1008246
< 95, 5 >. The configuration for layer 3 in Figure 9 can be clock cycles [6]. Thus the P E engine provides us with a

1340
R EFERENCES
[1] Y. Lecun et al. “Gradient-based learning applied to doc-
ument recognition”. In: Proceedings of the IEEE 86.11
(Nov. 1998), pp. 2278–2324. ISSN: 0018-9219. DOI: 10.
1109/5.726791.
[2] A Karpathy. “Convolutional Neural Net-
works(CNNs/ConvNets)”. In: (2016), p. 2016. URL:
http://cs231n.github.io/.
[3] Srihari Cadambi et al. “A Programmable Parallel Accel-
erator for Learning and Classification”. In: Proceedings
of the 19th International Conference on Parallel Archi-
tectures and Compilation Techniques. PACT ’10. Vienna,
Austria: ACM, 2010, pp. 273–284. ISBN: 978-1-4503-
0178-7. DOI: 10 . 1145 / 1854273 . 1854309. URL: http :
//doi.acm.org/10.1145/1854273.1854309.
[4] Y. H. Chen, J. Emer, and V. Sze. “Eyeriss: A Spatial
Architecture for Energy-Efficient Dataflow for Convolu-
tional Neural Networks”. In: 2016 ACM/IEEE 43rd An-
nual International Symposium on Computer Architecture
(ISCA). June 2016, pp. 367–379. DOI: 10.1109/ISCA.
2016.40.
[5] Yongming Shen, Michael Ferdman, and Peter Milder.
“Maximizing CNN Accelerator Efficiency Through Re-
Figure 10: Configuration Setup for Layer 5 source Partitioning”. In: CoRR abs/1607.00064 (2016).
arXiv: 1607 . 00064. URL: http : / / arxiv. org / abs / 1607 .
Table III: PE simulation clock cycles
00064.
Layer Cycles for PE engine [6] Chen Zhang et al. “Optimizing FPGA-based Accelerator
1 21682
2 2724
Design for Deep Convolutional Neural Networks”. In:
3 13008 Proceedings of the 2015 ACM/SIGDA International Sym-
4 13008 posium on Field-Programmable Gate Arrays. FPGA ’15.
5 4068 Monterey, California, USA: ACM, 2015, pp. 161–170.
total 54490
ISBN : 978-1-4503-3315-3. DOI : 10 . 1145 / 2684746 .
2689060. URL: http : / / doi . acm . org / 10 . 1145 / 2684746 .
94% improvement.The Fmax of our design is 100M Hz where 2689060.
Fmax is the maximum frequency our design can have. [7] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-
ton. “ImageNet Classification with Deep Convolutional
VI. C ONCLUSION Neural Networks”. In: Advances in Neural Information
Our work proposes a hardware accelerator architecture Processing Systems 25. Ed. by F. Pereira et al. Curran
design having a reconfigurable computational engine. The Associates, Inc., 2012, pp. 1097–1105. URL: http : / /
reconfigurable computational engine aims at achieving a fast papers.nips.cc/paper/4824-imagenet-classification-with-
performance in terms of the minimum number of execution deep-convolutional-neural-networks.pdf.
clock-cycles needed to execute the CNN. The design of our [8] D. Nassimi and S. Sahni. “A Self-Routing Benes Net-
architecture is truly run-time reconfigurable and can poten- work and Parallel Permutation Algorithms”. In: IEEE
tially support multiple advanced CNNs such as AlexNets, Transactions on Computers C-30.5 (May 1981), pp. 332–
GoogleNet and Microsoft’s ResidualNet. 340. ISSN: 0018-9340. DOI: 10.1109/TC.1981.1675791.
[9] K. K. Gunnam et al. “VLSI Architectures for Layered
VII. F UTURE W ORK Decoding for Irregular LDPC Codes of WiMax”. In:
We need to look at additional techniques like pipelining that 2007 IEEE International Conference on Communica-
maybe compatible with our design. In the future, we need to tions. June 2007, pp. 4542–4547. DOI: 10 . 1109 / ICC .
implement this design on an FPGA platform and benchmark it 2007.750.
against a GPU implementation. We also need to translate this
design to develop a potentially network agnostic architecture.
This would require to design a P E engine to accomodate
layers of all known dimensions of all existing networks.

1341

You might also like