An Efficient Reconfigurable Hardware Accelerator For CNN
An Efficient Reconfigurable Hardware Accelerator For CNN
1338
(a) 2 × 2 crossbar
1339
Figure 7: Configuration Setup for Layer 1
Figure 9: Configuration Setup for Layer 3 and Layer 4
E. Layer 5
Layer 5 has unroll factors < 32, 15 >. We need to send
15 pixels worth of information at a time. The arrangement
requires that we use two P E 8 at the stage 1 and P E 2 in
stage 2 as seen in Figure 10. Thus, we would require 64 P E 8
units in stage 1 and 16 P E 8 in stage 2.
V. S IMULATION R ESULTS
According our configuration setup we determined the num-
ber of P E 8 units that will be required for each stage. It is
given in the Table II.
1340
R EFERENCES
[1] Y. Lecun et al. “Gradient-based learning applied to doc-
ument recognition”. In: Proceedings of the IEEE 86.11
(Nov. 1998), pp. 2278–2324. ISSN: 0018-9219. DOI: 10.
1109/5.726791.
[2] A Karpathy. “Convolutional Neural Net-
works(CNNs/ConvNets)”. In: (2016), p. 2016. URL:
http://cs231n.github.io/.
[3] Srihari Cadambi et al. “A Programmable Parallel Accel-
erator for Learning and Classification”. In: Proceedings
of the 19th International Conference on Parallel Archi-
tectures and Compilation Techniques. PACT ’10. Vienna,
Austria: ACM, 2010, pp. 273–284. ISBN: 978-1-4503-
0178-7. DOI: 10 . 1145 / 1854273 . 1854309. URL: http :
//doi.acm.org/10.1145/1854273.1854309.
[4] Y. H. Chen, J. Emer, and V. Sze. “Eyeriss: A Spatial
Architecture for Energy-Efficient Dataflow for Convolu-
tional Neural Networks”. In: 2016 ACM/IEEE 43rd An-
nual International Symposium on Computer Architecture
(ISCA). June 2016, pp. 367–379. DOI: 10.1109/ISCA.
2016.40.
[5] Yongming Shen, Michael Ferdman, and Peter Milder.
“Maximizing CNN Accelerator Efficiency Through Re-
Figure 10: Configuration Setup for Layer 5 source Partitioning”. In: CoRR abs/1607.00064 (2016).
arXiv: 1607 . 00064. URL: http : / / arxiv. org / abs / 1607 .
Table III: PE simulation clock cycles
00064.
Layer Cycles for PE engine [6] Chen Zhang et al. “Optimizing FPGA-based Accelerator
1 21682
2 2724
Design for Deep Convolutional Neural Networks”. In:
3 13008 Proceedings of the 2015 ACM/SIGDA International Sym-
4 13008 posium on Field-Programmable Gate Arrays. FPGA ’15.
5 4068 Monterey, California, USA: ACM, 2015, pp. 161–170.
total 54490
ISBN : 978-1-4503-3315-3. DOI : 10 . 1145 / 2684746 .
2689060. URL: http : / / doi . acm . org / 10 . 1145 / 2684746 .
94% improvement.The Fmax of our design is 100M Hz where 2689060.
Fmax is the maximum frequency our design can have. [7] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-
ton. “ImageNet Classification with Deep Convolutional
VI. C ONCLUSION Neural Networks”. In: Advances in Neural Information
Our work proposes a hardware accelerator architecture Processing Systems 25. Ed. by F. Pereira et al. Curran
design having a reconfigurable computational engine. The Associates, Inc., 2012, pp. 1097–1105. URL: http : / /
reconfigurable computational engine aims at achieving a fast papers.nips.cc/paper/4824-imagenet-classification-with-
performance in terms of the minimum number of execution deep-convolutional-neural-networks.pdf.
clock-cycles needed to execute the CNN. The design of our [8] D. Nassimi and S. Sahni. “A Self-Routing Benes Net-
architecture is truly run-time reconfigurable and can poten- work and Parallel Permutation Algorithms”. In: IEEE
tially support multiple advanced CNNs such as AlexNets, Transactions on Computers C-30.5 (May 1981), pp. 332–
GoogleNet and Microsoft’s ResidualNet. 340. ISSN: 0018-9340. DOI: 10.1109/TC.1981.1675791.
[9] K. K. Gunnam et al. “VLSI Architectures for Layered
VII. F UTURE W ORK Decoding for Irregular LDPC Codes of WiMax”. In:
We need to look at additional techniques like pipelining that 2007 IEEE International Conference on Communica-
maybe compatible with our design. In the future, we need to tions. June 2007, pp. 4542–4547. DOI: 10 . 1109 / ICC .
implement this design on an FPGA platform and benchmark it 2007.750.
against a GPU implementation. We also need to translate this
design to develop a potentially network agnostic architecture.
This would require to design a P E engine to accomodate
layers of all known dimensions of all existing networks.
1341