Accelerating Deep Convolutional Neural Networks Using Number Theoretic Transform
Accelerating Deep Convolutional Neural Networks Using Number Theoretic Transform
Abstract— Modern deep convolutional neural networks (CNNs) types [4], [5], [6]. This requirement poses computational
suffer from high computational complexity due to excessive challenges in running them, mainly caused by excessive three-
convolution operations. Recently, fast convolution algorithms dimensional (3D) convolution operations with various kernel
such as fast Fourier transform (FFT) and Winograd transform
have gained attention to address this problem. They reduce the sizes.
number of multiplications required in the convolution operation Many CNN accelerators have been proposed to address the
by replacing it with element-wise multiplication in the trans- computational complexity using both software and hardware
form domain. However, fast convolution-based CNN accelera- optimization. Several works [7], [8], and [9] focus on CNN
tors have three major concerns: expensive domain transform, model quantization to make the computation lighter. The other
large memory overhead, and limited flexibility in kernel size.
In this paper, we present a novel CNN accelerator based on works [10], [11], [12], [13], [14], [15], [16] focus on spe-
number theoretic transform (NTT), which overcomes the existing cialized hardware architecture and logic design to accelerate
limitations. We propose the low-cost NTT and inverse-NTT the computation itself. Another noticeable approach in CNN
converter that only use adders and shifters for on-chip domain accelerators is using fast convolution algorithms, such as fast
transform, which solves the inflated bandwidth problem and Fourier transform (FFT) [17], [18], [19], [20], [21], [22],
enables more parallel computations in the accelerator. We also
propose the accelerator architecture that includes multiple tile [23], [24], [25], Winograd transform [26], [27], [28], [29],
engines with the optimized data flow and mapping. Finally, [30], [31], [32], [33], [34], [35], [36], and Fermat number
we implement the proposed NTT-based CNN accelerator on the transform [37], [38]. They have shown that fast convolution
Xilinx Alveo U50 FPGA and evaluate it for popular deep CNN algorithms can accelerate the CNN computation by replacing
models. As a result, the proposed accelerator achieves 2859.5, the convolution operation with an element-wise multiplication
990.3, and 805.6 GOPS throughput for VGG-16, GoogLeNet,
and Darknet-19, respectively. It outperforms the existing fast through domain transform. However, there are mainly three
convolution-based CNN accelerators up to 9.6×. issues in this approach that hinder the potential performance
gain. First, the domain transform overhead is high, so it
Index Terms— Convolutional neural networks (CNNs), fast
convolution, field programmable gate array (FPGA), hardware often becomes the bottleneck of the hardware accelerator. Sec-
accelerator, number theoretic transform (NTT). ond, the domain transform commonly uses higher-precision
or complex numbers, requiring a higher memory footprint
and bandwidth. The pre-computation approach that performs
I. I NTRODUCTION domain transform offline is proposed to address the trans-
form overhead, but the inflated data size causes an off-chip
C ONVOLUTIONAL neural network (CNN) is widely
used in many vision-based applications such as image
classification, object detection and segmentations in self-
bandwidth problem. Third, some domain transforms are less
flexible in changing kernel size.
driving cars, and image analysis in healthcare [1], [2], In this paper, we propose a novel and high-throughput CNN
[3]. As the applications require high accuracy, CNN accelerator based on number theoretic transform (NTT) to
models become large and diversified with different layer address the above challenges. NTT is another type of domain
transform sorely based on integer numbers, which inherits
Manuscript received 13 July 2022; revised 12 September 2022 and the fast convolution algorithm’s property. NTT computation
5 October 2022; accepted 6 October 2022. Date of publication 25 October generally requires expensive modulo operation that hinders
2022; date of current version 25 January 2023. This work was supported
in part by the Information Technology Research Center (ITRC) support accelerating CNN computation. However, we find that spe-
program, supervised by the Institute of Information and Communications cific parameter selection under certain conditions can turn
Technology Planning and Evaluation (IITP), under Grant IITP-2020-0-01847; the expensive modulo operation into simpler operations such
and in part by the Super Computer Development Leading Program of the
National Research Foundation (NRF), both under the Ministry of Science as shift and addition. By leveraging NTT’s computational
and ICT (MSIT), South Korea, under Grant 2021M3H6A1017683. This benefit with refined parameter selection, we design a low-cost
article was recommended by Associate Editor J. Di. (Corresponding author: domain converter that enables more on-chip domain transform.
Joo-Young Kim.)
The authors are with the School of Electrical Engineering, Korea Advanced Combined with processing element (PE) arrays and efficient
Institute of Science and Technology (KAIST), Daejeon 34141, South Korea memory access based on data tiling, we demonstrate that the
(e-mail: [email protected]; [email protected]). NTT-based convolution is a promising alternative for acceler-
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCSI.2022.3214528. ating modern deep CNN models. The main contributions of
Digital Object Identifier 10.1109/TCSI.2022.3214528 this paper are summarized as follows.
1549-8328 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 09,2023 at 10:53:06 UTC from IEEE Xplore. Restrictions apply.
316 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 70, NO. 1, JANUARY 2023
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 09,2023 at 10:53:06 UTC from IEEE Xplore. Restrictions apply.
PRASETIYO et al.: ACCELERATING DEEP CNNs USING NUMBER THEORETIC TRANSFORM 317
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 09,2023 at 10:53:06 UTC from IEEE Xplore. Restrictions apply.
318 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 70, NO. 1, JANUARY 2023
N−1
N−1
H (k1, k2 ) ≡ h(n 1 , n 2 )g nN1 k1 +n2 k2 (mod q)
n 1 =0 n 2 =0
for 0 ≤ k1 , k2 ≤ N − 1 (2) handle this, but it causes an exponential data explosion when
N−1
N−1 the small-sized weight is converted into a large-point NTT.
h(n 1 , n 2 ) ≡ N −2 H (k1, k2 )g −(n
N
1 k1 +n 2 k2 )
(mod q) To address this problem, we divide the large-sized input into
k1 =0 k2 =0 smaller tiles and perform convolution iteratively using the
for 0 ≤ n 1 , n 2 ≤ N − 1 (3) overlap-and-add (OaA) method [40]. It has widely used for fast
convolution-based CNN such as in FFT-based and Winograd-
based CNN. The OaA method divides a large convolution into
B. Optimization for Low-Cost NTT multiple convolutions with the short segments and overlaps the
In general, NTT and INTT are computationally expensive computation results from each part to make the final results
because they involve modular multiplications. However, it is the same as the large convolution case.
possible to obtain a low-cost transform by identifying the right Algorithm 1 describes the detailed computational step of
NTT parameters. We use the following insights to choose the the NTT-based convolution with overlap-and-add (OaA) for a
NTT parameters that enable a low-cost NTT and INTT. convolutional layer. First, it divides the I × I input feature
• We select the N-th root of unity g N as the power of map into the tiles with a fixed size of L × L. It repeats this
two so that the multiplications can be converted into shift tilling for the whole input channels. T (i, j, k) refers to the
operations. This setup is possible as long as the condition i , j , and k-th tile in the width, height, and channel direction,
g NN ≡ 1 (mod q) holds under the Galois field G F(q) and respectively. On the weight side, it converts all the K ×K sized
q − 1 is divisible by N. Table II shows several N, q and weights in the same output channel to the NTT domain using
g N parameters that meet the above conditions. N × N-point NTT (line 4). k and z are the index for the input
• We replace the modulo operation with the subtraction by channel and output channel, respectively. Then, it performs
setting the q value to the power of two plus one, i.e., q = element-wise multiplication between the transformed input tile
2r + 1. Since an arbitrary number A can be decomposed (T̂ (i, j, k)) and the transformed weight tile (Ŵ (k, z)), and
as A H × 2r + A L , the modulo q of A will be A L − A H accumulates the partial products to P̂(i, j ) for all the input
as the modulo q of 2r is -1. channels (line 9). The accumulated results will be converted
back to the spatial domain P(i, j ) using INTT. It has size
of N × N output pixels. Then, the OaA operation is applied
C. NTT-Based Convolution With Overlap-and-Add (OaA) to this result with the output of previous adjacent tiles to get
The high-level description of NTT-based convolution is the the final convolution results (line 11). Figure 2 illustrates the
same as the fast convolution. It transforms the input and weight OaA operation in neighboring tiles. The K − 1 pixels from
data using NTT and performs the element-wise MAC in the the previous output tiles (i.e., P(i − 1, j − 1), P(i − 1, j ), and
input channel direction. It then transforms the results back P(i, j − 1)) are overlapped and added to the current output
to the spatial domain. However, one critical problem is that tile, P(i, j ). Then, the final convolution output is extracted
the input size is usually much larger than the kernel size in from the L × L pixels of the overlapped result as shown in
the convolutional layers of CNN models (e.g., 224 × 224 vs. the figure. The algorithm iterates the above until it covers all
3 × 3). We need to use a large-point NTT on both data to the tiles in width and height direction. In the final iteration,
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 09,2023 at 10:53:06 UTC from IEEE Xplore. Restrictions apply.
PRASETIYO et al.: ACCELERATING DEEP CNNs USING NUMBER THEORETIC TRANSFORM 319
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 09,2023 at 10:53:06 UTC from IEEE Xplore. Restrictions apply.
320 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 70, NO. 1, JANUARY 2023
TABLE III
PARAMETER S ETTING FOR VARIOUS K ERNEL S IZE
adjust the input tile size L and the weight kernel size K .
Fig. 4. 8 × 8-point NTT and INTT using VR units.
The correctness of convolution operation is guaranteed if the
parameters meet the condition of K + L − 1 ≤ N. Therefore,
the overlap-and-add operation to produce the final convolution we can choose a preferred L value according to the kernel
results. For efficiency, the input transformation unit is shared size K . The OaA unit also supports this notion of configurable
for each PE array within the same tile engine. Furthermore, the parameters. It performs additions in the overlapping K −1 pix-
output transformation unit can also be shared among all the PE els with adjacent output tiles accordingly. If the accelerator
arrays because of the linear property of INTT. The summation uses 8 × 8-point NTT (i.e., N = 8), it can support kernel sizes
in the spatial domain can be moved into the transform domain less than the NTT size. To be specific, the input and kernel
as shown in (4), performing the inverse domain transform only size pair are {8, 1}, {6, 3}, {4, 5}, and {2, 7} for 1 × 1, 3 × 3,
once. 5 × 5, and 7 × 7 kernel size, respectively. If we need a larger
⎛ ⎞ kernel, like 11 × 11 convolution, we can use 16 × 16-point
NTT (N = 16) that can be implemented using four
F −1 (F (I) F (W)) = F −1 ⎝ F (I) F (W)⎠ (4)
8 × 8-point NTT modules. In summary, the accelerator can
C in C in
support various kernel sizes by having a proper setting of N,
L, and K parameters in the hardware configuration, as listed
C. NTT/INTT Using Vector-Radix (VR) in Table III.
To efficiently implement the transform unit (NTT/INTT),
we leverage the vector-radix (VR) method widely used in the
E. Vector Unit
FFT algorithm [41]. Unlike the row-column algorithm that
transforms the row first using a one-dimensional transform The VU consists of ALUs, pooling units, and a reshape
and then transforms the column after matrix transposition, buffer. It is designed to perform post-convolution operations
the vector-radix algorithm performs the transformation directly such as scaling, bias addition, and non-linear operations (e.g.,
on 2D input without required matrix transposition. The algo- ReLU, ReLU6, and Bounded ReLU). The VU directly receives
rithm recursively divides a large-point N × N transform into the convolution results from NCU and applies post-processing
successive half-point transforms until it reaches the basic operations in the ALUs. It can bypass the pooling (2 × 2 or
module called VR unit. The VR unit is a basic 2 × 2-point 3 × 3), if not needed. The reshape buffer reshapes the final
computation unit that interwinds the results from the lower- results into tiles before storing them back in the memory.
level NTT/INTT nodes into the higher-level NTT/INTT nodes.
Thanks to the NTT parameters chosen in Section III-B, the
V. DATA F LOW AND M APPING
calculation of the VR unit becomes much simpler than that
of FFT, using only addition and shift operations. Figure 4 A. Input Streaming Dataflow
illustrates the computation process of 8 × 8-point NTT and Figure 5 shows the proposed dataflow and data mapping
INTT using the VR algorithm. It shows the whole network on the PE arrays in the tile engines. We adopt the input
connections between the input-points (x/Y for NTT/INTT) streaming dataflow to maximize the throughput. It streams the
and output-points (X/y for NTT/INTT) on each VR unit. The input tiles while keeping the weights and output partial sums
NTT computation is processed in the forward direction, while on the chip as long as possible. The transformed weights are
the INTT computation happens in the opposite direction. The reused in the transformed weight buffers, while the partial
domain transform computation is composed of three stages sums are accumulated in the PE arrays.Based on the tile-
for 8 × 8-point NTT/INTT. The calculation requires 16 VR based processing, the proposed dataflow supports two types
units in each stage, so a total of 48 VR units are needed to of parallelism: 1) weight tile parallelism that computes
complete the NTT/INTT. As a result, the 8 × 8 input pixels multiple weight tiles with the same input tile in parallel (i.e.,
are converted to 8 × 8 output pixels through this multi-stage broadcasting the input tile to the PE arrays in the same row)
module. and 2) input tile parallelism that computes multiple input tiles
for the same weight tile in parallel (i.e., loading the weight
D. Convolution Type Configurability tile in the PE arrays in the same column).
We fix the size of the PE array to N × N for efficient The numbers in the figure indicate the order of computation
N × N-point NTT computation. Nonetheless, we can still loops for the convolution between the input and weights. First,
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 09,2023 at 10:53:06 UTC from IEEE Xplore. Restrictions apply.
PRASETIYO et al.: ACCELERATING DEEP CNNs USING NUMBER THEORETIC TRANSFORM 321
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 09,2023 at 10:53:06 UTC from IEEE Xplore. Restrictions apply.
322 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 70, NO. 1, JANUARY 2023
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 09,2023 at 10:53:06 UTC from IEEE Xplore. Restrictions apply.
PRASETIYO et al.: ACCELERATING DEEP CNNs USING NUMBER THEORETIC TRANSFORM 323
TABLE V
W EIGHT M EMORY S IZE BY T RANSFORM M ETHOD
TABLE VI
D. Performance and Power Efficiency
FPGA R ESOURCE U SAGE ON X ILINX A LVEO U50
For performance evaluation, we measure the actual execu-
tion time of the FPGA device for running the target CNN lay-
ers and calculate the throughput as follows. We first calculate
the total number of operations required for the target workload
in the spatial domain, then divide it by the measured execution
time. Figure 8 shows the performance and power efficiency
of the proposed accelerator for the 22 layer types of target
CNN models. With the 32 8 × 8 PE arrays, the accelerator
achieves average 2859.5, 990.3, and 805.6 giga operations
per second (GOPS) throughput for VGG-16, GoogLeNet, and
the on-the-fly domain transform and the pre-computation Darknet-19, respectively. The average power of the FPGA
method. Table V summarizes the weight memory size require- device is measured 26 W, resulting in the power efficiency of
ment for each method when the NTT size is 8 × 8. The 110.0, 38.1, and 31.0 GOPS per Watt (GOPS/W) for VGG-16,
pre-computation requires 275, 68, and 348 MB memory for GoogLeNet, and Darknet-19, respectively.
storing the transformed weights in the external memory, In this evaluation, we observe three key points. First, the
which are 18.7×, 11.3×, and 17.6× higher than the on- proposed accelerator achieves a high performance when the
the-fly method for VGG-16, GoogLeNet, and Darknet-19, input size is large. This trend is clearly seen in the results
respectively. This result suggests that the on-the-fly transform of VGG-16, whose input size decreases from 224 × 224 at
is preferable for most cases, as the memory size requirement Conv1 to 14 × 14 at Conv5, while the kernel size is fixed
severely increases. The pre-computation method will put sig- to 3 × 3. It shows good performance until the input size
nificant pressure on the system’s external memory bandwidth. is 56 × 56 at Conv3, but the performance drops quickly as
the input size gets smaller than the point. This is because
the proposed accelerator uses a tiling scheme in the input
C. FPGA Implementation Result
image. As the size of the PE array is fixed to 8 × 8,
Figure 7 shows the final layout of the NTT-based CNN its utilization drops for the partially populated tiles from
accelerator implemented on the Xilinx Alveo U50 FPGA. the input edges. Therefore, this utilization problem directly
We successfully integrate 32 8 × 8 PE arrays with the configu- impacts the performance if the input size is not large enough
ration of R = 4 and Z = 8, using all dies in the FPGA called compared to the array size. Second, the proposed accelerator
super logic regions (SLR0 and SLR1). We carefully place the performs strongly with 3 × 3 and 5 × 5 kernels but not well
four tile engines on the two SLRs with manual floorplanning with 1 × 1 and 7 × 7 kernels. This can be understood by
to optimize the routing delay. In addition, we minimize the calculating the number of equivalent operations involved in
number of die-crossing signals and use enough pipelining a single NTT-domain block convolution. Since the PE array
registers for high frequency. As a result, we achieve 200MHz performs the parallel convolution on L × L input points with
operating frequency with utilizing 69% of DSP, 69 % of LUT, a K × K sized kernel every cycle, the equivalent number of
81 % of BRAM, and 48 % of URAM resources. Table VI multiplications is L × L × K × K . Applying this calculation
summarizes the detailed FPGA resource utilization. to our parameter settings, we can get 64, 324, 400, and
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 09,2023 at 10:53:06 UTC from IEEE Xplore. Restrictions apply.
324 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 70, NO. 1, JANUARY 2023
TABLE VII
C OMPARISON W ITH S TATE - OF - THE -A RT A CCELERATORS
196 multiplications for 1 × 1, 3 × 3, 5 × 5, and 7 × 7 kernel, accelerators [28], [31], [33], [34]. It also achieves comparable
respectively. We can confirm that the performance of the performance over Winograd-based [35], which used a flexible
Inception layers suffers as they involve many 1 × 1 kernels and large transform length. However, the large transform
inside. However, it is noteworthy that the accelerator still length in Winograd-based CNN can cause accuracy degrada-
achieves over and close to 1000 GOPS performance on the tion, which is not evaluated in the work. Although not all
Inception layers, even with 1 × 1 kernels and small input sizes previous accelerators reported the power efficiency, our work
such as 28 × 28 and 14 × 14. Third, the proposed accelerator shows up to 7.0× higher power efficiency for the available
is less beneficial if the layer contains striding. Because the data. Our work achieves higher power efficiency compared
NTT-based convolution basically replaces the convolutions on to the spatial domain and FFT-based accelerators while it
the neighbor points into element-wise multiplications, it cannot achieves comparable efficiency over Winograd-based ones.
fully leverage this benefit if the layer skips the input points. The Winograd-based accelerators in [33] and [34] also utilized
In fact, its performance is reduced to one-fourth if the stride the low-cost domain transform for a specific and small trans-
is 2, as shown in GoogLeNet’s Conv 1 layer. However, striding form configuration, e.g., F(2 × 2, 3 × 3). However, the trans-
is often used very limitedly in modern CNN models. form option is too limited and its cost becomes significantly
expensive for larger transform lengths. For numerical accuracy
of the NTT-based convolution, we compare the NTT-based
E. Comparison With Other CNN Accelerators convolution result with the original convolution result from the
Table VII shows the comparison of our work against spatial domain. We confirm that there are no errors between
the state-of-the-art CNN accelerators on FPGA with vari- the two results. It means that the NTT-based convolution does
ous methods (i.e., spatial-based, FFT-based, Winograd-based). not incur any numerical error, so the inference accuracy sorely
With the optimized architecture for compute-efficient NTT- depends on the CNN model. In regards to this matter, we use
based convolution, our work achieves the best performance a INT8 VGG-16 model trained on ImageNet dataset and use
among the accelerators. It shows average 2.9× and 10.2× the same dataset for evaluation. The inference accuracy result
speedup over the spatial domain accelerator [14] and [15], is 72.32% and 90.97% for Top-1 and Top-5 accuracy, respec-
respectively. This is due to the reduction of required operations tively. The Winograd-based accelerators in [33] and [34] also
in NTT-based convolution over spatial domain convolution. reported that there is no error in Winograd-based convolution.
It achieves 4.3 − 9.6× speedup over the FFT-based accelera- However, it is only applicable for small transform length that
tors [19], [31]. The low-cost domain transform enables more only requires divide-by-two during data transformation. For
domain transform computation on chip without causing the the larger transform length, it requires larger precision to keep
external bandwidth problem in the FFT-based accelerator. The the error small during the Winograd transform. In addition, our
proposed NTT-based accelerator also achieves moderate per- accelerator is applicable for many CNN models with various
formance gains, 1.1 −7.0× speedup, over the Winograd-based kernel sizes and layer shapes by modifying tiling parameters
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 09,2023 at 10:53:06 UTC from IEEE Xplore. Restrictions apply.
PRASETIYO et al.: ACCELERATING DEEP CNNs USING NUMBER THEORETIC TRANSFORM 325
and data mapping, overcoming the need for complex control [9] S. Han et al., “EIE: Efficient inference engine on compressed deep
switching overhead [15] and filter decomposition inefficiency neural network,” ACM SIGARCH Comput. Archit. News, vol. 44, no. 3,
pp. 243–254, 2016.
in Winograd-based [28] and [34]. [10] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, “Optimizing the convolution
operation to accelerate deep neural networks on FPGA,” IEEE Trans.
VIII. C ONCLUSION Very Large Scale Integr. (VLSI) Syst., vol. 26, no. 7, pp. 1354–1367,
Jul. 2018.
In this work, we present a fast convolution CNN accelerator [11] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture
using Number Theoretic Transform (NTT). A fast convolution for energy-efficient dataflow for convolutional neural networks,” ACM
algorithm has an advantage over spatial domain convolution SIGARCH Comput. Archit. News, vol. 44, no. 3, pp. 367–379, Jun. 2016.
[12] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, “Optimizing loop operation
by reducing the number of operations. However, the expensive and dataflow in FPGA acceleration of deep convolutional neural net-
transform domain and the memory overhead limit the potential works,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays,
for accelerating CNN computation. By refining the parameter Feb. 2017, pp. 45–54.
selection of NTT, we utilize the low-cost domain transform [13] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
FPGA-based accelerator design for deep convolutional neural networks,”
that only uses simple addition and shift operations. We propose in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, Feb. 2015,
the multi-engine architecture for tile-based NTT processing pp. 161–170.
and optimize the dataflow and mapping to exploit maxi- [14] W. Huang et al., “FPGA-based high-throughput CNN hardware acceler-
ator with high computing resource utilization ratio,” IEEE Trans. Neural
mal parallelism. The proposed accelerator achieves 2859.5, Netw. Learn. Syst., vol. 33, no. 8, pp. 4069–4083, Aug. 2022.
990.3, and 805.6 GOPS throughput and 110.0, 38.1, and [15] Y. Yu, T. Zhao, K. Wang, and L. He, “Light-OPU: An FPGA-based
31.0 GOPS/W power efficiency for VGG-16, GoogLeNet, overlay processor for lightweight convolutional neural networks,” in
and Darknet-19, respectively. It outperforms the existing fast Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, Feb. 2020,
pp. 122–132.
convolution-based CNN accelerators up to 9.6×. This result [16] X. Wu, Y. Ma, M. Wang, and Z. Wang, “A flexible and efficient FPGA
proves that NTT is a promising alternative for fast convolution- accelerator for various large-scale and lightweight CNNs,” IEEE Trans.
based CNN. For future work, we will extend this method Circuits Syst. I, Reg. Papers, vol. 69, no. 3, pp. 1185–1198, Mar. 2022.
[17] C. Zhang and V. Prasanna, “Frequency domain acceleration of con-
to support various types of convolution operations, such as volutional neural networks on CPU-FPGA shared memory system,” in
convolution with stride larger than 1, transposed convolution, Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, Feb. 2017,
and dilated convolution. We can also extend this NTT-based pp. 35–44.
method using 3D NTT algorithm to accelerate the 3D con- [18] T. Abtahi, C. Shea, A. Kulkarni, and T. Mohsenin, “Accelerating convo-
lutional neural network with FFT on embedded hardware,” IEEE Trans.
volution layer. In addition, exploiting the neural network’s Very Large Scale Integr. (VLSI) Syst., vol. 26, no. 9, pp. 1737–1749,
optimization methods, including sparsity and quantization in Sep. 2018.
the NTT domain computation, will further improve the per- [19] H. Zeng, R. Chen, C. Zhang, and V. Prasanna, “A framework for
generating high throughput CNN implementations on FPGAs,” in
formance of this method. We also plan to use the proposed Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, Feb. 2018,
accelerator for self-driving car system to enable a real-time pp. 117–126.
object detection. [20] W. Sun, H. Zeng, Y.-H.-E. Yang, and V. Prasanna, “Throughput-
optimized frequency domain CNN with fixed-point quantization on
FPGA,” in Proc. Int. Conf. ReConFigurable Comput. FPGAs
ACKNOWLEDGMENT (ReConFig), Dec. 2018, pp. 1–8.
Prasetiyo would like to thank Hyundai Motor Chung [21] C. Ding, S. Wang, N. Liu, K. Xu, Y. Wang, and Y. Liang, “REQ-
YOLO: A resource-aware, efficient quantization framework for object
Mong-Koo Global Scholarship for the scholarship support. detection on FPGAs,” in Proc. ACM/SIGDA Int. Symp. Field-Program.
Gate Arrays, Feb. 2019, pp. 33–42.
R EFERENCES [22] H. Zeng, C. Zhang, and V. Prasanna, “Fast generation of high throughput
customized deep learning accelerators on FPGAs,” in Proc. Int. Conf.
[1] L. Chen, S. Li, Q. Bai, J. Yang, S. Jiang, and Y. Miao, “Review of
ReConFigurable Comput. FPGAs (ReConFig), Dec. 2017, pp. 1–8.
image classification algorithms based on convolutional neural networks,”
Remote Sens., vol. 13, no. 22, p. 4712, Nov. 2021. [23] Y. Niu, R. Kannan, A. Srivastava, and V. Prasanna, “Reuse kernels or
[2] S. S. A. Zaidi, M. S. Ansari, A. Aslam, N. Kanwal, M. Asghar, and activations: A flexible dataflow for low-latency spectral CNN acceler-
B. Lee, “A survey of modern deep learning based object detection ation,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays,
models,” Digit. Signal Process., vol. 126, Jun. 2022, Art. no. 103514. Feb. 2020, pp. 266–276.
[3] H. Yu, L. T. Yang, Q. Zhang, D. Armstrong, and M. J. Deen, “Con- [24] Y. He, J. Yue, Y. Liu, and H. Yang, “Block-circulant neural network
volutional neural networks for medical image analysis: State-of-the-art, accelerator featuring fine-grained frequency-domain quantization and
comparisons, improvement and perspectives,” Neurocomputing, vol. 444, reconfigurable FFT modules,” in Proc. 26th Asia South Pacific Design
pp. 92–110, Jul. 2021. Autom. Conf., Jan. 2021, pp. 813–818.
[4] K. Simonyan and A. Zisserman, “Very deep convolutional networks for [25] C. Fang, L. He, H. Wang, J. Wei, and Z. Wang, “Accelerating 3D
large-scale image recognition,” in Proc. 3rd Int. Conf. Learn. Represent., convolutional neural networks using 3D fast Fourier transform,” in Proc.
(ICLR), San Diego, CA, USA, Y. Bengio and Y. LeCun, Eds., 2015, IEEE Int. Symp. Circuits Syst. (ISCAS), May 2021, pp. 1–5.
pp. 1–14. [26] A. Lavin and S. Gray, “Fast algorithms for convolutional neural net-
[5] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE works,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1–9. Jun. 2016, pp. 4013–4021.
[6] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in [27] C. Yang, Y. Wang, X. Wang, and L. Geng, “WRA: A 2.2-to-6.3 TOPS
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, highly unified dynamically reconfigurable accelerator using a novel
pp. 7263–7271. Winograd decomposition algorithm for convolutional neural networks,”
[7] B. Jacob et al., “Quantization and training of neural networks for IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 66, no. 9, pp. 3480–3493,
efficient integer-arithmetic-only inference,” in Proc. IEEE/CVF Conf. Sep. 2019.
Comput. Vis. Pattern Recognit., Jun. 2018, pp. 2704–2713. [28] S. Kala, B. R. Jose, J. Mathew, and S. Nalesh, “High-performance
[8] J. Qiu et al., “Going deeper with embedded FPGA platform for convolu- CNN accelerator on FPGA using unified winograd-GEMM architecture,”
tional neural network,” in Proc. ACM/SIGDA Int. Symp. Field-Program. IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 27, no. 12,
Gate Arrays, Feb. 2016, pp. 26–35. pp. 2816–2828, Dec. 2019.
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 09,2023 at 10:53:06 UTC from IEEE Xplore. Restrictions apply.
326 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 70, NO. 1, JANUARY 2023
[29] L. Lu and Y. Liang, “SpWA: An efficient sparse Winograd convolutional Seongmin Hong (Graduate Student Member, IEEE)
neural networks accelerator on FPGAs,” in Proc. 55th ACM/ESDA/IEEE received the B.S. and M.S. degrees in electronic
Design Autom. Conf. (DAC), Jun. 2018, pp. 1–6. and electrical engineering from Hongik University,
[30] A. Xygkis, D. Soudris, L. Papadopoulos, S. Yous, and D. Moloney, Seoul, South Korea, in 2016 and 2018, respectively.
“Efficient winograd-based convolution kernel implementation on edge He is currently pursuing the Ph.D. degree with the
devices,” in Proc. 55th ACM/ESDA/IEEE Design Autom. Conf. (DAC), Korea Advanced Institute of Science and Technol-
Jun. 2018, pp. 1–6. ogy (KAIST), Daejeon, South Korea. His current
[31] Y. Liang, L. Lu, Q. Xiao, and S. Yan, “Evaluating fast algorithms for research interests include computer architecture and
convolutional neural networks on FPGAs,” IEEE Trans. Comput.-Aided FPGA-based accelerators for machine learning.
Design Integr. Circuits Syst., vol. 39, no. 4, pp. 857–870, Apr. 2020.
[32] C. Yang, Y. Wang, X. Wang, and L. Geng, “A stride-based convolution
decomposition method to stretch CNN acceleration algorithms for effi-
cient and flexible hardware implementation,” IEEE Trans. Circuits Syst.
I, Reg. Papers, vol. 67, no. 9, pp. 3007–3020, Sep. 2020.
[33] J. Yepez and S.-B. Ko, “Stride 2 1-D, 2-D, and 3-D Winograd for
convolutional neural networks,” IEEE Trans. Very Large Scale Integr.
(VLSI) Syst., vol. 28, no. 4, pp. 853–863, Apr. 2020.
[34] H. Deng, J. Wang, H. Ye, S. Xiao, X. Meng, and Z. Yu, “3D-VNPU:
A flexible accelerator for 2D/3D CNNs on FPGA,” in Proc. IEEE
29th Annu. Int. Symp. Field-Program. Custom Comput. Mach. (FCCM), Yashael Faith Arthanto (Graduate Student
May 2021, pp. 181–185. Member, IEEE) received the B.S. degree in
[35] X. Liu, Y. Chen, C. Hao, A. Dhar, and D. Chen, “WinoCNN: Ker- electrical engineering from the Institut Teknologi
nel sharing Winograd systolic array for efficient convolutional neural Bandung (ITB), Bandung, Indonesia, in 2019. He is
network acceleration on FPGAs,” in Proc. IEEE 32nd Int. Conf. Appl.- currently pursuing the M.S. degree with the Korea
Specific Syst., Archit. Processors (ASAP), Jul. 2021, pp. 258–265. Advanced Institute of Science and Technology
[36] D. Wu, X. Fan, W. Cao, and L. Wang, “SWM: A high-performance (KAIST), Daejeon, South Korea. His research
sparse-winograd matrix multiplication CNN accelerator,” IEEE Trans. interests include hardware architecture, hardware
Very Large Scale Integr. (VLSI) Syst., vol. 29, no. 5, pp. 936–949, accelerator for AI, and multi-FPGA infrastructures.
May 2021.
[37] W. Xu, X. You, and C. Zhang, “Using Fermat number transform to
accelerate convolutional neural network,” in Proc. IEEE 12th Int. Conf.
ASIC (ASICON), Oct. 2017, pp. 1033–1036.
[38] W. Xu, Z. Zhang, X. You, and C. Zhang, “Reconfigurable and low-
complexity accelerator for convolutional and generative networks over
finite fields,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.,
vol. 39, no. 12, pp. 4894–4907, Dec. 2020.
[39] B. Barabasz, A. Anderson, K. M. Soodhalter, and D. Gregg, “Error
analysis and improving the accuracy of Winograd convolution for deep
neural networks,” ACM Trans. Math. Softw., vol. 46, no. 4, pp. 1–33,
Dec. 2020. Joo-Young Kim (Senior Member, IEEE) received
the B.S., M.S., and Ph.D. degrees in electrical
[40] H. J. Nussbaumer, “The fast Fourier transform,” in Fast Fourier Trans- engineering from the Korea Advanced Institute
form and Convolution Algorithms. Cham, Switzerland: Springer, 1981, of Science and Technology (KAIST), Daejeon,
pp. 211–240. South Korea, in 2005, 2007, and 2010, respectively.
[41] K. R. Rao, D. N. Kim, and J. J. Hwang, “Fast Fourier transform: He is currently an Assistant Professor with the
Algorithms and applications,” Tech. Rep., 2010, pp. 184–193. School of Electrical Engineering, KAIST. He is
[42] S. Jain, A. Gural, M. Wu, and C. Dick, “Trained quantization thresholds also the Director of the AI Semiconductor Systems
for accurate and efficient fixed-point inference of deep neural networks,” Research Center. Before joining KAIST, he was
Proc. Mach. Learn. Syst., vol. 2, pp. 112–128, Mar. 2020. a Senior Hardware Engineering Lead at Microsoft
Azure, Redmond, WA, USA, working on hardware
acceleration for its hyper-scale big data analytics platform named Azure Data
Lake. He was also one of the initial members of Catapult project at Microsoft
Research, Redmond, where he deployed a fabric of field-programmable gate
Prasetiyo (Graduate Student Member, IEEE) arrays (FPGAs) in datacenters to accelerate critical cloud services, such as
received the B.S. degree in electrical engineer- machine learning, data storage, and networking. His research interests span
ing from the Institut Teknologi Bandung (ITB), various aspects of hardware design, including VLSI design, computer architec-
Bandung, Indonesia, in 2015, and the M.S. degree ture, FPGA, domain-specific accelerators, hardware/software co-design, and
in electrical engineering from the Korea Advanced agile hardware development. He was a recipient of the 2016 IEEE Micro Top
Institute of Science and Technology (KAIST), Picks Award, the 2014 IEEE Micro Top Picks Award, the 2010 DAC/ISSCC
Daejeon, South Korea, in 2020, where he is cur- Student Design Contest Award, the 2008 DAC/ISSCC Student Design Contest
rently pursuing the Ph.D. degree. His current Award, and the 2006 A-SSCC Student Design Contest Award. He serves
research interests include computer architecture, as an Associate Editor for the IEEE T RANSACTIONS ON C IRCUITS AND
FPGA, and domain-specific accelerators. S YSTEMS —I: R EGULAR PAPERS .
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 09,2023 at 10:53:06 UTC from IEEE Xplore. Restrictions apply.