0% found this document useful (0 votes)
100 views12 pages

Accelerating Deep Convolutional Neural Networks Using Number Theoretic Transform

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views12 pages

Accelerating Deep Convolutional Neural Networks Using Number Theoretic Transform

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 70, NO.

1, JANUARY 2023 315

Accelerating Deep Convolutional Neural Networks


Using Number Theoretic Transform
Prasetiyo , Graduate Student Member, IEEE, Seongmin Hong, Graduate Student Member, IEEE,
Yashael Faith Arthanto, Graduate Student Member, IEEE, and Joo-Young Kim , Senior Member, IEEE

Abstract— Modern deep convolutional neural networks (CNNs) types [4], [5], [6]. This requirement poses computational
suffer from high computational complexity due to excessive challenges in running them, mainly caused by excessive three-
convolution operations. Recently, fast convolution algorithms dimensional (3D) convolution operations with various kernel
such as fast Fourier transform (FFT) and Winograd transform
have gained attention to address this problem. They reduce the sizes.
number of multiplications required in the convolution operation Many CNN accelerators have been proposed to address the
by replacing it with element-wise multiplication in the trans- computational complexity using both software and hardware
form domain. However, fast convolution-based CNN accelera- optimization. Several works [7], [8], and [9] focus on CNN
tors have three major concerns: expensive domain transform, model quantization to make the computation lighter. The other
large memory overhead, and limited flexibility in kernel size.
In this paper, we present a novel CNN accelerator based on works [10], [11], [12], [13], [14], [15], [16] focus on spe-
number theoretic transform (NTT), which overcomes the existing cialized hardware architecture and logic design to accelerate
limitations. We propose the low-cost NTT and inverse-NTT the computation itself. Another noticeable approach in CNN
converter that only use adders and shifters for on-chip domain accelerators is using fast convolution algorithms, such as fast
transform, which solves the inflated bandwidth problem and Fourier transform (FFT) [17], [18], [19], [20], [21], [22],
enables more parallel computations in the accelerator. We also
propose the accelerator architecture that includes multiple tile [23], [24], [25], Winograd transform [26], [27], [28], [29],
engines with the optimized data flow and mapping. Finally, [30], [31], [32], [33], [34], [35], [36], and Fermat number
we implement the proposed NTT-based CNN accelerator on the transform [37], [38]. They have shown that fast convolution
Xilinx Alveo U50 FPGA and evaluate it for popular deep CNN algorithms can accelerate the CNN computation by replacing
models. As a result, the proposed accelerator achieves 2859.5, the convolution operation with an element-wise multiplication
990.3, and 805.6 GOPS throughput for VGG-16, GoogLeNet,
and Darknet-19, respectively. It outperforms the existing fast through domain transform. However, there are mainly three
convolution-based CNN accelerators up to 9.6×. issues in this approach that hinder the potential performance
gain. First, the domain transform overhead is high, so it
Index Terms— Convolutional neural networks (CNNs), fast
convolution, field programmable gate array (FPGA), hardware often becomes the bottleneck of the hardware accelerator. Sec-
accelerator, number theoretic transform (NTT). ond, the domain transform commonly uses higher-precision
or complex numbers, requiring a higher memory footprint
and bandwidth. The pre-computation approach that performs
I. I NTRODUCTION domain transform offline is proposed to address the trans-
form overhead, but the inflated data size causes an off-chip
C ONVOLUTIONAL neural network (CNN) is widely
used in many vision-based applications such as image
classification, object detection and segmentations in self-
bandwidth problem. Third, some domain transforms are less
flexible in changing kernel size.
driving cars, and image analysis in healthcare [1], [2], In this paper, we propose a novel and high-throughput CNN
[3]. As the applications require high accuracy, CNN accelerator based on number theoretic transform (NTT) to
models become large and diversified with different layer address the above challenges. NTT is another type of domain
transform sorely based on integer numbers, which inherits
Manuscript received 13 July 2022; revised 12 September 2022 and the fast convolution algorithm’s property. NTT computation
5 October 2022; accepted 6 October 2022. Date of publication 25 October generally requires expensive modulo operation that hinders
2022; date of current version 25 January 2023. This work was supported
in part by the Information Technology Research Center (ITRC) support accelerating CNN computation. However, we find that spe-
program, supervised by the Institute of Information and Communications cific parameter selection under certain conditions can turn
Technology Planning and Evaluation (IITP), under Grant IITP-2020-0-01847; the expensive modulo operation into simpler operations such
and in part by the Super Computer Development Leading Program of the
National Research Foundation (NRF), both under the Ministry of Science as shift and addition. By leveraging NTT’s computational
and ICT (MSIT), South Korea, under Grant 2021M3H6A1017683. This benefit with refined parameter selection, we design a low-cost
article was recommended by Associate Editor J. Di. (Corresponding author: domain converter that enables more on-chip domain transform.
Joo-Young Kim.)
The authors are with the School of Electrical Engineering, Korea Advanced Combined with processing element (PE) arrays and efficient
Institute of Science and Technology (KAIST), Daejeon 34141, South Korea memory access based on data tiling, we demonstrate that the
(e-mail: [email protected]; [email protected]). NTT-based convolution is a promising alternative for acceler-
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCSI.2022.3214528. ating modern deep CNN models. The main contributions of
Digital Object Identifier 10.1109/TCSI.2022.3214528 this paper are summarized as follows.
1549-8328 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 09,2023 at 10:53:06 UTC from IEEE Xplore. Restrictions apply.
316 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 70, NO. 1, JANUARY 2023

the fast convolution operation using domain transform can be


described as below.
 
O= I∗W = F −1 (F (I )  F (W)) (1)
C in C in

The main benefit of using a fast convolution algorithm in


CNN is reducing the total number of multiplications. How-
ever, there are some important considerations for the efficient
implementation of this method.
• Domain transform cost The domain transform opera-
tor F (·) and its inverse operator F −1 (·) are the most
Fig. 1. Illustration of fast convolution method for CNN layer. significant factor for the effectiveness of this approach.
Since the amount of potential parallel computations
• We propose the NTT-based CNN accelerator, which
depends on how much data is available in the transform
overcomes the limitations of the existing fast
domain, a low-cost and high-throughput domain trans-
convolution-based accelerators by demonstrating up
form is essential. Otherwise, domain transform can be
to 9.6× performance gain.
the bottleneck.
• We propose the multi-engine architecture that can convo-
• Memory requirement Size inflation issue in domain
lute the input and weight data in the NTT domain based
transform is challenging because the transformed data,
on the tiled data flow. We also provide an efficient data
F (I) and F (W), have a larger size (I + K − 1) × (I +
mapping for maximizing parallelism.
K − 1) than their original sizes. Especially, the weight
• We design the low-cost NTT and inverse NTT (INTT)
typically has a small kernel size (K × K ) with many
converter using only adders and shifters for on-chip
channels (Cin × Cout ), so its size notably increases after
domain transform. It addresses the transform overhead
transform.
problem and enables more parallel computations in the
• Operation complexity The complexity of the trans-
accelerator.
form domain operation () depends on the data type
• We identify the hardware-amenable NTT parameters for
on the transform domain. This operation represents
efficient compute path design and various kernel sizes
element-wise multiplications (Hadamard Product) in
computation.
transform domain. In general, it requires a high-bit pre-
• We implement the proposed accelerator on the Xilinx
cision multiplication to maintain the correctness of the
Alveo U50 FPGA utilizing high bandwidth memory and
computation result.
evaluate it for popular deep CNN models.
The remainder of the paper is organized as follows. B. Spatial CNN Accelerators
Section II overviews the fast convolution-based CNN and
existing CNN accelerators. Section III describes NTT para- There exist many CNN accelerators in the spatial domain.
meter optimization and its application to CNN. In Section IV, Chen et al. [11] proposed an energy-efficient array architec-
a novel accelerator architecture based on NTT is proposed ture and dataflow, which maximize data reuse in convolution
with the component design. Section V describes the proposed operation. Ma et al. [12] and Zhang et al. [13] utilized the
data flow and hardware mapping. The performance model loop optimization techniques such as loop unrolling, tiling,
and scalability analysis of the accelerator are formulated in and inter-changing to optimize the convolution operation.
Section VI. Section VII shows the FPGA implementation and Other works, including Qiu et al. [8] and Han et al. [9]
evaluation results. Finally, Section VIII concludes the paper. proposed model quantization and pruning methods along with
the dedicated hardware accelerators. Huang et al. [14] used
a row-level pipelined streaming strategy and optimized hard-
II. BACKGROUND AND R ELATED W ORK ware mapping mechanism to improve resource utilization.
A. Fast Convolution Algorithm In another work, Wu et al. [16] proposed a flexible and
reconfigurable accelerator for different types of convolution,
A convolutional layer in the CNN involves computation
such as depthwise convolution, transposed convolution, and
between the input feature maps (input tensor I) and the multi-
dilated convolution. However, all of them only focus on spatial
ple weights (weight tensor W) to produce output feature maps
domain optimization without adopting any domain transform
(output tensor O). The input tensor has a size of I × I × Cin ,
to accelerate the convolution operations.
the weight tensor has a size of K × K × Cin × Cout , and the
output tensor has a size of O × O ×Cout . A spatial CNN refers
to the conventional CNN that computes the convolution in the C. FFT-Based CNN Accelerators
spatial domain. While fast convolution converts the sliding FFT has been widely used for CNN among domain trans-
window operation of the weight convolution on the input forms. Abtahi et al. [18] shows that FFT-based CNN is
feature maps in the spatial domain into element-wise multiply- beneficial in hardware over the spatial-based CNN, mainly
and-accumulate (MAC) operation in a transform domain, because of the reduction in multiplications. However, the
as shown in Figure 1. Based on the convolution theorem, FFT’s transform cost is considerably high because it involves

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 09,2023 at 10:53:06 UTC from IEEE Xplore. Restrictions apply.
PRASETIYO et al.: ACCELERATING DEEP CNNs USING NUMBER THEORETIC TRANSFORM 317

TABLE I the parameter β represents the scaling factor in data size


C OMPARISON OF C ONVOLUTION A LGORITHMS : S PATIAL -BASED V S . between the spatial and transform domain.
FFT-BASED V S . W INOGRAD -BASED V S . NTT-BASED
First, the domain transform cost represents the overhead
of the fast algorithms needed for transforming the data from
the spatial to the transform domain, and vice versa. The cost
of FFT’s domain transform can be expressed in the order of
3N 2 log2 N, because it requires complex numbers multiplier
which needs at least three multiplications. In Winograd, as the
domain transform involves the matrix multiplications with
two transformation matrices (i.e., A W A T ), the cost can be
calculated in the order of 2(M + K − 1)2 . Second, the
operation cost represents the computational cost in terms of
complex number operations. As the transform overhead the number of required multiplications for the entire 2D input
becomes a limiting factor in FFT-based CNN accelera- data. Since each input data requires a K × K sized convolution
tors, Zhang et al. [17] and Zeng et al. [19] pre-computed the operation, the overall operation cost becomes L 2 K 2 in the
weight transform and stored them in the external memory. spatial domain. On the other hand, the operation cost is
However, the pre-computation method causes another critical the number of element-wise multiplications in the transform
issue on the external memory bandwidth because the size of domain for a fast convolution algorithm, which is in order
the transformed data has been significantly increased. Sun of 3N 2 for FFT-based due to the complex number operation
et al. [20] and Niu et al. [23] and He et al. [24] explored the and in order of (K + M − 1)2 for Winograd-based. This
data quantization and sparsity in the frequency domain, but this operation cost is typically smaller than the domain transform
method causes numerical instability and accuracy degradation. cost. Therefore, the computation bottleneck typically lies in
Fang et al. [25] utilized three-dimensional FFT to accelerate the domain transform. Third, the memory requirement means
CNN computation. the required memory size for the convolution operations,
i.e., the size of input and weight data in either spatial or
D. Winograd-Based CNN Accelerators transform domain, which depends on the transform length and
the data type in the transform domain (e.g., a complex number
Recently, Winograd-based CNN acceleration has gained
in FFT).
more popularity than FFT-based method due to the lower
This work has three main motivations. First, it is to accel-
transform cost overhead. However, the cost of Winograd
erate the CNN computation by using NTT, a fast convolution
transform increases with the input size and filter length. In fact,
algorithm that can asymptotically reduce the total number
only the transform function of F(2 × 2, 3 × 3), which operates
of operations for convolution by K 2 times over the spatial
on 4 × 4 input tile and 3 × 3 weight tile, has a low enough
domain’s. Second, it is to solve the high transform and
cost for practical usage. The transform function for a larger
memory overhead of the commonly used fast convolution
tile requires division and floating-point calculation to avoid
algorithms such as FFT-based and Winograd-based. The main
precision loss. Due to the precision loss, the Winograd method
advantage of NTT-based convolution over the other algorithms
suffers from a numerical instability [39]. Yang et al. [27]
is that it can achieve low-cost domain transform, which is the
and Kala et al. [28] utilized the kernel decomposition scheme
primary factor determining the efficiency of a fast convolution
that decomposes a large kernel into several 3 × 3 kernels.
algorithm. The key enabler is that the parameters in NTT can
However, it can be inefficient for the kernel sizes not divisible
be chosen such that the transform can be performed using only
by 3 × 3. In other works, Yang et al. [32] explored non-stride
shift and addition operations without any expensive multipli-
one Winograd convolution using filter decomposition. Yepez
cations. The NTT’s integer-only characteristic also promotes
et al. [33] and Deng et al. [34] utilized the 3D Winograd
simpler hardware and a smaller memory footprint. Third, it is
algorithm to accelerate 3D convolution. Wu et al. [36] utilized
to solve the accuracy and numerical stability issue in the
the sparsity for Winograd-based convolution to accelerate
FFT-based and Winograd-based convolutions. Both suffer from
CNN inference.
an accuracy loss due to the round-off error in the computation
during domain transform. On the other hand, NTT-based
E. Motivation convolution involves only integer computation, which means
Table I shows the comparison among various convolution the computation can be exactly computed without incurring
algorithms (i.e., spatial-based, FFT-based, Winograd-based, accuracy or precision loss.
and NTT-based). To specify the computational cost of them,
we assume the case of two-dimensional (2D) convolution III. NTT-BASED C ONVOLUTION
operations between the L × L input and K × K weight
which produces M × M output. The transform length N is A. Number Theoretic Transform (NTT)
the parameter for domain transform size, which is given by NTT is a generalization of FFT whose computation is
N = L + K − 1. The parameter α represents the scaling defined on integers followed by modulo q. The N-th root
factor in multiplication cost, which normalizes the cost in the of unity (g N ) is an integer in NTT, while it is with the
transform domain by the cost in the spatial domain. Likewise, complex form of e− j 2π/N in FFT. Due to the complex number

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 09,2023 at 10:53:06 UTC from IEEE Xplore. Restrictions apply.
318 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 70, NO. 1, JANUARY 2023

TABLE II Algorithm 1 NTT-Based Convolution With OaA


NTT PARAMETER C ONFIGURATION Input: Input (I × I × Cin ), Weight (K × K × Cin × Cout )
Output: Output ((I + K − 1) × (I + K − 1) × Cout )
1 Tilling the input maps into a tile T (i, j, k) with size of
 
L × L (i, j = 1, 2, · · · , LI , k = 1, 2, · · · , Cin )
2 for z ← 1 to Cout do
3 for k ← 1 to Cin do
4 Ŵ (k, z) ← NTT(W (k, z))
   
5 for (i, j ) ← (1, 1) to ( LI , LI ) do
representation, FFT suffers from the accuracy loss caused by 6 P̂(i, j ) ← 0
the round-off error in twiddle factor. NTT computation does 7 for k ← 1 to Cin do
not incur accuracy degradation during the computation in the
8 T̂ (i, j, k) ← NTT(T (i, j, k))
domain transform. Since the computation involves only integer
numbers and integer operands, we can consider NTT as a 9 P̂(i, j ) ← P̂(i, j ) + T̂ (i, j, k)  Ŵ (k, z)
function of integer-to-integer mapping. If we let h(n 1 , n 2 ) be 10 P(i, j ) ← INTT( P̂(i, j ))
an element of the N × N data in the spatial domain, an element 11 O(i, j, z) ← OaA(P(i, j ), P(i − 1, j ),
of the N × N data in the NTT domain, H (k1, k2 ), NTT can be P(i, j − 1), P(i − 1, j − 1))
calculated using the equation written in (2) [40]. The equation
for INTT is written in (3). 12 return O

 N−1
N−1 
H (k1, k2 ) ≡ h(n 1 , n 2 )g nN1 k1 +n2 k2 (mod q)
n 1 =0 n 2 =0
for 0 ≤ k1 , k2 ≤ N − 1 (2) handle this, but it causes an exponential data explosion when
 N−1
N−1  the small-sized weight is converted into a large-point NTT.
h(n 1 , n 2 ) ≡ N −2 H (k1, k2 )g −(n
N
1 k1 +n 2 k2 )
(mod q) To address this problem, we divide the large-sized input into
k1 =0 k2 =0 smaller tiles and perform convolution iteratively using the
for 0 ≤ n 1 , n 2 ≤ N − 1 (3) overlap-and-add (OaA) method [40]. It has widely used for fast
convolution-based CNN such as in FFT-based and Winograd-
based CNN. The OaA method divides a large convolution into
B. Optimization for Low-Cost NTT multiple convolutions with the short segments and overlaps the
In general, NTT and INTT are computationally expensive computation results from each part to make the final results
because they involve modular multiplications. However, it is the same as the large convolution case.
possible to obtain a low-cost transform by identifying the right Algorithm 1 describes the detailed computational step of
NTT parameters. We use the following insights to choose the the NTT-based convolution with overlap-and-add (OaA) for a
NTT parameters that enable a low-cost NTT and INTT. convolutional layer. First, it divides the I × I input feature
• We select the N-th root of unity g N as the power of map into the tiles with a fixed size of L × L. It repeats this
two so that the multiplications can be converted into shift tilling for the whole input channels. T (i, j, k) refers to the
operations. This setup is possible as long as the condition i , j , and k-th tile in the width, height, and channel direction,
g NN ≡ 1 (mod q) holds under the Galois field G F(q) and respectively. On the weight side, it converts all the K ×K sized
q − 1 is divisible by N. Table II shows several N, q and weights in the same output channel to the NTT domain using
g N parameters that meet the above conditions. N × N-point NTT (line 4). k and z are the index for the input
• We replace the modulo operation with the subtraction by channel and output channel, respectively. Then, it performs
setting the q value to the power of two plus one, i.e., q = element-wise multiplication between the transformed input tile
2r + 1. Since an arbitrary number A can be decomposed (T̂ (i, j, k)) and the transformed weight tile (Ŵ (k, z)), and
as A H × 2r + A L , the modulo q of A will be A L − A H accumulates the partial products to P̂(i, j ) for all the input
as the modulo q of 2r is -1. channels (line 9). The accumulated results will be converted
back to the spatial domain P(i, j ) using INTT. It has size
of N × N output pixels. Then, the OaA operation is applied
C. NTT-Based Convolution With Overlap-and-Add (OaA) to this result with the output of previous adjacent tiles to get
The high-level description of NTT-based convolution is the the final convolution results (line 11). Figure 2 illustrates the
same as the fast convolution. It transforms the input and weight OaA operation in neighboring tiles. The K − 1 pixels from
data using NTT and performs the element-wise MAC in the the previous output tiles (i.e., P(i − 1, j − 1), P(i − 1, j ), and
input channel direction. It then transforms the results back P(i, j − 1)) are overlapped and added to the current output
to the spatial domain. However, one critical problem is that tile, P(i, j ). Then, the final convolution output is extracted
the input size is usually much larger than the kernel size in from the L × L pixels of the overlapped result as shown in
the convolutional layers of CNN models (e.g., 224 × 224 vs. the figure. The algorithm iterates the above until it covers all
3 × 3). We need to use a large-point NTT on both data to the tiles in width and height direction. In the final iteration,

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 09,2023 at 10:53:06 UTC from IEEE Xplore. Restrictions apply.
PRASETIYO et al.: ACCELERATING DEEP CNNs USING NUMBER THEORETIC TRANSFORM 319

Transformed Weight Buffer for data reuse. Further description


of the transform unit can be found in Section IV-C. Next, the
PE Array performs the transform-domain operation, which is
element-wise multiplication and addition between transformed
input and transformed weight. After accumulating the result
through the input channel, the result will be transformed
back to the spatial domain for the overlap-and-add operation
in the OaA unit. The VU directly performs post-processing
operations on the convolution results.
Fig. 2. Overlap-and-Add (OaA). To run the application, the host initially send the hardware
and CNN model configuration registers to the scheduler via
PCIe controller. The hardware configuration registers store
the number of tile engines in NCU (R), the number of PE
arrays in the tile engine (Z ), and the PE array size (N) that is
equivalent to NTT size. The CNN model configuration regis-
ters include the input size (I ), input tile size (L), kernel size
(K ), number of input/output channel (Cin /Cout ), and memory
base addresses for the weight/input/output data. The scheduler
instructs the DMA to fetch and distribute the required data to
the NCU via interconnect. It also generates the control signals
to NCU and VU to process the scheduled computation. The
original input and weight data are initially stored in the exter-
nal HBM, and all data transformations are performed entirely
on the chip. The accelerator adopts input streaming dataflow
and maximizes the weight and output reuse. In addition, it can
also flexibly support various convolution kernel sizes.

B. NTT Convolution Unit


The NCU is the main computational block in the accelerator,
which is responsible for performing convolution in the NTT
domain. It consists of a controller, Z weight transform units,
R tile engines and R OaA units. The controller schedules the
Fig. 3. Proposed NTT-based CNN accelerator architecture. entire data loading and computation among the tile engines.
The weight transform units have the responsibility to convert
the pre-fetched weight tiles to the NTT domain. These trans-
the algorithm iterates the whole computation for all the output formed weight tiles are temporarily stored in the transformed
channels to complete the convolution layer. weight buffers in the tile engines. They remain in the buffers
and are being reused till the computation with all the input
IV. ACCELERATOR A RCHITECTURE tiles has been completed. While doing the computation, the
NCU also prefetch the weight tiles for the next computation
A. Architecture Overview round, utilizing double buffering in the transformed weight
Figure 3 shows the overall architecture of the proposed buffers. The tile engine is composed of one NTT unit, Z PE
NTT-based CNN accelerator. The overall architecture consists arrays paired with Z transformed weight buffers, and one
of a PCIe controller, scheduler, multi-channel direct memory INTT unit, as shown in Figure 3. The domain transform
access (DMA), interconnect, weight/input/output buffers, NTT operation is performed on this tile engine. The NTT unit
convolution unit (NCU), and vector unit (VU). The DMA con- transforms the L × L input tile into the NTT domain and
sists of 11 AXI channels to handle data loading from the HBM broadcasts the result to the Z PE arrays. The PE arrays
in parallel. This proposed architecture works based on the also load Z different transformed weight tiles in parallel
NTT-based convolution described in Algorithm 1. It processes from the corresponding transformed weight buffer. Each PE
the computation between the input and weight tile-by-tile. array containing N × N PEs performs element-wise MAC
The NCU is the main compute unit for NTT conversion and operations between the transformed input and weight tile.
transform-domain computation, and it is further discussed in A PE corresponds to the computation of one element in the
the next section. First, the input tile is transformed to the NTT NTT domain. It also accumulates the partial sums of the entire
domain using the Input Transform Unit on each Tile Engine. input channels and applies the modulo operation based on
While the weight tile is transformed to the NTT domain using subtraction. The output tiles are converted back to the spatial
the Weight Transform Unit, and the result is stored on the domain using the INTT unit. Finally, the OaA units perform

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 09,2023 at 10:53:06 UTC from IEEE Xplore. Restrictions apply.
320 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 70, NO. 1, JANUARY 2023

TABLE III
PARAMETER S ETTING FOR VARIOUS K ERNEL S IZE

adjust the input tile size L and the weight kernel size K .
Fig. 4. 8 × 8-point NTT and INTT using VR units.
The correctness of convolution operation is guaranteed if the
parameters meet the condition of K + L − 1 ≤ N. Therefore,
the overlap-and-add operation to produce the final convolution we can choose a preferred L value according to the kernel
results. For efficiency, the input transformation unit is shared size K . The OaA unit also supports this notion of configurable
for each PE array within the same tile engine. Furthermore, the parameters. It performs additions in the overlapping K −1 pix-
output transformation unit can also be shared among all the PE els with adjacent output tiles accordingly. If the accelerator
arrays because of the linear property of INTT. The summation uses 8 × 8-point NTT (i.e., N = 8), it can support kernel sizes
in the spatial domain can be moved into the transform domain less than the NTT size. To be specific, the input and kernel
as shown in (4), performing the inverse domain transform only size pair are {8, 1}, {6, 3}, {4, 5}, and {2, 7} for 1 × 1, 3 × 3,
once. 5 × 5, and 7 × 7 kernel size, respectively. If we need a larger
⎛ ⎞ kernel, like 11 × 11 convolution, we can use 16 × 16-point
  NTT (N = 16) that can be implemented using four
F −1 (F (I)  F (W)) = F −1 ⎝ F (I)  F (W)⎠ (4)
8 × 8-point NTT modules. In summary, the accelerator can
C in C in
support various kernel sizes by having a proper setting of N,
L, and K parameters in the hardware configuration, as listed
C. NTT/INTT Using Vector-Radix (VR) in Table III.
To efficiently implement the transform unit (NTT/INTT),
we leverage the vector-radix (VR) method widely used in the
E. Vector Unit
FFT algorithm [41]. Unlike the row-column algorithm that
transforms the row first using a one-dimensional transform The VU consists of ALUs, pooling units, and a reshape
and then transforms the column after matrix transposition, buffer. It is designed to perform post-convolution operations
the vector-radix algorithm performs the transformation directly such as scaling, bias addition, and non-linear operations (e.g.,
on 2D input without required matrix transposition. The algo- ReLU, ReLU6, and Bounded ReLU). The VU directly receives
rithm recursively divides a large-point N × N transform into the convolution results from NCU and applies post-processing
successive half-point transforms until it reaches the basic operations in the ALUs. It can bypass the pooling (2 × 2 or
module called VR unit. The VR unit is a basic 2 × 2-point 3 × 3), if not needed. The reshape buffer reshapes the final
computation unit that interwinds the results from the lower- results into tiles before storing them back in the memory.
level NTT/INTT nodes into the higher-level NTT/INTT nodes.
Thanks to the NTT parameters chosen in Section III-B, the
V. DATA F LOW AND M APPING
calculation of the VR unit becomes much simpler than that
of FFT, using only addition and shift operations. Figure 4 A. Input Streaming Dataflow
illustrates the computation process of 8 × 8-point NTT and Figure 5 shows the proposed dataflow and data mapping
INTT using the VR algorithm. It shows the whole network on the PE arrays in the tile engines. We adopt the input
connections between the input-points (x/Y for NTT/INTT) streaming dataflow to maximize the throughput. It streams the
and output-points (X/y for NTT/INTT) on each VR unit. The input tiles while keeping the weights and output partial sums
NTT computation is processed in the forward direction, while on the chip as long as possible. The transformed weights are
the INTT computation happens in the opposite direction. The reused in the transformed weight buffers, while the partial
domain transform computation is composed of three stages sums are accumulated in the PE arrays.Based on the tile-
for 8 × 8-point NTT/INTT. The calculation requires 16 VR based processing, the proposed dataflow supports two types
units in each stage, so a total of 48 VR units are needed to of parallelism: 1) weight tile parallelism that computes
complete the NTT/INTT. As a result, the 8 × 8 input pixels multiple weight tiles with the same input tile in parallel (i.e.,
are converted to 8 × 8 output pixels through this multi-stage broadcasting the input tile to the PE arrays in the same row)
module. and 2) input tile parallelism that computes multiple input tiles
for the same weight tile in parallel (i.e., loading the weight
D. Convolution Type Configurability tile in the PE arrays in the same column).
We fix the size of the PE array to N × N for efficient The numbers in the figure indicate the order of computation
N × N-point NTT computation. Nonetheless, we can still loops for the convolution between the input and weights. First,
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 09,2023 at 10:53:06 UTC from IEEE Xplore. Restrictions apply.
PRASETIYO et al.: ACCELERATING DEEP CNNs USING NUMBER THEORETIC TRANSFORM 321

parallelism m from 1 to R when the corresponding weight tile


parallelism s changes from R × Z to Z .

VI. A RCHITECTURAL A NALYSIS


A. Performance Model
To analyze the performance of the proposed NTT-based
CNN accelerator, we assume that it operates one convolution
layer that involves the input feature maps with the size of
I × I × Cin and the weight kernels with the size of K ×
K × Cin × Cout . Based on the OaA tiling data flow described
in Algorithm 1, the accelerator chunks the input into tiles in
which each tile has the size of L × L. Therefore, the number
of input tiles for the given layer is calculated as follow.
I I
# o f I nput_tiles =   ×   × Cin (5)
Fig. 5. Dataflow and data mapping on PE arrays. L L
Moreover, the weight is computed as a tile with the size of
K × K for one convolution. Hence, the number of weight tiles
the lowest order-loop iterates the tile processing in the input for the computation can be calculated as follow.
channel direction ❶. Each PE array performs element-wise
multiplications between the input and weight tiles and accu- # o f W eight_tiles = Cin × Cout (6)
mulates the partial sums until it covers the end of the input The accelerator has R tile engines, and each tile engine has
channel. Then, the second and third order-loop iterate it in the Z PE arrays for the main computation. It processes m input
direction of width ❷ and height ❸, respectively. The PE arrays tiles and s weight tiles in parallel within R × Z PE arrays.
iterate the input tiles for the fixed weights until they cover the Note that each PE array has N × N PEs. Therefore, the total
whole input maps. In the highest order-loop, ❹, the PE arrays number of PEs in the accelerator is calculated as follows.
move to the next group of weights, repeating the above until
covering all the weights. # o f P Es = R × Z × N × N (7)
On each clock cycle, each tile engine computes the convolution
B. Configurable Data Mapping on PE Arrays of one L × L-input tile with Z K × K -weight tiles from
different output channels in parallel. In the next successive
The proposed accelerator includes R tile engines in which cycles, each tile engine performs the convolution for the
each engine has Z PE arrays. The PE array is the basic com- following input channels, accumulating all the partial sums.
putational unit that corresponds to an NTT-based block con- Therefore, after Cin clock cycle, each tile engine produces
volution that performs all the convolution operations between Z output tiles. The accelerator continues the computation
the L × L input tile and K × K weight tile with element-wise for all the input tiles. Next, it repeats the convolution for
multiplications. One block convolution is equivalent to L × L different weight tiles in the output channel until it covers
convolutions in the spatial domain. Overall, the accelerator can all the weight tiles. Assuming that the accelerator’s operating
simultaneously perform R× Z block convolutions at once. The frequency is f op and available external bandwidth is BWt ot al ,
number of rows in the PE arrays, or m, represents the factor the total computation time to perform a convolution layer can
of input tile parallelism, and the number of columns in the PE be estimated as follow.
arrays, or s, represents the factor of weight tile parallelism.
Figure 5 shows a data mapping example on the PE arrays with  LI  ×  LI  Cout
Tcomput e = Cin × ×  × 1/ f op (8)
R = 4 and Z = 8 using the input and weight tile parallelism m s
with m = 2 and s = 16. The accelerator uses the input streaming data flow that
Even though the R and Z values are fixed in a specific requires at most R number of L × L input tiles every cycle.
hardware implementation, we can change the m and s values The required bandwidth is calculated as follows.
with data mapping. The scheduler can efficiently control this L×L
mapping by simply assigning what input/weight tiles to what BWinput = R × (9)
1/ f op
tile engines. It is also possible to adjust the input and weight
tile parallelism at runtime, depending on the layer shape. One The available bandwidth is mostly allocated for the input
limitation of the current tile architecture is that the PE arrays streaming, but the accelerator produces the R × Z number of
on the same tile engine must process the same input as they L × L output tiles at every Cin cycles. The amount of external
share the transform unit and transformed data. We can have a bandwidth to store back this data to external memory can be
fewer number of PE arrays per tile engine for better mapping calculated as follows.
flexibility, but this would require more hardware resources. L×L
BWout put = R × Z × (10)
As a result, the proposed accelerator can support the input tile Cin × 1/ f op

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 09,2023 at 10:53:06 UTC from IEEE Xplore. Restrictions apply.
322 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 70, NO. 1, JANUARY 2023

The weights are loaded from external memory to on-chip


memory and they are fully reused for all input tiles. For
this computation, the accelerator requires at least K × K ×
Cin × Z weight data available at on-chip memory. It should
be fetched using the remaining bandwidth BWweight =
BWt ot al − (BWinput + BWout put ). While doing computation,
the accelerator pre-fetches the required weight tiles for the
next computation round. It repeats the data loading for the
next computation until all the weight data has been processed.
Note that the weight is only loaded once from the external Fig. 6. Scalability analysis of the accelerator.
memory because it is fully reused in on-chip memory. The
total loading time to load all the weight data from external TABLE IV
memory can be estimated as below. CNN M ODEL C ONFIGURATION FOR E VALUATION
K × K × Cin × s Cout
Tload = ×  (11)
BWweight s
The accelerator can load the data tiles while it computes using
double-buffering scheme. Therefore, the total processing time
is given as the maximum value between the Tcomput e and Tload .
Finally, the performance of a single convolution layer can be
estimated by dividing the total number of operations in the
layer by the total processing time, as described in (12). For the
whole CNN model, the performance is calculated by dividing
the total number of operations in the model with the sum of
each layer’s processing time.
# o f Oper ati ons
Per f or mance =
max(Tcomput e , Tload )
2 × O × O × K × K × Cin × Cout is large enough, deteriorating its scalability. This result shows
= (12)
max(Tcomput e , Tload ) the importance of performing the domain transform on chip
for the scalability of the accelerator.
B. Scalability Analysis
To evaluate the performance scalability of the proposed VII. E VALUATION
accelerator, we use the performance model designed above. A. Methodology
In this experiment, we use VGG-16 as the target CNN model We implement the proposed NTT-based CNN accelerator
and vary the number of PE arrays by changing the Z value on the Xilinx Alveo U50 FPGA card. We use Xilinx Vitis
from 2 to 8 at R is fixed to 4. The operating frequency fop is platform 2020.2 for host communication and Xilinx Vivado
set to 200 MHz and BW is set to 52.4 GB/s while BWinput , 2020.2 for FPGA implementation. To evaluate our accelerator
BWout put , BWweight are configured to 28.8 GB/s, 3.6 GB/s, with various kernel sizes, we choose three popular CNN
and 20 GB/s, respectively. We compare the proposed acceler- models whose kernel sizes vary from 1 × 1 to 7 × 7: VGG-16
ator’s on-the-fly method, which performs the weight domain [4], GoogLeNet [5], and Darknet-19 [6]. Table IV describes
transform on-the-chip, and the pre-computation method, which the detailed layer information of target CNN models. We use
performs the weight domain transform offline. For the pre- the 8-bit integer representation for both input and weight based
computation case, we need to load the transformed weight tiles on the researches showing that CNN can be quantized to a low
of N × N × β from external memory instead of the K × K integer (e.g., INT8) without an accuracy loss [7], [42]. In addi-
weight tiles. tion, we use the 21-bit integer for internal transformed data
Figure 6 shows the performance scalability of the two to fully preserve the accuracy. For the FPGA implementation,
designs. The result indicates that the on-the-fly method scales we use the NTT parameter N = 8, q = 220 + 1, and g N = 25 .
linearly as the number of PE Arrays increases, while the For the external memory access, we utilize a total of 11 HBM
performance increase of pre-computation case is decelerated channels. We specifically allocate two channels (≈19.7 GB/s)
from 16 PE arrays. Eventually, the on-the-fly method shows for weight data loading and other channels for input data
1.5× better performance than the pre-computation method streaming and output data streaming. We successfully integrate
with the 32 PE arrays. This is because loading transformed 4 tile engines with 32 PE arrays (R = 4 and Z = 8).
data from external memory puts more pressure on the memory
bandwidth as the accelerator requires to fetch more tiles for
more PE arrays. In other words, the accelerator’s data loading B. On-the-Fly Transform Vs. Pre-Computation
time becomes dominant out of the overall processing time due We first analyze the memory size requirement of the two
to the limited memory bandwidth when the PE array number popular approaches in the fast convolution-based accelerators:

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 09,2023 at 10:53:06 UTC from IEEE Xplore. Restrictions apply.
PRASETIYO et al.: ACCELERATING DEEP CNNs USING NUMBER THEORETIC TRANSFORM 323

TABLE V
W EIGHT M EMORY S IZE BY T RANSFORM M ETHOD

Fig. 8. Performance and power efficiency of the proposed NTT-based CNN


accelerator on deep CNN models.
Fig. 7. Implementation layout on Xilinx Alveo U50.

TABLE VI
D. Performance and Power Efficiency
FPGA R ESOURCE U SAGE ON X ILINX A LVEO U50
For performance evaluation, we measure the actual execu-
tion time of the FPGA device for running the target CNN lay-
ers and calculate the throughput as follows. We first calculate
the total number of operations required for the target workload
in the spatial domain, then divide it by the measured execution
time. Figure 8 shows the performance and power efficiency
of the proposed accelerator for the 22 layer types of target
CNN models. With the 32 8 × 8 PE arrays, the accelerator
achieves average 2859.5, 990.3, and 805.6 giga operations
per second (GOPS) throughput for VGG-16, GoogLeNet, and
the on-the-fly domain transform and the pre-computation Darknet-19, respectively. The average power of the FPGA
method. Table V summarizes the weight memory size require- device is measured 26 W, resulting in the power efficiency of
ment for each method when the NTT size is 8 × 8. The 110.0, 38.1, and 31.0 GOPS per Watt (GOPS/W) for VGG-16,
pre-computation requires 275, 68, and 348 MB memory for GoogLeNet, and Darknet-19, respectively.
storing the transformed weights in the external memory, In this evaluation, we observe three key points. First, the
which are 18.7×, 11.3×, and 17.6× higher than the on- proposed accelerator achieves a high performance when the
the-fly method for VGG-16, GoogLeNet, and Darknet-19, input size is large. This trend is clearly seen in the results
respectively. This result suggests that the on-the-fly transform of VGG-16, whose input size decreases from 224 × 224 at
is preferable for most cases, as the memory size requirement Conv1 to 14 × 14 at Conv5, while the kernel size is fixed
severely increases. The pre-computation method will put sig- to 3 × 3. It shows good performance until the input size
nificant pressure on the system’s external memory bandwidth. is 56 × 56 at Conv3, but the performance drops quickly as
the input size gets smaller than the point. This is because
the proposed accelerator uses a tiling scheme in the input
C. FPGA Implementation Result
image. As the size of the PE array is fixed to 8 × 8,
Figure 7 shows the final layout of the NTT-based CNN its utilization drops for the partially populated tiles from
accelerator implemented on the Xilinx Alveo U50 FPGA. the input edges. Therefore, this utilization problem directly
We successfully integrate 32 8 × 8 PE arrays with the configu- impacts the performance if the input size is not large enough
ration of R = 4 and Z = 8, using all dies in the FPGA called compared to the array size. Second, the proposed accelerator
super logic regions (SLR0 and SLR1). We carefully place the performs strongly with 3 × 3 and 5 × 5 kernels but not well
four tile engines on the two SLRs with manual floorplanning with 1 × 1 and 7 × 7 kernels. This can be understood by
to optimize the routing delay. In addition, we minimize the calculating the number of equivalent operations involved in
number of die-crossing signals and use enough pipelining a single NTT-domain block convolution. Since the PE array
registers for high frequency. As a result, we achieve 200MHz performs the parallel convolution on L × L input points with
operating frequency with utilizing 69% of DSP, 69 % of LUT, a K × K sized kernel every cycle, the equivalent number of
81 % of BRAM, and 48 % of URAM resources. Table VI multiplications is L × L × K × K . Applying this calculation
summarizes the detailed FPGA resource utilization. to our parameter settings, we can get 64, 324, 400, and

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 09,2023 at 10:53:06 UTC from IEEE Xplore. Restrictions apply.
324 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 70, NO. 1, JANUARY 2023

TABLE VII
C OMPARISON W ITH S TATE - OF - THE -A RT A CCELERATORS

196 multiplications for 1 × 1, 3 × 3, 5 × 5, and 7 × 7 kernel, accelerators [28], [31], [33], [34]. It also achieves comparable
respectively. We can confirm that the performance of the performance over Winograd-based [35], which used a flexible
Inception layers suffers as they involve many 1 × 1 kernels and large transform length. However, the large transform
inside. However, it is noteworthy that the accelerator still length in Winograd-based CNN can cause accuracy degrada-
achieves over and close to 1000 GOPS performance on the tion, which is not evaluated in the work. Although not all
Inception layers, even with 1 × 1 kernels and small input sizes previous accelerators reported the power efficiency, our work
such as 28 × 28 and 14 × 14. Third, the proposed accelerator shows up to 7.0× higher power efficiency for the available
is less beneficial if the layer contains striding. Because the data. Our work achieves higher power efficiency compared
NTT-based convolution basically replaces the convolutions on to the spatial domain and FFT-based accelerators while it
the neighbor points into element-wise multiplications, it cannot achieves comparable efficiency over Winograd-based ones.
fully leverage this benefit if the layer skips the input points. The Winograd-based accelerators in [33] and [34] also utilized
In fact, its performance is reduced to one-fourth if the stride the low-cost domain transform for a specific and small trans-
is 2, as shown in GoogLeNet’s Conv 1 layer. However, striding form configuration, e.g., F(2 × 2, 3 × 3). However, the trans-
is often used very limitedly in modern CNN models. form option is too limited and its cost becomes significantly
expensive for larger transform lengths. For numerical accuracy
of the NTT-based convolution, we compare the NTT-based
E. Comparison With Other CNN Accelerators convolution result with the original convolution result from the
Table VII shows the comparison of our work against spatial domain. We confirm that there are no errors between
the state-of-the-art CNN accelerators on FPGA with vari- the two results. It means that the NTT-based convolution does
ous methods (i.e., spatial-based, FFT-based, Winograd-based). not incur any numerical error, so the inference accuracy sorely
With the optimized architecture for compute-efficient NTT- depends on the CNN model. In regards to this matter, we use
based convolution, our work achieves the best performance a INT8 VGG-16 model trained on ImageNet dataset and use
among the accelerators. It shows average 2.9× and 10.2× the same dataset for evaluation. The inference accuracy result
speedup over the spatial domain accelerator [14] and [15], is 72.32% and 90.97% for Top-1 and Top-5 accuracy, respec-
respectively. This is due to the reduction of required operations tively. The Winograd-based accelerators in [33] and [34] also
in NTT-based convolution over spatial domain convolution. reported that there is no error in Winograd-based convolution.
It achieves 4.3 − 9.6× speedup over the FFT-based accelera- However, it is only applicable for small transform length that
tors [19], [31]. The low-cost domain transform enables more only requires divide-by-two during data transformation. For
domain transform computation on chip without causing the the larger transform length, it requires larger precision to keep
external bandwidth problem in the FFT-based accelerator. The the error small during the Winograd transform. In addition, our
proposed NTT-based accelerator also achieves moderate per- accelerator is applicable for many CNN models with various
formance gains, 1.1 −7.0× speedup, over the Winograd-based kernel sizes and layer shapes by modifying tiling parameters

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 09,2023 at 10:53:06 UTC from IEEE Xplore. Restrictions apply.
PRASETIYO et al.: ACCELERATING DEEP CNNs USING NUMBER THEORETIC TRANSFORM 325

and data mapping, overcoming the need for complex control [9] S. Han et al., “EIE: Efficient inference engine on compressed deep
switching overhead [15] and filter decomposition inefficiency neural network,” ACM SIGARCH Comput. Archit. News, vol. 44, no. 3,
pp. 243–254, 2016.
in Winograd-based [28] and [34]. [10] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, “Optimizing the convolution
operation to accelerate deep neural networks on FPGA,” IEEE Trans.
VIII. C ONCLUSION Very Large Scale Integr. (VLSI) Syst., vol. 26, no. 7, pp. 1354–1367,
Jul. 2018.
In this work, we present a fast convolution CNN accelerator [11] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture
using Number Theoretic Transform (NTT). A fast convolution for energy-efficient dataflow for convolutional neural networks,” ACM
algorithm has an advantage over spatial domain convolution SIGARCH Comput. Archit. News, vol. 44, no. 3, pp. 367–379, Jun. 2016.
[12] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, “Optimizing loop operation
by reducing the number of operations. However, the expensive and dataflow in FPGA acceleration of deep convolutional neural net-
transform domain and the memory overhead limit the potential works,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays,
for accelerating CNN computation. By refining the parameter Feb. 2017, pp. 45–54.
selection of NTT, we utilize the low-cost domain transform [13] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
FPGA-based accelerator design for deep convolutional neural networks,”
that only uses simple addition and shift operations. We propose in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, Feb. 2015,
the multi-engine architecture for tile-based NTT processing pp. 161–170.
and optimize the dataflow and mapping to exploit maxi- [14] W. Huang et al., “FPGA-based high-throughput CNN hardware acceler-
ator with high computing resource utilization ratio,” IEEE Trans. Neural
mal parallelism. The proposed accelerator achieves 2859.5, Netw. Learn. Syst., vol. 33, no. 8, pp. 4069–4083, Aug. 2022.
990.3, and 805.6 GOPS throughput and 110.0, 38.1, and [15] Y. Yu, T. Zhao, K. Wang, and L. He, “Light-OPU: An FPGA-based
31.0 GOPS/W power efficiency for VGG-16, GoogLeNet, overlay processor for lightweight convolutional neural networks,” in
and Darknet-19, respectively. It outperforms the existing fast Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, Feb. 2020,
pp. 122–132.
convolution-based CNN accelerators up to 9.6×. This result [16] X. Wu, Y. Ma, M. Wang, and Z. Wang, “A flexible and efficient FPGA
proves that NTT is a promising alternative for fast convolution- accelerator for various large-scale and lightweight CNNs,” IEEE Trans.
based CNN. For future work, we will extend this method Circuits Syst. I, Reg. Papers, vol. 69, no. 3, pp. 1185–1198, Mar. 2022.
[17] C. Zhang and V. Prasanna, “Frequency domain acceleration of con-
to support various types of convolution operations, such as volutional neural networks on CPU-FPGA shared memory system,” in
convolution with stride larger than 1, transposed convolution, Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, Feb. 2017,
and dilated convolution. We can also extend this NTT-based pp. 35–44.
method using 3D NTT algorithm to accelerate the 3D con- [18] T. Abtahi, C. Shea, A. Kulkarni, and T. Mohsenin, “Accelerating convo-
lutional neural network with FFT on embedded hardware,” IEEE Trans.
volution layer. In addition, exploiting the neural network’s Very Large Scale Integr. (VLSI) Syst., vol. 26, no. 9, pp. 1737–1749,
optimization methods, including sparsity and quantization in Sep. 2018.
the NTT domain computation, will further improve the per- [19] H. Zeng, R. Chen, C. Zhang, and V. Prasanna, “A framework for
generating high throughput CNN implementations on FPGAs,” in
formance of this method. We also plan to use the proposed Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, Feb. 2018,
accelerator for self-driving car system to enable a real-time pp. 117–126.
object detection. [20] W. Sun, H. Zeng, Y.-H.-E. Yang, and V. Prasanna, “Throughput-
optimized frequency domain CNN with fixed-point quantization on
FPGA,” in Proc. Int. Conf. ReConFigurable Comput. FPGAs
ACKNOWLEDGMENT (ReConFig), Dec. 2018, pp. 1–8.
Prasetiyo would like to thank Hyundai Motor Chung [21] C. Ding, S. Wang, N. Liu, K. Xu, Y. Wang, and Y. Liang, “REQ-
YOLO: A resource-aware, efficient quantization framework for object
Mong-Koo Global Scholarship for the scholarship support. detection on FPGAs,” in Proc. ACM/SIGDA Int. Symp. Field-Program.
Gate Arrays, Feb. 2019, pp. 33–42.
R EFERENCES [22] H. Zeng, C. Zhang, and V. Prasanna, “Fast generation of high throughput
customized deep learning accelerators on FPGAs,” in Proc. Int. Conf.
[1] L. Chen, S. Li, Q. Bai, J. Yang, S. Jiang, and Y. Miao, “Review of
ReConFigurable Comput. FPGAs (ReConFig), Dec. 2017, pp. 1–8.
image classification algorithms based on convolutional neural networks,”
Remote Sens., vol. 13, no. 22, p. 4712, Nov. 2021. [23] Y. Niu, R. Kannan, A. Srivastava, and V. Prasanna, “Reuse kernels or
[2] S. S. A. Zaidi, M. S. Ansari, A. Aslam, N. Kanwal, M. Asghar, and activations: A flexible dataflow for low-latency spectral CNN acceler-
B. Lee, “A survey of modern deep learning based object detection ation,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays,
models,” Digit. Signal Process., vol. 126, Jun. 2022, Art. no. 103514. Feb. 2020, pp. 266–276.
[3] H. Yu, L. T. Yang, Q. Zhang, D. Armstrong, and M. J. Deen, “Con- [24] Y. He, J. Yue, Y. Liu, and H. Yang, “Block-circulant neural network
volutional neural networks for medical image analysis: State-of-the-art, accelerator featuring fine-grained frequency-domain quantization and
comparisons, improvement and perspectives,” Neurocomputing, vol. 444, reconfigurable FFT modules,” in Proc. 26th Asia South Pacific Design
pp. 92–110, Jul. 2021. Autom. Conf., Jan. 2021, pp. 813–818.
[4] K. Simonyan and A. Zisserman, “Very deep convolutional networks for [25] C. Fang, L. He, H. Wang, J. Wei, and Z. Wang, “Accelerating 3D
large-scale image recognition,” in Proc. 3rd Int. Conf. Learn. Represent., convolutional neural networks using 3D fast Fourier transform,” in Proc.
(ICLR), San Diego, CA, USA, Y. Bengio and Y. LeCun, Eds., 2015, IEEE Int. Symp. Circuits Syst. (ISCAS), May 2021, pp. 1–5.
pp. 1–14. [26] A. Lavin and S. Gray, “Fast algorithms for convolutional neural net-
[5] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE works,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1–9. Jun. 2016, pp. 4013–4021.
[6] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in [27] C. Yang, Y. Wang, X. Wang, and L. Geng, “WRA: A 2.2-to-6.3 TOPS
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, highly unified dynamically reconfigurable accelerator using a novel
pp. 7263–7271. Winograd decomposition algorithm for convolutional neural networks,”
[7] B. Jacob et al., “Quantization and training of neural networks for IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 66, no. 9, pp. 3480–3493,
efficient integer-arithmetic-only inference,” in Proc. IEEE/CVF Conf. Sep. 2019.
Comput. Vis. Pattern Recognit., Jun. 2018, pp. 2704–2713. [28] S. Kala, B. R. Jose, J. Mathew, and S. Nalesh, “High-performance
[8] J. Qiu et al., “Going deeper with embedded FPGA platform for convolu- CNN accelerator on FPGA using unified winograd-GEMM architecture,”
tional neural network,” in Proc. ACM/SIGDA Int. Symp. Field-Program. IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 27, no. 12,
Gate Arrays, Feb. 2016, pp. 26–35. pp. 2816–2828, Dec. 2019.

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 09,2023 at 10:53:06 UTC from IEEE Xplore. Restrictions apply.
326 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 70, NO. 1, JANUARY 2023

[29] L. Lu and Y. Liang, “SpWA: An efficient sparse Winograd convolutional Seongmin Hong (Graduate Student Member, IEEE)
neural networks accelerator on FPGAs,” in Proc. 55th ACM/ESDA/IEEE received the B.S. and M.S. degrees in electronic
Design Autom. Conf. (DAC), Jun. 2018, pp. 1–6. and electrical engineering from Hongik University,
[30] A. Xygkis, D. Soudris, L. Papadopoulos, S. Yous, and D. Moloney, Seoul, South Korea, in 2016 and 2018, respectively.
“Efficient winograd-based convolution kernel implementation on edge He is currently pursuing the Ph.D. degree with the
devices,” in Proc. 55th ACM/ESDA/IEEE Design Autom. Conf. (DAC), Korea Advanced Institute of Science and Technol-
Jun. 2018, pp. 1–6. ogy (KAIST), Daejeon, South Korea. His current
[31] Y. Liang, L. Lu, Q. Xiao, and S. Yan, “Evaluating fast algorithms for research interests include computer architecture and
convolutional neural networks on FPGAs,” IEEE Trans. Comput.-Aided FPGA-based accelerators for machine learning.
Design Integr. Circuits Syst., vol. 39, no. 4, pp. 857–870, Apr. 2020.
[32] C. Yang, Y. Wang, X. Wang, and L. Geng, “A stride-based convolution
decomposition method to stretch CNN acceleration algorithms for effi-
cient and flexible hardware implementation,” IEEE Trans. Circuits Syst.
I, Reg. Papers, vol. 67, no. 9, pp. 3007–3020, Sep. 2020.
[33] J. Yepez and S.-B. Ko, “Stride 2 1-D, 2-D, and 3-D Winograd for
convolutional neural networks,” IEEE Trans. Very Large Scale Integr.
(VLSI) Syst., vol. 28, no. 4, pp. 853–863, Apr. 2020.
[34] H. Deng, J. Wang, H. Ye, S. Xiao, X. Meng, and Z. Yu, “3D-VNPU:
A flexible accelerator for 2D/3D CNNs on FPGA,” in Proc. IEEE
29th Annu. Int. Symp. Field-Program. Custom Comput. Mach. (FCCM), Yashael Faith Arthanto (Graduate Student
May 2021, pp. 181–185. Member, IEEE) received the B.S. degree in
[35] X. Liu, Y. Chen, C. Hao, A. Dhar, and D. Chen, “WinoCNN: Ker- electrical engineering from the Institut Teknologi
nel sharing Winograd systolic array for efficient convolutional neural Bandung (ITB), Bandung, Indonesia, in 2019. He is
network acceleration on FPGAs,” in Proc. IEEE 32nd Int. Conf. Appl.- currently pursuing the M.S. degree with the Korea
Specific Syst., Archit. Processors (ASAP), Jul. 2021, pp. 258–265. Advanced Institute of Science and Technology
[36] D. Wu, X. Fan, W. Cao, and L. Wang, “SWM: A high-performance (KAIST), Daejeon, South Korea. His research
sparse-winograd matrix multiplication CNN accelerator,” IEEE Trans. interests include hardware architecture, hardware
Very Large Scale Integr. (VLSI) Syst., vol. 29, no. 5, pp. 936–949, accelerator for AI, and multi-FPGA infrastructures.
May 2021.
[37] W. Xu, X. You, and C. Zhang, “Using Fermat number transform to
accelerate convolutional neural network,” in Proc. IEEE 12th Int. Conf.
ASIC (ASICON), Oct. 2017, pp. 1033–1036.
[38] W. Xu, Z. Zhang, X. You, and C. Zhang, “Reconfigurable and low-
complexity accelerator for convolutional and generative networks over
finite fields,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.,
vol. 39, no. 12, pp. 4894–4907, Dec. 2020.
[39] B. Barabasz, A. Anderson, K. M. Soodhalter, and D. Gregg, “Error
analysis and improving the accuracy of Winograd convolution for deep
neural networks,” ACM Trans. Math. Softw., vol. 46, no. 4, pp. 1–33,
Dec. 2020. Joo-Young Kim (Senior Member, IEEE) received
the B.S., M.S., and Ph.D. degrees in electrical
[40] H. J. Nussbaumer, “The fast Fourier transform,” in Fast Fourier Trans- engineering from the Korea Advanced Institute
form and Convolution Algorithms. Cham, Switzerland: Springer, 1981, of Science and Technology (KAIST), Daejeon,
pp. 211–240. South Korea, in 2005, 2007, and 2010, respectively.
[41] K. R. Rao, D. N. Kim, and J. J. Hwang, “Fast Fourier transform: He is currently an Assistant Professor with the
Algorithms and applications,” Tech. Rep., 2010, pp. 184–193. School of Electrical Engineering, KAIST. He is
[42] S. Jain, A. Gural, M. Wu, and C. Dick, “Trained quantization thresholds also the Director of the AI Semiconductor Systems
for accurate and efficient fixed-point inference of deep neural networks,” Research Center. Before joining KAIST, he was
Proc. Mach. Learn. Syst., vol. 2, pp. 112–128, Mar. 2020. a Senior Hardware Engineering Lead at Microsoft
Azure, Redmond, WA, USA, working on hardware
acceleration for its hyper-scale big data analytics platform named Azure Data
Lake. He was also one of the initial members of Catapult project at Microsoft
Research, Redmond, where he deployed a fabric of field-programmable gate
Prasetiyo (Graduate Student Member, IEEE) arrays (FPGAs) in datacenters to accelerate critical cloud services, such as
received the B.S. degree in electrical engineer- machine learning, data storage, and networking. His research interests span
ing from the Institut Teknologi Bandung (ITB), various aspects of hardware design, including VLSI design, computer architec-
Bandung, Indonesia, in 2015, and the M.S. degree ture, FPGA, domain-specific accelerators, hardware/software co-design, and
in electrical engineering from the Korea Advanced agile hardware development. He was a recipient of the 2016 IEEE Micro Top
Institute of Science and Technology (KAIST), Picks Award, the 2014 IEEE Micro Top Picks Award, the 2010 DAC/ISSCC
Daejeon, South Korea, in 2020, where he is cur- Student Design Contest Award, the 2008 DAC/ISSCC Student Design Contest
rently pursuing the Ph.D. degree. His current Award, and the 2006 A-SSCC Student Design Contest Award. He serves
research interests include computer architecture, as an Associate Editor for the IEEE T RANSACTIONS ON C IRCUITS AND
FPGA, and domain-specific accelerators. S YSTEMS —I: R EGULAR PAPERS .

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 09,2023 at 10:53:06 UTC from IEEE Xplore. Restrictions apply.

You might also like