High-Utilization, High-Flexibility Depth-First CNN Coprocessor For Image Pixel Processing On FPGA
High-Utilization, High-Flexibility Depth-First CNN Coprocessor For Image Pixel Processing On FPGA
Authorized licensed use limited to: University of Liverpool. Downloaded on May 17,2021 at 08:06:19 UTC from IEEE Xplore. Restrictions apply.
462 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 3, MARCH 2021
the main advantage over state-of-the-art architectures roughly at the same time for every layer. Otherwise, multipliers
typically targeting only one algorithm and/or kernel. of a given layer have to wait for other layers, leading to a
4) Equations are derived in Sections III and IV to build decrease in throughput. This means that these architectures
a model. Section V uses this model to investigate are optimized for just one CNN.
the influence of hardware parameters and compute the
digital signal processing (DSP) utilization and the bene- C. Consequences for Our Design
fits of DF for a given pixel processing CNN and a given The targeted coprocessor for pixel processing algorithms
FPGA. will be flexible. This means that no reconfiguration of the
Hereafter, Section VI performs a state-of-the-art compar- FPGA is necessary between the executions of different net-
ison with FPGA implementations for both pixel processing works, saving latency. However, this means that the strategy
algorithm applications and classification applications. of [7], [8], [14], and [15] cannot be used. The restructuring of
The architecture does not support specific layer types that the processing elements due to the varying ratios of computa-
are not included in the networks of interest. Section VII tion complexities over convolutional layers in different CNNs
describes how the current architecture can be adapted to would lead to a huge overhead in the number of multiplexers
handle these layers, with a small loss of performance for the and connections. To be able to reduce I/O and memory bot-
networks under interest. This article ends with a conclusion tlenecks, the coprocessor will make use of the DF algorithm,
in Section VIII. as explained in Section III. The main difference is that the
architecture will not pipeline over layers. In addition, as large
II. BACKGROUND
activation tensors lead to many computations, the multipliers
A. Classification Tasks Versus Pixel Processing Algorithms of the PE array must have high useful utilization, as explained
Traditionally, CNNs for image processing mostly targeted in Section IV. The multipliers of the previously discussed
classification tasks. Popular classification networks are, for architectures only have to compute convolutions with one fixed
example, AlexNet [9], VGG-16 [10], and YOLO [11]. For kernel size. In our design, multipliers must deliver high useful
these classification tasks, many FPGA implementations exist utilization for many kernel sizes. The PE array must be able
in the literature, where the computation of the CNN happens to handle all kernel sizes from our benchmark networks: a
layer by layer [1], [2], [5], [12]. segmentation [2], super-resolution [15], denoising [16], depth
Nowadays, CNNs are not only used for classification tasks estimation [17], and optical flow [18] network. These kernel
anymore. Another increasingly important application is pixel sizes are 3 × 3, 5 × 5, 7 × 7, and 9 × 9.
processing algorithms. These are algorithms that transform an
input image into an enhanced output image [13], such as seg- III. D EPTH -F IRST
mentation [1], [2], super-resolution [14], [15], denoising [16], Traditional implementations evaluate a CNN layer by layer,
depth estimation [12], [17], and optical flow [18]. hereafter called layer-first operation. While layer-first compu-
Pixel processing algorithms are characterized by large OX tation is efficient in terms of weight movements, it results in
and OY values (large intermediate feature map dimensions, a large I/O communication for input and output activations.
e.g., full-HD images (3840 × 2160) here), and this for all More specifically, layer-first computation requires complete
layers until the end of the network. This characteristic stems intermediate feature maps to either be stored in embedded
from the fact that pixel processing algorithms output complete memory on FPGA or being sent to the external host processor
(transformed) images and not just a single classification value. and later fetched back for the computation of the following
Classification algorithms have, in several layers (especially layer. While this was manageable for classification CNNs,
the later ones), a much smaller OX and OY. Therefore, pixel this becomes a problem for pixel processing algorithms.
processing algorithms lead to a new memory bottleneck as the They typically work with bigger sized input images and also
weights are not the dominant factor anymore. For traditional have large intermediate features. This article will, therefore,
layer-by-layer implementations, even if every input must be improve and exploit a DF processing principle, which imme-
read from off-chip memory only once and every output must diately consumes feature outputs of a given convolutional
be sent to off-chip memory only once, these high OX and OY layer to compute feature outputs of the next convolutional
lead to I/O and memory bottlenecks. layer, avoiding I/O data transfers. Sections III-A–III-C discuss
three state-of-the-art DF strategies. Sections III-D and III-E
B. Parallelize Over Layers: Not Flexible Enough present three new DF methodologies. Section III-F performs
Activation dominated, I/O, and memory bottlenecks can be the comparison between the six discussed DF principles.
reduced by computing features of multiple layers in parallel.
This strategy is presented for both classification tasks in the A. Tile-Based Depth-First Without Reuse
architectures of [7] and [8] and pixel processing algorithms The DF principle can naively be deployed in a tile-based
in the architectures of [14] and [15]. These architectures are manner without reuse of intermediate features [19], as shown
pipelined. A certain number of multipliers are assigned for the in Fig. 1(a). To compute one pixel at the output of the CNN,
multiplications of every layer. This number of multipliers must a patch with the height and width of FN is necessary for the
be in proportion to the computation complexity of the given last intermediate layer. To obtain this, a patch with width and
layer as the computation of an output feature must be done height FN + FN −1 − 1 is needed in the previous intermediate
Authorized licensed use limited to: University of Liverpool. Downloaded on May 17,2021 at 08:06:19 UTC from IEEE Xplore. Restrictions apply.
COLLEMAN AND VERHELST: HIGH-UTILIZATION, HIGH-FLEXIBILITY DF CNN COPROCESSOR FOR IMAGE PIXEL PROCESSING 463
C. Pixel-Based Depth-First
To further exploit intermediate feature data reuse, one can
also exploit reuse in the vertical direction [20], [21], as shown
in Fig. 2(a). When the processor gets one new pixel from the
input layer, one new feature of the first intermediate layer is
computed. This feature is used immediately to compute one
new feature of the second intermediate layer. In the example,
we have just computed a new feature of layer 1 and use this
to compute a new feature of layer 2. This principle is proven
Fig. 1. Tile-based DF (a) without and (b) with reuse. A CNN of three feature to be both memory and I/O bandwidth-efficient, as it is not
layers is shown, from top to bottom. In every convolutional layer, the pixels necessary to load and unload all features in the memory of
that are needed to compute the red and blue output pixels are indicated. Every the FPGA. Every intermediate feature is computed only once,
row is computed pixel by pixel. The blue pixel is computed after the red pixel.
After finishing one row, the next row is evaluated. In (a), these are indicated and every input feature has to be sent only once from the host
with squares because the technique of (a) does not make use of overlap. processor to the FPGA.
In (b), therefore, the pixels are filled with colors. With reuse, the intermediate The architectures in Section II-B can be seen as pipelined
features in purple are reused for the computation of the blue output feature.
These are stored in memory. pixel-based DF architectures.
D. Line-Based Depth-First
layer. Therefore, by backpropagation, to compute one pixel at
The pixel-based DF principle can be extended to take in
the output of the CNN, a patch with the height and width of
a complete line of input features in every pass through the
N network, as shown in Fig. 2(b). When the computation of one
(Fi − 1) + 1 line of a layer is done, the network jumps to the next layer to
i=1 process the new output line. Only when the line reaches the last
is necessary at the input. This is called a tile. The output layer, the network takes in a new input line and starts again in
features of a convolutional layer can be used immediately to the first layer. For every intermediate layer, the number of rows
compute the output features of the next convolutional layer. to be stored in the internal memory of the FPGA processor is
In the example in the figure, a 3 × 3 patch in the intermediate equal to the FX of the subsequent convolutional layer instead
layer is computed. These nine pixels are used immediately of FX−1 rows and FX pixels for the pixel-based DF principle.
to compute the output pixel. In this way, every intermediate When a new line of intermediate features is computed, it can
feature result propagates along with the depth of the network replace the oldest line entry of this layer as this line is not
through all layers till the end of the neural network. This used anymore. In the example shown in Fig. 2(b), the blue
is why this technique is named “depth”-first. After this, line will overwrite the orange line.
the output is sent to the host processor, and the host processor
gives new input features. Therefore, no I/O communication E. Intermediate-Line-Based and Multiline-Based Depth-First
is needed for intermediate features. No intermediate features
are stored on-chip. This technique has two major drawbacks: In intermediate-line-based DF, only one complete line for
input features have to be sent multiple times to FPGA and each intermediate layer is stored. The other lines contain
intermediate features have to be computed multiple times. temporal results. When the computation of one line is done,
In the example, the 3 × 3 patches in the intermediate layers to this line is used to update the FX lines of the next layer
compute the red and blue output pixels on which it has influence. The lower FX − 1 lines are still
are overlapping. Every intermediate results, and the upper one contains the final
intermediate feature is computed ( Nj=i+1 (F j −1) + 1)2 times
instead of 1 in the i th convolutional layer. intermediate results and updates the subsequent layer. In the
example shown in Fig. 2(c), we have just updated the five
green lines of layer 1. Only the upper one of them contains
B. Tile-Based Depth-First With Horizontal Data Reuse final results, and the lower ones have to be updated later.
Reusing input and intermediate features [purple in Fig. 1(b)] This upper green line is now used to update the three blue
for the computation of subsequent output pixels due to the lines of layer 2. It is the last line from layer 1, which is
overlap of tiles reduces these two drawbacks [19]. They needed to compute the upper blue line of layer 2, so this line
are stored in on-chip memory. After the computation of will contain complete results and will be used to propagate
a new output feature, the leftmost column of the reusable further. The two bottom blue lines have to be updated further
intermediate features is replaced by the rightmost column of during the next one/two propagations of a new input line
the computed intermediate
features. Still, every intermediate throughout the network. In this way, every multiplication is
feature is computed Nj=i+1 (F j −1) + 1 times instead of 1 in performed also only once. The number of updates is equal
Authorized licensed use limited to: University of Liverpool. Downloaded on May 17,2021 at 08:06:19 UTC from IEEE Xplore. Restrictions apply.
464 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 3, MARCH 2021
Fig. 2. (a) Pixel-, (b) line-, (c) intermediate-line-, and (d) multiline-based DF approaches. Each big trapezoid represents a feature layer. Each small trapezoid
represents all the channels from one line of the activation data of one layer. The colors of the lines represent the current state of the given line: computed or
not yet, stored in on-chip or already removed, sent to off-chip memory, and so on. These colors explain the DF principle: new pixel/line/multiple lines of the
input layer comes in and propagates through the end of the network with few memory resources.
F. Comparison of Depth-First Principles Fig. 3. Border effects between two neighboring subimages.
This section compares the DF approaches where all features
1) Divide the network into two or more subnetworks that
are computed only once. As explained in Sections III-A
are executed DF within the subnetworks but sequentially
and III-B, this is not the case for tile-based DF. The four
across the subnetworks.
remaining DF approaches are compared with the layer-first
2) Divide the image in vertical subimages that are
approach in Table II. The table compares the I/O bandwidth,
processed independently with input width IY (and
the memory to store the intermediate features, and the paral-
according output width OY , as shown in Fig. 3).
lelization/unrolling options in hardware for the algorithms. The
Both solutions will, however, slightly increase the required
suffix “u” after a loop name indicates spatial unrolling [22].
I/O bandwidth again due to the following:
The I/O bandwidth is much lower for all DF approaches in
1) sending internal layers in between the subnetworks to
comparison with layer-first: 81×, 79.2×, 43.7×, 29.8×, and
the host processor for solution 1;
3.9× for the super-resolution [15], segmentation [2], denois-
2) overlap of the vertical subimages for solution 2, as illus-
ing [16], optical flow [18], and depth estimation network [17].
trated in Fig. 3.
The last factor is lower because the CNN of the depth estima-
When dividing the image in subimages, there is overhead
tion algorithm contains many output channels: K N = 64. The
in number of multiplications performed because the image
line-based DF approach needs a little more features’ memory
size will not perfectly fit the PE array subimage size. The
than the pixel-based (50%, 25%, 24.7%, 25%, and 50%) but
total number of vertical subimages is (OY/OY ). Fig. 4
is preferred due to its parallelization options: OYu and higher
illustrates how the overhead can be reduced using parts from
potential for IYu. To fully exploit the additional parallelization
two different images at the same time. The right part of the
option of the multiline-based approach, L has to be taken
first has not enough columns to completely fill subimage 3.
high enough, but this causes higher memory requirements. The
Therefore, already some columns of the subsequent image are
intermediate-line-based approach needs more memory writings
placed in this subimage 3 to have efficient use of the columns.
than the line-based approach. Therefore, in Section IV-A,
Those columns are not allowed to be the leftmost or rightmost
a line-based DF supporting PE array is presented. The possible
ones of this subsequent image. There will be some overlap
loop unrollings are OYu, IYu, FYu, Ku, Cu, IXu, and FXu.
between the two parts, and therefore, some of the outputs will
not be useful.
G. On-Chip Memory Savings Table III shows the impact of the on-chip memory saving
The features’ memory as in Table II will be too high for techniques on the I/O communication savings and overhead in
many FPGAs (2.3 MB for the optical flow network). Two numbers of multiplications for the segmentation network [2].
solutions are introduced in the pixel-based approach of [20]. Based on this table, the network is split into two subnetworks
Authorized licensed use limited to: University of Liverpool. Downloaded on May 17,2021 at 08:06:19 UTC from IEEE Xplore. Restrictions apply.
COLLEMAN AND VERHELST: HIGH-UTILIZATION, HIGH-FLEXIBILITY DF CNN COPROCESSOR FOR IMAGE PIXEL PROCESSING 465
TABLE II
C OMPARISON OF L AYER -F IRST AND DF A PPROACHES
TABLE III
Fig. 5. Architecture of an FPGA processor.
I MPACT OF M EMORY R EDUCTION T ECHNIQUES . T HREE N UMBERS A RE
G IVEN FOR E ACH C OMBINATION : T HE ON -C HIP M EMORY [kB],
I/O C OMMUNICATION S AVINGS IN C OMPARISON W ITH instructions from the host. This makes that, during execution,
L AYER -F IRST, AND OVERHEAD IN T ERMS only data (and very few control bits) have to be sent between
OF A DDITIONAL M ULTIPLICATIONS [%] FPGA and host. This also means that the PE array must have
a high useful utilization across kernel sizes in order to have
a high throughput. Useful utilization means that, during every
clock cycle, as much as possible PEs in the PE array have to
perform useful computation.
Subsequently, the FSM and communication between the
FPGA and the host processor are discussed in Section IV-C.
to save on-chip weights memory. This also reduces the over- The section ends with a paragraph discussing the other
head at the borders between subimages. The decision about the modules.
width of the vertical subimages is made in Section V because
this is architecture-dependent and the same decision must be A. Flexible PE Array and Adder Trees With High
made for all networks. Utilization Across Kernel Sizes
The module “PE array and adder trees” is the coprocessor’s
IV. A RCHITECTURE D ISCUSSION most important module. It gets data from the weights’ memory,
This section describes a coprocessor making use of the the features’ memory variably from the biases memory, and its
line-based DF approach with splits in vertical subimages. own outputs to add the temporary results provided through the
Fig. 5 summarizes the complete coprocessor architecture. This distributor module (see Fig. 5). Exploiting the parallelization
section first discusses the PE array and adder trees where the options discussed in Section III-F, the pseudocode for the
actual computations take place, followed by the memory hier- computation of one new line of a convolutional layer is given
archy. The FPGA coprocessor is instruction programmable. in Algorithm 1. Due to the DF principle, OX is not involved.
The instruction contains all dimensions of the CNN, such as 1) Spatial Unrolling of OY/IY and FY: 1-D Convolution:
N, Ci , K N , and Fi of the different layers. The host sends this The four bottom lines of the pseudocode implement a 1-D
instruction to the FPGA. After this, the FPGA can handle the convolution in the horizontal dimension. OY/IY and FY are
complete execution of the CNN and does not need any further placed in the inner loop to be able to reuse weights and
Authorized licensed use limited to: University of Liverpool. Downloaded on May 17,2021 at 08:06:19 UTC from IEEE Xplore. Restrictions apply.
466 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 3, MARCH 2021
Authorized licensed use limited to: University of Liverpool. Downloaded on May 17,2021 at 08:06:19 UTC from IEEE Xplore. Restrictions apply.
COLLEMAN AND VERHELST: HIGH-UTILIZATION, HIGH-FLEXIBILITY DF CNN COPROCESSOR FOR IMAGE PIXEL PROCESSING 467
TABLE VI
M EMORY D IMENSIONS
Fig. 7. Processor contains a pipelined adder tree for every column of the
PE array. For each supported kernel filter dimension, the colored squares at
the left of the figure indicate how many rows of the PE array need data from
the same input channel C. Each square at the right represents a register; a
register with two inputs performs a summation.
Table VI contains the dimensions of all memories.
TABLE V The features’ memories have three possible inputs.
E XAMPLE FOR M EMORY M APPING . LY S TANDS FOR “L AYER ” 1) The host interface memory to get the features of the
input layer.
2) Their own output for vertical padding. This is not strictly
necessary; it would be possible to use these features
from their original addresses. However, this copying
takes negligible time and simplifies the design of the
memory address calculator.
3) The newly computed features.
Authorized licensed use limited to: University of Liverpool. Downloaded on May 17,2021 at 08:06:19 UTC from IEEE Xplore. Restrictions apply.
468 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 3, MARCH 2021
TABLE VII
FPGA D ETAILS . A N MAC C OUNTS FOR T WO O PERATIONS :
A M ULTIPLICATION AND A S UM
Fig. 9. Simulation results for all networks on Zynq XCZU9EG with an input
is triggered when it must compute a new line of a given image with full-HD (3840 ×2160) resolution to investigate influence of wstat.
convolutional layer.
First, a good value for wstat is derived based on results
for Zynq XCZU9EG. It will turn out that this is a tradeoff
D. Other Modules between memory use and utilization (and therefore through-
When the computation of outputs of the convolutional layer put). The value of wstat is chosen based on the results for the
is ready, the results are requantized to 8 bits and rounded. segmentation [2], super-resolution [15], denoising [16], depth
If the convolutional layer under execution is the last one, estimation [17], and SPyNet network [18]. This experiment
the FPGA sends the outputs of this module to the host proces- shows the flexibility of the processor and the high utilization of
sor using a 32-bit-width FIFO as well. Otherwise, the out- the multipliers for all kernels dimensions. For the chosen value
puts go to the ReLU operation and horizontal padding [23]. of wstat in the first experiment, the advantages of DF are tested
The horizontal padding is different for the subimages. For for all FPGAs under study with the denoising network [16]
the leftmost subimage, horizontal padding happens on the with three convolutional layers (kernels 9×9, 5×5, and 5×5)
left-hand side. For the rightmost subimage, horizontal padding with a full-HD image at the input.
happens on the right side. This horizontal padding happens for
every line. Therefore, the FPGA processor must know which
B. Results and Discussion
kind of vertical subimage the one under execution is. The
host processor indicates this with two bits in the instruction 1) Selection of wstat and Utilization: Fig. 9 shows the
sent from the host to FPGA before the execution of the influence of wstat on the total useful utilization for the five
subimage: “00” for the leftmost, “01” for the rightmost, and networks under study. The useful utilization with and without
“10” otherwise. The outputs of these modules go to the inputs the image combining trick, as shown in Fig. 4, is shown. When
of the features’ memories. using this trick, the useful utilization raises with wstat as the
subimage width is increasing, and therefore, fewer multiplica-
V. M EASUREMENTS AND R ESULTS tions are performed twice at the border between subimages.
When not using this trick, the total useful utilization is lower
A. Setup and Goal and also not strictly increasing with wstat. The overhead at
This section investigates the performance of the proposed borders between subimages is decreasing, but the overhead
innovations introduced in Section IV: both the throughput and due to the fact that the image width does not fit perfectly the
I/O bandwidth gain for line-based DF in comparison with increasing subimage width becomes a dominant factor.
layer-first and the high utilization rate for different kernel The useful utilization of the super-resolution network is
sizes will be demonstrated. The experiments also prove the the lowest. This is mostly due to the first and last layers.
scalability of the design across platforms by implementing The first layer has a 3 × 3 kernel, so Cu = 5. However,
the design on multiple FPGAs with differences in the number the first layer has only one input channel, leading to a
of LUTs and DSPs available. The experiments to derive the useful utilization of less than 20%. The last layer is also a
number of DSPs used and the maximal clock frequency are 3 × 3 kernel, so Ku = 2. The last layer has only one output
done in Vivado 2018.2. In order to compare many networks on channel, leading to a useful utilization of less than 50%. In this
many FPGAs and sweep parameters, such as wstat, a model is network, the two intermediate convolutional layers are quite
built based on the equations as derived in Sections III and IV small leading to a nonnegligible drop in useful utilization and,
and equations to take the overlap between vertical subimages therefore, throughput.
into account. Table VII contains the FPGAs under study, with For wstat bigger than 6, the utilization is not increasing sig-
their corresponding number of columns NC, clock frequency, nificantly anymore. However, the amount of memory needed
and maximal number of GOPS. NC is chosen to be as high as to execute those three networks with the same bitfile is increas-
the number of DSPs on the given board allows. Sometimes, ing. The optimal value for wstat is a tradeoff between memory
choosing a smaller NC would increase the utilization but use and utilization (and therefore throughput). In this design,
decrease the throughput as fewer computations in each clock wstat equal to 6 is chosen. When using the image combining
cycle can happen. In this work, it is the ambition to combine trick, the useful utilization is 78.7% for the super-resolution
high utilization with a high number of PEs to achieve optimal network and between 88.3% and 93.1% for the four other net-
throughput. works. This leads to throughputs between 588 and 695 GOPS.
Authorized licensed use limited to: University of Liverpool. Downloaded on May 17,2021 at 08:06:19 UTC from IEEE Xplore. Restrictions apply.
COLLEMAN AND VERHELST: HIGH-UTILIZATION, HIGH-FLEXIBILITY DF CNN COPROCESSOR FOR IMAGE PIXEL PROCESSING 469
TABLE VIII of the figure. The exact throughput gains for DF are given
S IMULATION R ESULTS FOR FPGA C OPROCESSOR W HILE E XECUTING in Table VIII.
A D ENOISING N ETWORK [16] W ITH AN I NPUT I MAGE
W ITH F ULL -HD (3840 × 2160) R ESOLUTION
For a small FPGA with few columns and therefore small
vertical subimages, the throughput is a little worse for DF in
comparison with layer-first because the border effects between
subsequent subimages lead to more double computations and
the number of subsequent subimages itself also increases.
For FPGAs with more DSPs, this problem is solved. The
deeper the network and the bigger the kernel filter sizes,
the more DSPs are necessary to have an advantage from DF
in comparison with layer first.
Authorized licensed use limited to: University of Liverpool. Downloaded on May 17,2021 at 08:06:19 UTC from IEEE Xplore. Restrictions apply.
470 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 3, MARCH 2021
TABLE IX
C OMPARISON W ITH S TATE OF THE A RT. “ut” S TANDS FOR “U TILIZATION ”
D. Residual Blocks
In residual blocks [24], the output of a given convolutional
layer is added to the output of a further convolutional layer,
as illustrated in Fig. 11.
Fig. 11. Impact of residual blocks on the memory design. In the line-based DF approach, lines that are not used any-
more are removed from memory, as explained in Section III.
PE array needs access to 15 input channels at the same time When using residual blocks, more lines must be stored as
such that 15 features’ memories are needed instead of 5. This illustrated in the figure. The number of lines to be stored must
drastically increases the required bandwidth. The additional be increased with (FX − 1)/(2) for each convolutional layer
features’ memories can be used to store channels of the already before the features layer with which the output must be added.
supported kernels (like explained for the 7 ×7 kernel), but this
leads to more multiplexers in front of the PEs. The theoretical VIII. C ONCLUSION
maximal utilization is 100%.
This article presents an FPGA CNN coprocessor optimized
for pixel processing algorithms. There are four main innova-
B. 1-D Kernels tions introduced to overcome the challenges of the compute
In modern networks, square kernels are sometimes replaced and memory-intensive pixel processing CNNs.
by a vertical and a horizontal kernel to save weights and First, the I/O communication between the host processor and
reduce computation time. For a horizontal kernel, the “for fx” the FPGA is reduced to a minimum using the DF principle.
statement in the FSM for the computation of one line of a layer Second, six DF approaches are compared in terms of mem-
becomes “for fx = 0:0.” As a 1 × 1 kernel can be supported ory and parallelization options. The newly proposed line-based
as suggested earlier, a vertical kernel can be implemented as DF approach is chosen above the tile-, pixel-, intermediate-
FX kernels of 1 × 1. The results can be added together in the line-, and multiline-based approaches.
same way as for a square kernel. Third, a dataflow scheme maintains high utilization and a
high data reuse factor across a wide range of CNN kernels
sizes. This reduces on-chip memory bandwidth requirements
C. Stride of 2 and again increases throughput. Specifically, a mapping that
To support strides of 2, the architecture needs to be changed is flexible in terms of the Cu and Ku unrolling in the function
as the number of input pixels that are necessary to compute NC of the kernel size is implemented.
output pixels of a given line doubles. Therefore, the current Fourth, a model is built to investigate the influence of
features’ memory bandwidth is only high enough to keep half DF versus layer-first and the useful utilization rate. The model
of the columns busy. To keep all of them busy, we can double is also used to decide hardware parameters. For the bench-
the Ku: use the left half of the PE array for Ku output channels marked networks, the reduction in I/O communication due to
Authorized licensed use limited to: University of Liverpool. Downloaded on May 17,2021 at 08:06:19 UTC from IEEE Xplore. Restrictions apply.
COLLEMAN AND VERHELST: HIGH-UTILIZATION, HIGH-FLEXIBILITY DF CNN COPROCESSOR FOR IMAGE PIXEL PROCESSING 471
DF goes up to 81×. The abovementioned model shows that the [15] T. Manabe, Y. Shibata, and K. Oguri, “FPGA implementation of a
DF principle causes a gain in throughput in comparison with real-time super-resolution system using flips and an RNS-based CNN,”
IEICE Trans. Fundam. Electron., Commun. Comput. Sci., vol. 101,
the layer-first principle due to the reduction in the number of no. 12, pp. 2280–2289, Dec. 2018.
clock cycles that the FPGA has to wait for I/O. For four out [16] H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for image
of the five benchmarked networks on the Zynq XCZU9EG restoration with neural networks,” IEEE Trans. Comput. Imag., vol. 3,
no. 1, pp. 47–57, Mar. 2017.
platform, the useful utilization for the complete network is [17] J. Žbontar and Y. LeCun, “Stereo matching by training a convolutional
between 88.3% and 93.1%, higher than or on par with state neural network to compare image patches,” J. Mach. Learn. Res., vol. 17,
of the art, leading to a throughput between 659 and 695 GOPS. no. 1, pp. 2287–2318, Jan. 2016.
[18] A. Ranjan and M. J. Black, “Optical flow estimation using a spatial
For one network, the utilization was lower: 78.7%. pyramid network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
The proposed solution is fully flexible and programmable. (CVPR), Jul. 2017, pp. 4161–4170.
This allows executing a wide range of CNN models on the [19] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer CNN
accelerators,” in Proc. 49th Annu. IEEE/ACM Int. Symp. Microarchitec-
same FPGA platform, without intermediate costly reconfigu- ture (MICRO), Oct. 2016, pp. 1–12.
ration of the bitfile. [20] K. Goetschalckx and M. Verhelst, “Breaking high-resolution CNN
bandwidth barriers with enhanced depth-first execution,” IEEE J. Emerg.
Sel. Topics Circuits Syst., vol. 9, no. 2, pp. 323–331, Jun. 2019.
R EFERENCES [21] Q. Xiao, Y. Liang, L. Lu, S. Yan, and Y.-W. Tai, “Exploring heteroge-
neous algorithms for accelerating deep convolutional neural networks on
[1] Y. Lyu, L. Bai, and X. Huang, “ChipNet: Real-time LiDAR processing
FPGAs,” in Proc. 54th Annu. Design Autom. Conf., Jun. 2017, pp. 1–6.
for drivable region segmentation on an FPGA,” IEEE Trans. Circuits
[22] X. Yang, “Interstellar: Using Halide’s scheduling language to analyze
Syst. I, Reg. Papers, vol. 66, no. 5, pp. 1769–1779, May 2019.
DNN accelerators,” in Proc. 25th Int. Conf. Architectural Support
[2] Y. Lyu, L. Bai, and X. Huang, “Real-time road segmentation using
Program. Lang. Operating Syst., 2020, pp. 369–383.
LiDAR data processing on an FPGA,” in Proc. IEEE Int. Symp. Circuits
[23] M. Sewak, M. R. Karim, and P. Pujari, Practical Convolutional Neural
Syst. (ISCAS), 2018, pp. 1–5.
Networks: Implement Advanced Deep Learning Models Using Python.
[3] S. Michalik, S. Michalik, J. Naghmouchi, and M. Berekovic, “Real-time
Birmingham, U.K.: Packt Publishing Ltd., 2018.
smart stereo camera based on FPGA-SoC,” in Proc. IEEE-RAS 17th Int.
[24] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
Conf. Humanoid Robot. (Humanoids), Nov. 2017, pp. 311–317.
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
[4] C. Hahne, A. Lumsdaine, A. Aggoun, and V. Velisavljevic, “Real-time
(CVPR), Jun. 2016, pp. 770–778.
refocusing using an FPGA-based standard plenoptic camera,” IEEE
Trans. Ind. Electron., vol. 65, no. 12, pp. 9757–9766, Dec. 2018.
[5] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, “Optimizing loop opera- Steven Colleman was born in Lier, Belgium,
tion and dataflow in FPGA acceleration of deep convolutional neural in 1995. He received the M.Sc. degree in electrical
networks,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate engineering from the KU Leuven, Leuven, Belgium,
Arrays, Feb. 2017, pp. 45–54. in 2018, where he is currently pursuing the Ph.D.
[6] J. Qiu et al., “Going deeper with embedded FPGA platform for convolu- degree with MICAS in the laboratory of Prof. Dr. Ir.
tional neural network,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Marian Verhelst. His master’s thesis is entitled Opti-
Gate Arrays, Feb. 2016, pp. 26–35. malisation of a coarse grain reconfigurable array
[7] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, “A high per- for the efficient mapping of convolutional neural
formance FPGA-based accelerator for large-scale convolutional neural networks.
networks,” in Proc. 26th Int. Conf. Field Program. Log. Appl. (FPL), His research interest lies in the field of effi-
Aug. 2016, pp. 1–9. cient processing architectures for embedded deep
[8] Z. Liu, Y. Dou, J. Jiang, and J. Xu, “Automatic code genera- learning.
tion of convolutional neural networks in FPGA implementation,”
in Proc. Int. Conf. Field-Program. Technol. (FPT), Dec. 2016, Marian Verhelst (Senior Member, IEEE) received
pp. 61–68. the Ph.D. degree from KU Leuven, Leuven,
[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification Belgium, in 2008.
with deep convolutional neural networks,” Commun. ACM, vol. 60, no. 6, She was a Visiting Scholar with the Berkeley
pp. 84–90, May 2017. Wireless Research Center (BWRC), University of
[10] K. Simonyan and A. Zisserman, “Very deep convolutional networks California at Berkeley UC Berkeley, Berkeley, CA,
for large-scale image recognition,” 2014, arXiv:1409.1556. [Online]. USA, in summer 2005. She was a Research Scientist
Available: http://arxiv.org/abs/1409.1556 with Intel Labs, Hillsboro, OR, USA, from 2008 to
[11] J. Redmon and A. Farhadi, “YOLOv3: An incremental 2011. She is currently an Associate Professor with
improvement,” 2018, arXiv:1804.02767. [Online]. Available: the MICAS Laboratories, Electrical Engineering
http://arxiv.org/abs/1804.02767 Department, KU Leuven. Her research focuses on
[12] Y. Sada, N. Soga, M. Shimoda, A. Jinguji, S. Sato, and H. Nakahara, embedded machine learning, hardware accelerators, HW-algorithm codesigns,
“Fast monocular depth estimation on an FPGA,” in Proc. IEEE Int. and low-power edge processing.
Parallel Distrib. Process. Symp. Workshops (IPDPSW), May 2020, Dr. Verhelst was a member of the Young Academy of Belgium, an Associate
pp. 143–146. Editor for TVLSI, TCAS-II, and JSSC, and a member of the STEM Advisory
[13] S. Annadurai, Fundamentals of Digital Image Processing. London, U.K.: Committee to the Flemish Government. She is also a member of the Executive
Pearson, 2007. Committees of DATE and ISSCC, the TPC Co-Chair of AICA S2020 and
[14] T. Manabe, Y. Shibata, and K. Oguri, “FPGA implementation of a real- tinyML 2021, and a TPC Member of VLSI and ESSCIRC. She also holds a
time super-resolution system using a convolutional neural network,” prestigious ERC Starting Grant from the European Union. She was a Laureate
in Proc. Int. Conf. Field-Programmable Technol. (FPT), Dec. 2016, of the Royal Academy of Belgium in 2016. She is an SSCS Distinguished
pp. 249–252. Lecturer.
Authorized licensed use limited to: University of Liverpool. Downloaded on May 17,2021 at 08:06:19 UTC from IEEE Xplore. Restrictions apply.