0% found this document useful (0 votes)
41 views11 pages

High-Utilization, High-Flexibility Depth-First CNN Coprocessor For Image Pixel Processing On FPGA

Uploaded by

roopa_kothapalli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views11 pages

High-Utilization, High-Flexibility Depth-First CNN Coprocessor For Image Pixel Processing On FPGA

Uploaded by

roopa_kothapalli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO.

3, MARCH 2021 461

High-Utilization, High-Flexibility Depth-First CNN


Coprocessor for Image Pixel Processing on FPGA
Steven Colleman and Marian Verhelst , Senior Member, IEEE

Abstract— Recently, CNNs are increasingly exploited for pixel TABLE I


processing tasks, such as denoising, which opens up new chal- PARAMETER N AMING C ONVENTIONS
lenges due to the increased activation and operation count. This
article presents a CNN coprocessor architecture to solve these
challenges on field-programmable gate array (FPGA) through
four main contributions. First, the I/O communication between
the host processor and the FPGA is reduced to a minimum
using a depth-first (DF) principle. Three new DF approaches
are presented. Second, to ensure high throughput, the increased
parallelization opportunities of the proposed line-based DF oper-
ation are analyzed. Third, introducing programmability to the
compute array is introduced to enable a broad deployment while
maintaining high utilization of the available multipliers digital
signal processings (DSPs), independently of the kernel dimensions
and without control of the host processor. This is in contrast with
many state-of-the-art FPGA implementations, focusing on only
one algorithm and/or one kernel topology. Fourth, a model is built
to investigate the influence of architecture parameters and show
the benefits of DF. The scalable design can be deployed on a systems [3], [4]. Section II contains a discussion of state-of-
wide range of FPGAs, maintaining 78%–93% DSP utilization the-art FPGA implementations for both classification and pixel
across all algorithms (denoising, optical flow, depth estima- processing algorithms. The general trends will be discussed,
tion, segmentation, and super-resolution) and FPGA platforms. together with their shortcomings.
Up to 695 GOPS is achieved on a Zynq XCZU9EG board, match-
ing state-of-the-art performance with a more flexible design. The Hereafter, this article presents a memory-efficient,
throughput is compared with other pixel processing architectures high-utilization FPGA coprocessor for a broad range of pixel
on FPGA. processing algorithms consisting of convolutional, ReLU, and
Index Terms— Convolutional neural network (CNN), padding layers. This coprocessor can receive the network
depth-first (DF), field-programmable gate array (FPGA), flexible configuration (all parameters in Table I are configurable)
processor design, high throughput, pixel processing. from the host processor and execute the complete network
independently of the host processor, leaving it available for
I. I NTRODUCTION other tasks.
This article brings four main contributions.
M ACHINE learning (ML) algorithms are nowadays very
important in many real-life applications. Convolutional
neural networks (CNNs) in particular have shown unprece-
1) An optimized depth-first (DF) processing principle lifts
the external I/O bottleneck by replacing the tradi-
dented performance for image processing tasks. There is tional layer-size processing principle [5]–[8], discussed
an increasing interest in the execution of such CNNs on in Section III. Three new DF approaches are pre-
field-programmable gate array (FPGA) in order to enable sented: line-, intermediate-line-, and multiline-based
these algorithms at low cost into systems that already embed approaches.
an FPGA, such as a LiDAR [1], [2] or other camera 2) The proposed DF approaches are compared with the
state-of-the-art ones in terms of parallelization options
Manuscript received July 8, 2020; revised October 6, 2020 and
November 9, 2020; accepted December 14, 2020. Date of publication Jan- and memory use in Section III-F. The proposed approach
uary 14, 2021; date of current version February 24, 2021. This work was will show to have more parallelization options, leading
supported in part by the European Union European Research Council (EU to higher utilization for large PE arrays. This work is
ERC) project ReSENSE under Grant Agreement ERC-2016-STG-715037,
in part by the Fonds Wetenschappelijk Onderzoek Strategisch Basis Onder- the first to implement this line-based DF.
zoek (FWO SBO) project OmniDrone under Grant Agreement S003817N, 3) An efficiently scalable, widely parallel compute fabric
and in part by the Flemish Government through the AI Research Program. guarantees a high-utilization rate across a wide range
(Corresponding author: Steven Colleman.)
The authors are with the Department of Electrical Engineering, ESAT- of kernels and on a wide range of FPGA platforms,
MICAS, KU Leuven, 3001 Leuven, Belgium (e-mail: steven.colleman@ as detailed in Section IV-A. The flexible coprocessor
esat.kuleuven.be; [email protected]). can be reprogrammed without requiring a new FPGA
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TVLSI.2020.3046125. bitfile upload, hence avoiding any latency between the
Digital Object Identifier 10.1109/TVLSI.2020.3046125 executions of two consecutive tasks. This flexibility is
1063-8210 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: University of Liverpool. Downloaded on May 17,2021 at 08:06:19 UTC from IEEE Xplore. Restrictions apply.
462 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 3, MARCH 2021

the main advantage over state-of-the-art architectures roughly at the same time for every layer. Otherwise, multipliers
typically targeting only one algorithm and/or kernel. of a given layer have to wait for other layers, leading to a
4) Equations are derived in Sections III and IV to build decrease in throughput. This means that these architectures
a model. Section V uses this model to investigate are optimized for just one CNN.
the influence of hardware parameters and compute the
digital signal processing (DSP) utilization and the bene- C. Consequences for Our Design
fits of DF for a given pixel processing CNN and a given The targeted coprocessor for pixel processing algorithms
FPGA. will be flexible. This means that no reconfiguration of the
Hereafter, Section VI performs a state-of-the-art compar- FPGA is necessary between the executions of different net-
ison with FPGA implementations for both pixel processing works, saving latency. However, this means that the strategy
algorithm applications and classification applications. of [7], [8], [14], and [15] cannot be used. The restructuring of
The architecture does not support specific layer types that the processing elements due to the varying ratios of computa-
are not included in the networks of interest. Section VII tion complexities over convolutional layers in different CNNs
describes how the current architecture can be adapted to would lead to a huge overhead in the number of multiplexers
handle these layers, with a small loss of performance for the and connections. To be able to reduce I/O and memory bot-
networks under interest. This article ends with a conclusion tlenecks, the coprocessor will make use of the DF algorithm,
in Section VIII. as explained in Section III. The main difference is that the
architecture will not pipeline over layers. In addition, as large
II. BACKGROUND
activation tensors lead to many computations, the multipliers
A. Classification Tasks Versus Pixel Processing Algorithms of the PE array must have high useful utilization, as explained
Traditionally, CNNs for image processing mostly targeted in Section IV. The multipliers of the previously discussed
classification tasks. Popular classification networks are, for architectures only have to compute convolutions with one fixed
example, AlexNet [9], VGG-16 [10], and YOLO [11]. For kernel size. In our design, multipliers must deliver high useful
these classification tasks, many FPGA implementations exist utilization for many kernel sizes. The PE array must be able
in the literature, where the computation of the CNN happens to handle all kernel sizes from our benchmark networks: a
layer by layer [1], [2], [5], [12]. segmentation [2], super-resolution [15], denoising [16], depth
Nowadays, CNNs are not only used for classification tasks estimation [17], and optical flow [18] network. These kernel
anymore. Another increasingly important application is pixel sizes are 3 × 3, 5 × 5, 7 × 7, and 9 × 9.
processing algorithms. These are algorithms that transform an
input image into an enhanced output image [13], such as seg- III. D EPTH -F IRST
mentation [1], [2], super-resolution [14], [15], denoising [16], Traditional implementations evaluate a CNN layer by layer,
depth estimation [12], [17], and optical flow [18]. hereafter called layer-first operation. While layer-first compu-
Pixel processing algorithms are characterized by large OX tation is efficient in terms of weight movements, it results in
and OY values (large intermediate feature map dimensions, a large I/O communication for input and output activations.
e.g., full-HD images (3840 × 2160) here), and this for all More specifically, layer-first computation requires complete
layers until the end of the network. This characteristic stems intermediate feature maps to either be stored in embedded
from the fact that pixel processing algorithms output complete memory on FPGA or being sent to the external host processor
(transformed) images and not just a single classification value. and later fetched back for the computation of the following
Classification algorithms have, in several layers (especially layer. While this was manageable for classification CNNs,
the later ones), a much smaller OX and OY. Therefore, pixel this becomes a problem for pixel processing algorithms.
processing algorithms lead to a new memory bottleneck as the They typically work with bigger sized input images and also
weights are not the dominant factor anymore. For traditional have large intermediate features. This article will, therefore,
layer-by-layer implementations, even if every input must be improve and exploit a DF processing principle, which imme-
read from off-chip memory only once and every output must diately consumes feature outputs of a given convolutional
be sent to off-chip memory only once, these high OX and OY layer to compute feature outputs of the next convolutional
lead to I/O and memory bottlenecks. layer, avoiding I/O data transfers. Sections III-A–III-C discuss
three state-of-the-art DF strategies. Sections III-D and III-E
B. Parallelize Over Layers: Not Flexible Enough present three new DF methodologies. Section III-F performs
Activation dominated, I/O, and memory bottlenecks can be the comparison between the six discussed DF principles.
reduced by computing features of multiple layers in parallel.
This strategy is presented for both classification tasks in the A. Tile-Based Depth-First Without Reuse
architectures of [7] and [8] and pixel processing algorithms The DF principle can naively be deployed in a tile-based
in the architectures of [14] and [15]. These architectures are manner without reuse of intermediate features [19], as shown
pipelined. A certain number of multipliers are assigned for the in Fig. 1(a). To compute one pixel at the output of the CNN,
multiplications of every layer. This number of multipliers must a patch with the height and width of FN is necessary for the
be in proportion to the computation complexity of the given last intermediate layer. To obtain this, a patch with width and
layer as the computation of an output feature must be done height FN + FN −1 − 1 is needed in the previous intermediate

Authorized licensed use limited to: University of Liverpool. Downloaded on May 17,2021 at 08:06:19 UTC from IEEE Xplore. Restrictions apply.
COLLEMAN AND VERHELST: HIGH-UTILIZATION, HIGH-FLEXIBILITY DF CNN COPROCESSOR FOR IMAGE PIXEL PROCESSING 463

the i th convolutional layer because this technique only makes


use of the overlap of tiles for the computation of horizontally
neighboring pixels.

C. Pixel-Based Depth-First
To further exploit intermediate feature data reuse, one can
also exploit reuse in the vertical direction [20], [21], as shown
in Fig. 2(a). When the processor gets one new pixel from the
input layer, one new feature of the first intermediate layer is
computed. This feature is used immediately to compute one
new feature of the second intermediate layer. In the example,
we have just computed a new feature of layer 1 and use this
to compute a new feature of layer 2. This principle is proven
Fig. 1. Tile-based DF (a) without and (b) with reuse. A CNN of three feature to be both memory and I/O bandwidth-efficient, as it is not
layers is shown, from top to bottom. In every convolutional layer, the pixels necessary to load and unload all features in the memory of
that are needed to compute the red and blue output pixels are indicated. Every the FPGA. Every intermediate feature is computed only once,
row is computed pixel by pixel. The blue pixel is computed after the red pixel.
After finishing one row, the next row is evaluated. In (a), these are indicated and every input feature has to be sent only once from the host
with squares because the technique of (a) does not make use of overlap. processor to the FPGA.
In (b), therefore, the pixels are filled with colors. With reuse, the intermediate The architectures in Section II-B can be seen as pipelined
features in purple are reused for the computation of the blue output feature.
These are stored in memory. pixel-based DF architectures.

D. Line-Based Depth-First
layer. Therefore, by backpropagation, to compute one pixel at
The pixel-based DF principle can be extended to take in
the output of the CNN, a patch with the height and width of
a complete line of input features in every pass through the

N network, as shown in Fig. 2(b). When the computation of one
(Fi − 1) + 1 line of a layer is done, the network jumps to the next layer to
i=1 process the new output line. Only when the line reaches the last
is necessary at the input. This is called a tile. The output layer, the network takes in a new input line and starts again in
features of a convolutional layer can be used immediately to the first layer. For every intermediate layer, the number of rows
compute the output features of the next convolutional layer. to be stored in the internal memory of the FPGA processor is
In the example in the figure, a 3 × 3 patch in the intermediate equal to the FX of the subsequent convolutional layer instead
layer is computed. These nine pixels are used immediately of FX−1 rows and FX pixels for the pixel-based DF principle.
to compute the output pixel. In this way, every intermediate When a new line of intermediate features is computed, it can
feature result propagates along with the depth of the network replace the oldest line entry of this layer as this line is not
through all layers till the end of the neural network. This used anymore. In the example shown in Fig. 2(b), the blue
is why this technique is named “depth”-first. After this, line will overwrite the orange line.
the output is sent to the host processor, and the host processor
gives new input features. Therefore, no I/O communication E. Intermediate-Line-Based and Multiline-Based Depth-First
is needed for intermediate features. No intermediate features
are stored on-chip. This technique has two major drawbacks: In intermediate-line-based DF, only one complete line for
input features have to be sent multiple times to FPGA and each intermediate layer is stored. The other lines contain
intermediate features have to be computed multiple times. temporal results. When the computation of one line is done,
In the example, the 3 × 3 patches in the intermediate layers to this line is used to update the FX lines of the next layer
compute the red and blue output pixels on which it has influence. The lower FX − 1 lines are still
 are overlapping. Every intermediate results, and the upper one contains the final
intermediate feature is computed ( Nj=i+1 (F j −1) + 1)2 times
instead of 1 in the i th convolutional layer. intermediate results and updates the subsequent layer. In the
example shown in Fig. 2(c), we have just updated the five
green lines of layer 1. Only the upper one of them contains
B. Tile-Based Depth-First With Horizontal Data Reuse final results, and the lower ones have to be updated later.
Reusing input and intermediate features [purple in Fig. 1(b)] This upper green line is now used to update the three blue
for the computation of subsequent output pixels due to the lines of layer 2. It is the last line from layer 1, which is
overlap of tiles reduces these two drawbacks [19]. They needed to compute the upper blue line of layer 2, so this line
are stored in on-chip memory. After the computation of will contain complete results and will be used to propagate
a new output feature, the leftmost column of the reusable further. The two bottom blue lines have to be updated further
intermediate features is replaced by the rightmost column of during the next one/two propagations of a new input line
the computed intermediate
 features. Still, every intermediate throughout the network. In this way, every multiplication is
feature is computed Nj=i+1 (F j −1) + 1 times instead of 1 in performed also only once. The number of updates is equal

Authorized licensed use limited to: University of Liverpool. Downloaded on May 17,2021 at 08:06:19 UTC from IEEE Xplore. Restrictions apply.
464 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 3, MARCH 2021

Fig. 2. (a) Pixel-, (b) line-, (c) intermediate-line-, and (d) multiline-based DF approaches. Each big trapezoid represents a feature layer. Each small trapezoid
represents all the channels from one line of the activation data of one layer. The colors of the lines represent the current state of the given line: computed or
not yet, stored in on-chip or already removed, sent to off-chip memory, and so on. These colors explain the DF principle: new pixel/line/multiple lines of the
input layer comes in and propagates through the end of the network with few memory resources.

to Fi . Therefore, in the i th convolutional layer, Fi times more


writes are necessary.
The multiline-based DF approach is a variant of the
line-based approach where L lines are computed at a time
instead of 1. In Fig. 2(d), L equals 2.

F. Comparison of Depth-First Principles Fig. 3. Border effects between two neighboring subimages.
This section compares the DF approaches where all features
1) Divide the network into two or more subnetworks that
are computed only once. As explained in Sections III-A
are executed DF within the subnetworks but sequentially
and III-B, this is not the case for tile-based DF. The four
across the subnetworks.
remaining DF approaches are compared with the layer-first
2) Divide the image in vertical subimages that are
approach in Table II. The table compares the I/O bandwidth,
processed independently with input width IY (and
the memory to store the intermediate features, and the paral-
according output width OY , as shown in Fig. 3).
lelization/unrolling options in hardware for the algorithms. The
Both solutions will, however, slightly increase the required
suffix “u” after a loop name indicates spatial unrolling [22].
I/O bandwidth again due to the following:
The I/O bandwidth is much lower for all DF approaches in
1) sending internal layers in between the subnetworks to
comparison with layer-first: 81×, 79.2×, 43.7×, 29.8×, and
the host processor for solution 1;
3.9× for the super-resolution [15], segmentation [2], denois-
2) overlap of the vertical subimages for solution 2, as illus-
ing [16], optical flow [18], and depth estimation network [17].
trated in Fig. 3.
The last factor is lower because the CNN of the depth estima-
When dividing the image in subimages, there is overhead
tion algorithm contains many output channels: K N = 64. The
in number of multiplications performed because the image
line-based DF approach needs a little more features’ memory
size will not perfectly fit the PE array subimage size. The
than the pixel-based (50%, 25%, 24.7%, 25%, and 50%) but
total number of vertical subimages is (OY/OY ). Fig. 4
is preferred due to its parallelization options: OYu and higher
illustrates how the overhead can be reduced using parts from
potential for IYu. To fully exploit the additional parallelization
two different images at the same time. The right part of the
option of the multiline-based approach, L has to be taken
first has not enough columns to completely fill subimage 3.
high enough, but this causes higher memory requirements. The
Therefore, already some columns of the subsequent image are
intermediate-line-based approach needs more memory writings
placed in this subimage 3 to have efficient use of the columns.
than the line-based approach. Therefore, in Section IV-A,
Those columns are not allowed to be the leftmost or rightmost
a line-based DF supporting PE array is presented. The possible
ones of this subsequent image. There will be some overlap
loop unrollings are OYu, IYu, FYu, Ku, Cu, IXu, and FXu.
between the two parts, and therefore, some of the outputs will
not be useful.
G. On-Chip Memory Savings Table III shows the impact of the on-chip memory saving
The features’ memory as in Table II will be too high for techniques on the I/O communication savings and overhead in
many FPGAs (2.3 MB for the optical flow network). Two numbers of multiplications for the segmentation network [2].
solutions are introduced in the pixel-based approach of [20]. Based on this table, the network is split into two subnetworks

Authorized licensed use limited to: University of Liverpool. Downloaded on May 17,2021 at 08:06:19 UTC from IEEE Xplore. Restrictions apply.
COLLEMAN AND VERHELST: HIGH-UTILIZATION, HIGH-FLEXIBILITY DF CNN COPROCESSOR FOR IMAGE PIXEL PROCESSING 465

TABLE II
C OMPARISON OF L AYER -F IRST AND DF A PPROACHES

Fig. 4. Reduce subimage overhead using two frames/images.

TABLE III
Fig. 5. Architecture of an FPGA processor.
I MPACT OF M EMORY R EDUCTION T ECHNIQUES . T HREE N UMBERS A RE
G IVEN FOR E ACH C OMBINATION : T HE ON -C HIP M EMORY [kB],
I/O C OMMUNICATION S AVINGS IN C OMPARISON W ITH instructions from the host. This makes that, during execution,
L AYER -F IRST, AND OVERHEAD IN T ERMS only data (and very few control bits) have to be sent between
OF A DDITIONAL M ULTIPLICATIONS [%] FPGA and host. This also means that the PE array must have
a high useful utilization across kernel sizes in order to have
a high throughput. Useful utilization means that, during every
clock cycle, as much as possible PEs in the PE array have to
perform useful computation.
Subsequently, the FSM and communication between the
FPGA and the host processor are discussed in Section IV-C.
to save on-chip weights memory. This also reduces the over- The section ends with a paragraph discussing the other
head at the borders between subimages. The decision about the modules.
width of the vertical subimages is made in Section V because
this is architecture-dependent and the same decision must be A. Flexible PE Array and Adder Trees With High
made for all networks. Utilization Across Kernel Sizes
The module “PE array and adder trees” is the coprocessor’s
IV. A RCHITECTURE D ISCUSSION most important module. It gets data from the weights’ memory,
This section describes a coprocessor making use of the the features’ memory variably from the biases memory, and its
line-based DF approach with splits in vertical subimages. own outputs to add the temporary results provided through the
Fig. 5 summarizes the complete coprocessor architecture. This distributor module (see Fig. 5). Exploiting the parallelization
section first discusses the PE array and adder trees where the options discussed in Section III-F, the pseudocode for the
actual computations take place, followed by the memory hier- computation of one new line of a convolutional layer is given
archy. The FPGA coprocessor is instruction programmable. in Algorithm 1. Due to the DF principle, OX is not involved.
The instruction contains all dimensions of the CNN, such as 1) Spatial Unrolling of OY/IY and FY: 1-D Convolution:
N, Ci , K N , and Fi of the different layers. The host sends this The four bottom lines of the pseudocode implement a 1-D
instruction to the FPGA. After this, the FPGA can handle the convolution in the horizontal dimension. OY/IY and FY are
complete execution of the CNN and does not need any further placed in the inner loop to be able to reuse weights and

Authorized licensed use limited to: University of Liverpool. Downloaded on May 17,2021 at 08:06:19 UTC from IEEE Xplore. Restrictions apply.
466 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 3, MARCH 2021

Algorithm 1 Computation of One New Line of a TABLE IV


Convolutional Layer D ESIGN C HOICES AND U TILIZATION C ONSEQUENCES
FOR S UPPORTED K ERNEL D IMENSIONS

After FX.wstat clock cycles, a complete convolution for a


vertical subimage is computed. To exploit more parallelism,
we compute multiple convolutions in parallel, as discussed in
the next paragraph.
2) Spatial Unrolling of C and K: The spatial unrolling of
K is in the outer loop to minimize the features’ memory writes
and/or need for registers to store intermediate results. In the
previous paragraph, we have shown that not every kernel needs
the same number of rows of the PE array for the computation
of one kernel. Therefore, we also unroll over C and K to
obtain a high utilization for the processing elements. The
PE array will consist of PE subarrays, such as in Fig. 6. The
number of rows is chosen to be equal to 30. The corresponding
values for Cu, Ku, and maximal utilization can be found
in Table IV. FXu is chosen to be equal to 1.
C will not always be a multiple of Cu, and K will not
always be a multiple of Ku. This is the consequence of making
a processor, which is generally applicable but not optimized for
a given network. This will decrease the useful utilization of the
rows. The theoretical maximal useful utilization is multiplied
with a factor (C)/(Cu(C/Cu))(K )/(Ku(K /Ku)). Each
column of the PE array is followed by a pipelined adder tree
Fig. 6. Data parallelism to compute 1-D convolution. A 7 × 7 kernel is used to sum the results, as shown in Fig. 7. To add the 30 outputs of
as an example. Every square is a multiplier. “k” is the value of “oy” in the a given column with temporary results, the adder tree has two
pseudocode. intermediate inputs. These inputs also make it possible to add
bias. A multiplexer is necessary at the input of the 16th and
features data over subsequent clock cycles. The weights are
32nd inputs of the PE array to select the correct bias/temporal
kept stationary for (OY /OYu) clock cycles. This number is
result. The choice of 30 rows has the benefit that the number
called wstat throughout the text. The spatial unrolling of FY,
of inputs of the adder tree is a power of 2.
OY, and IY is shown in Fig. 6. FY is distributed horizontally
The input features and weights are expressed as dynamic
(every multiplier in a row has the same weight) and IY
fixed point 8-bit values. To handle overload with the sum
diagonally (every multiplier in a diagonal has the same input
of the products, the outputs of the adder tree and, therefore,
feature). The outputs of the multipliers in the same column are
the temporary results have an accuracy of 24 bits.
accumulated to form a single intermediate result. NC+6 inputs
are needed to have NC outputs, with NC being the number of
columns of the PE array. Every clock cycle, we shift the inputs B. Memory
NC positions to perform computations for the subsequent The design contains five features’ memories because the
NC pixels of the currently computing line. After the first clock PE array needs access to at most five channels in parallel (see
cycle, only NC new outputs are necessary from the memory Table IV for FY = 3). All features that are required in a single
because it is possible to reuse FY−1= 6 input features of the clock cycle are stored across the five memories at identical
previous clock cycle to compute the NC subsequent outputs. memory addresses such that they can be read out with a single
This is illustrated with the dashed lines. This is done for wstat read address pointer. This simplifies the computation of the
clock cycles, in which the weights are stationary. After this, addresses in the FSM, reducing the number of FPGA LUTs.
we perform the convolution of the second kernel filter row with As a consequence, not all positions in all memories contain
the corresponding input features row for wstat clock cycles. useful data. Specifically, for a 5 × 5 or 9 × 9 kernel, the PE
This is repeated FX times, one for every kernel filter row. The array needs only access to three channels in parallel. For a
sum of the outputs of a given clock cycle has to be added with 7 × 7 kernel, the PE array handles only two input channels in
the outputs of the multipliers wstat clock cycles later as they parallel. Theoretically, two of the five memories are sufficient
contribute to the same output features. to store input features of a 7 × 7 convolutional layer. The next

Authorized licensed use limited to: University of Liverpool. Downloaded on May 17,2021 at 08:06:19 UTC from IEEE Xplore. Restrictions apply.
COLLEMAN AND VERHELST: HIGH-UTILIZATION, HIGH-FLEXIBILITY DF CNN COPROCESSOR FOR IMAGE PIXEL PROCESSING 467

TABLE VI
M EMORY D IMENSIONS

Algorithm 2 Computation of All Lines of All Convolutional


Layers With Line-Based DF

Fig. 7. Processor contains a pipelined adder tree for every column of the
PE array. For each supported kernel filter dimension, the colored squares at
the left of the figure indicate how many rows of the PE array need data from
the same input channel C. Each square at the right represents a register; a
register with two inputs performs a summation.
Table VI contains the dimensions of all memories.
TABLE V The features’ memories have three possible inputs.
E XAMPLE FOR M EMORY M APPING . LY S TANDS FOR “L AYER ” 1) The host interface memory to get the features of the
input layer.
2) Their own output for vertical padding. This is not strictly
necessary; it would be possible to use these features
from their original addresses. However, this copying
takes negligible time and simplifies the design of the
memory address calculator.
3) The newly computed features.

C. FSM and Host-FPGA Communication


The transfer of data from off-chip to FPGA and vice versa
happens with a 32-bit-width FIFO and some status registers.
Algorithm 2 describes the complete DF approach for the
execution of a CNN. The pseudocode used to compute one
Fig. 8. Example network. line of one layer is given in Algorithm 1.
The module “FSM convolution” computes the memory
two memories store the next two input channels of this layer addresses such that the PE array can independently perform the
at the same address. This reduces the depth of the memories to computation of one line of one layer (see Algorithm 1). The
map big networks, especially if all layers have 7 × 7 kernels, module computes the correct addresses to read the weights and
such as SPyNet [18]. Table V contains an example of how read/write the features. It also makes sure that a given output
such a mapping would go for the complete example network channel of a convolutional layer (after ReLU and horizontal
in Fig. 8. padding) is written in the correct one of the five features’
As illustrated in Fig. 6, the PE array needs more inputs memories. The module “FSM control” executes the top two
during the first clock cycle then during the next wstat−1 clock lines of algorithm 2 and inserts vertical padding such that the
cycles. Therefore, each of the five memories is split into two output of a convolutional layer has the same dimensions as
memories: a small one to store max(Fi −1) = 8 input features the input of the convolutional layer. The horizontal padding is
of each line and a bigger one to store NC input features at discussed in the following. The “FSM convolution” module
each address. The depth of the bigger memory must be wstat 1
f (Fi ) is the number of features memories used in parallel for the filter
times deeper than the width of the smallest one because the kernel dimension of the i-th convolutional layer, thus f(3) = 5, f(5) = 3,
PE array needs data from this memory wstat subsequent clock f(7) = 4 and f(9) = 3. Cu(Fi ) and K u(Fi ) are respectively the numbers of input
cycles while only one clock cycle from the smaller memory. and outputs channels the PE array combines for the filter kernel dimension
of the i-th convolutional layer. g(Fi ) represents the number of biases that are
Registers are used to store the data from the big memory that stored in one line of the biases memory in function of the kernel size Fi , so
the PE needs again the next clock cycle. g(3) = g(5) = g(7) = 1 and g(9) = 2.

Authorized licensed use limited to: University of Liverpool. Downloaded on May 17,2021 at 08:06:19 UTC from IEEE Xplore. Restrictions apply.
468 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 3, MARCH 2021

TABLE VII
FPGA D ETAILS . A N MAC C OUNTS FOR T WO O PERATIONS :
A M ULTIPLICATION AND A S UM

Fig. 9. Simulation results for all networks on Zynq XCZU9EG with an input
is triggered when it must compute a new line of a given image with full-HD (3840 ×2160) resolution to investigate influence of wstat.
convolutional layer.
First, a good value for wstat is derived based on results
for Zynq XCZU9EG. It will turn out that this is a tradeoff
D. Other Modules between memory use and utilization (and therefore through-
When the computation of outputs of the convolutional layer put). The value of wstat is chosen based on the results for the
is ready, the results are requantized to 8 bits and rounded. segmentation [2], super-resolution [15], denoising [16], depth
If the convolutional layer under execution is the last one, estimation [17], and SPyNet network [18]. This experiment
the FPGA sends the outputs of this module to the host proces- shows the flexibility of the processor and the high utilization of
sor using a 32-bit-width FIFO as well. Otherwise, the out- the multipliers for all kernels dimensions. For the chosen value
puts go to the ReLU operation and horizontal padding [23]. of wstat in the first experiment, the advantages of DF are tested
The horizontal padding is different for the subimages. For for all FPGAs under study with the denoising network [16]
the leftmost subimage, horizontal padding happens on the with three convolutional layers (kernels 9×9, 5×5, and 5×5)
left-hand side. For the rightmost subimage, horizontal padding with a full-HD image at the input.
happens on the right side. This horizontal padding happens for
every line. Therefore, the FPGA processor must know which
B. Results and Discussion
kind of vertical subimage the one under execution is. The
host processor indicates this with two bits in the instruction 1) Selection of wstat and Utilization: Fig. 9 shows the
sent from the host to FPGA before the execution of the influence of wstat on the total useful utilization for the five
subimage: “00” for the leftmost, “01” for the rightmost, and networks under study. The useful utilization with and without
“10” otherwise. The outputs of these modules go to the inputs the image combining trick, as shown in Fig. 4, is shown. When
of the features’ memories. using this trick, the useful utilization raises with wstat as the
subimage width is increasing, and therefore, fewer multiplica-
V. M EASUREMENTS AND R ESULTS tions are performed twice at the border between subimages.
When not using this trick, the total useful utilization is lower
A. Setup and Goal and also not strictly increasing with wstat. The overhead at
This section investigates the performance of the proposed borders between subimages is decreasing, but the overhead
innovations introduced in Section IV: both the throughput and due to the fact that the image width does not fit perfectly the
I/O bandwidth gain for line-based DF in comparison with increasing subimage width becomes a dominant factor.
layer-first and the high utilization rate for different kernel The useful utilization of the super-resolution network is
sizes will be demonstrated. The experiments also prove the the lowest. This is mostly due to the first and last layers.
scalability of the design across platforms by implementing The first layer has a 3 × 3 kernel, so Cu = 5. However,
the design on multiple FPGAs with differences in the number the first layer has only one input channel, leading to a
of LUTs and DSPs available. The experiments to derive the useful utilization of less than 20%. The last layer is also a
number of DSPs used and the maximal clock frequency are 3 × 3 kernel, so Ku = 2. The last layer has only one output
done in Vivado 2018.2. In order to compare many networks on channel, leading to a useful utilization of less than 50%. In this
many FPGAs and sweep parameters, such as wstat, a model is network, the two intermediate convolutional layers are quite
built based on the equations as derived in Sections III and IV small leading to a nonnegligible drop in useful utilization and,
and equations to take the overlap between vertical subimages therefore, throughput.
into account. Table VII contains the FPGAs under study, with For wstat bigger than 6, the utilization is not increasing sig-
their corresponding number of columns NC, clock frequency, nificantly anymore. However, the amount of memory needed
and maximal number of GOPS. NC is chosen to be as high as to execute those three networks with the same bitfile is increas-
the number of DSPs on the given board allows. Sometimes, ing. The optimal value for wstat is a tradeoff between memory
choosing a smaller NC would increase the utilization but use and utilization (and therefore throughput). In this design,
decrease the throughput as fewer computations in each clock wstat equal to 6 is chosen. When using the image combining
cycle can happen. In this work, it is the ambition to combine trick, the useful utilization is 78.7% for the super-resolution
high utilization with a high number of PEs to achieve optimal network and between 88.3% and 93.1% for the four other net-
throughput. works. This leads to throughputs between 588 and 695 GOPS.

Authorized licensed use limited to: University of Liverpool. Downloaded on May 17,2021 at 08:06:19 UTC from IEEE Xplore. Restrictions apply.
COLLEMAN AND VERHELST: HIGH-UTILIZATION, HIGH-FLEXIBILITY DF CNN COPROCESSOR FOR IMAGE PIXEL PROCESSING 469

TABLE VIII of the figure. The exact throughput gains for DF are given
S IMULATION R ESULTS FOR FPGA C OPROCESSOR W HILE E XECUTING in Table VIII.
A D ENOISING N ETWORK [16] W ITH AN I NPUT I MAGE
W ITH F ULL -HD (3840 × 2160) R ESOLUTION
For a small FPGA with few columns and therefore small
vertical subimages, the throughput is a little worse for DF in
comparison with layer-first because the border effects between
subsequent subimages lead to more double computations and
the number of subsequent subimages itself also increases.
For FPGAs with more DSPs, this problem is solved. The
deeper the network and the bigger the kernel filter sizes,
the more DSPs are necessary to have an advantage from DF
in comparison with layer first.

VI. C OMPARISON W ITH S TATE OF THE


A RT AND D ISCUSSION
Table IX compares the implementation of the biggest FPGA
with state-of-the-art CNN implementations. The SotA FPGA
implementations are all optimized for one particular CNN, and
only have to perform computations with one or two different
kernel sizes. The processor proposed in this article is the only
DF implementation that is programmable to deal with a wide
variety of kernels. Table VII mentions the maximal throughput
for all the FPGAs. Table IX mentions the throughput and
utilization for the example of the five benchmarked CNNs,
as presented in Section II.
The processor is compared with five FPGA implemen-
tations for pixel processing algorithms and three FPGA
implementations for classification algorithms. For both pixel
processing and classification tasks, we compare with layer-first
Fig. 10. Simulation results for an FPGA coprocessor while executing a and (pipelined) pixel DF.
denoising network [16] with an input image with full-HD (3840 × 2160) For both kinds of applications, the PE utilization rate
resolution.
is higher for DF-based architectures in comparison with
layer-first architectures due to the reduction in stalling cycles.
The PE utilization for the super-resolution network is the
2) Benefits of Depth-First: Fig. 10 shows the results of lowest of the five benchmarked networks, as explained earlier.
the experiments with the layer-first and DF coprocessor for It is, therefore, lower than the utilization of the architectures
the denoising network with wstat equal to 6 to illustrate the in [14] and [15]. These architectures were the least flexible
benefits of DF in comparison with layer-first, using the image with one multiplier for each weight. The presented architecture
combining trick. performs on par or better in comparison with the other state-of-
The ratio in throughput gain for DF in comparison with the-art architectures, while the design is much more flexible.
layer-first for the different FPGAs is given in Table VIII. With this, it is also proven that the PE array has high utilization
The useful utilization during the execution is good for across kernel sizes.
every convolutional layer but never the maximum due to the
subimage overlap regions (see Section III) and the additional VII. F UTURE W ORK
utilization loss for rows and columns (see Section IV-A).
Nowadays, CNNs do not only contain classical convolu-
In the middle row of Fig. 10, the computation time for
tional layers but also strides, 1-D kernels, and residual blocks.
each of the convolutional layers is shown; separated from
The pixel processing algorithms of interest in this article do
the number of clock cycles, the processor is stalled waiting
not contain these kinds of layers. Yet, the architecture can be
for data. Note that this is not equal to the total number of
extended to also incorporate these layer types. This section,
communication clock cycles. The communication clock cycles
therefore, describes how the presented architecture can be
in which the PE array is performing useful computations are
augmented to compute these layers efficiently.
assigned to the corresponding layers. If the number of DSPs
increases, the computation time decreases. This increases the
time that the processor has to wait for new data to be able A. Kernel of 1 × 1
to start a computation. These two effects together cause the To compute a 1 × 1 kernel efficiently, a new Cu and Ku
percentage of clock cycles that the processor has to wait to be for FX = FY = 1 must be selected. Because of the shape of
much higher for layer-first than for DF. This results in a higher the adder tree, Ku must be a power of 2. We would choose
overall useful utilization and, therefore, higher throughput for Cu = 15 and Ku = 2. The same two outputs as in the current
DF in comparison with layer-first, as shown in the bottom row adder can be reused. A Cu of 15, however, means that the

Authorized licensed use limited to: University of Liverpool. Downloaded on May 17,2021 at 08:06:19 UTC from IEEE Xplore. Restrictions apply.
470 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 3, MARCH 2021

TABLE IX
C OMPARISON W ITH S TATE OF THE A RT. “ut” S TANDS FOR “U TILIZATION ”

and the right half for Ku additional output channels. The


features’ memory bandwidth does not need to be increased.

D. Residual Blocks
In residual blocks [24], the output of a given convolutional
layer is added to the output of a further convolutional layer,
as illustrated in Fig. 11.
Fig. 11. Impact of residual blocks on the memory design. In the line-based DF approach, lines that are not used any-
more are removed from memory, as explained in Section III.
PE array needs access to 15 input channels at the same time When using residual blocks, more lines must be stored as
such that 15 features’ memories are needed instead of 5. This illustrated in the figure. The number of lines to be stored must
drastically increases the required bandwidth. The additional be increased with (FX − 1)/(2) for each convolutional layer
features’ memories can be used to store channels of the already before the features layer with which the output must be added.
supported kernels (like explained for the 7 ×7 kernel), but this
leads to more multiplexers in front of the PEs. The theoretical VIII. C ONCLUSION
maximal utilization is 100%.
This article presents an FPGA CNN coprocessor optimized
for pixel processing algorithms. There are four main innova-
B. 1-D Kernels tions introduced to overcome the challenges of the compute
In modern networks, square kernels are sometimes replaced and memory-intensive pixel processing CNNs.
by a vertical and a horizontal kernel to save weights and First, the I/O communication between the host processor and
reduce computation time. For a horizontal kernel, the “for fx” the FPGA is reduced to a minimum using the DF principle.
statement in the FSM for the computation of one line of a layer Second, six DF approaches are compared in terms of mem-
becomes “for fx = 0:0.” As a 1 × 1 kernel can be supported ory and parallelization options. The newly proposed line-based
as suggested earlier, a vertical kernel can be implemented as DF approach is chosen above the tile-, pixel-, intermediate-
FX kernels of 1 × 1. The results can be added together in the line-, and multiline-based approaches.
same way as for a square kernel. Third, a dataflow scheme maintains high utilization and a
high data reuse factor across a wide range of CNN kernels
sizes. This reduces on-chip memory bandwidth requirements
C. Stride of 2 and again increases throughput. Specifically, a mapping that
To support strides of 2, the architecture needs to be changed is flexible in terms of the Cu and Ku unrolling in the function
as the number of input pixels that are necessary to compute NC of the kernel size is implemented.
output pixels of a given line doubles. Therefore, the current Fourth, a model is built to investigate the influence of
features’ memory bandwidth is only high enough to keep half DF versus layer-first and the useful utilization rate. The model
of the columns busy. To keep all of them busy, we can double is also used to decide hardware parameters. For the bench-
the Ku: use the left half of the PE array for Ku output channels marked networks, the reduction in I/O communication due to

Authorized licensed use limited to: University of Liverpool. Downloaded on May 17,2021 at 08:06:19 UTC from IEEE Xplore. Restrictions apply.
COLLEMAN AND VERHELST: HIGH-UTILIZATION, HIGH-FLEXIBILITY DF CNN COPROCESSOR FOR IMAGE PIXEL PROCESSING 471

DF goes up to 81×. The abovementioned model shows that the [15] T. Manabe, Y. Shibata, and K. Oguri, “FPGA implementation of a
DF principle causes a gain in throughput in comparison with real-time super-resolution system using flips and an RNS-based CNN,”
IEICE Trans. Fundam. Electron., Commun. Comput. Sci., vol. 101,
the layer-first principle due to the reduction in the number of no. 12, pp. 2280–2289, Dec. 2018.
clock cycles that the FPGA has to wait for I/O. For four out [16] H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for image
of the five benchmarked networks on the Zynq XCZU9EG restoration with neural networks,” IEEE Trans. Comput. Imag., vol. 3,
no. 1, pp. 47–57, Mar. 2017.
platform, the useful utilization for the complete network is [17] J. Žbontar and Y. LeCun, “Stereo matching by training a convolutional
between 88.3% and 93.1%, higher than or on par with state neural network to compare image patches,” J. Mach. Learn. Res., vol. 17,
of the art, leading to a throughput between 659 and 695 GOPS. no. 1, pp. 2287–2318, Jan. 2016.
[18] A. Ranjan and M. J. Black, “Optical flow estimation using a spatial
For one network, the utilization was lower: 78.7%. pyramid network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
The proposed solution is fully flexible and programmable. (CVPR), Jul. 2017, pp. 4161–4170.
This allows executing a wide range of CNN models on the [19] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer CNN
accelerators,” in Proc. 49th Annu. IEEE/ACM Int. Symp. Microarchitec-
same FPGA platform, without intermediate costly reconfigu- ture (MICRO), Oct. 2016, pp. 1–12.
ration of the bitfile. [20] K. Goetschalckx and M. Verhelst, “Breaking high-resolution CNN
bandwidth barriers with enhanced depth-first execution,” IEEE J. Emerg.
Sel. Topics Circuits Syst., vol. 9, no. 2, pp. 323–331, Jun. 2019.
R EFERENCES [21] Q. Xiao, Y. Liang, L. Lu, S. Yan, and Y.-W. Tai, “Exploring heteroge-
neous algorithms for accelerating deep convolutional neural networks on
[1] Y. Lyu, L. Bai, and X. Huang, “ChipNet: Real-time LiDAR processing
FPGAs,” in Proc. 54th Annu. Design Autom. Conf., Jun. 2017, pp. 1–6.
for drivable region segmentation on an FPGA,” IEEE Trans. Circuits
[22] X. Yang, “Interstellar: Using Halide’s scheduling language to analyze
Syst. I, Reg. Papers, vol. 66, no. 5, pp. 1769–1779, May 2019.
DNN accelerators,” in Proc. 25th Int. Conf. Architectural Support
[2] Y. Lyu, L. Bai, and X. Huang, “Real-time road segmentation using
Program. Lang. Operating Syst., 2020, pp. 369–383.
LiDAR data processing on an FPGA,” in Proc. IEEE Int. Symp. Circuits
[23] M. Sewak, M. R. Karim, and P. Pujari, Practical Convolutional Neural
Syst. (ISCAS), 2018, pp. 1–5.
Networks: Implement Advanced Deep Learning Models Using Python.
[3] S. Michalik, S. Michalik, J. Naghmouchi, and M. Berekovic, “Real-time
Birmingham, U.K.: Packt Publishing Ltd., 2018.
smart stereo camera based on FPGA-SoC,” in Proc. IEEE-RAS 17th Int.
[24] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
Conf. Humanoid Robot. (Humanoids), Nov. 2017, pp. 311–317.
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
[4] C. Hahne, A. Lumsdaine, A. Aggoun, and V. Velisavljevic, “Real-time
(CVPR), Jun. 2016, pp. 770–778.
refocusing using an FPGA-based standard plenoptic camera,” IEEE
Trans. Ind. Electron., vol. 65, no. 12, pp. 9757–9766, Dec. 2018.
[5] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, “Optimizing loop opera- Steven Colleman was born in Lier, Belgium,
tion and dataflow in FPGA acceleration of deep convolutional neural in 1995. He received the M.Sc. degree in electrical
networks,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate engineering from the KU Leuven, Leuven, Belgium,
Arrays, Feb. 2017, pp. 45–54. in 2018, where he is currently pursuing the Ph.D.
[6] J. Qiu et al., “Going deeper with embedded FPGA platform for convolu- degree with MICAS in the laboratory of Prof. Dr. Ir.
tional neural network,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Marian Verhelst. His master’s thesis is entitled Opti-
Gate Arrays, Feb. 2016, pp. 26–35. malisation of a coarse grain reconfigurable array
[7] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, “A high per- for the efficient mapping of convolutional neural
formance FPGA-based accelerator for large-scale convolutional neural networks.
networks,” in Proc. 26th Int. Conf. Field Program. Log. Appl. (FPL), His research interest lies in the field of effi-
Aug. 2016, pp. 1–9. cient processing architectures for embedded deep
[8] Z. Liu, Y. Dou, J. Jiang, and J. Xu, “Automatic code genera- learning.
tion of convolutional neural networks in FPGA implementation,”
in Proc. Int. Conf. Field-Program. Technol. (FPT), Dec. 2016, Marian Verhelst (Senior Member, IEEE) received
pp. 61–68. the Ph.D. degree from KU Leuven, Leuven,
[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification Belgium, in 2008.
with deep convolutional neural networks,” Commun. ACM, vol. 60, no. 6, She was a Visiting Scholar with the Berkeley
pp. 84–90, May 2017. Wireless Research Center (BWRC), University of
[10] K. Simonyan and A. Zisserman, “Very deep convolutional networks California at Berkeley UC Berkeley, Berkeley, CA,
for large-scale image recognition,” 2014, arXiv:1409.1556. [Online]. USA, in summer 2005. She was a Research Scientist
Available: http://arxiv.org/abs/1409.1556 with Intel Labs, Hillsboro, OR, USA, from 2008 to
[11] J. Redmon and A. Farhadi, “YOLOv3: An incremental 2011. She is currently an Associate Professor with
improvement,” 2018, arXiv:1804.02767. [Online]. Available: the MICAS Laboratories, Electrical Engineering
http://arxiv.org/abs/1804.02767 Department, KU Leuven. Her research focuses on
[12] Y. Sada, N. Soga, M. Shimoda, A. Jinguji, S. Sato, and H. Nakahara, embedded machine learning, hardware accelerators, HW-algorithm codesigns,
“Fast monocular depth estimation on an FPGA,” in Proc. IEEE Int. and low-power edge processing.
Parallel Distrib. Process. Symp. Workshops (IPDPSW), May 2020, Dr. Verhelst was a member of the Young Academy of Belgium, an Associate
pp. 143–146. Editor for TVLSI, TCAS-II, and JSSC, and a member of the STEM Advisory
[13] S. Annadurai, Fundamentals of Digital Image Processing. London, U.K.: Committee to the Flemish Government. She is also a member of the Executive
Pearson, 2007. Committees of DATE and ISSCC, the TPC Co-Chair of AICA S2020 and
[14] T. Manabe, Y. Shibata, and K. Oguri, “FPGA implementation of a real- tinyML 2021, and a TPC Member of VLSI and ESSCIRC. She also holds a
time super-resolution system using a convolutional neural network,” prestigious ERC Starting Grant from the European Union. She was a Laureate
in Proc. Int. Conf. Field-Programmable Technol. (FPT), Dec. 2016, of the Royal Academy of Belgium in 2016. She is an SSCS Distinguished
pp. 249–252. Lecturer.

Authorized licensed use limited to: University of Liverpool. Downloaded on May 17,2021 at 08:06:19 UTC from IEEE Xplore. Restrictions apply.

You might also like