0% found this document useful (0 votes)
42 views

Systolic Array Architecture For Educational Use

Uploaded by

Vibhav Gopal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

Systolic Array Architecture For Educational Use

Uploaded by

Vibhav Gopal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Systolic Array Architecture for Educational Use

Flavius OPRIȚOIU, Mircea VLĂDUȚIU


Advanced Computing Systems and Architectures (ACSA) Laboratory,
2023 27th International Conference on System Theory, Control and Computing (ICSTCC) | 979-8-3503-3798-3/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICSTCC59206.2023.10308496

Computer Science and Engineering Department, Politehnica University Timisoara,


2 V.Parvan Blvd, 300223 Timisoara, Romania
[email protected]; [email protected]

Abstract— This paper presents a systolic array architecture compared to the same CPUs and GPUs [7]. In the current
for General Matrix Multiplication. The system was designed technological context and with respect to energy efficiency, it
and verified using the Verilog description language. The is worth noting that for today’s general purpose CPUs, the
architecture was constructed for educational use aiming to energy required for fetching, decoding and executing an
complement the practical activities of Computer Architecture instruction can be 10 to 4000 times higher than the energy
classes. The proposed solution can be used as a design space required for performing a simple operation like integer
exploration tool for evaluation of matrix multiplication addition [6].
accelerators based on non-stationary systolic arrays. Both the
width and format of the processed operands as well as the With the advances provided by the field of machine
number of stages of the systolic array’s Processing Elements can learning, particularly with respect to the algorithm
be customized. As a case study, the design and performance of a development, the quality of the results offered by DNN
systolic array for accelerating the multiplication of square architectures increased significantly, leading to the
matrices of signed, 8-bit integers, is presented. The overall proliferation of machine learning based solutions. The
architecture was synthesized for the Altera DE2 FPGA board in improvements of Convolutional Neural Networks (CNNs)
order to evaluate its performance. used either as dedicated architectures or as
components/backbones for more complex models, benefits
Keywords— Systolic Array, Matrix Multiplication Accelerator,
domains such as computer vision (facilitating higher
Pipelined Processing Elements, Educational Design
accuracies in image recognition tasks), robotics, natural
I. INTRODUCTION language processing, automated reasoning, autonomous
driving, to name only a few. The rapid adoption of CNN-based
The Computer Architecture class represents one of the solutions, coupled with the already mentioned slowdown in
foundational subjects in Computer Engineering curricula [1] computer systems’ performance scaling, led to the
laying the groundwork for concepts and methods in other development of specialized hardware for improving CNNs’
courses such as compiler design, operating systems and high- training and/or inference. These domain-specific architectures
performance computing, to name only a few. are of importance not only from a performance point of view
Computer systems are used in more aspects of our daily but also for their energy efficiency when running CNN models
lives. Until recently, the computer performance improved at a [8]. The goal of improving performance and energy efficiency
steady pace, as indicated by the Moore’s law [2]. Whether the of DNN accelerators created the opportune conditions for the
performance improvement of computers originated in the domain specialization of GPUs from a dedicated purpose
advancement of the integration technology or whether it was hardware component into a ubiquitous element in today’s
the result of better integration technologies coupled with machine learning landscape.
improved architectures, to some extent, the economy itself The field literature documents several works geared
became reliant on the ever-improving performance of towards hardware acceleration of DNN architectures with
computer systems. However, in the current technological solutions ranging from dedicated, dataflow-oriented,
context, the prevalent opinion in the domain literature is that, structures to systems built around a systolic array architecture.
if the Moore’s law hasn’t completely ended, at the very least, It is worth mentioning that the systolic array paradigm appears
its extent and applicability has been reduced consistently [3]. to be preferred by the commercial solutions, such as Google’s
Currently, the increase in performance, as measured by TPU [7], Tesla’s Full Self-Driving computer [9], Habana’s
benchmark programs, has reduced to a few percentages (3%) Gaudi architecture [10] or Nvidia’s NVDLA [11].
per year [4]. Moreover, sustaining a high rate of performance
improvement for general-purpose processors became an Systolic arrays are known to be efficient architectures for
increasingly difficult task for CPU designers [4]. realizing matrix-related operations in hardware: from the
General Matrix Multiplication (GEMM) to matrix inversion
Among the alternatives, identified in the literature, for and Singular Value Decomposition. In consequence, both the
preserving the performance and efficiency scaling of convolutional layers of CNNs as well as fully connected layers
computers, one can mention several approaches: a) and multilayer perceptron can be mapped to systolic arrays. In
performance engineering, in relation to the software run by a a CNN architecture, as much as 90% of the inference
computer system [5]; b) development of new, improved computation time is used by convolutions [12], consequently,
algorithms [5]; and c) design and use of domain-specific the support for hardware acceleration of GEMM operations is
architectures [4], [5] or domain-specific hardware accelerators expected to have considerable effect on the overall inference
[6]. Google’s Tensor Processor Unit (TPU) is an example of a latency.
domain-specific accelerator, running Deep Neural Networks
(DNNs) 15 to 30 times faster when compared to CPUs and Using dedicated hardware for the execution of several
GPUs from the same technological generation while at the types of layers, in today’s DNN architectures, has an added
same time having 30 to 80 times better energy efficiency when advantage related to the format of the operands. More

979-8-3503-3798-3/23/$31.00 ©2023 IEEE 18

Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on November 17,2024 at 17:21:15 UTC from IEEE Xplore. Restrictions apply.
2023 27th International Conference on System Theory, Control and Computing (ICSTCC), Timisoara, Romania, October 11-13, 2023

precisely, for added performance and energy efficiency, pooling layer with a 2×2 filter and a final fully connected
during inference, the neural network model is slipstreamed by layer. A 5×5 stationary systolic array is used for implementing
means of data quantization and pruning allowing the neural the convolution layer. Since the multiplication operation can
model to operate with narrow integers instead of floating- take full advantage of the Sign-Magnitude representation, the
point values. It is relevant, in this context, that multiplication authors add a negation block between the RAM and the PEs
of 8-bit integers requires 6 times less energy than for on-the-fly conversion of Two’s Complement operands into
multiplication of 16-bit floating point operands and for Sign-Magnitude. The immediate advantage relates to the
addition, the energy requirement favors the narrow, fixed- multiplication being realized between two unsigned values
point, format by a factor of 13. Some of the more recent and the final sign being calculated with minimal investment
designs use even narrower fixed-point operands [13]. There is and delay. The authors included support for other activation
also interest in using reduced-dimension floating-point functions, while retaining the power efficiency of their
formats, on as few as 8 bits, with NVIDIA, ARM and Intel architecture.
jointly proposing the narrow floating-point formats FP8-
E4M3, having 4-bit exponents and 4-bit significands, and III. SYSTOLIC ARRAYS
FP8-E5M2 with 5-bit exponents and 3-bit significands [14]. A systolic array is structured as a lattice of locally
interconnected PEs, capable of running an iterative algorithm
II. RELATED WORK with minimal data dependencies [19]. Typically, the PEs are
Literature documents numerous architectural solutions identical and realize the same operations, however,
based on systolic arrays, especially with respect to heterogeneity can also characterize the array’s PEs.
acceleration of DNN inference. The solutions range from
those targeting a specific format or group of formats of the In the following, the systolic array is considered to
operands processed by the array, to architectures aiming to implement a matrix-to-matrix multiplication. It must be noted
that for convolutional layers, the matrix-to-matrix
reduce the latency by means of efficient overlapping of
operations within the array. While only very few references multiplication operates matrices of dimension 𝑘 × 𝑘 (where 𝑘
focus on the educational issue of designing and optimizing a represents the size of the filter) and the convolution operation
systolic array architecture, there are several works include a final addition of all 𝑘 2 elements of the resulting
implementing and evaluating such arrays either for the FPGA product matrix. Following similar approaches from the
platform or as custom-designed, ASIC hardware. literature, the focus of this work is the design of the hardware
for GEMM acceleration, the final addition being deferred to
In [15] the authors present the multi-pod systolic array dedicated adder tree based accumulators [7].
architecture and possible approaches for maximizing their
efficiency, as measured in terms of throughput/Watt. In doing Using a formal approach, a systolic array accelerating a
so, three design parameters were found to be of relevance: convolutional layer implements the operation in (1), providing
array granularity, array interconnect and array tiling. A design the 𝑘 2 elements of the product matrix 𝐶 [20] at its output.
space exploration approach was used for finding the optimal
dimension of the systolic array. The authors propose a tiling 𝐴𝑘×𝑘 × 𝐵𝑘×𝑘 = 𝐶𝑘×𝑘
()
approach for maximizing the pods utilization and proved by 𝑐𝑖𝑗 = ∑𝑘−1
𝑙=0 𝑎𝑖𝑙 𝑏𝑙𝑗 with 𝑖, 𝑗 ∈ 0,1, … , 𝑘 − 1
means of experiments that appropriate selection of the
partitioning size can improve utilization by up to a factor of 5. In order to compute element 𝑐𝑖𝑗 the relations in (2) are
(𝑙)
The systolic array architecture described in [16] is a used. The iterative nature of 𝑐𝑖𝑗 ’s calculation is inherited by
stationary design that separates the multiplication and the the systolic array itself.
addition operations. The MEISSA architecture shifts the
addition operations outside the multiplication critical path to (𝑙+1)
𝑐𝑖𝑗
(𝑙)
= 𝑐𝑖𝑗 + 𝑎𝑖𝑙 𝑏𝑙𝑗 , with l ∈ 0, … , k − 1; 𝑐𝑖𝑗 = 0
(0)
()
be efficiently realized by adder trees. In doing so, the three
stages of use for a systolic array: the loading of operands, the
multiplication, and the offloading, are correspondingly The recurrent relation in (2) represents the basis for
scheduled and launched to facilitate the overlapped execution constructing the PE, which is required to store the current
(𝑙)
of addition operation. The proposed solution is compared value of variable 𝑐𝑖𝑗 and to receive 𝑎𝑖𝑙 and 𝑏𝑙𝑗 at its input. In
against non-stationary systolic array architectures, which order to facilitate data reuse (regarding the elements of
receive both matrices during the GEMM operation, and matrices 𝐴 and 𝐵), the PEs are interconnected to form rows
against conventional stationary systolic array structures which and columns while the delivery of the final 𝑐𝑖𝑗 variable
starts the GEMM operation having one of the two matrices requires a third, diagonal output. The symbol of the PE is
preloaded. depicted in Fig. 1 together with outputs' equations.
In [17], the authors propose a systolic array built around a
multiply and accumulate unit for which the final Carry
Propagate Adder (CPA) is partially factored allowing for
improved performance. Several baseline designs were 𝑎𝑖
(𝑙+1)
= 𝑎𝑖
(𝑙)

presented, and experimental results supplement the work (𝑙+1) (𝑙)


justifying the performance increase over the higher number of 𝑏𝑗 = 𝑏𝑗 ()
flip-flops used inside the systolic array’s Processing Elements (𝑙+1)
𝑐𝑖𝑗 =
(𝑙)
𝑐𝑖𝑗 +
(𝑙) (𝑙)
𝑎𝑖 𝑏𝑗
(PEs). A complete system capable of accelerating digit
recognition based on the MNIST dataset is presented in [18].
The CNN topology is made up of a convolution layer with a
Rectified Linear Unit activation function followed by a max Fig. 1. PE symbol with its outputs’ equations

19

Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on November 17,2024 at 17:21:15 UTC from IEEE Xplore. Restrictions apply.
2023 27th International Conference on System Theory, Control and Computing (ICSTCC), Timisoara, Romania, October 11-13, 2023

As can be seen from Fig. 1, calculation of 𝑐𝑖𝑗


(𝑙+1)
requires Fig. 2 depicts the structure of a PE unit that executes the
(𝑙) multiply and accumulate operation in one clock cycle. Besides
updating the previously stored value of 𝑐𝑖𝑗 .
Consequently, at the multiplier and the adder devices, the unit includes 3
the start of the GEMM operation, all PEs internal state must storage elements, correspondingly labeled in the figure.
be initialized to the value 0, as indicated in (2). The multiply
and accumulate operation performed by the PE for updating Fig. 3 presents the computations of the first 3 clock cycles
𝑐𝑖𝑗 , as described in (3), is the operation that defines the latency for a systolic array of size 3 × 3. The inputs on the left edge
of the PE. of the PEs correspond to the row vectors of matrix 𝐴3×3 (of
which only the first two row vectors are depicted), and the 3
The value of 𝑐𝑖𝑗 , according to (2), is obtained as the inner values of each row are sequentially delivered, one per clock
product of row vector 𝑎𝑖 from 𝐴, and of column vector 𝑏𝑗 cycle. The first value of a row vector to be delivered is the
from B. Consequently, the vectors 𝑎𝑖 and 𝑏𝑗 must be entirely rightmost one. In particular, for row vector of index 0, the first
sequenced at PE’s inputs. This is the reason why the two element to be delivered at the 𝑃𝐸00 ’s input is 𝑎02 . Similarly,
inputs to the PE unit in Fig. 1 and in (3) are modified from 𝑎𝑖𝑙 the column vectors of matrix 𝐵3×3 are delivered sequentially,
(𝑙) (𝑙) starting with the last element in the column, at the PEs’ inputs
into 𝑎𝑖 and from 𝑏𝑙𝑗 into 𝑏𝑗 , indicating that in each clock
situated on the upper edge. After initialization to 0 of the 3
cycle, a new value 𝑎𝑖𝑙 and 𝑏𝑙𝑗 , are delivered at the inputs. In PEs depicted in Fig. 3, only the top left unit is presented at its
order to present the manner in which the systolic arrays inputs with elements to process: 𝑎02 and 𝑏20 .
interconnects PE units on rows and on columns, the special
case of a systolic architecture implementing GEMM for
square matrices of size 3 will be considered. The dot products
for calculating 𝑐𝑖𝑗 and ci(j+1) are expressed in (4) according to
relation (2), as follows:

𝑐𝑖𝑗 = 𝑎𝑖0 𝑏0𝑗 + 𝑎𝑖1 𝑏1𝑗 + 𝑎𝑖2 𝑏2𝑗


()
𝑐𝑖(𝑗+1) = 𝑎𝑖0 𝑏0(𝑗+1) + 𝑎𝑖1 𝑏1(𝑗+1) + 𝑎𝑖2 𝑏2(𝑗+1)

Element 𝑐𝑖(𝑗+1) of the product matrix uses the same row


vector 𝑎𝑖 as the element 𝑐𝑖𝑗 . Consequently, it is expected that
the PE unit situated to the right of the one depicted in Fig. 1
will deliver at its output element 𝑐𝑖(𝑗+1) . Considering that a
single clock cycle is necessary for the execution of the
multiply and accumulate operation, it follows that the PE unit
calculating element 𝑐𝑖(𝑗+1) will receive value 𝑎𝑖0 at its input
one clock cycle later. Consequently, the PE unit in Fig. 1 must
be capable of storing, internally, both values that are
(𝑙) (𝑙)
ultimately forwarded at its output: 𝑎𝑖 and 𝑏𝑗 .
Another consequence of the 1 clock cycle delay in
transmission of 𝑎𝑖0 is that, in order to correctly calculate the
value of 𝑐𝑖(𝑗+1) , the column vector 𝑏𝑗+1 needs to be delayed
by the same 1 clock cycle that row vector 𝑎𝑖 is delayed by to
the output of the PE unit calculating 𝑐𝑖𝑗 . In a symmetrical
manner, the PE unit positioned beneath the PE unit of Fig. 1,
is expected to provide at its output element 𝑐(𝑖+1)𝑗 of the
resulting matrix and, by similar considerations, the row vector Fig. 3. Example of the computations performed for determining the first 3
𝑎𝑖+1 will have to be delayed by the same 1 clock cycle by elements of the product matrix, in a 3 × 3 systolic array: a) PEs are
(𝑙+1) initialized; b) after the first clock cycle, 𝑐00 is updated with the first term; c)
which the output 𝑏𝑗 of the PE unit calculating 𝑐𝑖𝑗 is
after the second clock cycle, 𝑐00 is updated with the second term while 𝑐01
delivered. and 𝑐10 are updated with their first terms, respectively; d) element 𝑐00 is
correctly calculated and can be delivered at the array’s output while 𝑐01 and
𝑐10 are updated with their second terms, respectively

It is important to note that the 1 clock cycle delay required


by the unit to the right of 𝑃𝐸00 and by the unit beneath 𝑃𝐸00
are assured, in this implementation, by connecting a value of
0 on the respective inputs. As one can observe, the first
element of the product matrix to be delivered is 𝑐00 and both
elements 𝑐01 and 𝑐10 will be correctly calculated 1 clock cycle
after the completion of 𝑐00 . The correlation between the delay
required for each row and column input and the delay after
which the corresponding output element is delivered by the
array considerably reduces the complexity of the systolic
Fig. 2. The architecture of a PE unit capable of executing the multiply and
accumulate operation in one clock cycle
architecture’s control logic.

20

Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on November 17,2024 at 17:21:15 UTC from IEEE Xplore. Restrictions apply.
2023 27th International Conference on System Theory, Control and Computing (ICSTCC), Timisoara, Romania, October 11-13, 2023

It is evident that a systolic array capable of implementing downstream neighboring units of a particular PE element
GEMM operation for square matrices of size 𝑘 requires 𝑘 2 PE (neither the unit to the left nor the unit below) can use the
units distributed in a square 2D lattice. In [20], the authors propagated 𝑎𝑖 or 𝑏𝑖 sooner than the two clock cycles allocated
construct a formal model for the evaluation of space-time for the multiply and accumulate operations of the respective
optimal systolic arrays for matrix multiplication. Several PE.
planar systolic arrays of size 4 are constructed and evaluated.
Some of the variants require interleaving the elements of the
row and column vectors with values of 0, for the realization of
relation in (2), while also using more than 16 PE units.
However, the authors also present array variants that keep at a
minimum both the number of interspersed values of 0s, and
the number of PE components. The same reference formalizes
the number of clock cycles consumed by the array while
executing a GEMM operation as a function of array’s size,
considering PE units updating their content in 1 clock cycle.
A. Multi-cycle multiply and accumulate
When using wider formats for the operands, both addition
Fig. 4. The architecture of a multi-cycle PE unit capable of executing the
and multiplication operations incur additional hardware
multiply and accumulate operation in two clock cycles
complexity and larger latencies. In order to avoid large delays
adversely affecting the architecture’s overall performance, the Consequently, all 3 values delivered by the PE unit need
entire multiply and accumulate operation is split over several to be synchronously available at the output, thus, one
clock cycles. additional storage element is inserted on the propagation path
The computations performed by a systolic array, as of both 𝑎𝑖 and 𝑏𝑖 . In addition, the number of clock cycles by
described in Fig. 3 on a limited number of PEs, are directly which each row and column vector input is delayed, as
affected by a multi-cycle multiply and accumulate unit. More previously discussed, must be scaled by the same amount.
(𝑙+1)
precisely, because the 𝑐𝑖𝑗 output of the PE unit is obtained Consequently, for two-stage PEs, the row vector of index 1
must be delayed by 2 clock cycles, just like the column vector
after a number of clock cycles, the other two outputs of the of the same index.
(𝑙+1) (𝑙+1)
same PE, 𝑎𝑖 and 𝑏𝑖 , need to be delayed by the same
amount. B. Non-stationary 3×3 systolic array
Fig. 5 depicts a non-stationary systolic array of size 3. The PE
Fig. 4 depicts a PE structure that executes the multiply and
units execute the multiply and accumulate operation in one
accumulate operation in two clock cycles. In this particular
clock cycle, as can be seen for the 𝑃𝐸00 unit in the top left
case, an intermediate register stores the multiplication’s result
corner of the lattice. The systolic design concurrently receives
to be added to the PE unit’s current state one clock cycle later.
4 elements of both matrices and correspondingly inserts the
Several approaches are described in the literature for
required delays for the correct operation of the array. On
sequencing and balancing the latency of a multiply and
inputs 𝐴𝑟𝑜𝑤_𝑖 , in each clock cycle, one element of row 𝑖 of
accumulate operation into 2 stages of 1 clock cycle each. Such
a solution is described by the authors of [21]. matrix 𝐴 is delivered, starting with 𝑎𝑖2 , as indicated in Fig. 3.
Similarly, on the 𝐵𝑐𝑜𝑙_𝑖 inputs, in each clock cycle, one
For the unit in Fig. 4, because 2 clock cycles are required element of column 𝑗 from matrix 𝐵 is available.
for the multiply and accumulate operation, none of the

Fig. 5. A systolic array of size 3 that can accelerate multiplication of square matrices of shape 3 using PE units that executes the multiply and accumulate
operations in one clock cycle

21

Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on November 17,2024 at 17:21:15 UTC from IEEE Xplore. Restrictions apply.
2023 27th International Conference on System Theory, Control and Computing (ICSTCC), Timisoara, Romania, October 11-13, 2023

The delays are inserted by means of registers connected on signaling the current values at systolic array’s outputs are part
every row and column input except for the very first unit, of the product matrix.
𝑃𝐸00 , according to the details from Fig. 3. The registers,
Using the analysis in [20], for an array of size 𝑛, the delay
cleared at the beginning of the matrix multiplication operation,
provide the values of 0 that the correct computation of the for completing the multiplication is given in (5). In particular,
matrix elements requires. the architecture in Fig. 5 requires 9 clock cycles to complete
the multiplication of 2 square matrices of size 3.
The final, correct, value of the product matrix element,
provided by each PE unit, is delayed by a number of clock 𝑁𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 = 3𝑛 − 2 ()
cycles given by the unit’s position in the lattice. More
precisely, unit 𝑃𝐸00 provides its correct output first, followed, If the PE units are executing their operation in more than
one clock cycle later, by units 𝑃𝐸01 and 𝑃𝐸10 . Two clock one clock cycle, the adaptation of the architecture in Fig. 5 to
cycles after the result of 𝑃𝐸00 is available, three PE units are a multi-cycle PE unit design is straightforward and involves
delivering their product matrix elements: 𝑃𝐸02 , 𝑃𝐸11 and replicating each register present in the diagram by a factor
𝑃𝐸20 . A close analysis reveals that the number of clock cycles equal to the number of stages inside the PE unit. This justifies
required for unit P𝐸𝑖𝑗 to deliver its product matrix value, how the PE architecture of Fig. 2 was adapted into the 2-stage
equals to 𝑖 + 𝑗 + 𝐷𝑒𝑙𝑎𝑦𝐶00 , where 𝐷𝑒𝑙𝑎𝑦𝐶00 refers to the design of Fig. 4.
number of clock cycles needed for delivering the first element
of the product matrix, 𝑐00 . Because the latencies of the values IV. EXPERIMENTAL RESULTS
provided by systolic array’s PE units are aligned with the The proposed systolic array architecture was modeled
secondary diagonal of the lattice, computation with systolic using Verilog Hardware Description Language. The design
arrays architectures is also referred in the literature as wave was validated using the ModelSim-Intel Starter Edition. In
computing. order to evaluate the systolic design’s performance, the
implementation was synthesized for the Altera Quartus II
The systolic array architecture of Fig. 5 delivers the DE2-115 development and education board using the Quartus
product matrix on rows; however, minimal modifications are Prime Lite Edition software package. The PE units in the
required to adapt the structure for delivering the product evaluated systolic array architectures execute the multiply and
matrix on columns. Because the elements of a product matrix accumulate operation in a fused manner, with a delay of 1
row are available at different time moments, output row clock cycle. The operands processed by the arrays are 8-bit
selector units are included to forward the correct element of signed integers represented in Two’s Complement.
that row in its corresponding clock cycle slot.
The fused multiply add units within the PEs were modeled
For the case of the 3 × 3 systolic array in Fig. 5, all at the Register Transfer Level. The PEs realize the 8-bit signed
elements of a product matrix row are delivered at 3 multiplication operation using the Booth radix-4 method [22],
consecutive clock cycles. For example, final value of 𝑐00 is which, for the considered format, produces 4 partial products,
available 3 clock cycles after the array begin the matrices correspondingly aligned. Considering the multiplicand 𝑌 to
multiplication as can be seen from Fig. 3. The 𝑐01 element is (𝑙)
available 4 clock cycles after the start and 𝑐02 after 5 clock be Y = 𝑎𝑖 , as depicted in Fig. 2, each of the 4 partial products
periods. Thus, the output row selector for the first row of the can have the value of 0, 𝑌, 2Y, −Y or −2Y. In order to restrict
product matrix is implemented as a multiplexer which, the result to 8-bit signed integers, the 16 bits of the product are
dependent on the number of clock cycles that have elapsed truncated to the least significant 8 bits. As a solution for
since the beginning of the computation, forwards one of 𝑐00 , avoiding incorrect truncated results, when using fixed-point
𝑐01 or 𝑐02 to the output. In order to correctly select the value formats for implementing convolutions, the operation’s inputs
to be forwarded to the 𝐶𝑟𝑜𝑤_𝑖 output, an iteration counter is and the kernel’s weights are quantized in a pre-processing
required whose content is updated each clock cycle. For a phase [7]. For increased performance, the addition of the 4
systolic array using multi-cycle PEs, the iteration counter is partial products is avoided and, instead, the 4 partial products
updated once every 𝑚 clock cycles, with 𝑚 indicating the are added together with the value 𝑐𝑖𝑗 the PE unit provides at
number of stages of the PEs. its output using Carry Save Adders (CSAs) [22].
Consequently the 4 partial products are first truncated so that,
Another consequence of the manner in which PE units in the end, 5 partial products are reduced with 3 CSA stages.
deliver their final values, for the architecture in Fig. 5, is the Finally, the redundant carry and sum vectors are added using
fact that each PE unit in a line of the systolic design adds a an 8-bit, four layers Conditional Sum Adder [22].
latency of 1 clock cycle to the delays of the units in the
preceding line. As presented above, output 𝐶𝑟𝑜𝑤_0 provide its The Verilog implementation was developed in a
values 3, 4 and, respectively 5 clock cycles after the start of parameterized manner, facilitating the construction, synthesis,
multiplication. However, output 𝐶𝑟𝑜𝑤_1 provide its values 4, 5 and evaluation of various array configurations. Several
systolic array architectures were constructed and evaluated,
and, respectively, 6 clock cycles from the start while the last
for sizes 3, 4 and 5 proving the applicability of the design as a
output delays its content by one additional clock period. The
practical teaching resource for the specific topics of digital
architecture in Fig. 5 delivers the row of the product matrix
design and optimization in a Computer Architecture class.
concurrently, on all 3 outputs, in 3 consecutive clock cycles.
In order to achieve this, additional delaying elements, in the Table I presents the synthesis parameters for the 3 systolic
form of registers, were added on each row output, starting with array setups. The area requirements are expressed in terms of
the first output row which needs to be delayed the most until the number of Logic Elements used by the synthesis. With
the last output row which requires no delay. respect to each architecture’s performance, the maximum
operating frequency and the throughput are provided.
In Fig. 5, once all output rows are synchronized, a
combinational unit is added for activating the valid output,

22

Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on November 17,2024 at 17:21:15 UTC from IEEE Xplore. Restrictions apply.
2023 27th International Conference on System Theory, Control and Computing (ICSTCC), Timisoara, Romania, October 11-13, 2023

TABLE I. SYNTHESIS RESULTS FOR THREE SYSTOLIC ARRAY Annual International Symposium on Computer Architecture (ISCA),
CONFIGURATIONS 2011, pp. 365-376.
Systolic Slices Max frequency Throughput [4] John L. Hennessy, and David A. Patterson, “A new golden age for
array size (LE) (MHz) (Mbps) computer architecture,” Commun. ACM, vol. 62, no. 2, pp. 48–60,
2019.
3 878 245.64 2210.76
[5] Charles E. Leiserson, Neil C. Thompson, Joel S. Emer, Bradley C.
4 1604 240.15 2794.76 Kuszmaul, Butler W. Lampson, Daniel Sanchez et al., “There's plenty
of room at the Top: What will drive computer performance after
5 2535 238.21 3403.00 Moore's law?,” Science, vol. 368, no. 6495, pp. eaam9744, 2020.
[6] William J. Dally, Yatish Turakhia, and Song Han, “Domain-specific
hardware accelerators,” Commun. ACM, vol. 63, no. 7, pp. 48–57,
It can be observed from the table above that although the 2020.
maximum frequency degrades marginally with increasing the [7] Norman P. Jouppi, Cliff Young, Nishant Patil, and David Patterson, “A
size of the systolic array, the overall throughput improves, domain-specific architecture for deep neural networks,” Commun.
ACM, vol. 61, no. 9, pp. 50–59, 2018.
favoring larger systolic arrays over smaller ones. This
[8] Raju Machupalli, Masum Hossain, and Mrinal Mandal, “Review of
phenomenon can be explained by noting that for larger array ASIC accelerators for deep neural network,” Microprocessors and
sizes, the increased volume of data the systolic architecture Microsystems, vol. 89, pp. 104441, 2022/03/01/, 2022.
delivers at its output, at the end of the matrix multiplication, [9] P. Bannon, G. Venkataramanan, D. D. Sarma, and E. Talpes,
offsets the latency penalty of operating a larger array (which "Computer and Redundancy Solution for the Full Self-Driving
consists of running the architecture for additional clock cycles Computer," in 2019 IEEE Hot Chips 31 Symposium (HCS), 2019, pp.
and at a slower frequency). Consequently, it is preferable to 1-22.
use larger size arrays, one such decision being seen in the [10] E. Medina, and E. Dagan, “Habana Labs Purpose-Built AI Inference
and Training Processor Architectures: Scaling AI Training Systems
structure of the TPU unit [7] which includes a 256 by 256 Using Standard Ethernet With Gaudi Processor,” IEEE Micro, vol. 40,
systolic array. no. 2, pp. 17-24, 2020.
[11] F. Farshchi, Q. Huang, and H. Yun, "Integrating NVIDIA Deep
CONCLUSIONS Learning Accelerator (NVDLA) with RISC-V SoC on FireSim," in
In this article a parameterizable systolic array architecture 2019 2nd Workshop on Energy Efficient Machine Learning and
Cognitive Computing for Embedded Applications (EMC2), 2019, pp.
was constructed, operating signed 8-bit integers represented in 21-25.
Two’s Complement. The design facilitates design space
[12] C. Zhu, K. Huang, S. Yang, Z. Zhu, H. Zhang, and H. Shen, “An
exploration of systolic arrays’ performance, with respect to a Efficient Hardware Accelerator for Structured Sparse Convolutional
number of design decision parameters, such as the size of the Neural Networks on FPGAs,” IEEE Transactions on Very Large Scale
array, the format of the data operated by the array (fixed-point, Integration (VLSI) Systems, vol. 28, no. 9, pp. 1953-1965, 2020.
floating point or custom representation formats), the length of [13] S. K. Lee, A. Agrawal, J. Silberman, M. Ziegler, M. Kang, S.
the chosen format, the architecture of the multiply and Venkataramani et al., “A 7-nm Four-Core Mixed-Precision AI Chip
accumulate units inside array’s PEs as well as the pipeline With 26.2-TFLOPS Hybrid-FP8 Training, 104.9-TOPS INT4
Inference, and Workload-Aware Throttling,” IEEE Journal of Solid-
depth of these PEs. The proposed systolic array design can be State Circuits, vol. 57, no. 1, pp. 182-197, 2022.
used as a practical framework for evaluating various adder and [14] Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea,
multiplication architectures with respect to their area Pradeep Dubey, Richard Grisenthwaite et al., "FP8 Formats for Deep
requirements as well as the length of the critical path. In the Learning," https://ui.adsabs.harvard.edu/abs/2022arXiv220905433M,
same context, the proposed systolic array platform can be used [September 01, 2022, 2022].
for evaluating the performance impact of designing a [15] Ahmet Caner Yüzügüler, Canberk Sönmez, Mario Drumond, Yunho
dedicated DNN accelerator over the use of general-purpose Oh, Babak Falsafi, and Pascal Frossard, “Scale-out Systolic Arrays,”
ACM Trans. Archit. Code Optim., vol. 20, no. 2, pp. Article 27, 2023.
CPU and/or GPU platforms for running a neural network
[16] B. Asgari, R. Hadidi, and H. Kim, "MEISSA: Multiplying Matrices
model. Besides being a complementary study resource for Efficiently in a Scalable Systolic Architecture," in 2020 IEEE 38th
practical activities classes, the presented architecture is also International Conference on Computer Design (ICCD), 2020, pp. 130-
useful with respect to developing implementation and 137.
synthesis skills in conjunction with digital design aptitudes. [17] Kashif Inayat, and Jaeyong Chung, “Hybrid Accumulator Factored
The proposed systolic design was configured for three Systolic Array for Machine Learning Acceleration,” IEEE Trans. Very
different array sizes and the resulting instances were Large Scale Integr. Syst., vol. 30, no. 7, pp. 881–892, 2022.
synthesized for an FPGA platform. Finally, the performance [18] S. H. Chua, T. H. Teo, M. A. Tiruye, and I. C. Wey, "Systolic Array
of the three architectures was evaluated in terms of the Based Convolutional Neural Network Inference on FPGA," in 2022
IEEE 15th International Symposium on Embedded Multicore/Many-
required FPGA resources and the delivered throughput. core Systems-on-Chip (MCSoC), 2022, pp. 128-133.
[19] Jaime H. Moreno, and Tomas Lang, Matrix Computations on Systolic-
REFERENCES Type Arrays: Kluwer Academic Publishers, 1992.
[1] E. Durant, J. Impagliazzo, S. Conry, R. Reese, H. Lam, V. Nelson et [20] I. Z. Milentijević, I. Z̆ Milovanović, E. I. Milovanović, and M. K.
al., "CE2016: Updated computer engineering curriculum guidelines," Stojc̆ev, “The design of optimal planar systolic arrays for matrix
in 2015 IEEE Frontiers in Education Conference (FIE), 2015, pp. 1-2. multiplication,” Computers & Mathematics with Applications, vol. 33,
[2] G. E. Moore, “Progress in digital integrated electronics [Technical no. 6, pp. 17-35, 1997/03/01/, 1997.
literaiture, Copyright 1975 IEEE. Reprinted, with permission. [21] Dionysios Filippas, Christodoulos Peltekis, Giorgos Dimitrakopoulos,
Technical Digest. International Electron Devices Meeting, IEEE, 1975, and Chrysostomos Nicopoulos, "Reduced-Precision Floating-Point
pp. 11-13.],” IEEE Solid-State Circuits Society Newsletter, vol. 11, no. Arithmetic in Systolic Arrays with Skewed Pipelines,"
3, pp. 36-37, 2006. https://ui.adsabs.harvard.edu/abs/2023arXiv230401668F, [April 01,
[3] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. 2023, 2023].
Burger, "Dark silicon and the end of multicore scaling," in 2011 38th [22] Mircea Vladutiu, Computer Arithmetic - Algorithms and Hardware
Implementations: Springer, 2012.

23

Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on November 17,2024 at 17:21:15 UTC from IEEE Xplore. Restrictions apply.

You might also like