0% found this document useful (0 votes)
103 views

Factored Systolic Array Tensor Processing

DAC'57

Uploaded by

Kashif Inayat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
103 views

Factored Systolic Array Tensor Processing

DAC'57

Uploaded by

Kashif Inayat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Factored Radix-8 Systolic Array for Tensor Processing

ABSTRACT tion as the only data processing instruction at the software-


Systolic arrays are re-gaining the attention as the heart to ac- hardware interface, and faster speed and high efficiency can
celerate machine learning workloads. This paper shows that be achieved by eliminating architectural features included to
a large design space exists at the logic level despite the sim- support general-purpose processing. In such type of comput-
ple structure of systolic arrays and proposes a novel systolic ing systems, the datapath design for computer arithmetic
array based on factoring and radix-8 multipliers. The fac- deserves more attention and is even more important than
tored systolic array (FSA) extracts out the booth encoding that in the conventional microprocessors.
and the hard-multiple generation which is common across The systolic array (SA) is a parallel computer architec-
all processing elements, reducing the delay and the area of ture and consists of processing elements (PEs) organized as
the whole systolic array. This factoring is done at the cost a linear or two-dimensional array. It is configured or hard-
of an increased number of registers, however, the reduced wired for specific operations such as matrix multiplications.
pipeline register requirement in radix-8 offsets this effect. The systolic array was invented back in 1979 by H.T. Kung
The proposed factored 16-bit multiplier achieves up to 15%, and Charles E. Leiserson [10,11] and re-gained the attention
13%, and 23% better delay, area, and power, respectively, when it was employed in several accelerators to speed up
compared with the radix-4 multipliers even if the register deep learning [5, 12]. The first version of TPU (TPUv1) [5]
overhead is included. The proposed FSA architecture im- used a 256×256 8-bit integer SA, designed to accelerate the
proves delay, area, and power up to 11%, 20% and 31%, inference. A TPUv2 chip consists of two cores, each con-
respectively, for different bitwidths when compared with the taining 128×128 SA, while each of the two cores in TPUv3
conventional radix-4 systolic array. chip deploys two 128×128 SAs [13]. They are used to speed
up the training as well as the inference. The authors at
IBM [12] implemented a 28×28 wavefront SA on an FPGA
Keywords to accelerate the training.
Machine Learning, Systolic Arrays, Booth Multipliers With increasing interest in systolic arrays, many studies
have been performed on systolic arrays [5, 6, 14–17], but to
1. INTRODUCTION the best of our knowledge, none of them deal with the logic-
In recent years, deep learning has demonstrated the pre- level design of systolic arrays. This may be because multipli-
dictive performance unbeatable by any other known meth- ers in systolic arrays have been considered an atomic black-
ods. Deep neural networks have replaced many hand-crafted box primitive. This paper deals with the logic-level design
algorithms in various fields including computer vision, im- of systolic arrays considering the structure of the multipli-
age/video compression, natural language processing, rein- ers together. The major contributions of this paper can be
forcement learning, etc., [1–4]. However, deep neural net- summarized as follows.
works require a massive amount of computation, which hin-
• We present the concept of factored systolic arrays and
ders the wide deployment of the models at various devices
show that a large design space exists at the logic level
and slows down the innovations in the field of artificial in-
despite the simple structure of the systolic arrays.
telligence. Thus, the demand for more compute power is
higher than ever before and custom hardware to acceler- • We propose a realization of the factored systolic ar-
ate deep learning inference and training is being studied ac- rays based on radix-8 multipliers and demonstrates its
tively [5–7]. benefits in area, delay, and power.
In most machine learning models including deep learning,
the matrix multiplication is the key primitive. Most compu- • We also show that the radix-8 multiplier uses much
tations required by the models are explicitly represented in fewer pipeline registers compared with the conventional
matrix multiplications, or transformed into them easily [8,9]. radix-4 multiplier, reducing or offsetting the cost of the
For example, 2D convolutions in a convolutional layer can proposed factoring.
be transformed into a set of matrix multiplications using
Winograd algorithm [8], and this also reduces the number 2. BACKGROUND
of multiplications. Owing to this property, custom accelera-
tors for machine learning can provide the matrix multiplica- 2.1 Systolic Array Architecture
We consider a 2-dimensional (2-D) systolic system consist-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed ing of a set of PEs interconnected as a 2D-array. Figure 1
for profit or commercial advantage and that copies bear this notice and the full cita- shows a simplified example of a 3 × 3 systolic array perform-
tion on the first page. Copyrights for components of this work owned by others than ing the multiplication of two 3 × 3 matrices A and B and
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission producing resultant matrix C of the same size. The rows of
and/or a fee. Request permissions from [email protected]. the input matrix A are provided to the PEs on the left edge
DAC ’20, June 05-09, 2016, Austin, TX, USA and passed to the right. The columns of the matrix B are

c 2016 ACM. ISBN 978-1-4503-4236-0/16/06. . . $15.00 provided to the PEs on the top and passed downward. Each
DOI: http://dx.doi.org/10.1145/2897937.2898092 PE contain a multiplication and accumulation (MAC) unit
b33 t5 this level. We target the systolic array for matrix multipli-
b32 b23 t4 cation which provides the same functionality as the conven-
A×B = C b31 b22 b13 t3 tional systolic array and achieves significant improvements
b21 b12 t2 in area, power, and delay.
b11 t1 Since the MAC unit is a fundamental building block of
deep learning accelerators, studies on the unit [18, 19] have
PE11 × PE12 × PE13 × become active recently. In [18], using the weight sharing
a13 a12 a11 + + + property of CNNs, the authors propose parallel accumulate
t4 : c11 t5 : c12 t6 : c13
shared MAC, which counts the frequency of each weight and
PE21
the final accumulated value is calculated in a subsequent
× PE22 × PE23 ×
multiply phase. Unlike this design, our proposed design does
a23 a22 a21 + + +
t5 : c21 t6 : c22 t7 : c23 not impose any functional limitations. In [19], the authors
show that some pipeline registers in the MAC unit can be
PE31 × PE32 × PE33 ×
eliminated because only the final value is used out of the
a33 a32 a31 + + + large number of multiply–accumulations in machine learn-
t6 : c31 t7 : c32 t8 : c33 ing applications. Our proposed design is orthogonal to this
t5 t4 t3 t2 t 1
approach. Unlike these MAC studies, we deal with the entire
systolic array.
Figure 1: A 3 × 3 systolic array for matrix multiplication.

3. PROPOSED DESIGN
which performs the multiplication operation and the prod-
uct is accumulated in the accumulator register to make the 3.1 A General Idea to Improve Systolic Ar-
partial result. Every cycle the next batch of the inputs is rays
provided to the edge PEs and the previous inputs are for- We consider a systolic array where PEs are organized in a
warded to the next PEs. In the example given the P E11 grid. The systolic array accepts the inputs for a matrix at the
performs multiplication operation on the inputs a11 and b11 left edge and the inputs for the other matrix at the top edge,
in clock cycle t1 . In the second clock cycle t2 P E11 per- and produces the outputs at the top edge. Each PE consists
forms the multiplication on the next inputs a12 and b21 and of a MAC which takes two inputs X and Y and performs the
forwards the previous cycle inputs a11 and b11 to P E12 and multiplication and accumulation operations every clock cy-
P E21 , respectively. In the same clock cycle P E12 and P E21 cle. The input of the MAC fed from left side are considered
performs multiplication operations on the pairs (a11 , b12 ) as multiplier X, and the input fed from the top as multipli-
and (a21 , b11 ), respectively. In the third clock cycle t3 , PE11 cand Y . Let P be the value stored in the accumulator when
performs the multiplication operation on the third inputs the accumulation is done and Q be the dot-product (i.e., the
a13 , b31 and outputs the final result c11 in fourth cycle t4 . sum of the products). We assume that X, Y , and Q are
The PE12 and PE21 outputs the resultant elements of ma- represented in binary for compatibility but P can be repre-
trix C c12 and c21 in fifth clock cycle. In this way, the values sented in any format. The MAC unit performs the dedicated
of the input matrices are propagated in systolic array and pre-processing on the X and Y prior to the main process-
the output matrix C elements are computed in a wave-front ing. The P -dedicated post-processing takes the MAC output
flow as shown in the Figure 1. In the same manner this 3 × 3 and produces Q. The widely employed modified booth en-
systolic array is able to perform the multiplication of two coding multiplier in conventional general-purpose processors
3 × k and k × 3 size matrices. performs booth recoding as the X-dedicated pre-processing.
In the systolic array, the dedicated logic (DL) circuits for X-
2.2 Related Works , Y - and Z- dedicated processing are replicated across PEs
Our work is mainly related to systolic array architecture and we can factor them out and can be placed at the inputs
and MAC unit design. Systolic arrays are usually designed and the outputs of the array. Then, the area of the XDL
for specific operations such as convolution [6,14–16] and ma- (YDL and PDL) circuits can be amortized by the PEs in a
trix multiplications [5, 17]. Due to the popularity of convo- row (column) and becomes marginal as the size of the array
lutional neural networks (CNNs), majority of works target grows. Also, the delay of the DL circuits can be eliminated
convolutions. Most of the existing studies on SA are per- from the critical path. We call this structure the factored
formed at architectural level. They aim at minimizing systolic array. The logic to convert the number format or
memory access to reduce energy consumption and design for rounding may be placed naturally at the inputs and the
dataflows to maximize data reuse and local accumulation. outputs, but in the factored systolic array, we extract logic
In [6], the authors target at accelerating the convolution op- circuits that are traditionally considered a part of the MAC
erations in CNNs, study existing dataflows such as no local unit such as booth encoders. In this structure, more aggres-
reuse, weight stationary, output stationary, and proposes a sive, non-conventional approaches can be employed.
novel dataflow called row stationary. In [14], data reuse in
the row stationary dataflow is further exploited by sharing 3.2 Parallel Multipliers in SAs
the storage of processing elements. In [17], dataflows of sys- Multipliers dominate the datapath of the systolic array
tolic arrays for matrix multiplication are studied. Several architecture. A parallel multiplier consists of three major
studies [15, 16] also consider the physical mapping of pro- parts: 1) partial product generation; 2) partial products re-
cessing elements into 3D ICs and enable data reuse along duction trees (e.g., Wallace tree); and 3) addition of the pair
the third dimension, achieving further speed-up and less en- of final sum and carry rows of partial products using carry
ergy consumption. Unlike these studies, our study deals with propagate adder (CPA) to get the final product.
logic level design of systolic arrays. To the best of our Consider an M ×N -bit multiplier such that X = xM −1 xM −2
knowledge, none of the existing works on SAs are studied at · · · x1 x0 and Y = yN −1 yN −2 · · · y1 y0 are the multiplier and
Q
Y Y Q
PE
YDL PDL PDL PDL Y+2Y Y+2Y Y+2Y
YDL YDL YDL
XDL

ML Radix-8
X ML ML ML Booth PE PE PE

XDL
PE PE PE X ACC ACC ACC X Recoding
ACC
ML ML ML

XDL
PDL Radix-8 +
PE PE PE ACC ACC ACC Booth PE PE PE
X Recoding

Processing Element
ML ML ML

XDL
Radix-8
PE PE PE ACC ACC ACC
Booth PE PE PE
X Recoding

(a) (b)
Figure 3: Factored radix-8 systolic array architecture. The
Figure 2: (a) Conventional systolic array architecture (b) circuit in the red-dotted box is considered the multiplier,
Proposed factored systolic array architecture. XDL, YDL, which includes the input and the output registers.
and PDL are replicated across PEs and can be extracted
and placed the inputs and the outputs of the systolic array,
which allow us to re-think about what the best MAC design and we can re-design the code to reduce 5 lines to 4 lines.
is for systolic arrays. However, this can incur an additional delay in the booth
selection and we use this standard coding as it is. The com-
plexity of radix-8 booth recoding logic is higher than that
multiplicand, respectively. The N rows of M -bit partial of radix-4, but we will show that the entire booth recording
products PPi,j can be expressed as logic can be removed from individual multipliers in a systolic
array.
P Pi,j = xi yj , ∀i, j, 0 ≤ i < M, 0 ≤ i < N, (1)
which is taken from [20]. In fast multipliers, the modified
3.4 Factored Radix-8 Systolic Array (FSA)
Booth encoding algorithm is employed to reduce the height A radix-8 booth multiplier involves the dedicated pre-
of partial products [21, 22]. A radix-R=2r , r>0, booth mul- processing of complex booth recoding on the multiplier X
tiplier reduces the height of partial products from N to and 3Y = Y + 2Y generation on the multiplicand Y . When
d(N +1)/re, reducing the size and enhancing the speed of the the radix-8 booth multiplier is employed in the PEs of the
reduction tree. The existing systolic array architectures em- systolic array, the logic for the aforementioned dedicated pre-
ploy radix-4 booth multipliers which has d(N +1)/2e number processing on X and Y are replicated across PEs. The main
of partial products. idea of this paper is to use the radix-8 booth encoding in
designing the MACs of the systolic array and at the same
3.3 Radix-8 Booth Multiplier time factor out this repeated dedicated pre-processing logic
In radix-8 booth multipliers the height of the partial prod- from the PEs and place them at the inputs.
ucts reduces to d(N +1)/3e, which brings a noticeable reduc- Figure 3 shows the architecture of the proposed radix-8
tion in the area of partial products reduction tree and critical FSA. The X inputs accepted from the left edge are pre-
path delay. However, booth multipliers with a radix higher processed in parallel and the control lines for the booth se-
than 2 requires hard multiples, that is not a power of two and lection are forwarded to the corresponding rows of PEs. The
are not obtainable using simple shift and complement oper- Y inputs accepted from the top edge are pre-processed and
ations. In radix-8 booth multiplier ±0, ±Y, ±2Y, ±3Y, ±4Y the resultant 3Y s along with Y s are forwarded to the cor-
multiples of the multiplicand Y are required, and the gener- responding columns of PEs. In this way, the proposed FSA
ation of hard multiple 3Y involves the addition of Y and 2Y achieves the advantages of the -8 booth multipliers and at the
using CPA which slows down the multiplier. same time eliminates the associated drawbacks by factoring
The radix-8 booth recoding partitions the multiplier X out the complex booth encoding logic and hard multiple 3Y
into sets of 4 bits, 3 adjacent binary bits x3(i+1)−1 x3i+1 x3i computation. This brings a significantly decreasing impact
and a borrow bit x3i−1 from the previous set, each set en- on the delay, area and power of the whole systolic array.
coded into the control lines s, d, t, q, n using However, this gain in the performance of the FSA is achieved
at the cost of an increased number of input registers in the
si = (x3(i+1)−1 ⊕ x3i+1 )∧(x3i ⊕ x3i−1 ) PEs. A PE in the radix-8 FSA registers and forwards 3Y
di = (x3i ⊕ x3i+1 )∧(x3i ⊕ x3i−1 ) along with multiplicand Y to the next PE compared with Y
alone in the PEs of radix-4 systolic array. Similarly,
 the PEs
ti = (x3(i+1)−1 ⊕ x3i+1 )∧(x3i ⊕ x3i−1 )
of the radix-8 FSA registers and forwards 5( N3+1 ) number

qi = (x3(i+1)−1 ⊕ x3i+1 )∧(x3(i) ⊕ x3i+1 )∧(x3i ⊕ x3i−1 ) of control bits to the next PEs compared with N -bit multi-
ni = x3(i+1)−1 plier X in the PEs of radix-4 systolic array. However, this
  increase in the number of input registers is compensated by
N
for i = 0, 1, 2, · · · , − 1, (2) the reduced pipeline cut-set registers cost in the multipliers
3 of the proposed radix-8 FSA explained in the coming section.
where N is the width of multiplier Y and x−1 = 0. These five
control lines select a multiple of the multiplicand Y among 3.5 Pipeline Cut-set Cost in Booth Multipliers
±0,
 ±Y,±2Y, ±3Y, ±4Y . For an M × N radix-8 multiplier, The computing systems for machine learning are usually
5( N3+1 ) control signals are generated in parallel. Except highly optimized for throughput. Achieving a high through-
the all-zero case, these five control lines use one-hot encoding put of the overall system necessitates the pipelining of the
Table 2: Performance comparison of radix-4, radix-8 and
Pipeline Cut-set size at various Wallace tree levels in Pipeline Cut-set size at various Wallace tree levels in factored radix-8 multipliers.
radix-4 booth Multiplier factored radix-8 booth Multiplier

Y X Y 3Y
Area Power Delay PDP ADP
Booth Multiple’s WL Design
Booth Selection
Recoding
Booth Selection Selection Lines (µm2 ) (µW) (ns) (ns·µW) (ns·µm2 )
Level 0
Level 0 N+1 N+1 1869 0.95 1.12 1.06 2093
(M+2)
(M+1) 3 R4
3:2 Compressors 2 3:2 Compressors (1.00) (1.00) (1.00) (1.00) (1.00)
Level 1 Level 1 1931 0.85 1.05 0.89 2028
N+1 2 N+1 2 8 R8
(M+1+𝑒1 )
2 3
(M+2+𝑒1′ )
3 3
(1.03) (0.89) (0.94) (0.84) (0.97)
3:2 Compressors 3:2 Compressors 1625 0.79 0.99 0.78 1609
Level i Level i FR8
N+1 2 𝑖
N+1 2 𝑖 (0.87) (0.83) (0.88) (0.74) (0.77)
(M+1+𝑒𝑖 ) (M+2+𝑒𝑖′ )
2 3 3 3 6413 3.44 1.48 5.09 9491
R4
(1.00) (1.00) (1.00) (1.00) (1.00)
3:2 Compressors 6416 3.36 1.49 5.01 9560
3:2 Compressors 16 R8
Sum Carry
Level log 3
N+1 Sum Carry
Level log 3
N+1 (1.00) (0.98) (1.01) (0.98) (1.01)
2(M +N) 2 2(M +N) 3
2 2 5551 2.65 1.26 3.34 6994
Carry Propagate Adder FR8
Carry Propagate Adder (0.87) (0.77) (0.85) (0.66) (0.74)
Product Product 22926 12.10 1.29 15.61 29575
R4
(1.00) (1.00) (1.00) (1.00) (1.00)
22196 11.30 1.36 15.37 30187
32 R8
(0.97) (0.93) (1.05) (0.98) (1.02)
18305 8.49 1.14 9.68 20868
Figure 4: Pipeline cut-set cost comparison in proposed radix- FR8
(0.80) (0.70) (0.88) (0.62) (0.71)
8 and radix-4 multipliers. The pipeline cut-set size of the
proposed multiplier is about 33% less than that of the con-
ventional radix-4 multiplier at a corresponding level, reduc-
radix-8 multipliers offsets the aforementioned growth in its
ing the number of required pipeline registers substantially.
input registers in comparison with radix-4. It should be
noted that the lower height of the factored radix-8 partial
Table 1: Number of partial product bits at each Wallace tree products along-with the extraction of booth encoding en-
level in R4 and FR8 multipliers ables it in achieving lower critical path delay not only in the
nonpipelined case but also ensures in the pipelined versions
Wallace tree level
8-bit 16-bit 32-bit at a significantly low cost.
R4 FR8 R4 FR8 R4 FR8

Level-0 53 35 169 119 593 395 4. EVALUATION AND ANALYSIS


Level-1 43 31 131 94 431 300
Level-2
Level-3
37
29
28
-
105
92
81
69
316
259
246
191
4.1 Evaluation Setup & Baselines
Level-4 - - 76 59 199 159 We use the conventional radix-4 (R4) and basic radix-8
Level-5 - - 62 - 168 123 (R8) multipliers and the corresponding systolic arrays as the
Level-6 - - - - 123 -
baselines. We compare the proposed factored radix-8 (FR8)
designs to the baselines at different wordlengths W L = 8, 16,
and 32. A Wallace tree using CSAs (3 : 2 compressors) is
multipliers datapath. The partial products reduction tree used for the reduction of the partial products in designing
makes the major part of a parallel multiplier. The pipelin- multipliers. The multiplier designs include the input and
ing of a multiplier’s datapath requires the placement of bal- the output registers. In a systolic array, the registers to
anced cut-set registers in the Wallace tree levels which is propagate the (pre-preprocessed) inputs across PEs are the
executed by registering all the partial product bits at that input registers of the multipliers, so the factored multiplier
level. Thus, the number of partial product bits at any Wal- design includes the cost as well as the benefit of the factoring.
lace tree level decides the size of the pipeline cut-set. Figure All the designs are described in Verilog and verified us-
4 shows a summary of the pipeline cut-set sizes at various ing Synopsys VCS. The designs are synthesized and mapped
Wallace tree levels in radix-4 and factored radix-8 booth to an industrial 32nm standard cell library using the Syn-
multipliers. Since shifted partial products of various widths opsys Design Compiler. The post-synthesis gate-level sim-
are compressed at each Wallace tree level, the widths of the ulation is performed again using Synopsys VCS and gen-
intermediate partial products vary. In Figure 4, ei and e0i erates a switching activity interchange format (SAIF) file
denote positive integers and note that ei , e0i  M . Table using random vectors. The power dissipation is estimated
1 lists the actual number of partial product bits in 8, 16, by the PrimeTime-PX tool by annotating the SAIF file to
and 32 bit multipliers. Figure 4 and Table 1 explains that the netlist. The power consumption is measured at 1.5ns
the number of partial product bits in factored radix-8 multi- clock period unless stated otherwise. All the experiments
plier are about 2/3 times lower than that of radix-4 and that are performed on a Linux machine with 128GB memory.
the same ratio of 3 : 2 partial product bits is maintained at This memory capacity limits the largest systolic array size
the same levels of Wallace trees in these multipliers. Thus, in the experiments.
the lower number of partial product bits in the Wallace tree
of factored radix-8 significantly reduces its pipeline cut-set 4.2 Multipliers
registers cost when compared with that of radix-4. Balancing the area and the delay, we insert one pipeline
Moreover, the datapath of the factored radix-8 multiplier cut-set to 8-bit and 16-bit multipliers and two pipeline cut-
doesn’t include booth encoding, hence, there is a strong ten- sets to 32-bit multipliers. The inserted pipeline registers are
dency of placing a balanced pipeline cut-set at a lower level placed at optimal positions in terms of delay using retim-
of the Wallace tree with further reduced number of partial ing. Table 2 compares the area, power, delay, power delay
products, thus, further reduction in the pipeline cut-sets product (PDP), and area delay product (ADP) of R4, R8,
cost. This reduced usage of pipeline registers in factored and FR8 multipliers. All the listed results in Table 2 are
Table 3: Implementation results for 8, 16, and 32-bit systolic arrays of various sizes. The PDP and ADP of R8 are omitted
due to limited space.

Radix-4 (R4) Radix-8 (R8) Proposed Design (FR8)


WL Size Area Delay Power PDP ADP Area Delay Power Area Delay Power PDP ADP
(mm2 ) (ns) (W) (ns·W) (ns·mm2 ) (mm2 ) (ns) (W) (ns·W) (ns) (W) (ns·W) (ns·mm2 )
16 0.54 1.15 0.26 0.30 0.62 0.55 1.14 0.24 0.50 0.99 0.23 0.23 0.50
32 2.16 1.15 1.04 1.19 2.48 2.20 1.14 0.96 2.00 0.99 0.91 0.90 1.98
8
64 8.63 1.15 4.15 4.77 9.93 8.78 1.14 3.83 7.96 0.99 3.63 3.59 7.88
128 34.54 1.15 16.61 19.10 39.72 35.13 1.14 15.33 31.79 0.99 14.48 14.33 31.47
16 1.77 1.48 0.91 1.35 2.62 1.74 1.59 0.85 1.62 1.26 0.72 0.90 2.05
32 7.09 1.48 3.64 5.38 10.50 6.96 1.59 3.39 6.47 1.26 2.85 3.59 8.16
16
64 28.37 1.48 14.55 21.54 41.99 27.83 1.59 13.56 25.85 1.26 11.36 14.31 32.57
128 113.48 1.48 58.15 86.06 167.95 111.33 1.59 54.19 103.28 1.26 45.38 57.18 130.14
16 5.89 1.37 3.14 4.31 8.06 5.67 1.36 2.88 4.72 1.22 2.17 2.65 5.76
32 32 23.54 1.37 12.58 17.23 32.25 22.69 1.36 11.52 18.84 1.22 8.67 10.58 22.99
64 94.16 1.37 50.29 68.90 129.00 90.78 1.36 46.09 75.26 1.22 34.63 42.25 91.82

16x16 PEs 32x32 PEs 64x64 PEs 128x128 PEs

35

31.1
31.1
30.8

28.8
38.6
38.7

28.7
38.4

28.5
20.1
20.0
19.7

14.9
14.9
14.9
14.9

26.5
Power Reduction (%)

26.4
26.3
26.0
13.9
13.9
13.9
13.9

33.6

ADP Reduction (%)


33.5
30

PDP Reduction (%)

33.4
40

33.0
Delay Reduction (%)

16
Area Reduction (%)

20 30

22.0

20.8
21.9

20.7
21.8

20.4
10.9
10.9
10.9

21.3

20.0
25

25.0
24.7
24.4
24.1
12 30
15
20
20

12.9
12.6
12.2
11.8
9.0
8.9
8.7
8.4

8 20 15
8.0
7.8
7.6

10
7.0

10 10
5 4 10
5

0 0 0 0 0
8 Bit 16 Bit 32 Bit 8 Bit 16 Bit 32 Bit 8 Bit 16 Bit 32 Bit 8 Bit 16 Bit 32 Bit 8 Bit 16 Bit 32 Bit

Figure 5: Performance improvements of the factored radix-8 systolic arrays with respect to the corresponding conventional
radix-4 systolic arrays. The proposed design outperforms the conventional design in all metrics.

R4 IO Reg FR8 IO Reg R4 Comb. FR8 Comb.

R4 Pipe Reg FR8 Pipe Reg R4 Sequential FR8 Sequential


number of registers. Although the added registers take a
6000
R4 Delay FR8 Delay
3 6
small area compared to the reduced combinational area, reg-
5000 5
isters take a significant portion of power consumption, and
Power (mW)
Area (um )

Delay (ns)
2

4000 2 4 one can concern that the added registers may degrade over-
3000 3
all power consumption. The idle power increase may be not
2000 1 2
significant owing to the clock-gating technique, but the ac-
1000 1

0 0 0
tive power can increase. However, our proposed FR8 design
1 Stage 2 Stage 3 Stage 1 Stage 2 Stage 3 Stage
can have similar or smaller sequential area compared to R4
(a) (b) because of fewer pipeline registers used. Figures 6a and 6b
show the sequential area, delay, combinational power, and
Figure 6: (a) Sequential area and delay. (b) Combinational sequential power of 32-bit R4 and FR8 as the number of
and sequential power. When the number of pipeline stages pipeline stages varies. The power consumption in this exper-
increases, FR8 outperforms R4 even in terms of sequential iment are measured at 3ns clock period. The single stage de-
area and sequential power. sign means the non-pipelined version. The delay is reduced
substantially even when the number of pipeline stages in-
creases from 2 to 3, which makes the 3-stage multiplier a
normalized with respect to those of the corresponding R4. reasonable design. Figure 6a also breaks the sequential area
Although R8 consumes little less power than R4, it does not into the area of the input and output (IO) registers and the
provide significant benefits in area or power. This explains area of the pipeline registers. FR8 has a larger number of IO
why R8 is not usually employed in practice. Compared to registers than R4 due to the factoring, resulting in a larger
R8, FR8 does not have the adder for 3Y and the booth en- sequential area and a higher sequential power consumption
coder. Compared to R4, FR8 has the reduced height of the in the non-pipelined version. However, when a pipeline cut-
partial products owing to radix 8 and has fewer pipeline reg- set is inserted for the two stage design, the overall sequential
isters. Therefore, although FR8 contains a larger number areas and powers become similar due to the fewer pipeline
of the input registers than R4 and R8, it shows significant registers in FR4 than R8. In the three stage designs, the se-
improvements in all the metrics over R4 and R8. The im- quential areas and powers are finally inverted, making FR8
provements are made across all the wordlengths. The area, better in all aspects.
power, and delay of FR8 are 13−20%, 17−30%, and 12−15%
less than R4, respectively. The PDP and ADP performance 4.3 Systolic Arrays
of FR8 is 26 − 38% and 23 − 29% better than those of R4. Table 4 shows the area and the power breakdown for 16-
bit R4 and FR8 systolic arrays of size 32 × 32. The R4 SA
4.2.1 Sequential Area and Power Inversion is just a grid of PEs and the PE area takes up the whole
The proposed factoring reduces delay and combinational SA area. In R4 SA, the multipliers (Mult) occupies 86.13%
area significantly, but it comes at the cost of an increased of the whole SA area, so improvements in the multiplier de-
Table 4: Area and power breakdown of 16-bit 32×32 systolic [2] Oren Rippel and Lubomir Bourdev. Real-Time Adaptive Image
arrays. Compression. In Proceedings of the 34th International
Conference on Machine Learning-Volume 70, pages
2922–2930. JMLR. org, 2017.
Area by Module (mm2 ) Power by Component (W) [3] Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik
Cambria. Recent trends in deep learning based natural language
PE 7.09 ClkNet 1.1 (29.49%)
processing. ieee Computational intelligenCe magazine,
x Mult 6.11 (86.13%) Reg 0.3 (9.43 %)
R4 13(3):55–75, 2018.
x Other 0.98 (13.87%) Comb 2.2 (61.08%)
[4] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis
Total 7.09 Total 3.6 Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas
Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game
PE 6.45 (99.61%) ClkNet 1.2 (42.38%)
of go without human knowledge. Nature, 550(7676):354, 2017.
x Mult 5.03 (77.68%) Reg 0.3 (10.75%)
x Other 1.42 (21.93%) Comb 1.3 (46.87%) [5] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson,
FR8 Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia,
BEncoder 0.01 (0.14%)
3YAdder 0.02 (0.25%) Nan Boden, Al Borchers, et al. In-Datacenter Performance
Analysis of a Tensor Processing Unit. In 2017 ACM/IEEE
Total 6.48 Total 2.8 44th Annual International Symposium on Computer
Architecture (ISCA), pages 1–12. IEEE, 2017.
[6] Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze.
Eyeriss: An Energy-Efficient Reconfigurable Accelerator for
sign have significant impact on the large block of 7.09 mm2 . Deep Convolutional Neural Networks. IEEE Journal of
The area of the others (Other) including accumulators takes Solid-State Circuits, 52(1):127–138, 2016.
[7] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram,
up the rest. The arithmetic circuits usually have a higher Mark A Horowitz, and William J Dally. EIE: Efficient Inference
switching activity than control logic circuits, and the combi- Engine on Compressed Deep Neural Network. In 2016
national (Comb) power consumption accounts for 61.08% ACM/IEEE 43rd Annual International Symposium on
Computer Architecture (ISCA), pages 243–254. IEEE, 2016.
of the total SA power consumption. The clock network
[8] Andrew Lavin and Scott Gray. Fast Algorithms for
(ClkNet) and registers (Reg) consumes 38.92%. In FR8, Convolutional Neural Networks. In Proceedings of the IEEE
the booth encoder (BEncoder) and the 3Y adder (3YAdder) Conference on Computer Vision and Pattern Recognition,
are extracted, but they together occupies just 0.39% area pages 4013–4021, 2016.
[9] Minsik Cho and Daniel Brand. MEC: memory-efficient
already in the 32 × 32 array. Compared to the R4, the se- convolution for deep neural network. In Proceedings of the 34th
quential power (ClkNet+Reg) increases slightly by 0.1W, International Conference on Machine Learning-Volume 70,
but the combinational power is reduced by 0.9W. pages 815–824. JMLR. org, 2017.
Table 3 compares the area, delay, power, PDP, and ADP [10] HT Kung and CE Leiserson. Algorithms for VLSI processor
arrays/eds. C. Mead, L. Conway.–Addison-Wesley:
of the R4, R8 and FR8 SAs at different sizes 16×16, 32×32, Introduction to VLSI Systems, 1979.
64 × 64, and 128 × 128 and at different bitwidths W L=8, 16 [11] Hsiang-Tsung Kung. Why systolic architectures? IEEE
and 32, while Figure 5 shows the improvement in the per- computer, 15(1):37–46, 1982.
formance of the proposed FR8 SA over the conventional R4 [12] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and
Pritish Narayanan. Deep Learning with Limited Numerical
SA. The delay of SAs doesn’t depend on the SA size, so the Precision. In International Conference on Machine Learning,
delay reduction remains constant as the SA size changes. pages 1737–1746, 2015.
The area and the power reduction from R4 increases as the [13] Google. System Architecture.
https://cloud.google.com/tpu/docs/system-architecture, 2019.
size of SA increases because the cost of the extracted logic
[14] Chen Xin, Qiang Chen, Miren Tian, Mohan Ji, Chenglong Zou,
is amortized over the PEs. The area reduction is prominent Xin’An Wang, and Bo Wang. Cosy: An Energy-Efficient
at 32-bit because the 3-stage pipeline is employed and FR8 Hardware Architecture for Deep Convolutional Neural Networks
reduces even the sequential area from R4. The delay reduc- Based on Systolic Array. In 2017 IEEE 23rd International
Conference on Parallel and Distributed Systems (ICPADS),
tion at 32-bit is less than that at 8-bit and 16-bit because the pages 180–189. IEEE, 2017.
critical path appears in the accumulator. FR8 outperforms [15] Hsiang-Tsung Kung, Bradley McDanel, and Sai Qian Zhang.
R4 in all metrics. The area, power, and delay of FR8 are Mapping systolic arrays onto 3d circuit structures: Accelerating
7.0 − 20.1%, 10.9 − 14.9%, 11.8 − 31.1% less than those of convolutional neural network inference. In 2018 IEEE
International Workshop on Signal Processing Systems (SiPS),
R4, respectively. The PDP and ADP performance of FR8 pages 330–336. IEEE, 2018.
is 24.1 − 38.7% and 20.0 − 28.8% better than those of R4, [16] HT Kung, Bradley McDanel, Sai Qian Zhang, CT Wang, Jin
respectively. Cai, CY Chen, Victor CY Chang, MF Chen, Jack YC Sun, and
Douglas Yu. Systolic building block for logic-on-logic 3d-ic
implementations of convolutional neural networks. In 2019
IEEE International Symposium on Circuits and Systems
5. CONCLUSIONS (ISCAS), pages 1–5. IEEE, 2019.
We have presented a novel systolic array based on factor- [17] Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew
ing and radix-8 and have demonstrated that the proposed Mattina, and Tushar Krishna. SCALE-Sim: Systolic CNN
Accelerator Simulator. arXiv preprint arXiv:1811.02883, 2018.
design can achieve significant reductions in area, delay, and [18] James Garland and David Gregg. Low Complexity Multiply
power from the conventional radix-4 design providing ex- Accumulate Unit for Weight-Sharing Convolutional Neural
actly the same functionality. The concept of the factored Networks. IEEE Computer Architecture Letters,
16(2):132–135, 2017.
systolic array can be realized in many different ways and
[19] Sungju Ryu, Naebeom Park, and Jae-Joon Kim.
non-conventional multipliers such as radix-8 can be practi- Feedforward-Cutset-Free Pipelined Multiply–Accumulate Unit
cal with the factored systolic array. We believe that more for the Machine Learning Accelerator. IEEE Transactions on
exploration is required in that research path, which may be Very Large Scale Integration (VLSI) Systems, 27(1):138–146,
2018.
our future work. [20] Alberto A Del Barrio, Román Hermida, and Seda
Ogrenci-Memik. A Combined Arithmetic-High-Level Synthesis
Solution to Deploy Partial Carry-Save Radix-8 Booth
6. REFERENCES Multipliers in Datapaths. IEEE Transactions on Circuits and
[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Systems I: Regular Papers, 66(2):742–755, 2018.
ImageNet Classification with Deep Convolutional Neural [21] Andrew D Booth. A signed binary multiplication technique.
Networks. In Advances in neural information processing The Quarterly Journal of Mechanics and Applied
systems, pages 1097–1105, 2012. Mathematics, 4(2):236–240, 1951.
[22] Milos D Ercegovac and Tomas Lang. Digital Arithmetic.
Elsevier, 2004.

You might also like