Factored Systolic Array Tensor Processing
Factored Systolic Array Tensor Processing
3. PROPOSED DESIGN
which performs the multiplication operation and the prod-
uct is accumulated in the accumulator register to make the 3.1 A General Idea to Improve Systolic Ar-
partial result. Every cycle the next batch of the inputs is rays
provided to the edge PEs and the previous inputs are for- We consider a systolic array where PEs are organized in a
warded to the next PEs. In the example given the P E11 grid. The systolic array accepts the inputs for a matrix at the
performs multiplication operation on the inputs a11 and b11 left edge and the inputs for the other matrix at the top edge,
in clock cycle t1 . In the second clock cycle t2 P E11 per- and produces the outputs at the top edge. Each PE consists
forms the multiplication on the next inputs a12 and b21 and of a MAC which takes two inputs X and Y and performs the
forwards the previous cycle inputs a11 and b11 to P E12 and multiplication and accumulation operations every clock cy-
P E21 , respectively. In the same clock cycle P E12 and P E21 cle. The input of the MAC fed from left side are considered
performs multiplication operations on the pairs (a11 , b12 ) as multiplier X, and the input fed from the top as multipli-
and (a21 , b11 ), respectively. In the third clock cycle t3 , PE11 cand Y . Let P be the value stored in the accumulator when
performs the multiplication operation on the third inputs the accumulation is done and Q be the dot-product (i.e., the
a13 , b31 and outputs the final result c11 in fourth cycle t4 . sum of the products). We assume that X, Y , and Q are
The PE12 and PE21 outputs the resultant elements of ma- represented in binary for compatibility but P can be repre-
trix C c12 and c21 in fifth clock cycle. In this way, the values sented in any format. The MAC unit performs the dedicated
of the input matrices are propagated in systolic array and pre-processing on the X and Y prior to the main process-
the output matrix C elements are computed in a wave-front ing. The P -dedicated post-processing takes the MAC output
flow as shown in the Figure 1. In the same manner this 3 × 3 and produces Q. The widely employed modified booth en-
systolic array is able to perform the multiplication of two coding multiplier in conventional general-purpose processors
3 × k and k × 3 size matrices. performs booth recoding as the X-dedicated pre-processing.
In the systolic array, the dedicated logic (DL) circuits for X-
2.2 Related Works , Y - and Z- dedicated processing are replicated across PEs
Our work is mainly related to systolic array architecture and we can factor them out and can be placed at the inputs
and MAC unit design. Systolic arrays are usually designed and the outputs of the array. Then, the area of the XDL
for specific operations such as convolution [6,14–16] and ma- (YDL and PDL) circuits can be amortized by the PEs in a
trix multiplications [5, 17]. Due to the popularity of convo- row (column) and becomes marginal as the size of the array
lutional neural networks (CNNs), majority of works target grows. Also, the delay of the DL circuits can be eliminated
convolutions. Most of the existing studies on SA are per- from the critical path. We call this structure the factored
formed at architectural level. They aim at minimizing systolic array. The logic to convert the number format or
memory access to reduce energy consumption and design for rounding may be placed naturally at the inputs and the
dataflows to maximize data reuse and local accumulation. outputs, but in the factored systolic array, we extract logic
In [6], the authors target at accelerating the convolution op- circuits that are traditionally considered a part of the MAC
erations in CNNs, study existing dataflows such as no local unit such as booth encoders. In this structure, more aggres-
reuse, weight stationary, output stationary, and proposes a sive, non-conventional approaches can be employed.
novel dataflow called row stationary. In [14], data reuse in
the row stationary dataflow is further exploited by sharing 3.2 Parallel Multipliers in SAs
the storage of processing elements. In [17], dataflows of sys- Multipliers dominate the datapath of the systolic array
tolic arrays for matrix multiplication are studied. Several architecture. A parallel multiplier consists of three major
studies [15, 16] also consider the physical mapping of pro- parts: 1) partial product generation; 2) partial products re-
cessing elements into 3D ICs and enable data reuse along duction trees (e.g., Wallace tree); and 3) addition of the pair
the third dimension, achieving further speed-up and less en- of final sum and carry rows of partial products using carry
ergy consumption. Unlike these studies, our study deals with propagate adder (CPA) to get the final product.
logic level design of systolic arrays. To the best of our Consider an M ×N -bit multiplier such that X = xM −1 xM −2
knowledge, none of the existing works on SAs are studied at · · · x1 x0 and Y = yN −1 yN −2 · · · y1 y0 are the multiplier and
Q
Y Y Q
PE
YDL PDL PDL PDL Y+2Y Y+2Y Y+2Y
YDL YDL YDL
XDL
ML Radix-8
X ML ML ML Booth PE PE PE
XDL
PE PE PE X ACC ACC ACC X Recoding
ACC
ML ML ML
XDL
PDL Radix-8 +
PE PE PE ACC ACC ACC Booth PE PE PE
X Recoding
Processing Element
ML ML ML
XDL
Radix-8
PE PE PE ACC ACC ACC
Booth PE PE PE
X Recoding
(a) (b)
Figure 3: Factored radix-8 systolic array architecture. The
Figure 2: (a) Conventional systolic array architecture (b) circuit in the red-dotted box is considered the multiplier,
Proposed factored systolic array architecture. XDL, YDL, which includes the input and the output registers.
and PDL are replicated across PEs and can be extracted
and placed the inputs and the outputs of the systolic array,
which allow us to re-think about what the best MAC design and we can re-design the code to reduce 5 lines to 4 lines.
is for systolic arrays. However, this can incur an additional delay in the booth
selection and we use this standard coding as it is. The com-
plexity of radix-8 booth recoding logic is higher than that
multiplicand, respectively. The N rows of M -bit partial of radix-4, but we will show that the entire booth recording
products PPi,j can be expressed as logic can be removed from individual multipliers in a systolic
array.
P Pi,j = xi yj , ∀i, j, 0 ≤ i < M, 0 ≤ i < N, (1)
which is taken from [20]. In fast multipliers, the modified
3.4 Factored Radix-8 Systolic Array (FSA)
Booth encoding algorithm is employed to reduce the height A radix-8 booth multiplier involves the dedicated pre-
of partial products [21, 22]. A radix-R=2r , r>0, booth mul- processing of complex booth recoding on the multiplier X
tiplier reduces the height of partial products from N to and 3Y = Y + 2Y generation on the multiplicand Y . When
d(N +1)/re, reducing the size and enhancing the speed of the the radix-8 booth multiplier is employed in the PEs of the
reduction tree. The existing systolic array architectures em- systolic array, the logic for the aforementioned dedicated pre-
ploy radix-4 booth multipliers which has d(N +1)/2e number processing on X and Y are replicated across PEs. The main
of partial products. idea of this paper is to use the radix-8 booth encoding in
designing the MACs of the systolic array and at the same
3.3 Radix-8 Booth Multiplier time factor out this repeated dedicated pre-processing logic
In radix-8 booth multipliers the height of the partial prod- from the PEs and place them at the inputs.
ucts reduces to d(N +1)/3e, which brings a noticeable reduc- Figure 3 shows the architecture of the proposed radix-8
tion in the area of partial products reduction tree and critical FSA. The X inputs accepted from the left edge are pre-
path delay. However, booth multipliers with a radix higher processed in parallel and the control lines for the booth se-
than 2 requires hard multiples, that is not a power of two and lection are forwarded to the corresponding rows of PEs. The
are not obtainable using simple shift and complement oper- Y inputs accepted from the top edge are pre-processed and
ations. In radix-8 booth multiplier ±0, ±Y, ±2Y, ±3Y, ±4Y the resultant 3Y s along with Y s are forwarded to the cor-
multiples of the multiplicand Y are required, and the gener- responding columns of PEs. In this way, the proposed FSA
ation of hard multiple 3Y involves the addition of Y and 2Y achieves the advantages of the -8 booth multipliers and at the
using CPA which slows down the multiplier. same time eliminates the associated drawbacks by factoring
The radix-8 booth recoding partitions the multiplier X out the complex booth encoding logic and hard multiple 3Y
into sets of 4 bits, 3 adjacent binary bits x3(i+1)−1 x3i+1 x3i computation. This brings a significantly decreasing impact
and a borrow bit x3i−1 from the previous set, each set en- on the delay, area and power of the whole systolic array.
coded into the control lines s, d, t, q, n using However, this gain in the performance of the FSA is achieved
at the cost of an increased number of input registers in the
si = (x3(i+1)−1 ⊕ x3i+1 )∧(x3i ⊕ x3i−1 ) PEs. A PE in the radix-8 FSA registers and forwards 3Y
di = (x3i ⊕ x3i+1 )∧(x3i ⊕ x3i−1 ) along with multiplicand Y to the next PE compared with Y
alone in the PEs of radix-4 systolic array. Similarly,
the PEs
ti = (x3(i+1)−1 ⊕ x3i+1 )∧(x3i ⊕ x3i−1 )
of the radix-8 FSA registers and forwards 5( N3+1 ) number
qi = (x3(i+1)−1 ⊕ x3i+1 )∧(x3(i) ⊕ x3i+1 )∧(x3i ⊕ x3i−1 ) of control bits to the next PEs compared with N -bit multi-
ni = x3(i+1)−1 plier X in the PEs of radix-4 systolic array. However, this
increase in the number of input registers is compensated by
N
for i = 0, 1, 2, · · · , − 1, (2) the reduced pipeline cut-set registers cost in the multipliers
3 of the proposed radix-8 FSA explained in the coming section.
where N is the width of multiplier Y and x−1 = 0. These five
control lines select a multiple of the multiplicand Y among 3.5 Pipeline Cut-set Cost in Booth Multipliers
±0,
±Y,±2Y, ±3Y, ±4Y . For an M × N radix-8 multiplier, The computing systems for machine learning are usually
5( N3+1 ) control signals are generated in parallel. Except highly optimized for throughput. Achieving a high through-
the all-zero case, these five control lines use one-hot encoding put of the overall system necessitates the pipelining of the
Table 2: Performance comparison of radix-4, radix-8 and
Pipeline Cut-set size at various Wallace tree levels in Pipeline Cut-set size at various Wallace tree levels in factored radix-8 multipliers.
radix-4 booth Multiplier factored radix-8 booth Multiplier
Y X Y 3Y
Area Power Delay PDP ADP
Booth Multiple’s WL Design
Booth Selection
Recoding
Booth Selection Selection Lines (µm2 ) (µW) (ns) (ns·µW) (ns·µm2 )
Level 0
Level 0 N+1 N+1 1869 0.95 1.12 1.06 2093
(M+2)
(M+1) 3 R4
3:2 Compressors 2 3:2 Compressors (1.00) (1.00) (1.00) (1.00) (1.00)
Level 1 Level 1 1931 0.85 1.05 0.89 2028
N+1 2 N+1 2 8 R8
(M+1+𝑒1 )
2 3
(M+2+𝑒1′ )
3 3
(1.03) (0.89) (0.94) (0.84) (0.97)
3:2 Compressors 3:2 Compressors 1625 0.79 0.99 0.78 1609
Level i Level i FR8
N+1 2 𝑖
N+1 2 𝑖 (0.87) (0.83) (0.88) (0.74) (0.77)
(M+1+𝑒𝑖 ) (M+2+𝑒𝑖′ )
2 3 3 3 6413 3.44 1.48 5.09 9491
R4
(1.00) (1.00) (1.00) (1.00) (1.00)
3:2 Compressors 6416 3.36 1.49 5.01 9560
3:2 Compressors 16 R8
Sum Carry
Level log 3
N+1 Sum Carry
Level log 3
N+1 (1.00) (0.98) (1.01) (0.98) (1.01)
2(M +N) 2 2(M +N) 3
2 2 5551 2.65 1.26 3.34 6994
Carry Propagate Adder FR8
Carry Propagate Adder (0.87) (0.77) (0.85) (0.66) (0.74)
Product Product 22926 12.10 1.29 15.61 29575
R4
(1.00) (1.00) (1.00) (1.00) (1.00)
22196 11.30 1.36 15.37 30187
32 R8
(0.97) (0.93) (1.05) (0.98) (1.02)
18305 8.49 1.14 9.68 20868
Figure 4: Pipeline cut-set cost comparison in proposed radix- FR8
(0.80) (0.70) (0.88) (0.62) (0.71)
8 and radix-4 multipliers. The pipeline cut-set size of the
proposed multiplier is about 33% less than that of the con-
ventional radix-4 multiplier at a corresponding level, reduc-
radix-8 multipliers offsets the aforementioned growth in its
ing the number of required pipeline registers substantially.
input registers in comparison with radix-4. It should be
noted that the lower height of the factored radix-8 partial
Table 1: Number of partial product bits at each Wallace tree products along-with the extraction of booth encoding en-
level in R4 and FR8 multipliers ables it in achieving lower critical path delay not only in the
nonpipelined case but also ensures in the pipelined versions
Wallace tree level
8-bit 16-bit 32-bit at a significantly low cost.
R4 FR8 R4 FR8 R4 FR8
35
31.1
31.1
30.8
28.8
38.6
38.7
28.7
38.4
28.5
20.1
20.0
19.7
14.9
14.9
14.9
14.9
26.5
Power Reduction (%)
26.4
26.3
26.0
13.9
13.9
13.9
13.9
33.6
33.4
40
33.0
Delay Reduction (%)
16
Area Reduction (%)
20 30
22.0
20.8
21.9
20.7
21.8
20.4
10.9
10.9
10.9
21.3
20.0
25
25.0
24.7
24.4
24.1
12 30
15
20
20
12.9
12.6
12.2
11.8
9.0
8.9
8.7
8.4
8 20 15
8.0
7.8
7.6
10
7.0
10 10
5 4 10
5
0 0 0 0 0
8 Bit 16 Bit 32 Bit 8 Bit 16 Bit 32 Bit 8 Bit 16 Bit 32 Bit 8 Bit 16 Bit 32 Bit 8 Bit 16 Bit 32 Bit
Figure 5: Performance improvements of the factored radix-8 systolic arrays with respect to the corresponding conventional
radix-4 systolic arrays. The proposed design outperforms the conventional design in all metrics.
Delay (ns)
2
4000 2 4 one can concern that the added registers may degrade over-
3000 3
all power consumption. The idle power increase may be not
2000 1 2
significant owing to the clock-gating technique, but the ac-
1000 1
0 0 0
tive power can increase. However, our proposed FR8 design
1 Stage 2 Stage 3 Stage 1 Stage 2 Stage 3 Stage
can have similar or smaller sequential area compared to R4
(a) (b) because of fewer pipeline registers used. Figures 6a and 6b
show the sequential area, delay, combinational power, and
Figure 6: (a) Sequential area and delay. (b) Combinational sequential power of 32-bit R4 and FR8 as the number of
and sequential power. When the number of pipeline stages pipeline stages varies. The power consumption in this exper-
increases, FR8 outperforms R4 even in terms of sequential iment are measured at 3ns clock period. The single stage de-
area and sequential power. sign means the non-pipelined version. The delay is reduced
substantially even when the number of pipeline stages in-
creases from 2 to 3, which makes the 3-stage multiplier a
normalized with respect to those of the corresponding R4. reasonable design. Figure 6a also breaks the sequential area
Although R8 consumes little less power than R4, it does not into the area of the input and output (IO) registers and the
provide significant benefits in area or power. This explains area of the pipeline registers. FR8 has a larger number of IO
why R8 is not usually employed in practice. Compared to registers than R4 due to the factoring, resulting in a larger
R8, FR8 does not have the adder for 3Y and the booth en- sequential area and a higher sequential power consumption
coder. Compared to R4, FR8 has the reduced height of the in the non-pipelined version. However, when a pipeline cut-
partial products owing to radix 8 and has fewer pipeline reg- set is inserted for the two stage design, the overall sequential
isters. Therefore, although FR8 contains a larger number areas and powers become similar due to the fewer pipeline
of the input registers than R4 and R8, it shows significant registers in FR4 than R8. In the three stage designs, the se-
improvements in all the metrics over R4 and R8. The im- quential areas and powers are finally inverted, making FR8
provements are made across all the wordlengths. The area, better in all aspects.
power, and delay of FR8 are 13−20%, 17−30%, and 12−15%
less than R4, respectively. The PDP and ADP performance 4.3 Systolic Arrays
of FR8 is 26 − 38% and 23 − 29% better than those of R4. Table 4 shows the area and the power breakdown for 16-
bit R4 and FR8 systolic arrays of size 32 × 32. The R4 SA
4.2.1 Sequential Area and Power Inversion is just a grid of PEs and the PE area takes up the whole
The proposed factoring reduces delay and combinational SA area. In R4 SA, the multipliers (Mult) occupies 86.13%
area significantly, but it comes at the cost of an increased of the whole SA area, so improvements in the multiplier de-
Table 4: Area and power breakdown of 16-bit 32×32 systolic [2] Oren Rippel and Lubomir Bourdev. Real-Time Adaptive Image
arrays. Compression. In Proceedings of the 34th International
Conference on Machine Learning-Volume 70, pages
2922–2930. JMLR. org, 2017.
Area by Module (mm2 ) Power by Component (W) [3] Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik
Cambria. Recent trends in deep learning based natural language
PE 7.09 ClkNet 1.1 (29.49%)
processing. ieee Computational intelligenCe magazine,
x Mult 6.11 (86.13%) Reg 0.3 (9.43 %)
R4 13(3):55–75, 2018.
x Other 0.98 (13.87%) Comb 2.2 (61.08%)
[4] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis
Total 7.09 Total 3.6 Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas
Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game
PE 6.45 (99.61%) ClkNet 1.2 (42.38%)
of go without human knowledge. Nature, 550(7676):354, 2017.
x Mult 5.03 (77.68%) Reg 0.3 (10.75%)
x Other 1.42 (21.93%) Comb 1.3 (46.87%) [5] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson,
FR8 Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia,
BEncoder 0.01 (0.14%)
3YAdder 0.02 (0.25%) Nan Boden, Al Borchers, et al. In-Datacenter Performance
Analysis of a Tensor Processing Unit. In 2017 ACM/IEEE
Total 6.48 Total 2.8 44th Annual International Symposium on Computer
Architecture (ISCA), pages 1–12. IEEE, 2017.
[6] Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze.
Eyeriss: An Energy-Efficient Reconfigurable Accelerator for
sign have significant impact on the large block of 7.09 mm2 . Deep Convolutional Neural Networks. IEEE Journal of
The area of the others (Other) including accumulators takes Solid-State Circuits, 52(1):127–138, 2016.
[7] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram,
up the rest. The arithmetic circuits usually have a higher Mark A Horowitz, and William J Dally. EIE: Efficient Inference
switching activity than control logic circuits, and the combi- Engine on Compressed Deep Neural Network. In 2016
national (Comb) power consumption accounts for 61.08% ACM/IEEE 43rd Annual International Symposium on
Computer Architecture (ISCA), pages 243–254. IEEE, 2016.
of the total SA power consumption. The clock network
[8] Andrew Lavin and Scott Gray. Fast Algorithms for
(ClkNet) and registers (Reg) consumes 38.92%. In FR8, Convolutional Neural Networks. In Proceedings of the IEEE
the booth encoder (BEncoder) and the 3Y adder (3YAdder) Conference on Computer Vision and Pattern Recognition,
are extracted, but they together occupies just 0.39% area pages 4013–4021, 2016.
[9] Minsik Cho and Daniel Brand. MEC: memory-efficient
already in the 32 × 32 array. Compared to the R4, the se- convolution for deep neural network. In Proceedings of the 34th
quential power (ClkNet+Reg) increases slightly by 0.1W, International Conference on Machine Learning-Volume 70,
but the combinational power is reduced by 0.9W. pages 815–824. JMLR. org, 2017.
Table 3 compares the area, delay, power, PDP, and ADP [10] HT Kung and CE Leiserson. Algorithms for VLSI processor
arrays/eds. C. Mead, L. Conway.–Addison-Wesley:
of the R4, R8 and FR8 SAs at different sizes 16×16, 32×32, Introduction to VLSI Systems, 1979.
64 × 64, and 128 × 128 and at different bitwidths W L=8, 16 [11] Hsiang-Tsung Kung. Why systolic architectures? IEEE
and 32, while Figure 5 shows the improvement in the per- computer, 15(1):37–46, 1982.
formance of the proposed FR8 SA over the conventional R4 [12] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and
Pritish Narayanan. Deep Learning with Limited Numerical
SA. The delay of SAs doesn’t depend on the SA size, so the Precision. In International Conference on Machine Learning,
delay reduction remains constant as the SA size changes. pages 1737–1746, 2015.
The area and the power reduction from R4 increases as the [13] Google. System Architecture.
https://cloud.google.com/tpu/docs/system-architecture, 2019.
size of SA increases because the cost of the extracted logic
[14] Chen Xin, Qiang Chen, Miren Tian, Mohan Ji, Chenglong Zou,
is amortized over the PEs. The area reduction is prominent Xin’An Wang, and Bo Wang. Cosy: An Energy-Efficient
at 32-bit because the 3-stage pipeline is employed and FR8 Hardware Architecture for Deep Convolutional Neural Networks
reduces even the sequential area from R4. The delay reduc- Based on Systolic Array. In 2017 IEEE 23rd International
Conference on Parallel and Distributed Systems (ICPADS),
tion at 32-bit is less than that at 8-bit and 16-bit because the pages 180–189. IEEE, 2017.
critical path appears in the accumulator. FR8 outperforms [15] Hsiang-Tsung Kung, Bradley McDanel, and Sai Qian Zhang.
R4 in all metrics. The area, power, and delay of FR8 are Mapping systolic arrays onto 3d circuit structures: Accelerating
7.0 − 20.1%, 10.9 − 14.9%, 11.8 − 31.1% less than those of convolutional neural network inference. In 2018 IEEE
International Workshop on Signal Processing Systems (SiPS),
R4, respectively. The PDP and ADP performance of FR8 pages 330–336. IEEE, 2018.
is 24.1 − 38.7% and 20.0 − 28.8% better than those of R4, [16] HT Kung, Bradley McDanel, Sai Qian Zhang, CT Wang, Jin
respectively. Cai, CY Chen, Victor CY Chang, MF Chen, Jack YC Sun, and
Douglas Yu. Systolic building block for logic-on-logic 3d-ic
implementations of convolutional neural networks. In 2019
IEEE International Symposium on Circuits and Systems
5. CONCLUSIONS (ISCAS), pages 1–5. IEEE, 2019.
We have presented a novel systolic array based on factor- [17] Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew
ing and radix-8 and have demonstrated that the proposed Mattina, and Tushar Krishna. SCALE-Sim: Systolic CNN
Accelerator Simulator. arXiv preprint arXiv:1811.02883, 2018.
design can achieve significant reductions in area, delay, and [18] James Garland and David Gregg. Low Complexity Multiply
power from the conventional radix-4 design providing ex- Accumulate Unit for Weight-Sharing Convolutional Neural
actly the same functionality. The concept of the factored Networks. IEEE Computer Architecture Letters,
16(2):132–135, 2017.
systolic array can be realized in many different ways and
[19] Sungju Ryu, Naebeom Park, and Jae-Joon Kim.
non-conventional multipliers such as radix-8 can be practi- Feedforward-Cutset-Free Pipelined Multiply–Accumulate Unit
cal with the factored systolic array. We believe that more for the Machine Learning Accelerator. IEEE Transactions on
exploration is required in that research path, which may be Very Large Scale Integration (VLSI) Systems, 27(1):138–146,
2018.
our future work. [20] Alberto A Del Barrio, Román Hermida, and Seda
Ogrenci-Memik. A Combined Arithmetic-High-Level Synthesis
Solution to Deploy Partial Carry-Save Radix-8 Booth
6. REFERENCES Multipliers in Datapaths. IEEE Transactions on Circuits and
[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Systems I: Regular Papers, 66(2):742–755, 2018.
ImageNet Classification with Deep Convolutional Neural [21] Andrew D Booth. A signed binary multiplication technique.
Networks. In Advances in neural information processing The Quarterly Journal of Mechanics and Applied
systems, pages 1097–1105, 2012. Mathematics, 4(2):236–240, 1951.
[22] Milos D Ercegovac and Tomas Lang. Digital Arithmetic.
Elsevier, 2004.