0% found this document useful (0 votes)
92 views34 pages

Computer Arithmetic

Uploaded by

Kashif Inayat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views34 pages

Computer Arithmetic

Uploaded by

Kashif Inayat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Computer Arithmetic for ML/Crypto Acceleration

December 29, 2023

Kashif Inayat

Computer Sciences-European Exascale Accelerator


Barcelona Supercomputing Center

[Link]
Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 2  Kashif Inayat, 2023
Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 3  Kashif Inayat, 2023
Machine Learning

Image classification
Face recognition

Machine learning is
everywhere!
Optical character recognition Autonomous driving
12/29/2023 4  Kashif Inayat, 2023
Transistors are not getting Faster

Slowdown of Moore’s Law


General purpose microprocessors not
getting faster or more efficient!!

Slowdown
Need specialized / domain-specific
hardware
for significant improvements in speed
and energy efficiency

[Source: Sze Tutorials, [1]]

Slowdown of Moore’s law and Dennard scaling.

12/29/2023 5  Kashif Inayat, 2023


Key Parameters to consider

• Performance
• Conventional arithmetic blocks
inside PE usually have large
critical path
• Latency, Throughput

• Energy/Power
• MAC is the most power
consuming unit

• Hardware Cost
• Chip Storage, Chip Area [Source: Computing’s Energy Problem (and what we can do about it), Mark Horowitz, [2]]

Rough energy costs for various operations in 45nm


• Flexibility 0.9V.
• Determines the range of possible
trade-offs (or precision etc.)

12/29/2023 6  Kashif Inayat, 2023


Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 7  Kashif Inayat, 2023
Commercial ML Accelerators 2D spatial arrays 2D systolic arrays

• Google (TPUv1)
• Extended in TPUv2, TPUv3 and TPUv4

• Tesla
• 96x96 independent array in NPU

• Xilinx (XDNN) Engine


• Alveo Accelerator Card
[Source: TPUv1, [3], Google.] [Source: NPU, [4], Tesla.]
• NVIDIA (NVDLA)
• Jetson Xavier

• IBM
• 28x28 Wavefront Systolic array

• Samsung
• 1024 MACs NPU

• Intel, etc
[Source: XDNN [5], Xilinx.] [Source: 28x28 wavefront SA [6], IBM.]

12/29/2023 8  Kashif Inayat, 2023


Matrix Multiplication

[Source: Sze Tutorials, [1].]

Convolution and Matrix Multiplication.

• Matrix multiplication is the key primitive in machine learning


• Systolic arrays perform matrix multiplication with high data reuse
12/29/2023 9  Kashif Inayat, 2023
Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 10  Kashif Inayat, 2023
Adders
• Ripple Carry Adder

12/29/2023 11  Kashif Inayat, 2023


Adders
• Carry Save Adders

12/29/2023 12  Kashif Inayat, 2023


Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 13  Kashif Inayat, 2023
Multipliers
• String Property (SP)

12/29/2023 14  Kashif Inayat, 2023


Multipliers
• Radix-4 Booth Multiplier

12/29/2023 15  Kashif Inayat, 2023


Multipliers
• Radix-4 Booth
Multiplier Truth Table
using String Property

12/29/2023 16  Kashif Inayat, 2023


Multipliers
• Radix-8 Booth Multiplier

12/29/2023 17  Kashif Inayat, 2023


Multipliers
• Radix-8 Booth Multiplier Truth
Table using String Property

12/29/2023 18  Kashif Inayat, 2023


Multipliers
• Sign Extension

12/29/2023 19  Kashif Inayat, 2023


Multipliers
• Montgomery modular multiplier

12/29/2023 20  Kashif Inayat, 2023


Multipliers
• School Method-Decomposed

12/29/2023 21  Kashif Inayat, 2023


Multipliers
• Karatsuba–Ofman algorithm

12/29/2023 22  Kashif Inayat, 2023


Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 23  Kashif Inayat, 2023
b33 t5
Systolic Array Architecture
b23 b32 t4
b13 b22 b31 t3
b12 b21 t2
A ×B = C
A 3x3 systolic array for
b11 t1
matrix multiplication.
PE11 × PE12 × PE13 ×
a13 a12 a11 + C11 + C12 + C13
t4 : c11 t5 : c12 t6 : c13

PE21 × PE22 × PE23 ×


a23 a22 a21 + C21 + C22 + C23
t5 : c21 t6 : c22 t7 : c23

PE31 × PE32 × PE33 ×


a33 a32 a31 + C31 + C32 + C33
t6 : c31 t7 : c32 t8 : c33

t5 t4 t3 t2 t1
12/29/2023 24  Kashif Inayat, 2023
Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 25  Kashif Inayat, 2023
Data Flows

• Output Stationary

Input Stationary (IS)


Output Stationary (OS)

• Input Stationary

• Weight Stationary

Weight Stationary (WS)


[Source: Scale-sim, [7]]

Dataflows in systolic arrays.


12/29/2023 26  Kashif Inayat, 2023
Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 27  Kashif Inayat, 2023
Radix-4 Vs Radix-8 Multiply-Accumulate (MAC) Unit
Number of partial products: N  ⌈(N+1)/3⌉
Number of partial products: N  ⌈(N+1)/2⌉ Hard multiple problem: 3Y = Y + 2Y (Carry propagate adder)
Y Y
Y+2Y

Radix-4
Radix-8
X Booth
Recoding
Booth Selection
X Booth Booth Selection
Recoding

Wallace reduction
Wallace reduction
tree
tree
Multiplier Multiplier
CPA
CPA

CPA
CPA

Accumulator Accumulator

Radix-4 microarchitecture Radix-8 microarchitecture


12/29/2023 28  Kashif Inayat, 2023
Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 29  Kashif Inayat, 2023
Problem Statement
• SOTA
• minimizing memory access
• design data flows
• less data movement

• MAC perform 99% computations in SAs

Our Methodology
12/29/2023 30  Kashif Inayat, 2023
Conclusion
• We will focus on both:
• µ-architectures, e.g., multipliers, adders, pipeline stages in Systolic Array based
architectures etc.
• At the same time, we can will explore RISC-V architecture for similar optimization
approaches.

12/29/2023 31  Kashif Inayat, 2023


References
[1] Sze, Vivienne, et al. "Efficient processing of deep neural networks." Synthesis Lectures on Computer Architecture 15.2 (2020): 1-
341.
[2] Horowitz, Mark. "1.1 computing's energy problem (and what we can do about it)." 2014 IEEE International Solid-State Circuits
Conference Digest of Technical Papers (ISSCC). IEEE, 2014.
[3] Google. System Architecture. [Link] 2019.
[4] Inside Tesla’s Neural Processor In The FSD Chip: [Link]
chip/
[5] Xilinx. Accelerating DNNs With Xilinx Alveo Accelerator Cards. [Online Available]: [Link]
documentation/white_papers/[Link]
[6] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep Learning with Limited Numerical Precision. In
International Conference on Machine Learning, pages 1737–1746, 2015.
[7] Samajdar, Ananda, et al. "Scale-sim: Systolic cnn accelerator simulator." arXiv preprint arXiv:1811.02883 (2018).
[8] Ullah, Inayat, Kashif Inayat, Joon-Sung Yang, and Jaeyong Chung. "Factored radix-8 systolic array for tensor processing." In 2020
57th ACM/IEEE Design Automation Conference (DAC), pp. 1-6. IEEE, 2020.
[9] Genc, H., Haj-Ali, A., Iyer, V., Amid, A., Mao, H., Wright, J., ... & Asanovic, K. (2019). Gemmini: An agile systolic array generator
enabling systematic evaluations of deep-learning architectures. arXiv preprint arXiv:1911.09925, 3, 25.
[10] Inayat, Kashif, and Jaeyong Chung. "Carry-propagation-adder-factored gemmini systolic array for machine learning acceleration."
Electronics 10.6 (2021): 652.
[11] Inayat, Kashif, and Jaeyong Chung. "Hybrid Accumulator Factored Systolic Array for Machine Learning Acceleration." IEEE
Transactions on Very Large Scale Integration (VLSI) Systems (2022).
12/29/2023 32  Kashif Inayat, 2023
References
[12] Kashif Inayat, Inayat Ullah, and Jaeyong Chung ”Hard Multiple Carry Portioned Factored Radix-8 Systolic Array for Tensor
Processing”, IEEE TVLSI, 2023 (Accepted).
[13] Review and benchmarking of precision-scalable multiply-accumulate unit architectures for embedded neural-network
processing, IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9.4 (2019): 697-711.
[14] Ryu, Sungju, et al. "Bitblade: Area and energy-efficient precision-scalable neural network accelerator with bitwise
summation." Proceedings of the 56th Annual Design Automation Conference 2019. 2019.
[15] Sharma, Hardik, et al. "Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network."
2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018.
[16] Deep learning with limited numerical precision." International conference on machine learning. PMLR, 2015.

12/29/2023 33  Kashif Inayat, 2023


Thank you for Listening!

12/29/2023 34

You might also like