Computer Arithmetic for ML/Crypto Acceleration
December 29, 2023
Kashif Inayat
Computer Sciences-European Exascale Accelerator
Barcelona Supercomputing Center
[Link]
Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 2 Kashif Inayat, 2023
Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 3 Kashif Inayat, 2023
Machine Learning
Image classification
Face recognition
Machine learning is
everywhere!
Optical character recognition Autonomous driving
12/29/2023 4 Kashif Inayat, 2023
Transistors are not getting Faster
Slowdown of Moore’s Law
General purpose microprocessors not
getting faster or more efficient!!
Slowdown
Need specialized / domain-specific
hardware
for significant improvements in speed
and energy efficiency
[Source: Sze Tutorials, [1]]
Slowdown of Moore’s law and Dennard scaling.
12/29/2023 5 Kashif Inayat, 2023
Key Parameters to consider
• Performance
• Conventional arithmetic blocks
inside PE usually have large
critical path
• Latency, Throughput
• Energy/Power
• MAC is the most power
consuming unit
• Hardware Cost
• Chip Storage, Chip Area [Source: Computing’s Energy Problem (and what we can do about it), Mark Horowitz, [2]]
Rough energy costs for various operations in 45nm
• Flexibility 0.9V.
• Determines the range of possible
trade-offs (or precision etc.)
12/29/2023 6 Kashif Inayat, 2023
Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 7 Kashif Inayat, 2023
Commercial ML Accelerators 2D spatial arrays 2D systolic arrays
• Google (TPUv1)
• Extended in TPUv2, TPUv3 and TPUv4
• Tesla
• 96x96 independent array in NPU
• Xilinx (XDNN) Engine
• Alveo Accelerator Card
[Source: TPUv1, [3], Google.] [Source: NPU, [4], Tesla.]
• NVIDIA (NVDLA)
• Jetson Xavier
• IBM
• 28x28 Wavefront Systolic array
• Samsung
• 1024 MACs NPU
• Intel, etc
[Source: XDNN [5], Xilinx.] [Source: 28x28 wavefront SA [6], IBM.]
12/29/2023 8 Kashif Inayat, 2023
Matrix Multiplication
[Source: Sze Tutorials, [1].]
Convolution and Matrix Multiplication.
• Matrix multiplication is the key primitive in machine learning
• Systolic arrays perform matrix multiplication with high data reuse
12/29/2023 9 Kashif Inayat, 2023
Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 10 Kashif Inayat, 2023
Adders
• Ripple Carry Adder
12/29/2023 11 Kashif Inayat, 2023
Adders
• Carry Save Adders
12/29/2023 12 Kashif Inayat, 2023
Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 13 Kashif Inayat, 2023
Multipliers
• String Property (SP)
12/29/2023 14 Kashif Inayat, 2023
Multipliers
• Radix-4 Booth Multiplier
12/29/2023 15 Kashif Inayat, 2023
Multipliers
• Radix-4 Booth
Multiplier Truth Table
using String Property
12/29/2023 16 Kashif Inayat, 2023
Multipliers
• Radix-8 Booth Multiplier
12/29/2023 17 Kashif Inayat, 2023
Multipliers
• Radix-8 Booth Multiplier Truth
Table using String Property
12/29/2023 18 Kashif Inayat, 2023
Multipliers
• Sign Extension
12/29/2023 19 Kashif Inayat, 2023
Multipliers
• Montgomery modular multiplier
12/29/2023 20 Kashif Inayat, 2023
Multipliers
• School Method-Decomposed
12/29/2023 21 Kashif Inayat, 2023
Multipliers
• Karatsuba–Ofman algorithm
12/29/2023 22 Kashif Inayat, 2023
Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 23 Kashif Inayat, 2023
b33 t5
Systolic Array Architecture
b23 b32 t4
b13 b22 b31 t3
b12 b21 t2
A ×B = C
A 3x3 systolic array for
b11 t1
matrix multiplication.
PE11 × PE12 × PE13 ×
a13 a12 a11 + C11 + C12 + C13
t4 : c11 t5 : c12 t6 : c13
PE21 × PE22 × PE23 ×
a23 a22 a21 + C21 + C22 + C23
t5 : c21 t6 : c22 t7 : c23
PE31 × PE32 × PE33 ×
a33 a32 a31 + C31 + C32 + C33
t6 : c31 t7 : c32 t8 : c33
t5 t4 t3 t2 t1
12/29/2023 24 Kashif Inayat, 2023
Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 25 Kashif Inayat, 2023
Data Flows
• Output Stationary
Input Stationary (IS)
Output Stationary (OS)
• Input Stationary
• Weight Stationary
Weight Stationary (WS)
[Source: Scale-sim, [7]]
Dataflows in systolic arrays.
12/29/2023 26 Kashif Inayat, 2023
Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 27 Kashif Inayat, 2023
Radix-4 Vs Radix-8 Multiply-Accumulate (MAC) Unit
Number of partial products: N ⌈(N+1)/3⌉
Number of partial products: N ⌈(N+1)/2⌉ Hard multiple problem: 3Y = Y + 2Y (Carry propagate adder)
Y Y
Y+2Y
Radix-4
Radix-8
X Booth
Recoding
Booth Selection
X Booth Booth Selection
Recoding
Wallace reduction
Wallace reduction
tree
tree
Multiplier Multiplier
CPA
CPA
CPA
CPA
Accumulator Accumulator
Radix-4 microarchitecture Radix-8 microarchitecture
12/29/2023 28 Kashif Inayat, 2023
Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 29 Kashif Inayat, 2023
Problem Statement
• SOTA
• minimizing memory access
• design data flows
• less data movement
• MAC perform 99% computations in SAs
Our Methodology
12/29/2023 30 Kashif Inayat, 2023
Conclusion
• We will focus on both:
• µ-architectures, e.g., multipliers, adders, pipeline stages in Systolic Array based
architectures etc.
• At the same time, we can will explore RISC-V architecture for similar optimization
approaches.
12/29/2023 31 Kashif Inayat, 2023
References
[1] Sze, Vivienne, et al. "Efficient processing of deep neural networks." Synthesis Lectures on Computer Architecture 15.2 (2020): 1-
341.
[2] Horowitz, Mark. "1.1 computing's energy problem (and what we can do about it)." 2014 IEEE International Solid-State Circuits
Conference Digest of Technical Papers (ISSCC). IEEE, 2014.
[3] Google. System Architecture. [Link] 2019.
[4] Inside Tesla’s Neural Processor In The FSD Chip: [Link]
chip/
[5] Xilinx. Accelerating DNNs With Xilinx Alveo Accelerator Cards. [Online Available]: [Link]
documentation/white_papers/[Link]
[6] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep Learning with Limited Numerical Precision. In
International Conference on Machine Learning, pages 1737–1746, 2015.
[7] Samajdar, Ananda, et al. "Scale-sim: Systolic cnn accelerator simulator." arXiv preprint arXiv:1811.02883 (2018).
[8] Ullah, Inayat, Kashif Inayat, Joon-Sung Yang, and Jaeyong Chung. "Factored radix-8 systolic array for tensor processing." In 2020
57th ACM/IEEE Design Automation Conference (DAC), pp. 1-6. IEEE, 2020.
[9] Genc, H., Haj-Ali, A., Iyer, V., Amid, A., Mao, H., Wright, J., ... & Asanovic, K. (2019). Gemmini: An agile systolic array generator
enabling systematic evaluations of deep-learning architectures. arXiv preprint arXiv:1911.09925, 3, 25.
[10] Inayat, Kashif, and Jaeyong Chung. "Carry-propagation-adder-factored gemmini systolic array for machine learning acceleration."
Electronics 10.6 (2021): 652.
[11] Inayat, Kashif, and Jaeyong Chung. "Hybrid Accumulator Factored Systolic Array for Machine Learning Acceleration." IEEE
Transactions on Very Large Scale Integration (VLSI) Systems (2022).
12/29/2023 32 Kashif Inayat, 2023
References
[12] Kashif Inayat, Inayat Ullah, and Jaeyong Chung ”Hard Multiple Carry Portioned Factored Radix-8 Systolic Array for Tensor
Processing”, IEEE TVLSI, 2023 (Accepted).
[13] Review and benchmarking of precision-scalable multiply-accumulate unit architectures for embedded neural-network
processing, IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9.4 (2019): 697-711.
[14] Ryu, Sungju, et al. "Bitblade: Area and energy-efficient precision-scalable neural network accelerator with bitwise
summation." Proceedings of the 56th Annual Design Automation Conference 2019. 2019.
[15] Sharma, Hardik, et al. "Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network."
2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018.
[16] Deep learning with limited numerical precision." International conference on machine learning. PMLR, 2015.
12/29/2023 33 Kashif Inayat, 2023
Thank you for Listening!
12/29/2023 34