0% found this document useful (0 votes)
27 views

Computer Arithmetic

Uploaded by

Kashif Inayat
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Computer Arithmetic

Uploaded by

Kashif Inayat
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Computer Arithmetic for ML/Crypto Acceleration

December 29, 2023

Kashif Inayat

Computer Sciences-European Exascale Accelerator


Barcelona Supercomputing Center

https://www.bsc.es/inayat-kashif
Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 2  Kashif Inayat, 2023
Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 3  Kashif Inayat, 2023
Machine Learning

Image classification
Face recognition

Machine learning is
everywhere!
Optical character recognition Autonomous driving
12/29/2023 4  Kashif Inayat, 2023
Transistors are not getting Faster

Slowdown of Moore’s Law


General purpose microprocessors not
getting faster or more efficient!!

Slowdown
Need specialized / domain-specific
hardware
for significant improvements in speed
and energy efficiency

[Source: Sze Tutorials, [1]]

Slowdown of Moore’s law and Dennard scaling.

12/29/2023 5  Kashif Inayat, 2023


Key Parameters to consider

• Performance
• Conventional arithmetic blocks
inside PE usually have large
critical path
• Latency, Throughput

• Energy/Power
• MAC is the most power
consuming unit

• Hardware Cost
• Chip Storage, Chip Area [Source: Computing’s Energy Problem (and what we can do about it), Mark Horowitz, [2]]

Rough energy costs for various operations in 45nm


• Flexibility 0.9V.
• Determines the range of possible
trade-offs (or precision etc.)

12/29/2023 6  Kashif Inayat, 2023


Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 7  Kashif Inayat, 2023
Commercial ML Accelerators 2D spatial arrays 2D systolic arrays

• Google (TPUv1)
• Extended in TPUv2, TPUv3 and TPUv4

• Tesla
• 96x96 independent array in NPU

• Xilinx (XDNN) Engine


• Alveo Accelerator Card
[Source: TPUv1, [3], Google.] [Source: NPU, [4], Tesla.]
• NVIDIA (NVDLA)
• Jetson Xavier

• IBM
• 28x28 Wavefront Systolic array

• Samsung
• 1024 MACs NPU

• Intel, etc
[Source: XDNN [5], Xilinx.] [Source: 28x28 wavefront SA [6], IBM.]

12/29/2023 8  Kashif Inayat, 2023


Matrix Multiplication

[Source: Sze Tutorials, [1].]

Convolution and Matrix Multiplication.

• Matrix multiplication is the key primitive in machine learning


• Systolic arrays perform matrix multiplication with high data reuse
12/29/2023 9  Kashif Inayat, 2023
Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 10  Kashif Inayat, 2023
Adders
• Ripple Carry Adder

12/29/2023 11  Kashif Inayat, 2023


Adders
• Carry Save Adders

12/29/2023 12  Kashif Inayat, 2023


Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 13  Kashif Inayat, 2023
Multipliers
• String Property (SP)

12/29/2023 14  Kashif Inayat, 2023


Multipliers
• Radix-4 Booth Multiplier

12/29/2023 15  Kashif Inayat, 2023


Multipliers
• Radix-4 Booth
Multiplier Truth Table
using String Property

12/29/2023 16  Kashif Inayat, 2023


Multipliers
• Radix-8 Booth Multiplier

12/29/2023 17  Kashif Inayat, 2023


Multipliers
• Radix-8 Booth Multiplier Truth
Table using String Property

12/29/2023 18  Kashif Inayat, 2023


Multipliers
• Sign Extension

12/29/2023 19  Kashif Inayat, 2023


Multipliers
• Montgomery modular multiplier

12/29/2023 20  Kashif Inayat, 2023


Multipliers
• School Method-Decomposed

12/29/2023 21  Kashif Inayat, 2023


Multipliers
• Karatsuba–Ofman algorithm

12/29/2023 22  Kashif Inayat, 2023


Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 23  Kashif Inayat, 2023
b33 t5
Systolic Array Architecture
b23 b32 t4
b13 b22 b31 t3
b12 b21 t2
A ×B = C
A 3x3 systolic array for
b11 t1
matrix multiplication.
PE11 × PE12 × PE13 ×
a13 a12 a11 + C11 + C12 + C13
t4 : c11 t5 : c12 t6 : c13

PE21 × PE22 × PE23 ×


a23 a22 a21 + C21 + C22 + C23
t5 : c21 t6 : c22 t7 : c23

PE31 × PE32 × PE33 ×


a33 a32 a31 + C31 + C32 + C33
t6 : c31 t7 : c32 t8 : c33

t5 t4 t3 t2 t1
12/29/2023 24  Kashif Inayat, 2023
Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 25  Kashif Inayat, 2023
Data Flows

• Output Stationary

Input Stationary (IS)


Output Stationary (OS)

• Input Stationary

• Weight Stationary

Weight Stationary (WS)


[Source: Scale-sim, [7]]

Dataflows in systolic arrays.


12/29/2023 26  Kashif Inayat, 2023
Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 27  Kashif Inayat, 2023
Radix-4 Vs Radix-8 Multiply-Accumulate (MAC) Unit
Number of partial products: N  ⌈(N+1)/3⌉
Number of partial products: N  ⌈(N+1)/2⌉ Hard multiple problem: 3Y = Y + 2Y (Carry propagate adder)
Y Y
Y+2Y

Radix-4
Radix-8
X Booth
Recoding
Booth Selection
X Booth Booth Selection
Recoding

Wallace reduction
Wallace reduction
tree
tree
Multiplier Multiplier
CPA
CPA

CPA
CPA

Accumulator Accumulator

Radix-4 microarchitecture Radix-8 microarchitecture


12/29/2023 28  Kashif Inayat, 2023
Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 29  Kashif Inayat, 2023
Problem Statement
• SOTA
• minimizing memory access
• design data flows
• less data movement

• MAC perform 99% computations in SAs

Our Methodology
12/29/2023 30  Kashif Inayat, 2023
Conclusion
• We will focus on both:
• µ-architectures, e.g., multipliers, adders, pipeline stages in Systolic Array based
architectures etc.
• At the same time, we can will explore RISC-V architecture for similar optimization
approaches.

12/29/2023 31  Kashif Inayat, 2023


References
[1] Sze, Vivienne, et al. "Efficient processing of deep neural networks." Synthesis Lectures on Computer Architecture 15.2 (2020): 1-
341.
[2] Horowitz, Mark. "1.1 computing's energy problem (and what we can do about it)." 2014 IEEE International Solid-State Circuits
Conference Digest of Technical Papers (ISSCC). IEEE, 2014.
[3] Google. System Architecture. https://cloud.google.com/tpu/docs/system-architecture, 2019.
[4] Inside Tesla’s Neural Processor In The FSD Chip: https://fuse.wikichip.org/news/2707/inside-teslas-neural-processor-in-the-fsd-
chip/
[5] Xilinx. Accelerating DNNs With Xilinx Alveo Accelerator Cards. [Online Available]: https://www.xilinx.com/support/
documentation/white_papers/wp504-accel-dnns.pdf
[6] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep Learning with Limited Numerical Precision. In
International Conference on Machine Learning, pages 1737–1746, 2015.
[7] Samajdar, Ananda, et al. "Scale-sim: Systolic cnn accelerator simulator." arXiv preprint arXiv:1811.02883 (2018).
[8] Ullah, Inayat, Kashif Inayat, Joon-Sung Yang, and Jaeyong Chung. "Factored radix-8 systolic array for tensor processing." In 2020
57th ACM/IEEE Design Automation Conference (DAC), pp. 1-6. IEEE, 2020.
[9] Genc, H., Haj-Ali, A., Iyer, V., Amid, A., Mao, H., Wright, J., ... & Asanovic, K. (2019). Gemmini: An agile systolic array generator
enabling systematic evaluations of deep-learning architectures. arXiv preprint arXiv:1911.09925, 3, 25.
[10] Inayat, Kashif, and Jaeyong Chung. "Carry-propagation-adder-factored gemmini systolic array for machine learning acceleration."
Electronics 10.6 (2021): 652.
[11] Inayat, Kashif, and Jaeyong Chung. "Hybrid Accumulator Factored Systolic Array for Machine Learning Acceleration." IEEE
Transactions on Very Large Scale Integration (VLSI) Systems (2022).
12/29/2023 32  Kashif Inayat, 2023
References
[12] Kashif Inayat, Inayat Ullah, and Jaeyong Chung ”Hard Multiple Carry Portioned Factored Radix-8 Systolic Array for Tensor
Processing”, IEEE TVLSI, 2023 (Accepted).
[13] Review and benchmarking of precision-scalable multiply-accumulate unit architectures for embedded neural-network
processing, IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9.4 (2019): 697-711.
[14] Ryu, Sungju, et al. "Bitblade: Area and energy-efficient precision-scalable neural network accelerator with bitwise
summation." Proceedings of the 56th Annual Design Automation Conference 2019. 2019.
[15] Sharma, Hardik, et al. "Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network."
2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018.
[16] Deep learning with limited numerical precision." International conference on machine learning. PMLR, 2015.

12/29/2023 33  Kashif Inayat, 2023


Thank you for Listening!

12/29/2023 34

You might also like