0% found this document useful (0 votes)

92 views34 pages

Computer Arithmetic

Uploaded by

Kashif Inayat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

92 views34 pages

Computer Arithmetic

Uploaded by

Kashif Inayat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Computer Arithmetic for ML/Crypto Acceleration

December 29, 2023

Kashif Inayat

Computer Sciences-European Exascale Accelerator

Barcelona Supercomputing Center

[Link]
Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 2  Kashif Inayat, 2023
Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 3  Kashif Inayat, 2023
Machine Learning

Image classification
Face recognition

Machine learning is
everywhere!
Optical character recognition Autonomous driving
12/29/2023 4  Kashif Inayat, 2023
Transistors are not getting Faster

Slowdown of Moore’s Law

General purpose microprocessors not
getting faster or more efficient!!

Slowdown
Need specialized / domain-specific
hardware
for significant improvements in speed
and energy efficiency

[Source: Sze Tutorials, [1]]

Slowdown of Moore’s law and Dennard scaling.

12/29/2023 5  Kashif Inayat, 2023

Key Parameters to consider

• Performance
• Conventional arithmetic blocks
inside PE usually have large
critical path
• Latency, Throughput

• Energy/Power
• MAC is the most power
consuming unit

• Hardware Cost
• Chip Storage, Chip Area [Source: Computing’s Energy Problem (and what we can do about it), Mark Horowitz, [2]]

Rough energy costs for various operations in 45nm

• Flexibility 0.9V.
• Determines the range of possible
trade-offs (or precision etc.)

12/29/2023 6  Kashif Inayat, 2023

Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 7  Kashif Inayat, 2023
Commercial ML Accelerators 2D spatial arrays 2D systolic arrays

• Google (TPUv1)
• Extended in TPUv2, TPUv3 and TPUv4

• Tesla
• 96x96 independent array in NPU

• Xilinx (XDNN) Engine

• Alveo Accelerator Card
[Source: TPUv1, [3], Google.] [Source: NPU, [4], Tesla.]
• NVIDIA (NVDLA)
• Jetson Xavier

• IBM
• 28x28 Wavefront Systolic array

• Samsung
• 1024 MACs NPU

• Intel, etc
[Source: XDNN [5], Xilinx.] [Source: 28x28 wavefront SA [6], IBM.]

12/29/2023 8  Kashif Inayat, 2023

Matrix Multiplication

[Source: Sze Tutorials, [1].]

Convolution and Matrix Multiplication.

• Matrix multiplication is the key primitive in machine learning

• Systolic arrays perform matrix multiplication with high data reuse
12/29/2023 9  Kashif Inayat, 2023
Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 10  Kashif Inayat, 2023
Adders
• Ripple Carry Adder

12/29/2023 11  Kashif Inayat, 2023

Adders
• Carry Save Adders

12/29/2023 12  Kashif Inayat, 2023

Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 13  Kashif Inayat, 2023
Multipliers
• String Property (SP)

12/29/2023 14  Kashif Inayat, 2023

Multipliers
• Radix-4 Booth Multiplier

12/29/2023 15  Kashif Inayat, 2023

Multipliers
• Radix-4 Booth
Multiplier Truth Table
using String Property

12/29/2023 16  Kashif Inayat, 2023

Multipliers
• Radix-8 Booth Multiplier

12/29/2023 17  Kashif Inayat, 2023

Multipliers
• Radix-8 Booth Multiplier Truth
Table using String Property

12/29/2023 18  Kashif Inayat, 2023

Multipliers
• Sign Extension

12/29/2023 19  Kashif Inayat, 2023

Multipliers
• Montgomery modular multiplier

12/29/2023 20  Kashif Inayat, 2023

Multipliers
• School Method-Decomposed

12/29/2023 21  Kashif Inayat, 2023

Multipliers
• Karatsuba–Ofman algorithm

12/29/2023 22  Kashif Inayat, 2023

Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 23  Kashif Inayat, 2023
b33 t5
Systolic Array Architecture
b23 b32 t4
b13 b22 b31 t3
b12 b21 t2
A ×B = C
A 3x3 systolic array for
b11 t1
matrix multiplication.
PE11 × PE12 × PE13 ×
a13 a12 a11 + C11 + C12 + C13
t4 : c11 t5 : c12 t6 : c13

PE21 × PE22 × PE23 ×

a23 a22 a21 + C21 + C22 + C23
t5 : c21 t6 : c22 t7 : c23

PE31 × PE32 × PE33 ×

a33 a32 a31 + C31 + C32 + C33
t6 : c31 t7 : c32 t8 : c33

t5 t4 t3 t2 t1
12/29/2023 24  Kashif Inayat, 2023
Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 25  Kashif Inayat, 2023
Data Flows

• Output Stationary

Input Stationary (IS)

Output Stationary (OS)

• Input Stationary

• Weight Stationary

Weight Stationary (WS)

[Source: Scale-sim, [7]]

Dataflows in systolic arrays.

12/29/2023 26  Kashif Inayat, 2023
Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 27  Kashif Inayat, 2023
Radix-4 Vs Radix-8 Multiply-Accumulate (MAC) Unit
Number of partial products: N  ⌈(N+1)/3⌉
Number of partial products: N  ⌈(N+1)/2⌉ Hard multiple problem: 3Y = Y + 2Y (Carry propagate adder)
Y Y
Y+2Y

Radix-4
Radix-8
X Booth
Recoding
Booth Selection
X Booth Booth Selection
Recoding

Wallace reduction
Wallace reduction
tree
tree
Multiplier Multiplier
CPA
CPA

CPA
CPA

Accumulator Accumulator

Radix-4 microarchitecture Radix-8 microarchitecture

12/29/2023 28  Kashif Inayat, 2023
Content
• Introduction
• Motivation
• Background
• Adders
• Multipliers
• Systolic Array Architecture
• Data Flows
• MACs
• Problem Statement
• Conclusions
12/29/2023 29  Kashif Inayat, 2023
Problem Statement
• SOTA
• minimizing memory access
• design data flows
• less data movement

• MAC perform 99% computations in SAs

Our Methodology
12/29/2023 30  Kashif Inayat, 2023
Conclusion
• We will focus on both:
• µ-architectures, e.g., multipliers, adders, pipeline stages in Systolic Array based
architectures etc.
• At the same time, we can will explore RISC-V architecture for similar optimization
approaches.

12/29/2023 31  Kashif Inayat, 2023

References
[1] Sze, Vivienne, et al. "Efficient processing of deep neural networks." Synthesis Lectures on Computer Architecture 15.2 (2020): 1-
341.
[2] Horowitz, Mark. "1.1 computing's energy problem (and what we can do about it)." 2014 IEEE International Solid-State Circuits
Conference Digest of Technical Papers (ISSCC). IEEE, 2014.
[3] Google. System Architecture. [Link] 2019.
[4] Inside Tesla’s Neural Processor In The FSD Chip: [Link]
chip/
[5] Xilinx. Accelerating DNNs With Xilinx Alveo Accelerator Cards. [Online Available]: [Link]
documentation/white_papers/[Link]
[6] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep Learning with Limited Numerical Precision. In
International Conference on Machine Learning, pages 1737–1746, 2015.
[7] Samajdar, Ananda, et al. "Scale-sim: Systolic cnn accelerator simulator." arXiv preprint arXiv:1811.02883 (2018).
[8] Ullah, Inayat, Kashif Inayat, Joon-Sung Yang, and Jaeyong Chung. "Factored radix-8 systolic array for tensor processing." In 2020
57th ACM/IEEE Design Automation Conference (DAC), pp. 1-6. IEEE, 2020.
[9] Genc, H., Haj-Ali, A., Iyer, V., Amid, A., Mao, H., Wright, J., ... & Asanovic, K. (2019). Gemmini: An agile systolic array generator
enabling systematic evaluations of deep-learning architectures. arXiv preprint arXiv:1911.09925, 3, 25.
[10] Inayat, Kashif, and Jaeyong Chung. "Carry-propagation-adder-factored gemmini systolic array for machine learning acceleration."
Electronics 10.6 (2021): 652.
[11] Inayat, Kashif, and Jaeyong Chung. "Hybrid Accumulator Factored Systolic Array for Machine Learning Acceleration." IEEE
Transactions on Very Large Scale Integration (VLSI) Systems (2022).
12/29/2023 32  Kashif Inayat, 2023
References
[12] Kashif Inayat, Inayat Ullah, and Jaeyong Chung ”Hard Multiple Carry Portioned Factored Radix-8 Systolic Array for Tensor
Processing”, IEEE TVLSI, 2023 (Accepted).
[13] Review and benchmarking of precision-scalable multiply-accumulate unit architectures for embedded neural-network
processing, IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9.4 (2019): 697-711.
[14] Ryu, Sungju, et al. "Bitblade: Area and energy-efficient precision-scalable neural network accelerator with bitwise
summation." Proceedings of the 56th Annual Design Automation Conference 2019. 2019.
[15] Sharma, Hardik, et al. "Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network."
2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018.
[16] Deep learning with limited numerical precision." International conference on machine learning. PMLR, 2015.

12/29/2023 33  Kashif Inayat, 2023

Thank you for Listening!

12/29/2023 34

RISC-V Matrix Multiplier Thesis
No ratings yet
RISC-V Matrix Multiplier Thesis
41 pages
Factored Systolic Array Tensor Processing
No ratings yet
Factored Systolic Array Tensor Processing
7 pages
Designing of 4-Bit Array Multiplayer
No ratings yet
Designing of 4-Bit Array Multiplayer
6 pages
Conference Template A4
No ratings yet
Conference Template A4
5 pages
Systolic Array Design for Education
No ratings yet
Systolic Array Design for Education
6 pages
Sys Array
No ratings yet
Sys Array
11 pages
Karatsuba Matrix Multiplication and Its Efficient Custom Hardware Implementations
No ratings yet
Karatsuba Matrix Multiplication and Its Efficient Custom Hardware Implementations
15 pages
Design and FPGA Implementation of Systolic Array Architecture
No ratings yet
Design and FPGA Implementation of Systolic Array Architecture
6 pages
Data-Level Parallelism in RV64V Architecture
No ratings yet
Data-Level Parallelism in RV64V Architecture
87 pages
An Energy-Efficient GeMM-Based Convolution Accelerator With On-the-Fly Im2col
No ratings yet
An Energy-Efficient GeMM-Based Convolution Accelerator With On-the-Fly Im2col
5 pages
CS7103 - MultiCore Architecture Ppts Unit-II
No ratings yet
CS7103 - MultiCore Architecture Ppts Unit-II
43 pages
Efficient Matrix-Vector Multiplication on FPGAs
No ratings yet
Efficient Matrix-Vector Multiplication on FPGAs
18 pages
Matrix-Matrix Multiplication Using Systolic Array Architecture in Bluespec
No ratings yet
Matrix-Matrix Multiplication Using Systolic Array Architecture in Bluespec
8 pages
Supercomputers and Vector Machines
No ratings yet
Supercomputers and Vector Machines
40 pages
IJCRT2304397
No ratings yet
IJCRT2304397
5 pages
Instruction Set Architecture Overview
No ratings yet
Instruction Set Architecture Overview
50 pages
Unit 5
No ratings yet
Unit 5
14 pages
26-27 SIMD Architecture
No ratings yet
26-27 SIMD Architecture
33 pages
CA 4 Notes
No ratings yet
CA 4 Notes
34 pages
Systolic Tensor Array An Efficient Structured-Sparse GEMM Accelerator For Mobile CNN Inference
No ratings yet
Systolic Tensor Array An Efficient Structured-Sparse GEMM Accelerator For Mobile CNN Inference
4 pages
Auto-Vectorization for Intel CPUs
No ratings yet
Auto-Vectorization for Intel CPUs
12 pages
SIMD Programming Overview
No ratings yet
SIMD Programming Overview
31 pages
Neural Network Accelerators: CS223 Computer Architecture & Organization
No ratings yet
Neural Network Accelerators: CS223 Computer Architecture & Organization
45 pages
Course Code and Name: (23BSMD31 COMPUTER ORGANIZATION AND Architecture)
No ratings yet
Course Code and Name: (23BSMD31 COMPUTER ORGANIZATION AND Architecture)
16 pages
Simulation Manual For Configurable Mapreduce Accelerator: Gheorghe M. S Tefan
No ratings yet
Simulation Manual For Configurable Mapreduce Accelerator: Gheorghe M. S Tefan
78 pages
Proposal Presentation
No ratings yet
Proposal Presentation
22 pages
Basic Design Approaches To Accelerating Deep Neural Networks
No ratings yet
Basic Design Approaches To Accelerating Deep Neural Networks
93 pages
Multipliers: Presented By
No ratings yet
Multipliers: Presented By
13 pages
Energy Efficient Time Domain Vector by Matrix Mult
No ratings yet
Energy Efficient Time Domain Vector by Matrix Mult
7 pages
Ijecet: International Journal of Electronics and Communication Engineering & Technology (Ijecet)
No ratings yet
Ijecet: International Journal of Electronics and Communication Engineering & Technology (Ijecet)
11 pages
Major Project (Multipliers)
No ratings yet
Major Project (Multipliers)
18 pages
HC2023 Qualcomm Hexagon NPU
No ratings yet
HC2023 Qualcomm Hexagon NPU
19 pages
2020.6.sparse-Tpu Ics2020
No ratings yet
2020.6.sparse-Tpu Ics2020
12 pages
2024 06 Isca Trapezoid
No ratings yet
2024 06 Isca Trapezoid
15 pages
Comp Arch CH 03 L05 Booth Algor
No ratings yet
Comp Arch CH 03 L05 Booth Algor
35 pages
Unit 3-4
No ratings yet
Unit 3-4
76 pages
Systolic Array
No ratings yet
Systolic Array
42 pages
Parallel Architectures Parallel Architectures: Ever Faster
No ratings yet
Parallel Architectures Parallel Architectures: Ever Faster
11 pages
DSP Processor Features and Architecture
No ratings yet
DSP Processor Features and Architecture
3 pages
VLSI Architecture for Engineers
No ratings yet
VLSI Architecture for Engineers
8 pages
FPGA-Optimized Radix-8 Multiplier Design
No ratings yet
FPGA-Optimized Radix-8 Multiplier Design
14 pages
GPU SIMD Architecture Overview
No ratings yet
GPU SIMD Architecture Overview
26 pages
19 Computer Architecture Vector Processor
No ratings yet
19 Computer Architecture Vector Processor
20 pages
Vector Processing in Computer Architecture
No ratings yet
Vector Processing in Computer Architecture
3 pages
8x8-Bit Booth Multiplier
No ratings yet
8x8-Bit Booth Multiplier
6 pages
A New Vlsi Architecture For Modi Ed
No ratings yet
A New Vlsi Architecture For Modi Ed
6 pages
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
58 pages
Systolic Array
No ratings yet
Systolic Array
9 pages
Computing Multiplier Analysis
No ratings yet
Computing Multiplier Analysis
4 pages
Ift 212 Computer Architecture Lecture Notes 2
No ratings yet
Ift 212 Computer Architecture Lecture Notes 2
38 pages
Existing Methodology
No ratings yet
Existing Methodology
7 pages
This Breakthrough - 3
No ratings yet
This Breakthrough - 3
5 pages
Axon A Novel Systolic Array Architecture For Improved Run Time and Energy Efficient GeMM and Conv Operation With On-Chip Im2col
No ratings yet
Axon A Novel Systolic Array Architecture For Improved Run Time and Energy Efficient GeMM and Conv Operation With On-Chip Im2col
7 pages
Matrix-Matrix Multiplication
No ratings yet
Matrix-Matrix Multiplication
8 pages
Design, Comparison and Implementation of Multipliers On FPGA
No ratings yet
Design, Comparison and Implementation of Multipliers On FPGA
8 pages