0% found this document useful (0 votes)

19 views37 pages

Thit

The document discusses domain-specific architectures and provides guidelines and examples for designing them. It focuses on deep neural networks and discusses parameters, operations, and optimizations for multi-layer perceptrons, convolutional neural networks and recurrent neural networks. It also describes Google's tensor processing unit designed for deep neural network workloads.

Uploaded by

Lokesh Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views37 pages

Thit

Uploaded by

Lokesh Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Computer Architecture

A Quantitative Approach, Sixth Edition

Chapter 7

Domain-Specific Architectures

Copyright © 2019, Elsevier Inc. All rights Reserved 1

Introduction: General Purpose Architectures
 Increasing transistors availability (Moore’s Law) enabled:
 Deep memory hierarchy
 Wide SIMD units
 Deep pipelines
 Branch prediction
 Out-of-order execution
 Speculative prefetching
 Multithreading
 Multiprocessing

 But end of Denard’s Law => more power/transistor

 And Power budget of chips remained same
=> can’t use the additional transistors
Unless performance/joule drastically improves

Copyright © 2019, Elsevier Inc. All rights Reserved 2

Introduction
Performance/Joule
 Add more arithmetic ops to single
instruction to get better performance/joule

Copyright © 2019, Elsevier Inc. All rights Reserved 3

Introduction
Domain Specific Architecture (DSA)
 Need factor of 100 improvements in number of
operations per instruction
 Requires domain specific architectures (DSA)
 ASICs: Amortization of NRE’s require large volumes
 FPGAs: less efficient than ASICs

 Software Ecosystem e.g. new compilers and libraries

for DSA’s is a challenge

Copyright © 2019, Elsevier Inc. All rights Reserved 4

Guidelines for DSAs
Five Guidelines for DSAs
To achieve: (i) better area and energy efficiency,
(ii) design simplicity/lower NRE, and (iii) faster
response time for user facing applications:
 Use dedicated memories (instead of caches) to minimize data
movement.
 Software knows when and where data is needed.
 Invest resources into more arithmetic units or bigger memories
 instead of features like superscalar, multithreading etc
 Use the easiest form of parallelism that matches the domain
 Reduce data size and type to the simplest needed for the
domain.
 Helps improve effective memory BW, and enables more arithmetic units
 Use a domain-specific programming language
 Languages like TensorFlow are a better match for porting DNN apps to
DSA’s than standard languages like C++

Copyright © 2019, Elsevier Inc. All rights Reserved 5

Guidelines for DSAs
Guidelines for DSAs

Copyright © 2019, Elsevier Inc. All rights Reserved 6

Example: Deep Neural Networks
Example: Deep Neural Networks
 Inspired by neurons in the brain
 Neurons arranged in layers
 Computes non-linear “activation” function of the weighted
sum of input values
 Types of DNN:
 Multi Layer Perceptrons (MLPs)
 Convolutional Neural Networks (CNNs)
 Recurrent Neural Networks (RNNs)

Copyright © 2019, Elsevier Inc. All rights Reserved 7

Example: Deep Neural Networks
Example: Deep Neural Networks (DNNs)
 Most practitioners will choose an existing design
 Topology, and Data type
 Training (learning):
 Supervised learning: stochastic gradient descent
 Calculate weights using backpropagation algorithm
 Unsupervised Learning: in absence of training data, uses
reinforcement learning

 Inference: use neural network for classification

 ~100ms/inference, much less than training time

Copyright © 2019, Elsevier Inc. All rights Reserved 8

Example: Deep Neural Networks
Multi-Layer Perceptrons (MLPs)
 Parameters:
 Dim[i]: number of neurons
 Dim[i-1]: dimension of input vector
 Number of weights: Dim[i-1] x Dim[i]
 Operations: 2 x Dim[i-1] x Dim[i]
 Operations/weight: 2
 Typical NLF (nonlinear function), ReLU:
F(x) = Max (x,0)

Copyright © 2019, Elsevier Inc. All rights Reserved 9

Example: Deep Neural Networks
Convolutional Neural Network
 Computer vision
 Each layer raises the level of abstraction
 First layer recognizes horizontal and vertical lines
 Second layer recognizes corners
 Third layer recognizes shapes
 Fourth layer recognizes features, such as ears of a dog
 Higher layers recognizes different breeds of dogs

Copyright © 2019, Elsevier Inc. All rights Reserved 10

Example: Deep Neural Networks
Convolutional Neural Network
 Parameters:
 DimFM[i-1]: Dimension of the (square) input
Feature Map
 DimFM[i]: Dimension of the (square) output
Feature Map
 DimSten[i]: Dimension of the (square) stencil
 NumFM[i-1]: Number of input Feature Maps
 NumFM[i]: Number of output Feature Maps
 Number of neurons: NumFM[i] x DimFM[i]2
 Number of weights per output Feature Map:
NumFM[i-1] x DimSten[i]2
 Total number of weights per layer: NumFM[i] x
Number of weights per output Feature Map
 Number of operations per output Feature Map: 2
x DimFM[i]2 x Number of weights per output
Feature Map
 Total number of operations per layer: NumFM[i]
x Number of operations per output Feature Map
= 2 x DimFM[i]2 x NumFM[i] x Number of weights
per output Feature Map = 2 x DimFM[i]2 x Total
number of weights per layer
 Operations/Weight: 2 x DimFM[i]2

Copyright © 2019, Elsevier Inc. All rights Reserved 11

Example: Deep Neural Networks
Recurrent Neural Network
 Speech recognition and language translation
 Long short-term memory (LSTM) network

Copyright © 2019, Elsevier Inc. All rights Reserved 12

Example: Deep Neural Networks
Recurrent Neural Network
 Parameters:
 Number of weights per cell:
3 x (3 x Dim x Dim)+(2 x
Dim x Dim) + (1 x Dim x
Dim) = 12 x Dim2
 Number of operations for
the 5 vector-matrix
multiplies per cell: 2 x
Number of weights per cell
= 24 x Dim2
 Number of operations for
the 3 element-wise
multiplies and 1 addition
(vectors are all the size of
the output): 4 x Dim
 Total number of operations
per cell (5 vector-matrix
multiplies and the 4
element-wise operations):
24 x Dim2 + 4 x Dim
 Operations/Weight: ~2

Copyright © 2019, Elsevier Inc. All rights Reserved 13

Example: Deep Neural Networks
Convolutional Neural Network
 Batches:
 Reuse weights once fetched from memory across multiple inputs
 Increases operational intensity
 Quantization
 Use 8- or 16-bit fixed point
 Summary:
 Need the following kernels:
 Matrix-vector multiply
 Matrix-matrix multiply
 Stencil
 ReLU
 Sigmoid
 Hyperbolic tangeant

Copyright © 2019, Elsevier Inc. All rights Reserved 14

Tensor Processing Unit
Tensor Processing Unit
 Google’s DNN ASIC for high volume data center use
 256 x 256 8-bit matrix multiply unit
 Large software-managed scratchpad
 Coprocessor on the PCIe bus
 Simple, in-order, deterministic execution unit
 Achieves 99 percentile response time of DNN inference needs

Copyright © 2019, Elsevier Inc. All rights Reserved 15

Tensor Processing Unit
Tensor Processing Unit

Copyright © 2019, Elsevier Inc. All rights Reserved 16

Tensor Processing Unit
TPU ISA
 Read_Host_Memory
 Reads memory from the CPU memory into the unified buffer
 Read_Weights
 Reads weights from the Weight Memory into the Weight FIFO as input
to the Matrix Unit
 MatrixMatrixMultiply/Convolve
 Perform a matrix-matrix multiply, a vector-matrix multiply, an element-
wise matrix multiply, an element-wise vector multiply, or a convolution
from the Unified Buffer into the accumulators
 takes a variable-sized B*256 input, multiplies it by a 256x256 constant
input, and produces a B*256 output, taking B pipelined cycles to
complete
 Activate
 Computes activation function
 Write_Host_Memory
 Writes data from unified buffer into host memory

Copyright © 2019, Elsevier Inc. All rights Reserved 17

Tensor Processing Unit
TPU ISA

Copyright © 2019, Elsevier Inc. All rights Reserved 18

Tensor Processing Unit
TPU ISA

Copyright © 2019, Elsevier Inc. All rights Reserved 19

Tensor Processing Unit
Improving the TPU

Copyright © 2019, Elsevier Inc. All rights Reserved 20

Tensor Processing Unit
The TPU and the Guidelines
 Use dedicated memories
 24 MiB dedicated buffer, 4 MiB accumulator buffers
 Invest resources in arithmetic units and dedicated
memories
 60% of the memory and 250X the arithmetic units of a server-class CPU
 Use the easiest form of parallelism that matches the
domain
 Exploits 2D SIMD parallelism
 Reduce the data size and type needed for the domain
 Primarily uses 8-bit integers
 Use a domain-specific programming language
 Uses TensorFlow

Copyright © 2019, Elsevier Inc. All rights Reserved 21

Microsoft Capapult
Microsoft Catapult
 Needed to be general
purpose and power efficient
 Uses FPGA PCIe board with
dedicated 20 Gbps network in 6 x
8 torus
 Each of the 48 servers in half the
rack has a Catapult board
 Limited to 25 watts
 32 MiB Flash memory
 Two banks of DDR3-1600 (11
GB/s) and 8 GiB DRAM
 FPGA (unconfigured) has 3962
18-bit ALUs and 5 MiB of on-chip
memory
 Programmed in Verilog RTL
 Shell is 23% of the FPGA

Copyright © 2019, Elsevier Inc. All rights Reserved 22

Microsoft Capapult
Microsoft Catapult: CNN
 CNN accelerator, mapped across multiple FPGAs

Copyright © 2019, Elsevier Inc. All rights Reserved 23

Microsoft Capapult
Microsoft Catapult: CNN

Copyright © 2019, Elsevier Inc. All rights Reserved 24

Microsoft Capapult
Microsoft Catapult: Bing Search Ranking
 Feature extraction (1 FPGA)
 Extracts 4500 features for every document-query pair, e.g. frequency in which the query
appears in the page
 Systolic array of FSMs
 Free-form expressions (2 FPGAs)
 Calculates feature combinations
 Machine-learned Scoring (1 FPGA for compression, 3 FPGAs calculate
score)
 Uses results of previous two stages to calculate floating-point score
 One FPGA allocated as a hot-spare

Microsoft Capapult
Microsoft Catapult: Search Ranking

 Free-form expression evaluation

 60 core processor
 Pipelined cores
 Each core supports four threads that can hide each other’s latency
 Threads are statically prioritized according to thread latency

Microsoft Capapult
Microsoft Catapult: Search Ranking
 Version 2 of Catapult
 Placed the FPGA between the CPU
and NIC
 Increased network from 10 Gb/s to
40 Gb/s
 Also performs network acceleration
 Shell now consumes 44% of the
FPGA
 Now FPGA performs only feature
extraction
 Scales to ALL server FPGA
connected thru data center NW
(Not limited to 48 FPGA Dedicated
NW)

Microsoft Capapult
Catapult and the Guidelines
 Use dedicated memories
 5 MiB dedicated memory
 Invest resources in arithmetic units and dedicated
memories
 3926 ALUs
 Use the easiest form of parallelism that matches the
domain
 2D SIMD for CNN, MISD parallelism for search scoring
 Reduce the data size and type needed for the
domain
 Uses mixture of 8-bit integers and 64-bit floating-point
 Use a domain-specific programming language
 Uses Verilog RTL; Microsoft did not follow this guideline

Intel Crest
Intel Crest
 DNN training
 16-bit ‘flex point’
 a 5bit exponent which is a part of the instruction, used by all data
 Operates on blocks of 32x32 matrices
 SRAM + HBM2
 ICC (Inter Chip Controller), ICL (Inter Chip Links)
 1 TB/sec mem BW

Pixel Visual Core
Not Covered
in HPCA 2023 Class

Pixel Visual Core
Pixel Visual Core

 Pixel Visual Core

 Image Processing Unit
 Performs stencil operations
 Decended from Image Signal processor

Pixel Visual Core
Pixel Visual Core

 Software written in Halide, a DSL

 Compiled to virtual ISA
 vISA is lowered to physical ISA using application-specific
parameters
 pISA is VLSI
 Optimized for energy
 Power Budget is 6 to 8 W for bursts of 10-20 seconds,
dropping to tens of milliwatts when not in use
 8-bit DRAM access equivalent energy as 12,500 8-bit
integer operations or 7 to 100 8-bit SRAM accesses
 IEEE 754 operations require 22X to 150X of the cost of 8-bit
integer operations
 Optimized for 2D access
 2D SIMD unit
 On-chip SRAM structured using a square geometry
Copyright © 2019, Elsevier Inc. All rights Reserved 32
Pixel Visual Core
Pixel Visual Core

Pixel Visual Core
Pixel Visual Core

Microsoft Capapult
Visual Core and the Guidelines

 Use dedicated memories

 128 + 64 MiB dedicated memory per core
 Invest resources in arithmetic units and dedicated
memories
 16x16 2D array of processing elements per core and 2D
shifting network per core
 Use the easiest form of parallelism that matches the
domain
 2D SIMD and VLIW
 Reduce the data size and type needed for the
domain
 Uses mixture of 8-bit and 16-bit integers
 Use a domain-specific programming language
 Halide for image processing and TensorFlow for CNNs

Microsoft Capapult
Fallacies and Pitfalls

 It costs $100 million to design a custom chip

 Performance counters added as an
afterthought
 Architects are tackling the right DNN tasks
 For DNN hardware, inferences per second
(IPS) is a fair summary performance metric
 Being ignorant of architecture history when
designing an DSA

Top Answers To Artificial Intelligence Interview Questions
No ratings yet
Top Answers To Artificial Intelligence Interview Questions
3 pages
Artificial Intelligence and Internet of Things For Autonomous Vehicles
No ratings yet
Artificial Intelligence and Internet of Things For Autonomous Vehicles
31 pages
International Conference On Application of Recent Technologies in Science, Engineering, Management For Societal and Industrial Development Proceedings
No ratings yet
International Conference On Application of Recent Technologies in Science, Engineering, Management For Societal and Industrial Development Proceedings
9 pages
Can Artificial Intelligence Create Art, de Théotime Gros
No ratings yet
Can Artificial Intelligence Create Art, de Théotime Gros
60 pages
Statistical Machine Learning (CSE 575) : About This Course
No ratings yet
Statistical Machine Learning (CSE 575) : About This Course
12 pages
Malware Classification Using Deep Learning: Mohd Shahril
No ratings yet
Malware Classification Using Deep Learning: Mohd Shahril
48 pages
Slow Momentum With Fast Reversion: A Trading Strategy Using Deep Learning and Changepoint Detection
No ratings yet
Slow Momentum With Fast Reversion: A Trading Strategy Using Deep Learning and Changepoint Detection
14 pages
Answer Key Sample Paper 3 AI Class 10
100% (4)
Answer Key Sample Paper 3 AI Class 10
10 pages
Massively Distributed SGD: Imagenet/Resnet-50 Training in A Flash
No ratings yet
Massively Distributed SGD: Imagenet/Resnet-50 Training in A Flash
7 pages
(Artigo) How AI Is Helping Historians Better Understand Our Past (MIT Technology Review) PDF
No ratings yet
(Artigo) How AI Is Helping Historians Better Understand Our Past (MIT Technology Review) PDF
12 pages
Deep Learning Basics
No ratings yet
Deep Learning Basics
4 pages
LLM Explainable Financial Forecasting
No ratings yet
LLM Explainable Financial Forecasting
13 pages
CCS369 - TSS-Unit 2
No ratings yet
CCS369 - TSS-Unit 2
56 pages
Internship Report On Machine Learning With Python
100% (1)
Internship Report On Machine Learning With Python
50 pages
Improved Anomaly Detection in Surveillance Videos Based On A Deep Learning Method (2018)
No ratings yet
Improved Anomaly Detection in Surveillance Videos Based On A Deep Learning Method (2018)
9 pages
Sensors 23 03551
No ratings yet
Sensors 23 03551
20 pages
B Techbrochure
No ratings yet
B Techbrochure
24 pages
Fin Irjmets1653652636
No ratings yet
Fin Irjmets1653652636
5 pages
Flavour Fusion
No ratings yet
Flavour Fusion
5 pages
FPGA CNN Project Paper
No ratings yet
FPGA CNN Project Paper
31 pages
Improving The Accuracy of Food Commodity Price Prediction Model Using Deep Learning Algorithm
No ratings yet
Improving The Accuracy of Food Commodity Price Prediction Model Using Deep Learning Algorithm
5 pages
TSP CMC 54460
No ratings yet
TSP CMC 54460
26 pages
Fpls 14 1308528
No ratings yet
Fpls 14 1308528
10 pages
Video 4 The Terminology of AI
No ratings yet
Video 4 The Terminology of AI
5 pages
SNNS and ANNS - Notes
No ratings yet
SNNS and ANNS - Notes
14 pages
Unit 2
No ratings yet
Unit 2
15 pages
Artificial Intelligence in Translation: Challenges and Opportunities
No ratings yet
Artificial Intelligence in Translation: Challenges and Opportunities
9 pages
Optimizing Inventory For Fashion Stores Using AI
No ratings yet
Optimizing Inventory For Fashion Stores Using AI
7 pages
Robi George 2025 Application of Machine Learning Algorithms To Predict Urban Expansion
No ratings yet
Robi George 2025 Application of Machine Learning Algorithms To Predict Urban Expansion
6 pages
DNN Intro
No ratings yet
DNN Intro
19 pages
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (648)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2886)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)

Thit

Uploaded by

Thit

Uploaded by

Computer Architecture

A Quantitative Approach, Sixth Edition

Copyright © 2019, Elsevier Inc. All rights Reserved 1

 But end of Denard’s Law => more power/transistor

Copyright © 2019, Elsevier Inc. All rights Reserved 2

Copyright © 2019, Elsevier Inc. All rights Reserved 3

 Software Ecosystem e.g. new compilers and libraries

Copyright © 2019, Elsevier Inc. All rights Reserved 4

Copyright © 2019, Elsevier Inc. All rights Reserved 5

Copyright © 2019, Elsevier Inc. All rights Reserved 6

Copyright © 2019, Elsevier Inc. All rights Reserved 7

 Inference: use neural network for classification

Copyright © 2019, Elsevier Inc. All rights Reserved 8

Copyright © 2019, Elsevier Inc. All rights Reserved 9

Copyright © 2019, Elsevier Inc. All rights Reserved 10

Copyright © 2019, Elsevier Inc. All rights Reserved 11

Copyright © 2019, Elsevier Inc. All rights Reserved 12

Copyright © 2019, Elsevier Inc. All rights Reserved 13

Copyright © 2019, Elsevier Inc. All rights Reserved 14

Copyright © 2019, Elsevier Inc. All rights Reserved 15

Copyright © 2019, Elsevier Inc. All rights Reserved 16

Copyright © 2019, Elsevier Inc. All rights Reserved 17

Copyright © 2019, Elsevier Inc. All rights Reserved 18

Copyright © 2019, Elsevier Inc. All rights Reserved 19

Copyright © 2019, Elsevier Inc. All rights Reserved 20

Copyright © 2019, Elsevier Inc. All rights Reserved 21

Copyright © 2019, Elsevier Inc. All rights Reserved 22

Copyright © 2019, Elsevier Inc. All rights Reserved 23

Copyright © 2019, Elsevier Inc. All rights Reserved 24

Copyright © 2019, Elsevier Inc. All rights Reserved 25

 Free-form expression evaluation

Copyright © 2019, Elsevier Inc. All rights Reserved 26

Copyright © 2019, Elsevier Inc. All rights Reserved 27

Copyright © 2019, Elsevier Inc. All rights Reserved 28

Copyright © 2019, Elsevier Inc. All rights Reserved 29

Copyright © 2019, Elsevier Inc. All rights Reserved 30

 Pixel Visual Core

Copyright © 2019, Elsevier Inc. All rights Reserved 31

 Software written in Halide, a DSL

Copyright © 2019, Elsevier Inc. All rights Reserved 33

Copyright © 2019, Elsevier Inc. All rights Reserved 34

Copyright © 2019, Elsevier Inc. All rights Reserved 35

 Use dedicated memories

Copyright © 2019, Elsevier Inc. All rights Reserved 36

 It costs $100 million to design a custom chip

Copyright © 2019, Elsevier Inc. All rights Reserved 37

You might also like