0% found this document useful (0 votes)
19 views37 pages

Thit

The document discusses domain-specific architectures and provides guidelines and examples for designing them. It focuses on deep neural networks and discusses parameters, operations, and optimizations for multi-layer perceptrons, convolutional neural networks and recurrent neural networks. It also describes Google's tensor processing unit designed for deep neural network workloads.

Uploaded by

Lokesh Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views37 pages

Thit

The document discusses domain-specific architectures and provides guidelines and examples for designing them. It focuses on deep neural networks and discusses parameters, operations, and optimizations for multi-layer perceptrons, convolutional neural networks and recurrent neural networks. It also describes Google's tensor processing unit designed for deep neural network workloads.

Uploaded by

Lokesh Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Computer Architecture

A Quantitative Approach, Sixth Edition

Chapter 7

Domain-Specific Architectures

Copyright © 2019, Elsevier Inc. All rights Reserved 1


Introduction: General Purpose Architectures
 Increasing transistors availability (Moore’s Law) enabled:
 Deep memory hierarchy
 Wide SIMD units
 Deep pipelines
 Branch prediction
 Out-of-order execution
 Speculative prefetching
 Multithreading
 Multiprocessing

 But end of Denard’s Law => more power/transistor


 And Power budget of chips remained same
=> can’t use the additional transistors
Unless performance/joule drastically improves

Copyright © 2019, Elsevier Inc. All rights Reserved 2


Introduction
Performance/Joule
 Add more arithmetic ops to single
instruction to get better performance/joule

Copyright © 2019, Elsevier Inc. All rights Reserved 3


Introduction
Domain Specific Architecture (DSA)
 Need factor of 100 improvements in number of
operations per instruction
 Requires domain specific architectures (DSA)
 ASICs: Amortization of NRE’s require large volumes
 FPGAs: less efficient than ASICs

 Software Ecosystem e.g. new compilers and libraries


for DSA’s is a challenge

Copyright © 2019, Elsevier Inc. All rights Reserved 4


Guidelines for DSAs
Five Guidelines for DSAs
To achieve: (i) better area and energy efficiency,
(ii) design simplicity/lower NRE, and (iii) faster
response time for user facing applications:
 Use dedicated memories (instead of caches) to minimize data
movement.
 Software knows when and where data is needed.
 Invest resources into more arithmetic units or bigger memories
 instead of features like superscalar, multithreading etc
 Use the easiest form of parallelism that matches the domain
 Reduce data size and type to the simplest needed for the
domain.
 Helps improve effective memory BW, and enables more arithmetic units
 Use a domain-specific programming language
 Languages like TensorFlow are a better match for porting DNN apps to
DSA’s than standard languages like C++

Copyright © 2019, Elsevier Inc. All rights Reserved 5


Guidelines for DSAs
Guidelines for DSAs

Copyright © 2019, Elsevier Inc. All rights Reserved 6


Example: Deep Neural Networks
Example: Deep Neural Networks
 Inspired by neurons in the brain
 Neurons arranged in layers
 Computes non-linear “activation” function of the weighted
sum of input values
 Types of DNN:
 Multi Layer Perceptrons (MLPs)
 Convolutional Neural Networks (CNNs)
 Recurrent Neural Networks (RNNs)

Copyright © 2019, Elsevier Inc. All rights Reserved 7


Example: Deep Neural Networks
Example: Deep Neural Networks (DNNs)
 Most practitioners will choose an existing design
 Topology, and Data type
 Training (learning):
 Supervised learning: stochastic gradient descent
 Calculate weights using backpropagation algorithm
 Unsupervised Learning: in absence of training data, uses
reinforcement learning

 Inference: use neural network for classification


 ~100ms/inference, much less than training time

Copyright © 2019, Elsevier Inc. All rights Reserved 8


Example: Deep Neural Networks
Multi-Layer Perceptrons (MLPs)
 Parameters:
 Dim[i]: number of neurons
 Dim[i-1]: dimension of input vector
 Number of weights: Dim[i-1] x Dim[i]
 Operations: 2 x Dim[i-1] x Dim[i]
 Operations/weight: 2
 Typical NLF (nonlinear function), ReLU:
F(x) = Max (x,0)

Copyright © 2019, Elsevier Inc. All rights Reserved 9


Example: Deep Neural Networks
Convolutional Neural Network
 Computer vision
 Each layer raises the level of abstraction
 First layer recognizes horizontal and vertical lines
 Second layer recognizes corners
 Third layer recognizes shapes
 Fourth layer recognizes features, such as ears of a dog
 Higher layers recognizes different breeds of dogs

Copyright © 2019, Elsevier Inc. All rights Reserved 10


Example: Deep Neural Networks
Convolutional Neural Network
 Parameters:
 DimFM[i-1]: Dimension of the (square) input
Feature Map
 DimFM[i]: Dimension of the (square) output
Feature Map
 DimSten[i]: Dimension of the (square) stencil
 NumFM[i-1]: Number of input Feature Maps
 NumFM[i]: Number of output Feature Maps
 Number of neurons: NumFM[i] x DimFM[i]2
 Number of weights per output Feature Map:
NumFM[i-1] x DimSten[i]2
 Total number of weights per layer: NumFM[i] x
Number of weights per output Feature Map
 Number of operations per output Feature Map: 2
x DimFM[i]2 x Number of weights per output
Feature Map
 Total number of operations per layer: NumFM[i]
x Number of operations per output Feature Map
= 2 x DimFM[i]2 x NumFM[i] x Number of weights
per output Feature Map = 2 x DimFM[i]2 x Total
number of weights per layer
 Operations/Weight: 2 x DimFM[i]2

Copyright © 2019, Elsevier Inc. All rights Reserved 11


Example: Deep Neural Networks
Recurrent Neural Network
 Speech recognition and language translation
 Long short-term memory (LSTM) network

Copyright © 2019, Elsevier Inc. All rights Reserved 12


Example: Deep Neural Networks
Recurrent Neural Network
 Parameters:
 Number of weights per cell:
3 x (3 x Dim x Dim)+(2 x
Dim x Dim) + (1 x Dim x
Dim) = 12 x Dim2
 Number of operations for
the 5 vector-matrix
multiplies per cell: 2 x
Number of weights per cell
= 24 x Dim2
 Number of operations for
the 3 element-wise
multiplies and 1 addition
(vectors are all the size of
the output): 4 x Dim
 Total number of operations
per cell (5 vector-matrix
multiplies and the 4
element-wise operations):
24 x Dim2 + 4 x Dim
 Operations/Weight: ~2

Copyright © 2019, Elsevier Inc. All rights Reserved 13


Example: Deep Neural Networks
Convolutional Neural Network
 Batches:
 Reuse weights once fetched from memory across multiple inputs
 Increases operational intensity
 Quantization
 Use 8- or 16-bit fixed point
 Summary:
 Need the following kernels:
 Matrix-vector multiply
 Matrix-matrix multiply
 Stencil
 ReLU
 Sigmoid
 Hyperbolic tangeant

Copyright © 2019, Elsevier Inc. All rights Reserved 14


Tensor Processing Unit
Tensor Processing Unit
 Google’s DNN ASIC for high volume data center use
 256 x 256 8-bit matrix multiply unit
 Large software-managed scratchpad
 Coprocessor on the PCIe bus
 Simple, in-order, deterministic execution unit
 Achieves 99 percentile response time of DNN inference needs

Copyright © 2019, Elsevier Inc. All rights Reserved 15


Tensor Processing Unit
Tensor Processing Unit

Copyright © 2019, Elsevier Inc. All rights Reserved 16


Tensor Processing Unit
TPU ISA
 Read_Host_Memory
 Reads memory from the CPU memory into the unified buffer
 Read_Weights
 Reads weights from the Weight Memory into the Weight FIFO as input
to the Matrix Unit
 MatrixMatrixMultiply/Convolve
 Perform a matrix-matrix multiply, a vector-matrix multiply, an element-
wise matrix multiply, an element-wise vector multiply, or a convolution
from the Unified Buffer into the accumulators
 takes a variable-sized B*256 input, multiplies it by a 256x256 constant
input, and produces a B*256 output, taking B pipelined cycles to
complete
 Activate
 Computes activation function
 Write_Host_Memory
 Writes data from unified buffer into host memory

Copyright © 2019, Elsevier Inc. All rights Reserved 17


Tensor Processing Unit
TPU ISA

Copyright © 2019, Elsevier Inc. All rights Reserved 18


Tensor Processing Unit
TPU ISA

Copyright © 2019, Elsevier Inc. All rights Reserved 19


Tensor Processing Unit
Improving the TPU

Copyright © 2019, Elsevier Inc. All rights Reserved 20


Tensor Processing Unit
The TPU and the Guidelines
 Use dedicated memories
 24 MiB dedicated buffer, 4 MiB accumulator buffers
 Invest resources in arithmetic units and dedicated
memories
 60% of the memory and 250X the arithmetic units of a server-class CPU
 Use the easiest form of parallelism that matches the
domain
 Exploits 2D SIMD parallelism
 Reduce the data size and type needed for the domain
 Primarily uses 8-bit integers
 Use a domain-specific programming language
 Uses TensorFlow

Copyright © 2019, Elsevier Inc. All rights Reserved 21


Microsoft Capapult
Microsoft Catapult
 Needed to be general
purpose and power efficient
 Uses FPGA PCIe board with
dedicated 20 Gbps network in 6 x
8 torus
 Each of the 48 servers in half the
rack has a Catapult board
 Limited to 25 watts
 32 MiB Flash memory
 Two banks of DDR3-1600 (11
GB/s) and 8 GiB DRAM
 FPGA (unconfigured) has 3962
18-bit ALUs and 5 MiB of on-chip
memory
 Programmed in Verilog RTL
 Shell is 23% of the FPGA

Copyright © 2019, Elsevier Inc. All rights Reserved 22


Microsoft Capapult
Microsoft Catapult: CNN
 CNN accelerator, mapped across multiple FPGAs

Copyright © 2019, Elsevier Inc. All rights Reserved 23


Microsoft Capapult
Microsoft Catapult: CNN

Copyright © 2019, Elsevier Inc. All rights Reserved 24


Microsoft Capapult
Microsoft Catapult: Bing Search Ranking
 Feature extraction (1 FPGA)
 Extracts 4500 features for every document-query pair, e.g. frequency in which the query
appears in the page
 Systolic array of FSMs
 Free-form expressions (2 FPGAs)
 Calculates feature combinations
 Machine-learned Scoring (1 FPGA for compression, 3 FPGAs calculate
score)
 Uses results of previous two stages to calculate floating-point score
 One FPGA allocated as a hot-spare

Copyright © 2019, Elsevier Inc. All rights Reserved 25


Microsoft Capapult
Microsoft Catapult: Search Ranking

 Free-form expression evaluation


 60 core processor
 Pipelined cores
 Each core supports four threads that can hide each other’s latency
 Threads are statically prioritized according to thread latency

Copyright © 2019, Elsevier Inc. All rights Reserved 26


Microsoft Capapult
Microsoft Catapult: Search Ranking
 Version 2 of Catapult
 Placed the FPGA between the CPU
and NIC
 Increased network from 10 Gb/s to
40 Gb/s
 Also performs network acceleration
 Shell now consumes 44% of the
FPGA
 Now FPGA performs only feature
extraction
 Scales to ALL server FPGA
connected thru data center NW
(Not limited to 48 FPGA Dedicated
NW)

Copyright © 2019, Elsevier Inc. All rights Reserved 27


Microsoft Capapult
Catapult and the Guidelines
 Use dedicated memories
 5 MiB dedicated memory
 Invest resources in arithmetic units and dedicated
memories
 3926 ALUs
 Use the easiest form of parallelism that matches the
domain
 2D SIMD for CNN, MISD parallelism for search scoring
 Reduce the data size and type needed for the
domain
 Uses mixture of 8-bit integers and 64-bit floating-point
 Use a domain-specific programming language
 Uses Verilog RTL; Microsoft did not follow this guideline

Copyright © 2019, Elsevier Inc. All rights Reserved 28


Intel Crest
Intel Crest
 DNN training
 16-bit ‘flex point’
 a 5bit exponent which is a part of the instruction, used by all data
 Operates on blocks of 32x32 matrices
 SRAM + HBM2
 ICC (Inter Chip Controller), ICL (Inter Chip Links)
 1 TB/sec mem BW

Copyright © 2019, Elsevier Inc. All rights Reserved 29


Pixel Visual Core
Not Covered
in HPCA 2023 Class

Copyright © 2019, Elsevier Inc. All rights Reserved 30


Pixel Visual Core
Pixel Visual Core

 Pixel Visual Core


 Image Processing Unit
 Performs stencil operations
 Decended from Image Signal processor

Copyright © 2019, Elsevier Inc. All rights Reserved 31


Pixel Visual Core
Pixel Visual Core

 Software written in Halide, a DSL


 Compiled to virtual ISA
 vISA is lowered to physical ISA using application-specific
parameters
 pISA is VLSI
 Optimized for energy
 Power Budget is 6 to 8 W for bursts of 10-20 seconds,
dropping to tens of milliwatts when not in use
 8-bit DRAM access equivalent energy as 12,500 8-bit
integer operations or 7 to 100 8-bit SRAM accesses
 IEEE 754 operations require 22X to 150X of the cost of 8-bit
integer operations
 Optimized for 2D access
 2D SIMD unit
 On-chip SRAM structured using a square geometry
Copyright © 2019, Elsevier Inc. All rights Reserved 32
Pixel Visual Core
Pixel Visual Core

Copyright © 2019, Elsevier Inc. All rights Reserved 33


Pixel Visual Core
Pixel Visual Core

Copyright © 2019, Elsevier Inc. All rights Reserved 34


Pixel Visual Core
Pixel Visual Core

Copyright © 2019, Elsevier Inc. All rights Reserved 35


Microsoft Capapult
Visual Core and the Guidelines

 Use dedicated memories


 128 + 64 MiB dedicated memory per core
 Invest resources in arithmetic units and dedicated
memories
 16x16 2D array of processing elements per core and 2D
shifting network per core
 Use the easiest form of parallelism that matches the
domain
 2D SIMD and VLIW
 Reduce the data size and type needed for the
domain
 Uses mixture of 8-bit and 16-bit integers
 Use a domain-specific programming language
 Halide for image processing and TensorFlow for CNNs

Copyright © 2019, Elsevier Inc. All rights Reserved 36


Microsoft Capapult
Fallacies and Pitfalls

 It costs $100 million to design a custom chip


 Performance counters added as an
afterthought
 Architects are tackling the right DNN tasks
 For DNN hardware, inferences per second
(IPS) is a fair summary performance metric
 Being ignorant of architecture history when
designing an DSA

Copyright © 2019, Elsevier Inc. All rights Reserved 37

You might also like