Gpu-Arc
Gpu-Arc
December 5, 2019
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Course Organization
Topic Week Hours
Review of basic COA w.r.t. performance 1 2
Intro to GPU architectures 2 3
Intro to CUDA programming 3 2
Multi-dimensional data and synchronization 4 2
Warp Scheduling and Divergence 5 2
Memory Access Coalescing 6 2
Optimizing Reduction Kernels 7 3
Kernel Fusion, Thread and Block Coarsening 8 3
OpenCL - runtime system 9 3
OpenCL - heterogeneous computing 10 2
Efficient Neural Network Training/Inferencing 11-12 6
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Handling Data Level Parallelism
Data parallel algorithms handle multiple data points in each basic step (single thread of
control)
I Vector Processors : early style of data parallel compute
I Single Instruction Multiple Data (SIMD) in x86 : MMX (Multimedia Extensions),
AVX (Advanced Vector Extensions)
I GPUs : have their own distinguishing characteristics
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Vector Processors
I Vector registers : Each vector register is a fixed-length bank holding a single vector,
I Functional units are also vectorized,
I Original Scalar registers are also present.
I VMIPS has eight vector registers, and each vector register holds 64 elements, each
64 bits wide.
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
L.D F0,a ;load scalar a
DADDIU R4,Rx,#512 ;last address to load
Loop: L.D F2,0(Rx) ;load X[i]
Vector Processors : Consider a simple Y = a ∗ X + Y operation
er Four Data-Level Parallelism in Vector, SIMD, and GPU Architectures MUL.D
L.D
F2,F2,F0
F4,0(Ry)
;a × X[i]
;load Y[i]
ADD.D F4,F4,F2 ;a × X[i] + Y[i]
S.D F4,9(Ry) ;store into Y[i]
L.D F0,a ;load scalar a DADDIU Rx,Rx,#8 ;increment index to X
DADDIU R4,Rx,#512 ;last address to load DADDIU Ry,Ry,#8 ;increment index to Y
Loop: L.D F2,0(Rx) ;load X[i] DSUBU R20,R4,Rx ;compute bound
MUL.D F2,F2,F0 ;a × X[i] BNEZ R20,Loop ;check if done
L.D F4,0(Ry) ;load Y[i] Here is the VMIPS code for DAXPY.
ADD.D F4,F4,F2 ;a × X[i] + Y[i]
L.D F0,a ;load scalar a
S.D F4,9(Ry) ;store into Y[i] LV V1,Rx ;load vector X
DADDIU Rx,Rx,#8 ;increment index to X MULVS.D V2,V1,F0 ;vector-scalar multiply
DADDIU Ry,Ry,#8 ;increment index to Y LV V3,Ry ;load vector Y
DSUBU R20,R4,Rx ;compute bound ADDVV.D V4,V2,V3 ;add
SV V4,Ry ;store the result
BNEZ R20,Loop ;check if done
The most dramatic difference is that the vector processor greatly reduces the
Here is the VMIPS code for DAXPY.
(a) MIPS dynamic instruction bandwidth, executing only 6 instructions versus almost
(b) VMIPS
600 for MIPS. This reduction occurs because the vector operations work on 64
L.D F0,a ;load scalar a elements and the overhead instructions that constitute nearly half the loop on
Figure: Assuming the data size < vector storage (Ref: CoA: a quantitative approach (Hennessy
LV V1,Rx ;load vector X MIPS are not present in the VMIPS code. When the compiler produces vector
MULVS.D V2,V1,F0 instructions for such a sequence and the resulting code spends much of its time
;vector-scalar multiply
& Patterson))
LV V3,Ry ;load vector Y running in vector mode, the code is said to be vectorized or vectorizable. Loops
can be vectorized when they do not have dependences between iterations of a
ADDVV.D V4,V2,V3 ;add loop, which are called loop-carried dependences (see Section 4.5).
TE
OF
TECHNO
LO
In non-vectorized code, every ADD.D must wait for a MUL.D, and every S.D must wait
GY
ITU
IAN INST
KH
ARAGPUR
SV V4,Ry ;store the result Another important difference between MIPS and VMIPS is the frequency of
IND
19 5 1
for the ADD.D pipeline interlocks. In the straightforward MIPS code, every ADD.D must wait for
The most dramatic difference is that the vector processor greatlya MUL.D,
reduces andthe
every S.D must wait for the ADD.D. On the vector processor, each
yog, kms kOflm^
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Vector Processors
A[9] B[9]
…. ….
… …
A[1] B[1]
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
GPUs
Ideas from parallel instruction handling by vector architectures, ILP techniques etc were
borrowed to accelerate graphics processing
Host CPU Bridge System Memory
GPU
Host Interface
SM SM SM SM SM SM SP SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
….. SP
SP
SP
SP
SP
SP
SP
SP
SP SP
SP SP SP SP SP SP SP SP SP SP SP SP
SP SP SP SP SP SP SP SP SP SP SP SP SP SP
SP SP
Texture Unit Texture Unit Texture Unit
Tex L1 Tex L1 Tex L1
SFU SFU
Interconnection Network
GY
ITU
IAN INST
KH
ARAGPUR
Figure: GPU systems (GeForce 8800) - Hennessy, Patterson (reproduced)
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
GPU Architecture (Tesla)
I Earlier figure depicts a GPU with an array of 128 streaming/scalar processor (SP)
cores, organized as 16 multithreaded streaming multiprocessors (SM),
I Each SM has 8 SPs,
I 2 SMs together are arranged as independent processing units called
texture/processor clusters (TPCs).
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Early GPUs
Early GPUs accelerated the logical graphics pipeline
Raster Operations/
Pixel Shader
Output Merger
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Shader Programs
Graphics application sends the GPU a sequence of vertices grouped into geometric
primitives—points, lines, triangles, and polygons.
I The input assembler collects vertices and primitives.
I Vertex shader programs map the position of vertices onto the screen, altering their
position, color, or orientation.
I Geometry shader programs operate on geometric primitives (such as lines and
triangles) defined by multiple vertices, changing them or generating additional
primitives.
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Shader Programs
Usually dataflow style, model how light interacts with different materials and to render
complex lighting and shadows.
I The setup and rasterizer unit generates pixel fragments (which are potential
contributions to pixels) that are covered by a geometric primitive.
I The pixel shader program fills the interior of primitives, including interpolating
per-fragment parameters, texturing, and coloring.
I The raster operations processing (or output merger) stage : depth testing and
stencil testing, color blending operation etc
Ref : "Computer Organization and Architecture" - Hennessy, Patterson (Appendix A
on GPUs) TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
GPUs : massive multi-threading
Design goals
I Cover the latency of memory loads and texture fetches from DRAM
I Support fine-grained parallel graphics shader (and general parallel compute)
programming models
I Virtualize the physical processors as threads and thread blocks to provide
transparent scalability
I Simplify the parallel programming model to writing a serial program for one thread
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
First generation GPUs
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Trade-off
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Tesla architecture
We come back to GeForce 8800 GPU with 128 SPs organized as 16 SMs
I external DRAM control and fixed-function raster operation processors (ROPs)
perform color and depth frame buffer operations directly on memory
I The interconnection network carries computed pixel-fragment colors and depth
values from SPs to the ROPs
I The network also routes texture memory read requests from the SP to DRAM and
read data from DRAM through a level-2 cache back to the SPs
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Graphics in Tesla
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
GPGPU
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
GPGPU
I Each SP core contains a scalar multiply-add (MAD) unit, giving the SM eight
MAD units
I The SM uses its two SFU units for transcendental functions
I Each SFU also contains four floating-point multipliers
I In total an SM has eight MAD and floating-point multipliers
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
SIMT
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
SIMT
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Warp execution
I In each operation cycle, the SM warp scheduler selects one of the 24 warps
I An issued warp executes over four processor cycles
I The SP cores and SFU units execute instructions independently
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
ISA
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Register File
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Fermi GTX 480 GPU
Has
I 16 SMs, total 512 CUDA cores
I Each SM has 32 SPs, 32,768 32-bit registers divided logically across executing
threads
I Each SIMD Thread is limited to no more than 64 registers
I A warp has access to 64×32 registers which are 32 bit,
I Alternatively, considering double-precision floating-point operands, a warp has
access to 32 vector registers of 32 elements, each of which is 64 bits wide.
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Fermi Streaming Multiprocessor (SM)
Instruction Cache
CUDA Core
to cache or DRAM.) - 16 SIMD
Core Core Core Core
LD/ST
LD/ST
LD/ST
SFU
Dispatch Port lanes
Core Core Core Core Operand Collector
LD/ST
Core Core
Core
Core
Core
Core
LD/ST
LD/ST
I Each SM has 4 SFUs, Each SP
LD/ST
Result Queue
Core Core Core Core
LD/ST
LD/ST
SFU
has one FP, one Integer ALU.
Interconnect Network
GY
ITU
IAN INST
KH
ARAGPUR
IND
Multiprocessor (SM) 19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Memory Hierarchy
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Fermi Memory Hierarchy
Thread
I Shared memory enables threads to cooperate,
facilitates reuse of on-chip data, and reduces
off-chip traffic.
Shared Memory L1 Cache I Each SM has 64 KB of on-chip memory that
can be configured as 48 KB of Shared memory
L2 Cache with 16 KB of L1 cache or as 16 KB of Shared
memory with 48 KB of L1 cache.
I Source : NVIDIA Whitepaper on Fermi
DRAM
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Fermi Memory Hierarchy
GY
ITU
IAN INST
KH
ARAGPUR
DRAM DRAM DRAM
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
GPU ISA
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
PTX instructions
I format : opcode.type d, a, b, c;
I a,b,c, are source; d is destination operand
I Source operands are 32-bit or 64-bit registers or a constant value
I All instructions can be predicated by 1-bit predicate registers, which can be set by
a set predicate instruction (setp)
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
GPUs becoming ubiquitous
GPUs have started finding wide usage in several domains where workloads have become
intensive
I Mobile GPUs : ARM Mali, Adreno GPUs (Qualcom) - accelerate graphics as well
as compute tasks
I NVIDIA in embedded space : Jetson TX/ Nano / AGX Xavier ⇒ Multi core ARM
CPU + 128-512 core GPU targeting AI / Deep Learning tasks
I NVIDIA Drive : for implementing autonomous car and ADAS functionality
powered by deep learning (Tesla cars !!)
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
GPUs as mobile workload accelerators
GY
ITU
IAN INST
KH
ARAGPUR
IND
ARM based Mobile SoC
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Integrated GPUs in Desktop Systems
With the release of AMD’s Fusion and Intel’s Ivy Bridge architecture (i3, i5, i7) in
2011, the trend of fused CPU-GPU architectures started
I CPU and GPU access the same physical memory such that zero-copy transfers can
be employed
I Zero- copy transfers ensure coherency; translate pointers to memory buffers for the
common CPU and GPU address space, but do not actually transfer data.
I Bad effect - CPU and GPU compete for memory bandwidth of the shared physical
memory
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Integrated GPUs in Desktop Systems
I In more recent architectures, Intel
Broadwell and beyond, CPU and GPU
CPU CPU
were further integrated
L1-I$ L1-D$ L1-I$ L1-D$
GPU
L2$ L2$
I They access the shared last level cache
(LLC)
Shared Last-Level Cache
Main Memory I This helps in CPU and GPU executing
GPU Cache System Bus
(DDR)
computational kernels on the same
L2$ L2$ data in parallel collaboratively (LLC
L1-I$ L1-D$ L1-I$ L1-D$
CPU CPU enables cache coherence between CPU
and GPU)
Figure: Fused CPU-GPU with shared LLC I "Co-Scheduling on Fused CPU-GPU
TECHNO
OF LO
GY
ITU
IAN INST
KH
ARAGPUR
IND
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Jetson Series from NVIDIA
K1 GPU
I TK1 SOC incorporates a quad-core
CPU 0 CPU 3
L1-I L1-D
32KB 32KB
… L1-I L1-D
32KB 32KB
192 cores
2.32 GHz 32-bit ARM machine and an
integrated Kepler GK20a GPU
L2 L2 I The CPUs share a 2-MB L2 cache
2 MB 128 MB
I The GPU has 192 cores and a 128-KB
Memory Controller L2 cache
I The CPU also has ‘little’ ARM cores
DRAM
Bank 0
64 MB
DRAM
Bank 1
64 MB
DRAM
Bank 2
64 MB
….. DRAM
Bank 30
64 MB
DRAM
Bank 31
64 MB
(not shown) - low power, low
performance
Figure: Jetson TK1 TE
OF
TECHNO
LO
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
NVIDIA Drive series of systems
PX Platform
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur