0% found this document useful (0 votes)
74 views37 pages

Gpu-Arc

This document outlines the course organization for a course on GPU architectures and programming. The course covers topics like GPU architectures, CUDA and OpenCL programming, optimization techniques like memory access coalescing and kernel fusion. It discusses handling data parallelism on vector processors, SIMD instructions, and GPUs. Vector processors use vector registers to hold multiple data elements and perform the same operation on these elements simultaneously using vectorized functional units.

Uploaded by

Vijay Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views37 pages

Gpu-Arc

This document outlines the course organization for a course on GPU architectures and programming. The course covers topics like GPU architectures, CUDA and OpenCL programming, optimization techniques like memory access coalescing and kernel fusion. It discusses handling data parallelism on vector processors, SIMD instructions, and GPUs. Vector processors use vector registers to hold multiple data elements and perform the same operation on these elements simultaneously using vectorized functional units.

Uploaded by

Vijay Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

GPU Architectures and Programming

Soumyajit Dey, Assistant Professor,


CSE, IIT Kharagpur

December 5, 2019

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Course Organization
Topic Week Hours
Review of basic COA w.r.t. performance 1 2
Intro to GPU architectures 2 3
Intro to CUDA programming 3 2
Multi-dimensional data and synchronization 4 2
Warp Scheduling and Divergence 5 2
Memory Access Coalescing 6 2
Optimizing Reduction Kernels 7 3
Kernel Fusion, Thread and Block Coarsening 8 3
OpenCL - runtime system 9 3
OpenCL - heterogeneous computing 10 2
Efficient Neural Network Training/Inferencing 11-12 6
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Handling Data Level Parallelism

Data parallel algorithms handle multiple data points in each basic step (single thread of
control)
I Vector Processors : early style of data parallel compute
I Single Instruction Multiple Data (SIMD) in x86 : MMX (Multimedia Extensions),
AVX (Advanced Vector Extensions)
I GPUs : have their own distinguishing characteristics

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Vector Processors

I Vector registers : Each vector register is a fixed-length bank holding a single vector,
I Functional units are also vectorized,
I Original Scalar registers are also present.
I VMIPS has eight vector registers, and each vector register holds 64 elements, each
64 bits wide.

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
L.D F0,a ;load scalar a
DADDIU R4,Rx,#512 ;last address to load
Loop: L.D F2,0(Rx) ;load X[i]
Vector Processors : Consider a simple Y = a ∗ X + Y operation
er Four Data-Level Parallelism in Vector, SIMD, and GPU Architectures MUL.D
L.D
F2,F2,F0
F4,0(Ry)
;a × X[i]
;load Y[i]
ADD.D F4,F4,F2 ;a × X[i] + Y[i]
S.D F4,9(Ry) ;store into Y[i]
L.D F0,a ;load scalar a DADDIU Rx,Rx,#8 ;increment index to X
DADDIU R4,Rx,#512 ;last address to load DADDIU Ry,Ry,#8 ;increment index to Y
Loop: L.D F2,0(Rx) ;load X[i] DSUBU R20,R4,Rx ;compute bound
MUL.D F2,F2,F0 ;a × X[i] BNEZ R20,Loop ;check if done
L.D F4,0(Ry) ;load Y[i] Here is the VMIPS code for DAXPY.
ADD.D F4,F4,F2 ;a × X[i] + Y[i]
L.D F0,a ;load scalar a
S.D F4,9(Ry) ;store into Y[i] LV V1,Rx ;load vector X
DADDIU Rx,Rx,#8 ;increment index to X MULVS.D V2,V1,F0 ;vector-scalar multiply
DADDIU Ry,Ry,#8 ;increment index to Y LV V3,Ry ;load vector Y
DSUBU R20,R4,Rx ;compute bound ADDVV.D V4,V2,V3 ;add
SV V4,Ry ;store the result
BNEZ R20,Loop ;check if done
The most dramatic difference is that the vector processor greatly reduces the
Here is the VMIPS code for DAXPY.
(a) MIPS dynamic instruction bandwidth, executing only 6 instructions versus almost
(b) VMIPS
600 for MIPS. This reduction occurs because the vector operations work on 64
L.D F0,a ;load scalar a elements and the overhead instructions that constitute nearly half the loop on
Figure: Assuming the data size < vector storage (Ref: CoA: a quantitative approach (Hennessy
LV V1,Rx ;load vector X MIPS are not present in the VMIPS code. When the compiler produces vector
MULVS.D V2,V1,F0 instructions for such a sequence and the resulting code spends much of its time
;vector-scalar multiply
& Patterson))
LV V3,Ry ;load vector Y running in vector mode, the code is said to be vectorized or vectorizable. Loops
can be vectorized when they do not have dependences between iterations of a
ADDVV.D V4,V2,V3 ;add loop, which are called loop-carried dependences (see Section 4.5).
TE
OF
TECHNO
LO

In non-vectorized code, every ADD.D must wait for a MUL.D, and every S.D must wait

GY
ITU
IAN INST

KH
ARAGPUR
SV V4,Ry ;store the result Another important difference between MIPS and VMIPS is the frequency of

IND
 

19 5 1

for the ADD.D pipeline interlocks. In the straightforward MIPS code, every ADD.D must wait for
The most dramatic difference is that the vector processor greatlya MUL.D,
reduces andthe
every S.D must wait for the ADD.D. On the vector processor, each
yog, kms kOflm^

dynamic instruction bandwidth,


GPU Architectures and Programming executing only 6 instructions versus almost willSoumyajit
vector instruction only stall for the first
Dey, elementProfessor,
Assistant in each vector,
CSE,andIIT
thenKharagpur
sub-
Vector Processors

I A vector instruction passes lot of parallel work to the hardware


I The FUs can be : fully parallel, or a combination of parallel and pipelined units
I If the clock rate of a vector processor is halved, doubling the number of lanes will
retain the same potential performance.
I Work for compilers - loop vectorization, dependency handling

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Vector Processors

A[9] B[9]
…. ….
… …

Four add pipelines can complete four additions per cycle


Elements are interleaved

A[8] B[8] A[9] B[9]


A[4] B[4] A[5] B[5] A[6] B[6] A[7] B[7]

A[1] B[1]

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

Single Lane yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
GPUs
Ideas from parallel instruction handling by vector architectures, ILP techniques etc were
borrowed to accelerate graphics processing
Host CPU Bridge System Memory
GPU
Host Interface

Input Assembler Clip/Setup/Raster/ZCull Compute Work HD Video Processor SM


Distribution
Vertex Work Distribution Pixel Work Distribution I-Cache
MT Issue
TPC TPC TPC
C-Cache

SM SM SM SM SM SM SP SP

SP

SP
SP

SP
SP

SP
SP

SP
SP

SP
SP

SP
SP

SP
SP

SP
….. SP

SP
SP

SP
SP

SP
SP

SP
SP SP
SP SP SP SP SP SP SP SP SP SP SP SP

SP SP SP SP SP SP SP SP SP SP SP SP SP SP

SP SP
Texture Unit Texture Unit Texture Unit
Tex L1 Tex L1 Tex L1
SFU SFU
Interconnection Network

ROP L2 ROP L2 ….. ROP L2 Display Interface


Shared
Memory
DRAM DRAM DRAM DISPLAY
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
Figure: GPU systems (GeForce 8800) - Hennessy, Patterson (reproduced)

IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
GPU Architecture (Tesla)

I Earlier figure depicts a GPU with an array of 128 streaming/scalar processor (SP)
cores, organized as 16 multithreaded streaming multiprocessors (SM),
I Each SM has 8 SPs,
I 2 SMs together are arranged as independent processing units called
texture/processor clusters (TPCs).

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Early GPUs
Early GPUs accelerated the logical graphics pipeline

Geometry Setup and


Input Assembler Vertex Shader
Shader Rasterizer

Raster Operations/
Pixel Shader
Output Merger

TECHNO
OF LO
TE

Figure: Graphics logical pipeline

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Shader Programs

Graphics application sends the GPU a sequence of vertices grouped into geometric
primitives—points, lines, triangles, and polygons.
I The input assembler collects vertices and primitives.
I Vertex shader programs map the position of vertices onto the screen, altering their
position, color, or orientation.
I Geometry shader programs operate on geometric primitives (such as lines and
triangles) defined by multiple vertices, changing them or generating additional
primitives.

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Shader Programs

Usually dataflow style, model how light interacts with different materials and to render
complex lighting and shadows.
I The setup and rasterizer unit generates pixel fragments (which are potential
contributions to pixels) that are covered by a geometric primitive.
I The pixel shader program fills the interior of primitives, including interpolating
per-fragment parameters, texturing, and coloring.
I The raster operations processing (or output merger) stage : depth testing and
stencil testing, color blending operation etc
Ref : "Computer Organization and Architecture" - Hennessy, Patterson (Appendix A
on GPUs) TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
GPUs : massive multi-threading

Design goals
I Cover the latency of memory loads and texture fetches from DRAM
I Support fine-grained parallel graphics shader (and general parallel compute)
programming models
I Virtualize the physical processors as threads and thread blocks to provide
transparent scalability
I Simplify the parallel programming model to writing a serial program for one thread

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
First generation GPUs

I GeForce 256, introduced in 1999


I Contained fixed function vertex, pixel shaders programmed with OpenGL and the
Microsoft DX7 API
I GeForce 3 - the first programmable vertex processor executing vertex shaders
- Ref for contents and here and subsequent places : "NVIDIA Tesla: A Unified Graphics and
Computing Architecture" by Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym,
(NVIDIA) IEEE Micro, Volume 28, Issue 2, March 2008

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Trade-off

I Vertex processors were designed for low-latency, high-precision math operations


I pixel-fragment processors were optimized for high-latency, lower-precision texture
filtering - typically more busy (considering large triangulation)
I if these are fixed function blocks - difficult to select a fixed processor ratio
I Primary design objective for Tesla architecture - execute vertex and pixel-fragment
shader programs on the same unified processor.
I Unification helps in 1) dynamic load balancing of varying vertex- and
pixel-processing workloads, 2) introducing other shaders
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Tesla architecture

We come back to GeForce 8800 GPU with 128 SPs organized as 16 SMs
I external DRAM control and fixed-function raster operation processors (ROPs)
perform color and depth frame buffer operations directly on memory
I The interconnection network carries computed pixel-fragment colors and depth
values from SPs to the ROPs
I The network also routes texture memory read requests from the SP to DRAM and
read data from DRAM through a level-2 cache back to the SPs

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Graphics in Tesla

I The input assembler collects vertex work


I Vertex work distributor distributes vertex work packets to the various TPCs
I The TPCs execute vertex/geometry shader programs
I output data is written to on-chip buffers
I buffers then pass their results to the viewport/clip/setup/raster/zcull block
We continue from here to general purpose processing

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
GPGPU

Each TPC has two SMs, each SM has


I eight streaming/scalar processor (SP) cores,
I two special function units (SFUs),
I a multi-threaded instruction fetch and issue unit (MT Issue),
I an instruction cache, a read-only constant cache,
I a 16-Kbyte read/write shared memory.

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
GPGPU

I Each SP core contains a scalar multiply-add (MAD) unit, giving the SM eight
MAD units
I The SM uses its two SFU units for transcendental functions
I Each SFU also contains four floating-point multipliers
I In total an SM has eight MAD and floating-point multipliers

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
SIMT

GPU execution model


I SIMT architecture is similar to SIMD design, which applies one instruction to
multiple data lanes.
I The difference is that SIMT applies one instruction to multiple independent
threads in parallel, not just multiple data lanes.
I A SIMD instruction controls a vector of multiple data lanes together, a SIMT
instruction controls the execution and branching behavior of one thread.

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
SIMT

I In contrast to SIMD vector architectures, SIMT enables programmers to write


thread level parallel code for independent threads as well as data-parallel code for
coordinated threads
I SIMT - essentially a single thread of SIMD instructions
I Each SM’s multithreaded instruction unit creates, manages, schedules, and
executes threads in groups of 32 parallel threads called warps
I Each SM manages a pool of 24 warps, with a total of 768 threads
I Each SM maps warp threads to the SP cores
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Warp execution

I In each operation cycle, the SM warp scheduler selects one of the 24 warps
I An issued warp executes over four processor cycles
I The SP cores and SFU units execute instructions independently

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
ISA

I Support for floating-point, integer, bit, conversion, transcendental, flow control,


memory load/store
I Floating-point and integer operations include add, multiply, multiply-add,
minimum, maximum, compare, set predicate, and conversions between integer and
floating-point numbers
I Transcendental function instructions include cosine, sine, binary exponential, binary
logarithm, reciprocal, and reciprocal square root.
I Bitwise operators include shift left, shift right, logic operators, and move
I Control flow includes branch, call, return, trap, and barrier synchronization
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Register File

Each SIMD processor (SM)


I has a large vector register file
I like a vector processor, these registers are divided logically across the SIMD Lanes,
i.e. the SPs
I These numbers vary across across architecture families.

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Fermi GTX 480 GPU

Has
I 16 SMs, total 512 CUDA cores
I Each SM has 32 SPs, 32,768 32-bit registers divided logically across executing
threads
I Each SIMD Thread is limited to no more than 64 registers
I A warp has access to 64×32 registers which are 32 bit,
I Alternatively, considering double-precision floating-point operands, a warp has
access to 32 vector registers of 32 elements, each of which is 64 bits wide.
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Fermi Streaming Multiprocessor (SM)
Instruction Cache

Warp Scheduler Warp Scheduler

Dispatch Unit Dispatch Unit


I Each SM has 16 Load/store units
Register File (32,768 x 32-bit)

Core Core Core Core


LD/ST
(load/store data at each address
LD/ST

Core Core Core Core


LD/ST
LD/ST
SFU

CUDA Core
to cache or DRAM.) - 16 SIMD
Core Core Core Core
LD/ST
LD/ST
LD/ST
SFU
Dispatch Port lanes
Core Core Core Core Operand Collector
LD/ST

Core Core Core Core


LD/ST
LD/ST
I Each lane has 2048 registers
SFU FP Unit INT Unit
LD/ST
Core Core

Core Core
Core

Core
Core

Core
LD/ST
LD/ST
I Each SM has 4 SFUs, Each SP
LD/ST
Result Queue
Core Core Core Core
LD/ST
LD/ST
SFU
has one FP, one Integer ALU.
Interconnect Network

Figure: Single SP I ALUs also support Boolean, shift,


64 KB Shared Memory / L1 Cache

Uniform Cache move, compare, convert, bit-field


Figure: Fermi Streaming extract, ... TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

Multiprocessor (SM) 19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Memory Hierarchy

I Local memory for per-thread, private, temporary data (implemented in external


DRAM)
I Shared memory for low-latency access to data shared by threads in the same SM
I Global memory for data shared by all threads of a computing application
(implemented in external DRAM)

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Fermi Memory Hierarchy

Thread
I Shared memory enables threads to cooperate,
facilitates reuse of on-chip data, and reduces
off-chip traffic.
Shared Memory L1 Cache I Each SM has 64 KB of on-chip memory that
can be configured as 48 KB of Shared memory
L2 Cache with 16 KB of L1 cache or as 16 KB of Shared
memory with 48 KB of L1 cache.
I Source : NVIDIA Whitepaper on Fermi
DRAM
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Fermi Memory Hierarchy

I L1 (Data) cache + Shared memory is


DRAM DRAM DRAM
private to SMs along with read-only
Memory Memory Memory texture and constant caches
Controller Controller Controller

L2 Cache L2 Cache L2 Cache


I L2 is unified for all SMs, 6
high-bandwidth DRAM channels
SM SM SM … … SM SM SM I Compared to CPU, GPUs have larger
Interconnect
SM SM SM … … SM SM SM register file, smaller L1/L2 cache with
higher bandwidth
L2 Cache L2 Cache L2 Cache I Ref : "The Architecture and Evolution of
Memory Memory Memory
Controller Controller Controller CPU-GPU Systems for General Purpose
Computing" - Manish Arora
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
DRAM DRAM DRAM

IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
GPU ISA

I The instruction set target of the NVIDIA compilers is an abstraction of the


hardware instruction set
I PTX (Parallel Thread Execution) provides a instruction set for compilers that
remains same for different generations of GPUs
I PTX code gets translated to target hardware instructions while being loaded to
GPU

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
PTX instructions

I format : opcode.type d, a, b, c;
I a,b,c, are source; d is destination operand
I Source operands are 32-bit or 64-bit registers or a constant value
I All instructions can be predicated by 1-bit predicate registers, which can be set by
a set predicate instruction (setp)

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
GPUs becoming ubiquitous

GPUs have started finding wide usage in several domains where workloads have become
intensive
I Mobile GPUs : ARM Mali, Adreno GPUs (Qualcom) - accelerate graphics as well
as compute tasks
I NVIDIA in embedded space : Jetson TX/ Nano / AGX Xavier ⇒ Multi core ARM
CPU + 128-512 core GPU targeting AI / Deep Learning tasks
I NVIDIA Drive : for implementing autonomous car and ADAS functionality
powered by deep learning (Tesla cars !!)

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
GPUs as mobile workload accelerators

CPU Big cores


CPU small cores Mali GPU cores

I Objective : Maximize performance and reduce


L2 Cache L2 Cache L2 Cache
power consumption
MMU
I Developers need to map the workload across
the whole CPU + GPU system
I RenderScript for Android SDK, OpenCL -
language support for data parallel computation
on Mobile devices
TECHNO
OF

Figure: Typical architecture of an


LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
ARM based Mobile SoC
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Integrated GPUs in Desktop Systems

With the release of AMD’s Fusion and Intel’s Ivy Bridge architecture (i3, i5, i7) in
2011, the trend of fused CPU-GPU architectures started
I CPU and GPU access the same physical memory such that zero-copy transfers can
be employed
I Zero- copy transfers ensure coherency; translate pointers to memory buffers for the
common CPU and GPU address space, but do not actually transfer data.
I Bad effect - CPU and GPU compete for memory bandwidth of the shared physical
memory

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Integrated GPUs in Desktop Systems
I In more recent architectures, Intel
Broadwell and beyond, CPU and GPU
CPU CPU
were further integrated
L1-I$ L1-D$ L1-I$ L1-D$
GPU
L2$ L2$
I They access the shared last level cache
(LLC)
Shared Last-Level Cache
Main Memory I This helps in CPU and GPU executing
GPU Cache System Bus
(DDR)
computational kernels on the same
L2$ L2$ data in parallel collaboratively (LLC
L1-I$ L1-D$ L1-I$ L1-D$
CPU CPU enables cache coherence between CPU
and GPU)
Figure: Fused CPU-GPU with shared LLC I "Co-Scheduling on Fused CPU-GPU
TECHNO
OF LO

Architectures with Shared Last Level


TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

Caches" - Henkel et. al.


19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Jetson Series from NVIDIA

K1 GPU
I TK1 SOC incorporates a quad-core
CPU 0 CPU 3
L1-I L1-D
32KB 32KB
… L1-I L1-D
32KB 32KB
192 cores
2.32 GHz 32-bit ARM machine and an
integrated Kepler GK20a GPU
L2 L2 I The CPUs share a 2-MB L2 cache
2 MB 128 MB
I The GPU has 192 cores and a 128-KB
Memory Controller L2 cache
I The CPU also has ‘little’ ARM cores
DRAM
Bank 0
64 MB
DRAM
Bank 1
64 MB
DRAM
Bank 2
64 MB
….. DRAM
Bank 30
64 MB
DRAM
Bank 31
64 MB
(not shown) - low power, low
performance
Figure: Jetson TK1 TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
NVIDIA Drive series of systems

I The Nvidia Drive PX 2 is based on 1/2 Tegra


SoCs where each SoC contains 2 Denver cores,
4 ARM A57 cores and a GPU from the Pascal
generation
I Useful for implementing high throughput real
time neural net processing - self driving / drive
assist systems
Figure: Source- Wiki, NVIDIA Drive OF
TECHNO
LO
TE

PX Platform

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur

You might also like