0% found this document useful (0 votes)
7 views58 pages

Chapter 8

Lecture slides from chapter 8 - GPU

Uploaded by

Zlata
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views58 pages

Chapter 8

Lecture slides from chapter 8 - GPU

Uploaded by

Zlata
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 58

CS2115 Computer Organization

2024/2025 Sem A

Chapter 8: Multicore & GPU

Dr. Nan Guan


Department of Computer Science
City University of Hong Kong

1
Sequential Execution
Fetch Decode Execute Memory Writeback
Address

Instruction
IR
Next

Data
PC

Data

Address
Sel Sel

Imm Imm
Decoder Main Memory
Add
Operand
Sel
Sel

s
Register Mul
Files Data
Reg.

.
..
Data

ALU
Data

2
Processing Models

single instruction and single data (SISD) single instruction and multiple data (SIMD)

multiple instruction and single data (MISD) multiple instruction and multiple data (MIMD)
3
Multicore Processors
• A single chip with two or more separate processing units (cores)
• Each of which reads and executes program instructions
• Increasing overall speed for programs that support multithreading
or other parallel computing techniques

Which processing model?


(among the four in the previous page)

4
Trend of Morden Processors
Moore’s Law:
integrated circuit
resources double
every 18–24 months

5
Amdahl’s Law
• In a program with parallel processing , a few instructions that
have to be performed in sequence will have a limiting factor on
program speedup such that adding more processors may not
make the program run faster.

6
Amdahl’s Law
ExeTimeold
Speedup =
ExeTimenew

• ExeTimeold: execution time of entire task without enhancement


• ExeTimenew: execution time of entire task with enhancement

Fractionenhanced
ExeTimenew = ExeTimeold × [(1 - Fractionenhanced) + ]
Speedupenhanced

ExeTimeold 1
Speedup = =
ExeTime new
Fractionenhanced
(1 - Fractionenhanced) +
Speedupenhanced

7
Amdahl’s Law
• When executing software on multi-core processors, the ideal
speedup achieved is equal to the number of processor cores
(or called cores for short)
• In reality, only part of the software can be parallelized
• Example: suppose 80% of a software program can be
parallelized, what is the best speedup if it is executed on 100
processor cores?
ExeTimeold 1 1
Speedup = = = = 1/0.208 = 4.808
ExeTimenew Fractionenhanced (1-0.8) + 0.8/100
(1 - Fractionenhanced) +
Speedupenhanced
8
GPU (Graphic Processing Unit)

9
Graphics Applications
• Graphics pipeline have a lot of redundant instructions
– Draw dots
– Draw triangles
– Draw lines
– Raster an image(convert from vector to pixel bitmap)
– Draw surfaces
– Determine which object to be shown on each pixel
– Compute lightning on each pixel
– Repeat

– All of these are lots of adding and multiplying tons of independent data in parallel !

10
GPU in Modern Systems
DOMAIN-SPECIFIC
cuDNN nvGRAPH TensorRT NCCL

VISUAL
PROCESSING NVIDA NPP DeepStream SDK NVIDIA CODEC SDK Index Framework

LINEAR
ALGEBRA
cuSPARSE cuBLAS cuRAND cuFFT

MATH
ALGORITHMS
CUDA Math library AmgX cuSOLVER THRUST LIBRARY

11
GPU: SIMD
• Idea: Use one decoder on multiple data
• Basically all these data perform the same operation
• Save logic required to fetch/decode tons of instructions

• SIMD: Single-instruction multiple-data

12
GPU Architecture SM
SP SP SP SP
SP: Scalar Processor
‘CUDA core’ SP SP SP SP

Executes one ‘thread’ SP SP SP SP

SP SP SP SP

SM: Streaming Multiprocessor SHARED MEMORY


multiple SPs (16, 32, 48 or more)

Fast local ‘shared memory’


shared among SPs GLOBAL MEMORY
e.g., 16 KB or 64 KB (ON DEVICE)

In this lecture, we use NVIDIA GPU as the example


HOST (CPU)
13
GPU Architecture SM
• For example, GTX 480: SP SP SP SP

• 14 SMs x 32 SPs = 448 cores on a GPU SP SP SP SP

SP SP SP SP

SP SP SP SP

SHARED MEMORY

GDDR memory GLOBAL MEMORY


(ON DEVICE)
512 MB - 6 GB

HOST
14
GPU Programming Models
• CUDA (Compute Unified Device Architecture): parallel GPU
programming API created by NVIDIA
• Hardware and software architecture for issuing and managing
computations on GPU
• API libraries with C/C++/Fortran language

• Other models (not covered in this course):


– OpenGL – an open standard for GPU programming
– DirectX – a series of Microsoft multimedia programming interfaces

15
CUDA Structure
• Kernel: a function executed on the GPU Host Device

Grid 1

Kernel Block Block Block


• Thread: a minimum unit managed by CUDA runtime 1 (0, 0) (1, 0) (2, 0)

– Threads execute the same copy of codes Block


(0, 1)
Block
(1, 1)
Block
(2, 1)

• Warp: a scheduling unit of up to 32 threads Grid 2

• Block: a user-defined group of threads (usually Kernel


2

multiple of 32)
Block (1, 1)

• Grid: a group of one or more blocks. Thread Thread Thread Thread Thread
(0, 0) (1, 0) (2, 0) (3, 0) (4, 0)
– A grid is created for each CUDA kernel
Thread Thread Thread Thread Thread
(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)

Thread Thread Thread Thread Thread


(0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

16
CUDA Structure
• One block executes on one SM Host Device

– All threads sharing the shared memory Grid 1

– Blocks execute on SMs in parallel independently Kernel


1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)

– E.g., 128, 192, or 256 threads in a block Block Block Block


(0, 1) (1, 1) (2, 1)

• A thread executes on one SP at a time


Grid 2
– But may execute on different SP at different time Kernel
2
• Warp: a scheduling unit of up to 32 threads
– Each block is physically executed by multiple warps Block (1, 1)

Thread Thread Thread Thread Thread


(0, 0) (1, 0) (2, 0) (3, 0) (4, 0)

Thread Thread Thread Thread Thread


(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)

Thread Thread Thread Thread Thread


(0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

17
CUDA Structure
• Each block has a unique ID Host Device

• blockIdx: identifier for a block (1D, 2D or 3D) Grid 1

• E.g., (blockIdx.x, blockIdx.y, blockIdx.z) Kernel


1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)

• blockDim: the size of the block Block Block Block


(0, 1) (1, 1) (2, 1)
• Each thread has a unique ID within a block
• threadIdx: identifier for a thread (1D, 2D or 3D) Kernel
Grid 2

• E.g., (threadIdx.x, threadIdx.y, threadIdx.z) 2

Block (1, 1)

Thread Thread Thread Thread Thread


(0, 0) (1, 0) (2, 0) (3, 0) (4, 0)

Thread Thread Thread Thread Thread


(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)

Thread Thread Thread Thread Thread


(0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

18
CUDA Structure
• Build-in variables: Host Device

• In this example Grid 1

• gridDim.x=3, gridDim.y =2 Kernel Block Block Block


1
• blockDim.x=5, blockDim.y=3 (0, 0) (1, 0) (2, 0)

• blockIdx.x = 1, blockIdx.y = 1 Block Block Block


(0, 1) (1, 1) (2, 1)
• threadIdx.x = 2, threadIdx.y = 1
Grid 2

Kernel
2

Block (1, 1)

Thread Thread Thread Thread Thread


(0, 0) (1, 0) (2, 0) (3, 0) (4, 0)

Thread Thread Thread Thread Thread


(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)

Thread Thread Thread Thread Thread


(0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

19
CUDA: Programming
• Declarations on functions: __host__, __global__, __device__
– __host__ indicates host functions which execute on CPU
– __global__ makes a function for a kernel which executes on GPU
– __device__ indicates device functions, which can only be called from other
device or global functions. It cannot be called from host code

• Launch kernel: mapping of thread programs to device


– kernel <<<grid_size, block_size>>>(<args>)

20
CUDA: Programming
mycode.cu
Compiled by CPU Compiled by GPU
int main_data; compiler compiler
__shared__ int sdata;
int main_data; __shared__ sdata;
Main() { }
__host__ hfunc () {
Main() {} __global__ gfunc() {
Host Only int hdata;
__host__ hfunc () { int gdata;
gfunc<<<g,b,m>>>();
int hdata; dfunc()
}
gfunc<<<g,b,m>>>(); }
__global__ gfunc() { }
int gdata; __device__ dfunc() {
Interface dfunc(); int ddata;
} }

__device__ dfunc() {
Device Only int ddata;
}

21
Difference between CPU and GPU Programs

CPU Program GPU program

22
Example 1
• Task to do
 Scan elements of an array of numbers (any of 0 to 9)
 Count how many times does “6” appear
 The array has 16 elements (SIZE=16)
 Each block contains 4 threads (BLOCKSIZE=4)
 So each thread checks 4 elements (SIZE/BLOCKSIZE=4)
 1 block in the grid

3 6 7 5 3 5 6 2 9 1 2 7 0 9 3 6

threadIdx.x = 0 checks elements 0, 4, 8, 12


threadIdx.x = 1 checks elements 1, 5, 9, 13
threadIdx.x = 2 checks elements 2, 6, 10, 14
threadIdx.x = 3 checks elements 3, 7, 11, 15
23
Example 1
 Launch Kernel in Host Function: compute<<<1, 4>>>(d_in, d_out);

 Kernel Function:
 Thread scans subset of array elements
 Call device function to compare with 6
 Compute local result

 Device Function:
 Compare current element and 6
 Return 1 if same, else 0

3 6 7 5 3 5 6 2 9 1 2 7 0 9 3 6

threadIdx.x = 0 checks elements 0, 4, 8, 12


threadIdx.x = 1 checks elements 1, 5, 9, 13
threadIdx.x = 2 checks elements 2, 6, 10, 14
threadIdx.x = 3 checks elements 3, 7, 11, 15
24
Example 2
• Task to do
 Element-wise addition of two matrices.
 Both matrices have N*N elements.
 Each thread does addition for one pair of elements
 Each block contains 16*16 threads.
 The grid has (N/16) * (N/16) blocks.
blockIdx.x threadIdx.x

0,0 0,N-1 0,0 0,15

blockIdx.y*blockDim.y threadIdx.y

threadId.y
blockIdx.y

i,j
blockIdx.x*blockDim.x threadIdx.x
N-1,N-
N-1,0 1 15,0 15,15

25
GPU Tradeoffs
• Very high throughput

• Low single thread performance


– No complex pipeline
– No branch predictor

• Your program’s performance depends on the applications’ parallelism


– If you can run many things in parallel → very fast
– Otherwise → slow

26
Interconnection
• For transferring information between
nodes or for broadcasting information to
all nodes
• Suitability of a network judged by cost,
bandwidth, and effective throughput
– Bandwidth is capacity in bits per second
– Effective throughput is actual rate due to
need for transferring control information
Bus and Ring
• A bus is a set of lines providing a shared path
– Requires arbitration for one access at a time
– Simple bus is held until response is ready, but split-transaction bus can
overlap requests

• A ring provides point-to-point connections


– Bi-directional ring halves average latency
– Hierarchy of rings reduces latency for transfers within a lower-level
ring or between rings
Crossbar
• A crossbar provides a direct link
between any pair of units
connected to the network

• Can implement with collection of


switches for simultaneous
transfers to different destinations

29
Shared Memory Multiprocessors
• All processors have direct (hardware) access to all the main
memory, i.e., share the same address space
– Multicore processors, GPU, …

• Uniform Memory Access (UMA)

• Non-Uniform Memory Access (NUMA)

30
UMA vs NUMA
Uniform Memory Access (UMA) non-uniform Memory Access (NUMA)
Memory access time is balanced or equal. Memory access time is not equal.

Single memory controller is used Multiple memory controllers are used.

UMA is usually slower than NUMA NUMA is usually faster than UMA

UMA has limited bandwidth NUMA has higher bandwidth than UMA

31
Shared Data
• Proc 0 writes to an address, followed by Proc 1 reading
Proc 0 Proc 1
Mem[A] = 1 …

Print Mem[A]

• Mem[A] is cached (at either end)

32
Cache Coherence
• Basic question: If multiple processors cache the same block,
how do they ensure they all see a consistent state?
P1 P2

Interconnection Network

1000
x
Main Memory
33
Cache Coherence
• Basic question: If multiple processors cache the same block,
how do they ensure they all see a consistent state?
P1 P2 lw r2, x

1000

Interconnection Network

1000
x
Main Memory
34
Cache Coherence
• Basic question: If multiple processors cache the same block,
how do they ensure they all see a consistent state?
lw r2, x P1 P2 lw r2, x

1000 1000

Interconnection Network

1000
x
Main Memory
35
Cache Coherence
• Basic question: If multiple processors cache the same block,
how do they ensure they all see a consistent state?
lw r2, x P1 P2 lw r2, x
add r1, r2, r4
sw r1, x
2000 1000

Interconnection Network

1000
x
Main Memory
36
Cache Coherence
• Basic question: If multiple processors cache the same block,
how do they ensure they all see a consistent state?
lw r2, x P1 P2 lw r2, x
add r1, r2, r4 Should NOT
lw r5, x
sw r1, x load 1000
2000 1000

Interconnection Network

1000
x
Main Memory
37
Cache Coherence
• Basic idea:
P1 P2
– A processor/cache broadcasts its
write/update to a memory 2000 1000
location to all other processors
– Another cache that has the
Interconnection Network
location either updates or
invalidates its local copy
1000
x
Main Memory

38
Snoopy Coherence Scheme
• All caches “snoop” all other caches’ read/write requests and
keep the cache line coherent; invalidate the cache line if other
caches change the value

– Easy to implement if all caches share a common bus


• Each cache broadcasts its read/write operations on the bus
– Good for small-scale multiprocessors
– Full broadcast of all requests is not efficient for large-scale shared-
memory multiprocessors
39
Directory-based Cache Coherence Scheme
• Idea: A (logically) central directory keeps track of where the copies
of each cache block reside; caches check directory to ensure
coherence.
– For each memory block, identify which nodes have copies,
– If it is modified, request sent to module containing the block for limited
broadcast, or to forward to owner

40
Directory-based Cache Coherence Scheme
• Requestor: the processor who requesting for a read/write of a memory block.
• Owner: An owner node owns the most recent state of the cache block, note that
directory might not be always up to date with latest data.
• Sharer: One or many node which are sharing a copy of the cache block.

41
Directory-based Cache Coherence Scheme
• Directory-based Cache Coherence for NUMA

42
Directory-based Cache Coherence Scheme
• Directory-based Cache Coherence for NUMA
– read miss to clean line

43
Directory-based Cache Coherence Scheme
• Directory-based Cache Coherence for NUMA
– read miss to clean line

1. Read miss message sent to “home node”

44
Directory-based Cache Coherence Scheme
• Directory-based Cache Coherence for NUMA
– read miss to clean line

1. Read miss message sent to “home node”


2. Home directory checks entry for line and return
the line of data from memory

45
Directory-based Cache Coherence Scheme
• Directory-based Cache Coherence for NUMA
– read miss to dirty line

46
Directory-based Cache Coherence Scheme
• Directory-based Cache Coherence for NUMA
– read miss to dirty line

1. If the data is dirty, then data must be sourced by


another processor (with the up-to-date copy)
2. Home node tell requesting node where to find data

47
Directory-based Cache Coherence Scheme
• Directory-based Cache Coherence for NUMA
– read miss to dirty line

1. If the data is dirty, then data must be sourced by


another processor (with the up-to-date copy)
2. Home node tell requesting node where to find data
3. Requesting node requests data from owner
4. Owner changes state in cache to SHARED (read only),
responds to requesting node

48
Directory-based Cache Coherence Scheme
• Directory-based Cache Coherence for NUMA
– read miss to dirty line

1. If the data is dirty, then data must be sourced by


another processor (with the up-to-date copy)
2. Home node tell requesting node where to find data
3. Requesting node requests data from owner
4. Owner changes state in cache to SHARED (read only),
responds to requesting node
5. Owner also responds to home node, home clears
dirty, updates memory

49
Snoopy vs Directory-Based

Snoopy Directory-based
Full broadcast communication Limited broadcast or point-to-point
communication
Simple control logic More complex control logic
(need to maintain directory states)
Low latency Higher latency
(need to access the directory)
Limited scalability Good scalability

50
Message-Passing Multiprocessors
• Processors have their own physical address space
– memory accessed only by local processor
• Communicate via explicit message passing
– one processor needs to know when a message is sent, and the
receiving processor needs to know when a message arrives

51
Message-Passing Multiprocessors
• Each node is effectively a complete computer
• To build large-scale computers, usually use high-performance
message-passing interconnection networks
– Special interconnection networks offer better communication
performance than using e.g., LAN, but is much more expensive
• Much easier for hardware designer to build
– But require more programming efforts

52
Shared Memory vs Message Passing
Shared Memory Message Passing

The shared memory region is used for communication. A message passing facility is used for communication.
Communicating processes share a common address Communicating processes have different address
space. space.
It provides a maximum speed of computation as It is time-consuming as message passing is
communication is done through shared memory so implemented crossing software layers (e.g., system
system calls are made only to establish the shared calls provided by operating systems).
memory.
Faster communication strategy. Relatively slower communication strategy.

It can be used in exchanging larger amounts of data. It can be used in exchanging small amounts of data.

Software on different processors need to ensure the Explicit communication actions, no consistency
data consistency problem

53
Network-on-Chips (NoC)
• Bus has been the most popular for multiprocessor systems, but
– When expanding to a many core-system, contention decreases throughput;
– When scaling sizes and frequency, wire delays remain larger than clock cycle
• Need for interconnect with deterministic delays and scalability
• Network-on-Chip (NoC): Each on-chip component connected by an
intelligent switch (router) to particular communication wire(s)

Leveraging existing computer networking


principles for on-chip communications

54
Network-on-Chips (NoC)

Mesh NoC
System-on-Chip (SoC)
• Integrates most or all components of a computer on one chip
– Rather than many chips on a board
– Popular in domains sensitive to size and energy efficiency
• Mobile devices
• Embedded devices

56
Exercise 1
• 80% of a program can be perfectly parallelized; 10% can be
parallelized on at most 4 cores; the remaining has to execute
sequentially
• What is the speedup if executed on 8 cores?

57
Exercise 2
• 40% of a program can be perfectly parallelized; 40% can be
parallelized on at most 8 cores; 10% can be parallelized on at
most 4 cores; the remaining has to execute sequentially

• To achieve speedup of 4, how many cores are needed at least?

• To achieve speedup of 8, how many cores are needed at least?

58

You might also like