0% found this document useful (0 votes)

11 views58 pages

Chapter 8

Lecture slides from chapter 8 - GPU

Uploaded by

Zlata

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views58 pages

Chapter 8

Lecture slides from chapter 8 - GPU

Uploaded by

Zlata

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 58

CS2115 Computer Organization

2024/2025 Sem A

Chapter 8: Multicore & GPU

Dr. Nan Guan

Department of Computer Science
City University of Hong Kong

1
Sequential Execution
Fetch Decode Execute Memory Writeback
Address

Instruction
IR
Next

Data
PC

Data

Address
Sel Sel

Imm Imm
Decoder Main Memory
Add
Operand
Sel
Sel

s
Register Mul
Files Data
Reg.

.
..
Data

ALU
Data

2
Processing Models

single instruction and single data (SISD) single instruction and multiple data (SIMD)

multiple instruction and single data (MISD) multiple instruction and multiple data (MIMD)
3
Multicore Processors
• A single chip with two or more separate processing units (cores)
• Each of which reads and executes program instructions
• Increasing overall speed for programs that support multithreading
or other parallel computing techniques

Which processing model?

(among the four in the previous page)

4
Trend of Morden Processors
Moore’s Law:
integrated circuit
resources double
every 18–24 months

5
Amdahl’s Law
• In a program with parallel processing , a few instructions that
have to be performed in sequence will have a limiting factor on
program speedup such that adding more processors may not
make the program run faster.

6
Amdahl’s Law
ExeTimeold
Speedup =
ExeTimenew

• ExeTimeold: execution time of entire task without enhancement

• ExeTimenew: execution time of entire task with enhancement

Fractionenhanced
ExeTimenew = ExeTimeold × [(1 - Fractionenhanced) + ]
Speedupenhanced

ExeTimeold 1
Speedup = =
ExeTime new
Fractionenhanced
(1 - Fractionenhanced) +
Speedupenhanced

7
Amdahl’s Law
• When executing software on multi-core processors, the ideal
speedup achieved is equal to the number of processor cores
(or called cores for short)
• In reality, only part of the software can be parallelized
• Example: suppose 80% of a software program can be
parallelized, what is the best speedup if it is executed on 100
processor cores?
ExeTimeold 1 1
Speedup = = = = 1/0.208 = 4.808
ExeTimenew Fractionenhanced (1-0.8) + 0.8/100
(1 - Fractionenhanced) +
Speedupenhanced
8
GPU (Graphic Processing Unit)

9
Graphics Applications
• Graphics pipeline have a lot of redundant instructions
– Draw dots
– Draw triangles
– Draw lines
– Raster an image(convert from vector to pixel bitmap)
– Draw surfaces
– Determine which object to be shown on each pixel
– Compute lightning on each pixel
– Repeat

– All of these are lots of adding and multiplying tons of independent data in parallel !

10
GPU in Modern Systems
DOMAIN-SPECIFIC
cuDNN nvGRAPH TensorRT NCCL

VISUAL
PROCESSING NVIDA NPP DeepStream SDK NVIDIA CODEC SDK Index Framework

LINEAR
ALGEBRA
cuSPARSE cuBLAS cuRAND cuFFT

MATH
ALGORITHMS
CUDA Math library AmgX cuSOLVER THRUST LIBRARY

11
GPU: SIMD
• Idea: Use one decoder on multiple data
• Basically all these data perform the same operation
• Save logic required to fetch/decode tons of instructions

• SIMD: Single-instruction multiple-data

12
GPU Architecture SM
SP SP SP SP
SP: Scalar Processor
‘CUDA core’ SP SP SP SP

Executes one ‘thread’ SP SP SP SP

SP SP SP SP

SM: Streaming Multiprocessor SHARED MEMORY

multiple SPs (16, 32, 48 or more)

Fast local ‘shared memory’

shared among SPs GLOBAL MEMORY
e.g., 16 KB or 64 KB (ON DEVICE)

In this lecture, we use NVIDIA GPU as the example

HOST (CPU)
13
GPU Architecture SM
• For example, GTX 480: SP SP SP SP

• 14 SMs x 32 SPs = 448 cores on a GPU SP SP SP SP

SP SP SP SP

SHARED MEMORY

GDDR memory GLOBAL MEMORY

(ON DEVICE)
512 MB - 6 GB

HOST
14
GPU Programming Models
• CUDA (Compute Unified Device Architecture): parallel GPU
programming API created by NVIDIA
• Hardware and software architecture for issuing and managing
computations on GPU
• API libraries with C/C++/Fortran language

• Other models (not covered in this course):

– OpenGL – an open standard for GPU programming
– DirectX – a series of Microsoft multimedia programming interfaces

15
CUDA Structure
• Kernel: a function executed on the GPU Host Device

Grid 1

Kernel Block Block Block

• Thread: a minimum unit managed by CUDA runtime 1 (0, 0) (1, 0) (2, 0)

– Threads execute the same copy of codes Block

(0, 1)
Block
(1, 1)
Block
(2, 1)

• Warp: a scheduling unit of up to 32 threads Grid 2

• Block: a user-defined group of threads (usually Kernel

multiple of 32)
Block (1, 1)

• Grid: a group of one or more blocks. Thread Thread Thread Thread Thread
(0, 0) (1, 0) (2, 0) (3, 0) (4, 0)
– A grid is created for each CUDA kernel
Thread Thread Thread Thread Thread
(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)

Thread Thread Thread Thread Thread

(0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

16
CUDA Structure
• One block executes on one SM Host Device

– All threads sharing the shared memory Grid 1

– Blocks execute on SMs in parallel independently Kernel

1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)

– E.g., 128, 192, or 256 threads in a block Block Block Block

(0, 1) (1, 1) (2, 1)

• A thread executes on one SP at a time

Grid 2
– But may execute on different SP at different time Kernel
2
• Warp: a scheduling unit of up to 32 threads
– Each block is physically executed by multiple warps Block (1, 1)

Thread Thread Thread Thread Thread

(0, 0) (1, 0) (2, 0) (3, 0) (4, 0)

Thread Thread Thread Thread Thread

(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)

Thread Thread Thread Thread Thread

(0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

17
CUDA Structure
• Each block has a unique ID Host Device

• blockIdx: identifier for a block (1D, 2D or 3D) Grid 1

• E.g., (blockIdx.x, blockIdx.y, blockIdx.z) Kernel

1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)

• blockDim: the size of the block Block Block Block

(0, 1) (1, 1) (2, 1)
• Each thread has a unique ID within a block
• threadIdx: identifier for a thread (1D, 2D or 3D) Kernel
Grid 2

• E.g., (threadIdx.x, threadIdx.y, threadIdx.z) 2

Block (1, 1)

Thread Thread Thread Thread Thread

(0, 0) (1, 0) (2, 0) (3, 0) (4, 0)

Thread Thread Thread Thread Thread

(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)

Thread Thread Thread Thread Thread

(0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

18
CUDA Structure
• Build-in variables: Host Device

• In this example Grid 1

• gridDim.x=3, gridDim.y =2 Kernel Block Block Block

1
• blockDim.x=5, blockDim.y=3 (0, 0) (1, 0) (2, 0)

• blockIdx.x = 1, blockIdx.y = 1 Block Block Block

(0, 1) (1, 1) (2, 1)
• threadIdx.x = 2, threadIdx.y = 1
Grid 2

Kernel
2

Block (1, 1)

Thread Thread Thread Thread Thread

(0, 0) (1, 0) (2, 0) (3, 0) (4, 0)

Thread Thread Thread Thread Thread

(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)

Thread Thread Thread Thread Thread

(0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

19
CUDA: Programming
• Declarations on functions: __host__, __global__, __device__
– __host__ indicates host functions which execute on CPU
– __global__ makes a function for a kernel which executes on GPU
– __device__ indicates device functions, which can only be called from other
device or global functions. It cannot be called from host code

• Launch kernel: mapping of thread programs to device

– kernel <<<grid_size, block_size>>>(<args>)

20
CUDA: Programming
mycode.cu
Compiled by CPU Compiled by GPU
int main_data; compiler compiler
__shared__ int sdata;
int main_data; __shared__ sdata;
Main() { }
__host__ hfunc () {
Main() {} __global__ gfunc() {
Host Only int hdata;
__host__ hfunc () { int gdata;
gfunc<<<g,b,m>>>();
int hdata; dfunc()
}
gfunc<<<g,b,m>>>(); }
__global__ gfunc() { }
int gdata; __device__ dfunc() {
Interface dfunc(); int ddata;
} }

__device__ dfunc() {
Device Only int ddata;
}

21
Difference between CPU and GPU Programs

CPU Program GPU program

22
Example 1
• Task to do
 Scan elements of an array of numbers (any of 0 to 9)
 Count how many times does “6” appear
 The array has 16 elements (SIZE=16)
 Each block contains 4 threads (BLOCKSIZE=4)
 So each thread checks 4 elements (SIZE/BLOCKSIZE=4)
 1 block in the grid

3 6 7 5 3 5 6 2 9 1 2 7 0 9 3 6

threadIdx.x = 0 checks elements 0, 4, 8, 12

threadIdx.x = 1 checks elements 1, 5, 9, 13
threadIdx.x = 2 checks elements 2, 6, 10, 14
threadIdx.x = 3 checks elements 3, 7, 11, 15
23
Example 1
 Launch Kernel in Host Function: compute<<<1, 4>>>(d_in, d_out);

 Kernel Function:
 Thread scans subset of array elements
 Call device function to compare with 6
 Compute local result

 Device Function:
 Compare current element and 6
 Return 1 if same, else 0

3 6 7 5 3 5 6 2 9 1 2 7 0 9 3 6

threadIdx.x = 0 checks elements 0, 4, 8, 12

threadIdx.x = 1 checks elements 1, 5, 9, 13
threadIdx.x = 2 checks elements 2, 6, 10, 14
threadIdx.x = 3 checks elements 3, 7, 11, 15
24
Example 2
• Task to do
 Element-wise addition of two matrices.
 Both matrices have N*N elements.
 Each thread does addition for one pair of elements
 Each block contains 16*16 threads.
 The grid has (N/16) * (N/16) blocks.
blockIdx.x threadIdx.x

0,0 0,N-1 0,0 0,15

blockIdx.y*blockDim.y threadIdx.y

threadId.y
blockIdx.y

i,j
blockIdx.x*blockDim.x threadIdx.x
N-1,N-
N-1,0 1 15,0 15,15

25
GPU Tradeoffs
• Very high throughput

• Low single thread performance

– No complex pipeline
– No branch predictor

• Your program’s performance depends on the applications’ parallelism

– If you can run many things in parallel → very fast
– Otherwise → slow

26
Interconnection
• For transferring information between
nodes or for broadcasting information to
all nodes
• Suitability of a network judged by cost,
bandwidth, and effective throughput
– Bandwidth is capacity in bits per second
– Effective throughput is actual rate due to
need for transferring control information
Bus and Ring
• A bus is a set of lines providing a shared path
– Requires arbitration for one access at a time
– Simple bus is held until response is ready, but split-transaction bus can
overlap requests

• A ring provides point-to-point connections

– Bi-directional ring halves average latency
– Hierarchy of rings reduces latency for transfers within a lower-level
ring or between rings
Crossbar
• A crossbar provides a direct link
between any pair of units
connected to the network

• Can implement with collection of

switches for simultaneous
transfers to different destinations

29
Shared Memory Multiprocessors
• All processors have direct (hardware) access to all the main
memory, i.e., share the same address space
– Multicore processors, GPU, …

• Uniform Memory Access (UMA)

• Non-Uniform Memory Access (NUMA)

30
UMA vs NUMA
Uniform Memory Access (UMA) non-uniform Memory Access (NUMA)
Memory access time is balanced or equal. Memory access time is not equal.

Single memory controller is used Multiple memory controllers are used.

UMA is usually slower than NUMA NUMA is usually faster than UMA

UMA has limited bandwidth NUMA has higher bandwidth than UMA

31
Shared Data
• Proc 0 writes to an address, followed by Proc 1 reading
Proc 0 Proc 1
Mem[A] = 1 …

Print Mem[A]

• Mem[A] is cached (at either end)

32
Cache Coherence
• Basic question: If multiple processors cache the same block,
how do they ensure they all see a consistent state?
P1 P2

Interconnection Network

1000
x
Main Memory
33
Cache Coherence
• Basic question: If multiple processors cache the same block,
how do they ensure they all see a consistent state?
P1 P2 lw r2, x

1000

Interconnection Network

1000
x
Main Memory
34
Cache Coherence
• Basic question: If multiple processors cache the same block,
how do they ensure they all see a consistent state?
lw r2, x P1 P2 lw r2, x

1000 1000

Interconnection Network

1000
x
Main Memory
35
Cache Coherence
• Basic question: If multiple processors cache the same block,
how do they ensure they all see a consistent state?
lw r2, x P1 P2 lw r2, x
add r1, r2, r4
sw r1, x
2000 1000

Interconnection Network

1000
x
Main Memory
36
Cache Coherence
• Basic question: If multiple processors cache the same block,
how do they ensure they all see a consistent state?
lw r2, x P1 P2 lw r2, x
add r1, r2, r4 Should NOT
lw r5, x
sw r1, x load 1000
2000 1000

Interconnection Network

1000
x
Main Memory
37
Cache Coherence
• Basic idea:
P1 P2
– A processor/cache broadcasts its
write/update to a memory 2000 1000
location to all other processors
– Another cache that has the
Interconnection Network
location either updates or
invalidates its local copy
1000
x
Main Memory

38
Snoopy Coherence Scheme
• All caches “snoop” all other caches’ read/write requests and
keep the cache line coherent; invalidate the cache line if other
caches change the value

– Easy to implement if all caches share a common bus

• Each cache broadcasts its read/write operations on the bus
– Good for small-scale multiprocessors
– Full broadcast of all requests is not efficient for large-scale shared-
memory multiprocessors
39
Directory-based Cache Coherence Scheme
• Idea: A (logically) central directory keeps track of where the copies
of each cache block reside; caches check directory to ensure
coherence.
– For each memory block, identify which nodes have copies,
– If it is modified, request sent to module containing the block for limited
broadcast, or to forward to owner

40
Directory-based Cache Coherence Scheme
• Requestor: the processor who requesting for a read/write of a memory block.
• Owner: An owner node owns the most recent state of the cache block, note that
directory might not be always up to date with latest data.
• Sharer: One or many node which are sharing a copy of the cache block.

41
Directory-based Cache Coherence Scheme
• Directory-based Cache Coherence for NUMA

42
Directory-based Cache Coherence Scheme
• Directory-based Cache Coherence for NUMA
– read miss to clean line

43
Directory-based Cache Coherence Scheme
• Directory-based Cache Coherence for NUMA
– read miss to clean line

1. Read miss message sent to “home node”

44
Directory-based Cache Coherence Scheme
• Directory-based Cache Coherence for NUMA
– read miss to clean line

1. Read miss message sent to “home node”

2. Home directory checks entry for line and return
the line of data from memory

45
Directory-based Cache Coherence Scheme
• Directory-based Cache Coherence for NUMA
– read miss to dirty line

46
Directory-based Cache Coherence Scheme
• Directory-based Cache Coherence for NUMA
– read miss to dirty line

1. If the data is dirty, then data must be sourced by

another processor (with the up-to-date copy)
2. Home node tell requesting node where to find data

47
Directory-based Cache Coherence Scheme
• Directory-based Cache Coherence for NUMA
– read miss to dirty line

1. If the data is dirty, then data must be sourced by

another processor (with the up-to-date copy)
2. Home node tell requesting node where to find data
3. Requesting node requests data from owner
4. Owner changes state in cache to SHARED (read only),
responds to requesting node

48
Directory-based Cache Coherence Scheme
• Directory-based Cache Coherence for NUMA
– read miss to dirty line

1. If the data is dirty, then data must be sourced by

49
Snoopy vs Directory-Based

Snoopy Directory-based
Full broadcast communication Limited broadcast or point-to-point
communication
Simple control logic More complex control logic
(need to maintain directory states)
Low latency Higher latency
(need to access the directory)
Limited scalability Good scalability

50
Message-Passing Multiprocessors
• Processors have their own physical address space
– memory accessed only by local processor
• Communicate via explicit message passing
– one processor needs to know when a message is sent, and the
receiving processor needs to know when a message arrives

51
Message-Passing Multiprocessors
• Each node is effectively a complete computer
• To build large-scale computers, usually use high-performance
message-passing interconnection networks
– Special interconnection networks offer better communication
performance than using e.g., LAN, but is much more expensive
• Much easier for hardware designer to build
– But require more programming efforts

52
Shared Memory vs Message Passing
Shared Memory Message Passing

The shared memory region is used for communication. A message passing facility is used for communication.
Communicating processes share a common address Communicating processes have different address
space. space.
It provides a maximum speed of computation as It is time-consuming as message passing is
communication is done through shared memory so implemented crossing software layers (e.g., system
system calls are made only to establish the shared calls provided by operating systems).
memory.
Faster communication strategy. Relatively slower communication strategy.

It can be used in exchanging larger amounts of data. It can be used in exchanging small amounts of data.

Software on different processors need to ensure the Explicit communication actions, no consistency
data consistency problem

53
Network-on-Chips (NoC)
• Bus has been the most popular for multiprocessor systems, but
– When expanding to a many core-system, contention decreases throughput;
– When scaling sizes and frequency, wire delays remain larger than clock cycle
• Need for interconnect with deterministic delays and scalability
• Network-on-Chip (NoC): Each on-chip component connected by an
intelligent switch (router) to particular communication wire(s)

Leveraging existing computer networking

principles for on-chip communications

54
Network-on-Chips (NoC)

Mesh NoC
System-on-Chip (SoC)
• Integrates most or all components of a computer on one chip
– Rather than many chips on a board
– Popular in domains sensitive to size and energy efficiency
• Mobile devices
• Embedded devices

56
Exercise 1
• 80% of a program can be perfectly parallelized; 10% can be
parallelized on at most 4 cores; the remaining has to execute
sequentially
• What is the speedup if executed on 8 cores?

57
Exercise 2
• 40% of a program can be perfectly parallelized; 40% can be
parallelized on at most 8 cores; 10% can be parallelized on at
most 4 cores; the remaining has to execute sequentially

• To achieve speedup of 4, how many cores are needed at least?

• To achieve speedup of 8, how many cores are needed at least?

Accpac - Guide - Manual For AP User Guide PDF
100% (8)
Accpac - Guide - Manual For AP User Guide PDF
509 pages
Com - Dual.space - Parallel.apps - Multiaccounts.thinktech Logcat
No ratings yet
Com - Dual.space - Parallel.apps - Multiaccounts.thinktech Logcat
21 pages
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Lec 1
No ratings yet
Lec 1
27 pages
Lecture2 GPU Architecture - 2025
No ratings yet
Lecture2 GPU Architecture - 2025
46 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Lecture 12 GPU Programming
No ratings yet
Lecture 12 GPU Programming
65 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
100% (1)
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
29 pages
CUDA
No ratings yet
CUDA
33 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
1 Cuda
100% (1)
1 Cuda
173 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
CUDA Programming
No ratings yet
CUDA Programming
35 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
6 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
CH19 COA10e
No ratings yet
CH19 COA10e
20 pages
PART19
No ratings yet
PART19
20 pages
CUDA
No ratings yet
CUDA
18 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Parallel Processing With Cuda
No ratings yet
Parallel Processing With Cuda
25 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Cuda
No ratings yet
Cuda
25 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
GTC S62191
No ratings yet
GTC S62191
89 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
44 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
Summary Exam 2015
No ratings yet
Summary Exam 2015
30 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Cuda C
No ratings yet
Cuda C
70 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet
Imperva - Cloud WAF - Capability Brief 2020
No ratings yet
Imperva - Cloud WAF - Capability Brief 2020
3 pages
Erp Proposal
No ratings yet
Erp Proposal
4 pages
DSA Question Solve
No ratings yet
DSA Question Solve
42 pages
Drawing Aids: Setting Grid and Snap
No ratings yet
Drawing Aids: Setting Grid and Snap
14 pages
The World-Wide Web
No ratings yet
The World-Wide Web
6 pages
How To Hack Using QR Codes - Hacker Academy
No ratings yet
How To Hack Using QR Codes - Hacker Academy
13 pages
Idoc Pub - Manual-Maxsurf (121-144) Es en
No ratings yet
Idoc Pub - Manual-Maxsurf (121-144) Es en
24 pages
Web Applications and Security (1 Mark Questions)
No ratings yet
Web Applications and Security (1 Mark Questions)
5 pages
AIML Quiz 2 Jan-Jun 24 - Attempt Review
No ratings yet
AIML Quiz 2 Jan-Jun 24 - Attempt Review
4 pages
How To Download Free VidMate
No ratings yet
How To Download Free VidMate
8 pages
Chapter 5
No ratings yet
Chapter 5
47 pages
Aws Perspective
No ratings yet
Aws Perspective
70 pages
Cucumber BDD Notes
No ratings yet
Cucumber BDD Notes
60 pages
Mobile Application Development MAD - 22617 (Report)
No ratings yet
Mobile Application Development MAD - 22617 (Report)
37 pages
Angular Interview Questions With Answers
No ratings yet
Angular Interview Questions With Answers
6 pages
Welcome To Web Development
No ratings yet
Welcome To Web Development
8 pages
Faktura: Beskrivelse Antall À Pris Mva % EUR Artikkelnummer Mva
No ratings yet
Faktura: Beskrivelse Antall À Pris Mva % EUR Artikkelnummer Mva
2 pages
Citra Log
No ratings yet
Citra Log
172 pages
Ayyash Et Al 2024 Optimal Operation of Intermittent Water Supply Systems Under Water Scarcity
No ratings yet
Ayyash Et Al 2024 Optimal Operation of Intermittent Water Supply Systems Under Water Scarcity
15 pages
ESP8266 DevilTwin Deborshibd - Ino
No ratings yet
ESP8266 DevilTwin Deborshibd - Ino
7 pages
KMSmicro VJJHJH
No ratings yet
KMSmicro VJJHJH
10 pages
Generative Ai
No ratings yet
Generative Ai
1 page
Eque2 Customers-Navigation-and-Overview
No ratings yet
Eque2 Customers-Navigation-and-Overview
2 pages
Neumann MA 1 - Version History - 2024-11
No ratings yet
Neumann MA 1 - Version History - 2024-11
7 pages
Fall Semester 2024-25 - SWE4007 - TH - AP2024252000662 - 2024-09-13 - Reference-Material-I
No ratings yet
Fall Semester 2024-25 - SWE4007 - TH - AP2024252000662 - 2024-09-13 - Reference-Material-I
21 pages
Question Engl-Versão Final
No ratings yet
Question Engl-Versão Final
63 pages
W-2 Employee Electronic Consent Instructions
No ratings yet
W-2 Employee Electronic Consent Instructions
2 pages
Microsoft Excel 365
No ratings yet
Microsoft Excel 365
1 page

Chapter 8

Uploaded by

Chapter 8

Uploaded by

CS2115 Computer Organization

Chapter 8: Multicore & GPU

Dr. Nan Guan

Which processing model?

• ExeTimeold: execution time of entire task without enhancement

• SIMD: Single-instruction multiple-data

Executes one ‘thread’ SP SP SP SP

SM: Streaming Multiprocessor SHARED MEMORY

Fast local ‘shared memory’

In this lecture, we use NVIDIA GPU as the example

• 14 SMs x 32 SPs = 448 cores on a GPU SP SP SP SP

GDDR memory GLOBAL MEMORY

• Other models (not covered in this course):

Kernel Block Block Block

– Threads execute the same copy of codes Block

• Warp: a scheduling unit of up to 32 threads Grid 2

• Block: a user-defined group of threads (usually Kernel

Thread Thread Thread Thread Thread

– All threads sharing the shared memory Grid 1

– Blocks execute on SMs in parallel independently Kernel

– E.g., 128, 192, or 256 threads in a block Block Block Block

• A thread executes on one SP at a time

Thread Thread Thread Thread Thread

Thread Thread Thread Thread Thread

Thread Thread Thread Thread Thread

• blockIdx: identifier for a block (1D, 2D or 3D) Grid 1

• E.g., (blockIdx.x, blockIdx.y, blockIdx.z) Kernel

• blockDim: the size of the block Block Block Block

• E.g., (threadIdx.x, threadIdx.y, threadIdx.z) 2

Thread Thread Thread Thread Thread

Thread Thread Thread Thread Thread

Thread Thread Thread Thread Thread

• In this example Grid 1

• gridDim.x=3, gridDim.y =2 Kernel Block Block Block

• blockIdx.x = 1, blockIdx.y = 1 Block Block Block

Thread Thread Thread Thread Thread

Thread Thread Thread Thread Thread

Thread Thread Thread Thread Thread

• Launch kernel: mapping of thread programs to device

CPU Program GPU program

threadIdx.x = 0 checks elements 0, 4, 8, 12

threadIdx.x = 0 checks elements 0, 4, 8, 12

0,0 0,N-1 0,0 0,15

• Low single thread performance

• Your program’s performance depends on the applications’ parallelism

• A ring provides point-to-point connections

• Can implement with collection of

• Uniform Memory Access (UMA)

• Non-Uniform Memory Access (NUMA)

Single memory controller is used Multiple memory controllers are used.

• Mem[A] is cached (at either end)

– Easy to implement if all caches share a common bus

1. Read miss message sent to “home node”

1. Read miss message sent to “home node”

1. If the data is dirty, then data must be sourced by

1. If the data is dirty, then data must be sourced by

1. If the data is dirty, then data must be sourced by

Leveraging existing computer networking

• To achieve speedup of 4, how many cores are needed at least?

• To achieve speedup of 8, how many cores are needed at least?

You might also like