0% found this document useful (0 votes)

45 views70 pages

Chapter 5 - General Purpose PGPU, CUDA

Uploaded by

thanhtruongtran23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views70 pages

Chapter 5 - General Purpose PGPU, CUDA

Uploaded by

thanhtruongtran23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 70

General purpose PGPU,

CUDA
References
• Michael J. Quinn. Parallel Computing. Theory and Practice.
McGraw-Hill
• Albert Y. Zomaya. Parallel and Distributed Computing
Handbook. McGraw-Hill
• Ian Foster. Designing and Building Parallel Programs.
Addison-Wesley.
• Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar .
Introduction to Parallel Computing, Second Edition. Addison
Wesley.
• Joseph Jaja. An Introduction to Parallel Algorithm. Addison
Wesley.
• Nguyễn Đức Nghĩa. Tính toán song song. Hà Nội 2003.

3
5.1 GPU Architecture

4
Performance factors (1)

• Amount of data processed at one

time (Parallel processing)
Processor
• Processing speed on each data
DATAPROCESSED element (Clock frequency)
DATAOUT

• Amount of data transferred at

DATAIN

one time (Memory bandwidth)

• Time for each data element to be
Memory
transferred (Memory latency)

5
Performance factors (2)

• Different computational problems

Processor
are sensitive to these in different
DATAPROCESSED ways from one another
DATAOUT

• Different architectures address

DATAIN

these factors in different ways

Memory

6
CPUs: 4 key factors
• Parallel processing
– Until relatively recently, each CPU only had a single core. Now
CPUs have multiple cores, where each can process multiple
instructions per cycle
• Clock frequency
– CPUs aim to maximise clock frequency, but this has now hit a
limit due to power restrictions (more later)
• Memory bandwidth
– CPUs use regular DDR memory, which has limited bandwidth
• Memory latency
– Latency from DDR is high, but CPUs strive to hide the latency
through:
– Large on-chip low-latency caches to stage data
– Multithreading
– Out-of-order execution

7
The Problem with CPUs
• The power used by a CPU core is proportional to
Clock Frequency x Voltage2
• In the past, computers got faster by increasing the
frequency
– Voltage was decreased to keep power reasonable.
• Now, voltage cannot be decreased any further
– 1s and 0s in a system are represented by different
voltages
– Reducing overall voltage further would reduce this
difference to a point where 0s and 1s cannot be properly
distinguished
8
Reproduced from http://queue.acm.org/detail.cfm?id=2181798

9
The Problem with CPUs
• Instead, performance increases can be achieved
through exploiting parallelism
• Need a chip which can perform many parallel
operations every clock cycle
– Many cores and/or many operations per core
• Want to keep power/core as low as possible
• Much of the power expended by CPU cores is on
functionality not generally that useful for HPC
– e.g. branch prediction

10
Accelerators
• So, for HPC, we want chips with simple, low power,
number-crunching cores
• But we need our machine to do other things as well
as the number crunching
– Run an operating system, perform I/O, set up calculation
etc
• Solution: “Hybrid” system containing both CPU and
“accelerator” chips

11
Accelerators
• It costs a huge amount of money to design and
fabricate new chips
– Not feasible for relatively small HPC market
• Luckily, over the last few years, Graphics
Processing Units (GPUs) have evolved for the
highly lucrative gaming market
– And largely possess the right characteristics for HPC
– Many number-crunching cores

• GPUs now firmly established in HPC industry

12
AMD 12-core CPU
• Not much space on CPU is dedicated to compute

= compute unit
(= core)

13
NVIDIA Pascal GPU
• GPU dedicates much more space to compute
– At expense of caches, controllers, sophistication etc

= compute unit
(= SM
= 64 CUDA cores)

14
GPUs: 4 key factors
• Parallel processing
– GPUs have a much higher extent of parallelism than
CPUs: many more cores (high-end GPUs have thousands
of cores).
• Clock frequency
– GPUs typically have lower clock-frequency than CPUs,
and instead get performance through parallelism.
• Memory bandwidth
– GPUs use high bandwidth GDDR or HBM2 memory.
• Memory latency
– Memory latency from is similar to DDR.
– GPUs hide latency through very high levels of
multithreading.

15
NVIDIA Tesla Series GPU

• Chip partitioned into Streaming Multiprocessors (SMs) that act

independently of each other
• Multiple cores per SM. Groups of cores act in “lock-step”: they
perform the same instruction on different data elements
• Number of SMs, and cores per SM, varies across products. High-end
GPUs have thousands of cores

16
Performance trends

• GPU performance has been increasing much more rapidly

than CPU

17
Programming GPU
• GPUs cannot be used instead of CPUs
– They must be used together
– GPUs act as accelerators
– Responsible for the computationally expensive parts of the code

• CUDA: Extensions to the C language which allow

interfacing to the hardware (NVIDIA specific)
• OpenCL: Similar to CUDA but cross-platform
(including AMD and NVIDIA)

18
GPU Accelerated Systems
• CPUs and GPUs are used together
– Communicate over PCIe bus
– Or, in case of newest Pascal P100 GPUs, NVLINK (more later)

DRAM GDRAM/HBM2

CPU GPU

PCIe
I/O I/O

19
Scaling to larger systems
• Can have multiple CPUs and GPUs within each “workstation” or
“shared memory node”
– E.g. 2 CPUs +2 GPUs (below)
– CPUs share memory, but GPUs do not
PCIe
I/O I/O

GPU +
CPU
GDRAM/HB
Interconnect M2
DRAM

GPU +
CPU
Interconnect allows GDRAM/HB
multiple nodes to be M2
connected PCIe
I/O I/O

20
GPU Accelerated Supercomputer

GPU+CPU GPU+CPU … GPU+CPU

Node Node Node

GPU+CPU GPU+CPU GPU+CPU

Node Node
… Node

… … …

GPU+CPU GPU+CPU GPU+CPU

Node Node
… Node

21
Summary
• GPUs have higher compute and memory bandwidth
capabilities than CPUs
– Silicon dedicated to many simplistic cores
– Use of high bandwidth graphics or HBM2 memory
• Accelerators are typically not used alone, but work
in tandem with CPUs
• Most common are NVIDIA GPUs
– AMD also have high performance GPUs, but not so
widely used due to programming support
• GPU accelerated systems scale from simple
workstations to large-scale supercomputers
22
5.2 Introduction to CUDA
Programming

23
What is CUDA?
Programing system for machines with GPUs
• Programming Language
• Compilers
• Runtime Environments
• Drivers
• Hardware

24
CUDA : Heterogeneous Parallel
Computing
CPU optimized for fast single-thread execution
• Cores designed to execute 1 thread or 2 threads concurrently
• Large caches attempt to hide DRAM access times
• Cores optimized for low-latency cache accesses
• Complex control-logic for speculation and out-of-orderexecution

GPU optimized for high multi‐thread throughput

• Cores designed to execute many parallel threads concurrently
• Cores optimized for data‐parallel, throughput computation
• Chips use extensive multithreading to tolerate DRAM access times

25
CUDA C/C++ Program
Serial code executes in a Host (CPU) thread
Parallel code executes in many concurrent Device (GPU) threads
across multiple parallel processing elements

Host = CPU

Device = GPU

Host = CPU

Device = GPU

...

26
Compiling CUDA C/C++ Programs
/ / f oo. cpp
i n t foo(int x)
{
... CUDA C Rest of C
} Application
fl o at bar(float x)
Functions
{
...
}
NVCC CPU Compiler
/ / saxpy.cu
global void saxpy(int n , f l o a t . . . )
{
i n t i = threadIdx.x + . . . ;
i f ( i < n) y [ i ] = a * x [ i ] + y [ i ] ;
CUDA object CPU object
}
ﬁles Linker ﬁles
/ / main.cpp
void main( ) {
fl o at x = bar(1.0)
i f (x<2.0f) CPU + GPU
saxp y< < < . . . >>>(fo o (1), . . . ) ; Executable
...
}

© 2010, 2011 NVIDIA Corporation

27
Canonical execution ﬂow
GPU CPU

Core Core
SMEM SMEM

Cache
PCIe Bus
SMEM SMEM

Device Global Memory Host Memory

(GDDRAM) (DRAM)

28
Step 1 – copy data to GPU memory
GPU CPU

Core Core
SMEM SMEM

Cache
PCIe Bus
SMEM SMEM

Device Memory Host Memory

29
Step 2 – launch kernel on GPU
GPU CPU

Core Core
SMEM SMEM

Cache
PCIe Bus
SMEM SMEM

Device Memory Host Memory

(GDDRAM) (DRAM)

30
Step 3 – execute kernel on GPU
GPU CPU

Core Core
SMEM SMEM

Cache
PCIe Bus
SMEM SMEM

Device Memory Host Memory

(GDDRAM) (DRAM)

31
Step 4 – copy data to
CPU memoryGPU CPU

Core Core
SMEM SMEM

Cache
PCIe Bus
SMEM SMEM

Device Global Memory Host Memory

(GDDRAM) (DRAM)

32
Example: Fermi's GPU Architecture
32 CUDA Cores per Streaming
Multiprocessor (SM)
• 32 fp32 ops/clock
• 16 fp64 ops/clock
• 32 int32 ops/clock
2 Warp schedulers per SM
• 1,536 concurrent threads
4 special--function units
64KB shared memory + L1 cache
32K 32-bit registers
Load/Store Units x 16
Special Func Units x 4

Fermi GPUs have as many as 16 SMs

• 24,576 concurrent threads

33
Multithreading
CPU architecture attempts to minimize latency within each thread
GPU architecture hides latency with computation from other thread
warps

GPU Stream Multiprocessor – Throughput Processor Computation Thread/

W4 Warp of parallelThreads

W3
Tn Executing
W2

W1
Waiting for data

Ready to execute
CPU core – Low Latency Processor
T1 T2 T1 Context switch

34
CUDA Kernels
Parallel portion of application: execute as a kernel
• E\ ntire GPU executes kernel
• Kernel launches create thousands of CUDA threads eﬃciently

CPU Host Executes functions

GPU Device Executes kernels

CUDA threads
• Lightweight
• Fast switching
• 1000s execute simultaneously
Kernel launches create hierarchical groups of threads
• Threads are grouped into Blocks, and Blocks into Grids
• Threads and Blocks represent diﬀerent levels of parallelism

35
CUDA C : C with a few keywords
Kernel : function that executes on device (GPU) and can be called
from host (CPU)
• Can only access GPU memory
• No variable number ofarguments
• No static variables

Functions must be declared with a qualiﬁer

global : GPU kernel function launched by CPU, must return void
device : can be called from GPU functions
host : can be called from CPU functions (default)
host and device qualiﬁers can be combined

Qualiﬁers determines how functionsare compiled

• Controls which compilers are used to compile functions

36
CUDA Kernels : Parallel Threads
A kernel is a function in[i] in[i+1] in[i+2] in[i+3]

executed on the GPU as an

array of parallel threads

All threads execute the same f l o a t x = in[threadIdx.x];

f l o a t y = func(x);
kernel code, but can take out[threadIdx.x] = y ;
diﬀerent paths

Each thread has an ID out[i] out[i+1] out[i+2] out[i+3]

• Select input/outputdata
• Control decisions

37
5.3 Synchronization/Communication

38
CUDA Thread Organization

GPUs can handle thousands of concurrent threads

CUDA programming model supports even more
• Allows a kernel launch to specify more threads than the GPU can
execute concurrently
• Helps to amortize kernel launch times

39
Blocks of threads

Block Block Block

Threads are grouped into blocks

40
Grids of blocks

Block (0) Block (1) Block (2)

Grid

Threads are grouped into blocks

Blocks are grouped into a grid
A kernel is executed as a grid of blocks of threads

41
Blocks execute on Streaming
Multiprocessors Streaming Multiprocessor
Streaming Processor

SMEM

Thread Thread Block

Registers Per B
- lock
Shared
Memory

Global (Device) Global (Device)

Memory Memory

42
Grids of blocks executes across GPU
GPU Grid of Blocks

. ..

SMEM SMEM
. ..

… … …

. ..
SMEM SMEM

. ..

Global (Device) Memory

43
Kernel Execution Recap
A thread executes on a single streaming processor
• Allows use of familiar scalar code within kernel

A block executes on a single streaming multiprocessor

• Threads and blocks do not migrate to diﬀerent SMs
• All threads within block execute in concurrently, in parallel
A streaming multiprocessor may execute multiple blocks
• Must be able to satisfy aggregate register and memory demands

A grid executes on a single device (GPU)

• Blocks from the same grid may execute concurrently or serially
• Blocks from multiple grids may execute concurrently
• A device can execute multiple kernels concurrently

44
Block abstraction provides scalability
Blocks may execute in arbitrary order, concurrently or
sequentially, and parallelism increases with resources
• Depends on when execution resources become available
Independent execution of blocks provides scalability
• Blocks can be distributed across any number of SMs
Kernel
Grid
Launch

Device with 2 SMs Block 0 Device with 4 SMs

Block 1
SM 0 SM 1 SM 0 SM 1 SM 2 SM 3
Block 2
Block 0 Block 1 Block 0 Block 1 Block 2 Block 3
Block 3
Block 2 Block 3 Block 4 Block 5 Block 6 Block 7
Block 4
Block 4 Block 5
Block 5
Block 6 Block 7
Block 6
Block 7

45
Blocks enable e cient collaboration
Threads often need tocollaborate
• Cooperatively load/store common data sets
• Share results or cooperate to produce a single result
• Synchronize with each other

Threads in the sameblock

• Can communicate through shared and global memory
• Can synchronize using fast synchronization hardware

Threads in diﬀerent blocks of the same grid

• Cannot synchronize reliably
No guarantee that both threads are alive at the same time

46
Blocks must be independent
Any possible interleaving of blocks is allowed
• Blocks presumed to run to completion without pre--emption
• May run in any order, concurrently or sequentially

Programs that depend on block execution order within grid

for correctness are not wellformed
• May deadlock or return incorrectresults

Blocks may coordinate but not synchronize

• shared queue pointer: OK
• shared lock: BAD … can easily deadlock

47
Thread and Block ID and Dimensions
Threads: 3D IDs, unique within a
block
Device
Thread Blocks: 2D IDs, unique Grid 1
within a grid
Dimensions set at launch: Can be
unique for each grid
Block(1,1)
Build-in variables
• threadIdx, blockIdx
Block (1, 1)
• blockDim, gridDim
Thread Thread Thread Thread Thread
Programmers usually select (0, 0) (1, 0) (2, 0) (3, 0) (4, 0)

dimensions that simplify the Thread

(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)

mapping of the application data Thread Thread Thread Thread Thread

to CUDA threads (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

48
Examples of Indexes and Indexing
global v o i d k e r n e l ( i n t *a )
{
i n t i d x = b l o ck Idx. x* blockD im.x + t h r e a d I d x . x ;
a[idx] = 7;
} Output: 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7

global v o i d k e r n e l ( i n t *a )
{
i n t i d x = b l o ck Idx. x* blockD im.x + t h r e a d I d x . x ;
a [ i d x ] = block Idx.x;
} } Output: 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3

global v o i d k e r n e l ( i n t *a )
{
i n t i d x = b l o ck Idx. x* blockD im.x + t h r e a d I d x . x ;
a [ i d x ] = threadIdx.x;
} } Output: 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

49
Example of 2D indexing

global v o id k e r n e l ( i n t * a , i n t dim x, i n t dimy)

{
i n t i x = blockIdx.x*blockDim.x + t h r e a d I d x . x ;
i n t i y = blockIdx.y*blockDim.y + t h r e a d I d x . y ;
i n t i d x = i y * dimx + i x ;

a[idx] = a[idx]+1;
}

50
Independent address spaces
CPU and GPU have independent memory systems
• PCIe bus transfers data between CPU and GPU memory systems

Typically, CPU thread and GPU threads access what are

logically different, independent virtual address spaces

System GPU0 GPU1

Memory Memory Memory

PCI-e
51
51
Independent address spaces:
consequences
Cannot reliably determine whether a pointer references a host
(CPU) or device (GPU) address from the pointer value
• Dereferencing CPU/GPU pointer on GPU/CPU will likely cause crash
Uniﬁed virtual addressing (UVA) in CUDA 4.0
• One virtual address space shared by CPU thread and GPU threads

System GPU0 GPU1

Memory Memory Memory

PCI-e

52
CUDA Memory Hierarchy
Thread Thread Thread Thread

Thread
• Registers
Registers Registers Registers Registers

53
CUDA Memory Hierarchy
Thread Thread Thread Thread

Thread
• Registers
• Local memory
Registers Registers Registers Registers

Local Local Local Local

54
CUDA Memory Hierarchy
Thread Thread Thread Thread

Thread
• Registers
• Local memory
Registers Registers Registers Registers

Thread Block Local Local Local Local

• Shared memory

Shared

55
CUDA Memory Hierarchy

Thread
• Registers
• Local memory

Thread Block Global Memory

• Shared memory (DRAM)

All Thread Blocks

• Global Memory

56
Shared Memory
shared <type> x [<elements>];
Thread Thread Thread Thread
Allocated per thread block
Scope: threads in block
Data lifetime: same as block
Registers Registers Registers Registers
Capacity: small (about 48kB)
Latency: a few cycles Local Local Local Local
Bandwidth: very high
SM: 32 * 4 B * 1.15 GHz /2 = 73.6 GB/s Shared
GPU: 14 * 32 * 4 B * 1.15 GHz /2 = 1.03 TB/s
Common uses

• Sharing data among threads in a block

• User--managedcache (to reduce global memory accesses) 57
Global Memory
Allocated explicitly by host (CPU) thread
Scope:all threads of all kernels
Data lifetime: determine by host (CPU) thread
• cudaMalloc (void ** pointer, size_t nbytes)
• cudaFree (void* pointer)
Capacity: large (1--6GB)
Latency: 400--800cycles
Bandwidth: 156 GB/s
• Data access patterns will limit
bandwidth achieved in practice
Common uses Global Memory
• Staging data transfers to/from CPU (DRAM)
• Staging databetween kernel launches

58
Communication and Data Persistence
Kernel 1

Shared Shared
Memory . .. Memory

sequential Kernel 2
kernels
Shared Shared
Memory . .. Memory
Global
Memory

Kernel 3
Shared Shared
Memory . .. Memory

59
Managing Device (GPU) Memory
Host (CPU) manages device (GPU) memory
cudaMalloc (void ** pointer, size_t num_bytes)
cudaMemset (void* pointer, int value, size_t count)
cudaFree(void* pointer)

Example: allocate and initialize array of 1024 ints on device

/ / allocate and i n i t i a l i z e i n t x[1024] on device
i n t n = 1024;
i n t num_bytes = 1 0 2 4 * s i ze of (in t) ;
i n t * d_x = 0 ; / / holds device pointer
cudaMalloc((void**)&d_x, num_bytes);
cudaMemset(d_x, 0, num_bytes);
cudaFree(d_x);

60
Transferring Data
cudaMemcpy(void* d s t , void* s r c , size_t num_bytes, enum
cudaMemcpyKind d i r e c t i o n ) ;
• returns to host thread after the copy completes
• blocks CPU thread until all bytes have been copied
• doesn’t start copying until previous CUDA calls complete
Direction controlled by enum cudaMemcpyKind
cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice
CUDA also provides non--blocking
• Allows program to overlap data transfer with concurrent computation
on host anddevice
• Need to ensure that source locations are stable and destination locations are
not accessed

61
Example: SAXPY Kernel [1/4]
/ / [com pute] f o r ( i = 0 ; i < n ; i + + ) y [ i ] = a * x [ i ] + y [ i ] ;
// Each thread processes one element
g l o b a l v o id s a xp y( i n t n , f l o a t a , f l o a t * x , f l o a t * y )
{
i n t i = t h r e a d I d x. x + blockDim.x * b l o c k I d x . x ;
i f ( i < n) y [ i ] = a * x [ i ] + y [ i ] ;
}

i n t m ain()
{
...
/ / invoke p a r a l l e l SAXPY k e r n e l w i t h 256 threads / b lo ck
i n t nblocks = ( n + 2 5 5 )/2 5 6 ;
saxpy<<<nblocks, 256>>>(n, 2 . 0 , d_x, d _ y) ;
...
}

62
Example:SAXPY Kernel [1/4]
/ / [computes] f o r ( i = 0 ; i < n ; i + + ) y [ i ] = a * x [ i ] + y [ i ] ;
// Each thread processes one element
g l o b a l v o id s a xp y( i n t n , f l o a t a , f l o a t * x , f l o a t * y )
{
int i = t h r e a d I d x . x + blockDim.x * b l o c k I d x . x ;
if (i < n ) y [ i ] = a * x [ i ] + y [ i ] ;
}

i n t m ain() Device Code

{
...
/ / invoke p a r a l l e l SAXPY k e r n e l w i t h 256 threads / b lo ck
i n t nblocks = ( n + 2 5 5 )/2 5 6 ;
saxpy<<<nblocks, 256>>>(n, 2 . 0 , d_x, d _ y) ;
...
}

63
Example: SAXPY Kernel [1/4]
/ / [computes] f o r ( i = 0 ; i < n ; i + + ) y [ i ] = a * x [ i ] + y [ i ] ;
// Each thread processes one element
g l o b a l v o id s a xp y( i n t n , f l o a t a , f l o a t * x , f l o a t * y )
{
i n t i = t h r e a d I d x. x + blockDim.x * b l o c k I d x . x ;
i f ( i < n) y [ i ] = a * x [ i ] + y [ i ] ;
}
Host Code
i n t m ain()
{
...
/ / invoke p a r a l l e l SAXPY k e r n e l w i t h 256 threads / b lo ck
i n t nblocks = ( n + 2 5 5 )/2 5 6 ;
saxpy<<<nblocks, 256>>>(n, 2 . 0 , d_x, d _ y) ;
...
}

64
Example: SAXPY Kernel [2/4]
i n t m ain()
{
/ / a l l o c a t e and i n i t i a l i z e host (CPU) memory
float* x = . . . ;
float* y = . . . ;

/ / a l l o c a t e device (GPU) memory

f l o a t * d _ x, * d _ y;
c u d a M a l l o c ( ( v o id * *) &d_x, n * s i z e o f ( f l o a t ) ) ;
c u d a M a l l o c ( ( v o id * *) &d_y, n * s i z e o f ( f l o a t ) ) ;

/ / copy x and y from hos t memory t o dev ice memory

cudaMemcpy(d_x, x , n * s i z e o f ( f l o a t ) , cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y , n * s i z e o f ( f l o a t ) , cudaMemcpyHostToDevice);

/ / invoke p a r a l l e l SAXPY k e r n e l w i t h 256 threads / b lo ck

i n t nblocks = ( n + 2 5 5 )/2 5 6 ;
saxpy<<<nblocks, 256>>>(n, 2 . 0 , d_x, d _ y) ;

65
Example: SAXPY Kernel [2/4]
i n t m ain()
{
/ / a l l o c a t e and i n i t i a l i z e host (CPU) memory
float* x = . . . ;
float* y = . . . ;

/ / a l l o c a t e device (GPU) memory

f l o a t * d _ x, * d _ y;
c u d a M a l l o c ( ( v o id * *) &d_x, n * s i z e o f ( f l o a t ) ) ;
c u d a M a l l o c ( ( v o id * *) &d_y, n * s i z e o f ( f l o a t ) ) ;

/ / copy x and y from hos t memory t o dev ice memory

cudaMemcpy(d_x, x , n * s i z e o f ( f l o a t ) , cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y , n * s i z e o f ( f l o a t ) , cudaMemcpyHostToDevice);

/ / invoke p a r a l l e l SAXPY k e r n e l w i t h 256 threads / b lo ck

i n t nblocks = ( n + 2 5 5 )/2 5 6 ;
saxpy<<<nblocks, 256>>>(n, 2 . 0 , d_x, d _ y) ;

66
Example: SAXPY Kernel [3/4]
/ / invoke p a r a l l e l SAXPY k e r n e l w i t h 256 threads / b lo ck
i n t nblocks = ( n + 2 5 5 )/2 5 6 ;
saxpy<<<nblocks, 256>>>(n, 2 . 0 , d_x, d _ y) ;

/ / copy y from device (GPU) memory t o host (CPU) memory

cudaMemcpy(y, d_y, n * s i z e o f ( f l o a t ) , cudaMemcpyDeviceToHost);

/ / do something w i t h th e resultP

/ / f r e e device (GPU) memory

cudaFree(d_x);
cudaFree(d_y;

return 0;
}

67
Example: SAXPY Kernel [3/4]
/ / invoke p a r a l l e l SAXPY k e r n e l w i t h 256 threads / b lo ck
i n t nblocks = ( n + 2 5 5 )/2 5 6 ;
saxpy<<<nblocks, 256>>>(n, 2 . 0 , d_x, d _ y) ;

/ / copy y from device (GPU) memory t o host (CPU) memory

cudaMemcpy(y, d_y, n * s i z e o f ( f l o a t ) , cudaMemcpyDeviceToHost);

/ / do something w i t h th e resultP

/ / f r e e device (GPU) memory

cudaFree(d_x);
cudaFree(d_y;

return 0;
}

68
Example: SAXPY Kernel [4/4]
void s a xp y_ s e r i a l ( i n t n , f l o a t a , f l o a t * x , f l o a t * y)
{
f o r ( i n t i = 0; i < n; ++i)
y[ i ] = a*x[i] + y[ i ] ;
}
/ / invoke host SAXPYfu n c t i o n
s a xp y_ s e r i a l ( n , 2 . 0 , x , y ) ; Standard C Code

global v o id s a xp y( i n t n , f l o a t a , f l o a t * x , f l o a t * y )
{
i n t i = b lo ck Id x.x*b lo ck Dim .x + t h r e a d I d x . x ;
i f ( i < n) y [ i ] = a * x [ i ] + y [ i ] ;
}
/ / invoke p a r a l l e l SAXPY k e r n e l w i t h 256 th re a d s/b lo ck
i n t nblocks = ( n + 255) / 256;
saxpy<<<nblocks, 256>>>(n, 2 . 0 , x , y ) ; CUDA C Code
69

69
Thank
you for
your
attentions
!

Parallel Programming With CUDA - Architecture, Analysis
No ratings yet
Parallel Programming With CUDA - Architecture, Analysis
93 pages
Os Material Unit 1
No ratings yet
Os Material Unit 1
27 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
Cuda
No ratings yet
Cuda
69 pages
From CPU To GPU With CUDA C Language: Michele Tuttafesta Dottorato Di Ricerca in Fisica 25 Ciclo
No ratings yet
From CPU To GPU With CUDA C Language: Michele Tuttafesta Dottorato Di Ricerca in Fisica 25 Ciclo
71 pages
CH19 COA10e
No ratings yet
CH19 COA10e
20 pages
Coe4590 15 Gpu1
No ratings yet
Coe4590 15 Gpu1
14 pages
PART19
No ratings yet
PART19
20 pages
10 GPU-IntroCUDA3
No ratings yet
10 GPU-IntroCUDA3
141 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
44 pages
Lecture2 GPU Architecture - 2025
No ratings yet
Lecture2 GPU Architecture - 2025
46 pages
GPU Architecture
No ratings yet
GPU Architecture
12 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
PDC Lecture 7-8 GPU Architectures
No ratings yet
PDC Lecture 7-8 GPU Architectures
25 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
2023 CSC14120 Lecture00 CourseIntroduction
No ratings yet
2023 CSC14120 Lecture00 CourseIntroduction
30 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
Note2 4
No ratings yet
Note2 4
11 pages
Intro To Gpu &amp Cuda
No ratings yet
Intro To Gpu &amp Cuda
15 pages
Lecture 1
No ratings yet
Lecture 1
17 pages
06 Intro Gpus
No ratings yet
06 Intro Gpus
33 pages
Gpu Cuda Part2
No ratings yet
Gpu Cuda Part2
15 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
GPU in Supercomputer
No ratings yet
GPU in Supercomputer
7 pages
GPU Introduction
No ratings yet
GPU Introduction
52 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
Owens
No ratings yet
Owens
67 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
100% (1)
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
29 pages
Unit 5 - CUDA Architecture
No ratings yet
Unit 5 - CUDA Architecture
17 pages
Unit 4
No ratings yet
Unit 4
48 pages
HPC 5th Unit - 240504 - 160548
No ratings yet
HPC 5th Unit - 240504 - 160548
18 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
p10 Cuda
No ratings yet
p10 Cuda
28 pages
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
43 pages
Cuuda Nvidai Guide - Part1
No ratings yet
Cuuda Nvidai Guide - Part1
15 pages
1 Cuda
100% (1)
1 Cuda
173 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
CUDA
No ratings yet
CUDA
46 pages
Lec 1
No ratings yet
Lec 1
27 pages
cs179 2017 Lec01
No ratings yet
cs179 2017 Lec01
24 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
No ratings yet
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
14 pages
Mastering Concurrency Programming With Java 8 - Sample Chapter
33% (3)
Mastering Concurrency Programming With Java 8 - Sample Chapter
37 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Programming Assignment: On Openmp
No ratings yet
Programming Assignment: On Openmp
19 pages
MCQ's For Operating Systems
No ratings yet
MCQ's For Operating Systems
11 pages
Chapter 4 - Message-Passing Programming, MPI
No ratings yet
Chapter 4 - Message-Passing Programming, MPI
79 pages
Loop Fusion in Haskell
No ratings yet
Loop Fusion in Haskell
90 pages
CUDA C - Nvidia - Programming Guide EN
No ratings yet
CUDA C - Nvidia - Programming Guide EN
496 pages
Chapter 1 - Parallel Architectures
No ratings yet
Chapter 1 - Parallel Architectures
60 pages
TD 4
No ratings yet
TD 4
41 pages
Thread Implementation: For Parallel Processing
No ratings yet
Thread Implementation: For Parallel Processing
42 pages
Operating Systems: Threads Implementation
No ratings yet
Operating Systems: Threads Implementation
47 pages
Operating Systems ENN 205
No ratings yet
Operating Systems ENN 205
14 pages
Chapter 3 - Shared-Memory Programming, OpenMP
No ratings yet
Chapter 3 - Shared-Memory Programming, OpenMP
65 pages
Pipeline
No ratings yet
Pipeline
33 pages
Chapter 2 - Parallel Algorithm Design
No ratings yet
Chapter 2 - Parallel Algorithm Design
84 pages
OS Unit - 5
No ratings yet
OS Unit - 5
9 pages
Phase Test Chap 1,2
No ratings yet
Phase Test Chap 1,2
4 pages
Sun Web Services Training
No ratings yet
Sun Web Services Training
3 pages
Web Services 1-5
No ratings yet
Web Services 1-5
112 pages
Process Synchronization and Critical Section Problem
No ratings yet
Process Synchronization and Critical Section Problem
7 pages
Parallel & Distributed Computing:: Spring-2020 Lec#1
No ratings yet
Parallel & Distributed Computing:: Spring-2020 Lec#1
19 pages
CSE - CS403 - OPERATING SYSTEMS - R21 - Booklet
No ratings yet
CSE - CS403 - OPERATING SYSTEMS - R21 - Booklet
2 pages
Parallel Algorithm and Programming
No ratings yet
Parallel Algorithm and Programming
4 pages
3 - Types of OS
No ratings yet
3 - Types of OS
19 pages
Scheduling Algorithm
No ratings yet
Scheduling Algorithm
18 pages
ES Module-3
No ratings yet
ES Module-3
19 pages
Lecture 2 (Parallelism)
No ratings yet
Lecture 2 (Parallelism)
7 pages
What Is A Process Scheduler? State The Characteristics of A Good Process Scheduler? Which Criteria Affect The Schedulers Performance?
No ratings yet
What Is A Process Scheduler? State The Characteristics of A Good Process Scheduler? Which Criteria Affect The Schedulers Performance?
6 pages
Experiment No. - 13: AIM - Write A Program To Implement Producer Consumer Problem in Java
No ratings yet
Experiment No. - 13: AIM - Write A Program To Implement Producer Consumer Problem in Java
6 pages
Anr 6.0.0 (60006072) 20170929 070205
No ratings yet
Anr 6.0.0 (60006072) 20170929 070205
11 pages
Parallel Dense Gauss-Seidel Algorithm On Many-Core Processors
No ratings yet
Parallel Dense Gauss-Seidel Algorithm On Many-Core Processors
9 pages
OPENMP Notes
No ratings yet
OPENMP Notes
4 pages
Assignment 4 - OSF
No ratings yet
Assignment 4 - OSF
3 pages
Platform Script
No ratings yet
Platform Script
2 pages
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
From Everand
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
Rodrigo Copetti
No ratings yet

Chapter 5 - General Purpose PGPU, CUDA

Uploaded by

Chapter 5 - General Purpose PGPU, CUDA

Uploaded by

General purpose PGPU,

• Amount of data processed at one

• Amount of data transferred at

one time (Memory bandwidth)

• Different computational problems

• Different architectures address

these factors in different ways

• GPUs now firmly established in HPC industry

• Chip partitioned into Streaming Multiprocessors (SMs) that act

• GPU performance has been increasing much more rapidly

• CUDA: Extensions to the C language which allow

GPU+CPU GPU+CPU … GPU+CPU

GPU+CPU GPU+CPU GPU+CPU

GPU+CPU GPU+CPU GPU+CPU

GPU optimized for high multi‐thread throughput

© 2010, 2011 NVIDIA Corporation

Device Global Memory Host Memory

Device Memory Host Memory

Device Memory Host Memory

Device Memory Host Memory

Device Global Memory Host Memory

Fermi GPUs have as many as 16 SMs

GPU Stream Multiprocessor – Throughput Processor Computation Thread/

CPU Host Executes functions

Functions must be declared with a qualiﬁer

Qualiﬁers determines how functionsare compiled

executed on the GPU as an

All threads execute the same f l o a t x = in[threadIdx.x];

Each thread has an ID out[i] out[i+1] out[i+2] out[i+3]

GPUs can handle thousands of concurrent threads

Block Block Block

Threads are grouped into blocks

Block (0) Block (1) Block (2)

Threads are grouped into blocks

Thread Thread Block

Global (Device) Global (Device)

Global (Device) Memory

A block executes on a single streaming multiprocessor

A grid executes on a single device (GPU)

Device with 2 SMs Block 0 Device with 4 SMs

Threads in the sameblock

Threads in diﬀerent blocks of the same grid

Programs that depend on block execution order within grid

Blocks may coordinate but not synchronize

dimensions that simplify the Thread

mapping of the application data Thread Thread Thread Thread Thread

to CUDA threads (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

global v o id k e r n e l ( i n t * a , i n t dim x, i n t dimy)

Typically, CPU thread and GPU threads access what are

System GPU0 GPU1

System GPU0 GPU1

Local Local Local Local

Thread Block Local Local Local Local

Thread Block Global Memory

All Thread Blocks

• Sharing data among threads in a block

Example: allocate and initialize array of 1024 ints on device

i n t m ain() Device Code

/ / a l l o c a t e device (GPU) memory

/ / copy x and y from hos t memory t o dev ice memory

/ / invoke p a r a l l e l SAXPY k e r n e l w i t h 256 threads / b lo ck

/ / a l l o c a t e device (GPU) memory

/ / copy x and y from hos t memory t o dev ice memory

/ / invoke p a r a l l e l SAXPY k e r n e l w i t h 256 threads / b lo ck

/ / copy y from device (GPU) memory t o host (CPU) memory

/ / f r e e device (GPU) memory

/ / copy y from device (GPU) memory t o host (CPU) memory

/ / f r e e device (GPU) memory

You might also like