0% found this document useful (0 votes)
45 views70 pages

Chapter 5 - General Purpose PGPU, CUDA

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views70 pages

Chapter 5 - General Purpose PGPU, CUDA

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

General purpose PGPU,

CUDA
References
• Michael J. Quinn. Parallel Computing. Theory and Practice.
McGraw-Hill
• Albert Y. Zomaya. Parallel and Distributed Computing
Handbook. McGraw-Hill
• Ian Foster. Designing and Building Parallel Programs.
Addison-Wesley.
• Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar .
Introduction to Parallel Computing, Second Edition. Addison
Wesley.
• Joseph Jaja. An Introduction to Parallel Algorithm. Addison
Wesley.
• Nguyễn Đức Nghĩa. Tính toán song song. Hà Nội 2003.

3
5.1 GPU Architecture

4
Performance factors (1)

• Amount of data processed at one


time (Parallel processing)
Processor
• Processing speed on each data
DATAPROCESSED element (Clock frequency)
DATAOUT

• Amount of data transferred at


DATAIN

one time (Memory bandwidth)


• Time for each data element to be
Memory
transferred (Memory latency)

5
Performance factors (2)

• Different computational problems


Processor
are sensitive to these in different
DATAPROCESSED ways from one another
DATAOUT

• Different architectures address


DATAIN

these factors in different ways


Memory

6
CPUs: 4 key factors
• Parallel processing
– Until relatively recently, each CPU only had a single core. Now
CPUs have multiple cores, where each can process multiple
instructions per cycle
• Clock frequency
– CPUs aim to maximise clock frequency, but this has now hit a
limit due to power restrictions (more later)
• Memory bandwidth
– CPUs use regular DDR memory, which has limited bandwidth
• Memory latency
– Latency from DDR is high, but CPUs strive to hide the latency
through:
– Large on-chip low-latency caches to stage data
– Multithreading
– Out-of-order execution

7
The Problem with CPUs
• The power used by a CPU core is proportional to
Clock Frequency x Voltage2
• In the past, computers got faster by increasing the
frequency
– Voltage was decreased to keep power reasonable.
• Now, voltage cannot be decreased any further
– 1s and 0s in a system are represented by different
voltages
– Reducing overall voltage further would reduce this
difference to a point where 0s and 1s cannot be properly
distinguished
8
Reproduced from http://queue.acm.org/detail.cfm?id=2181798

9
The Problem with CPUs
• Instead, performance increases can be achieved
through exploiting parallelism
• Need a chip which can perform many parallel
operations every clock cycle
– Many cores and/or many operations per core
• Want to keep power/core as low as possible
• Much of the power expended by CPU cores is on
functionality not generally that useful for HPC
– e.g. branch prediction

10
Accelerators
• So, for HPC, we want chips with simple, low power,
number-crunching cores
• But we need our machine to do other things as well
as the number crunching
– Run an operating system, perform I/O, set up calculation
etc
• Solution: “Hybrid” system containing both CPU and
“accelerator” chips

11
Accelerators
• It costs a huge amount of money to design and
fabricate new chips
– Not feasible for relatively small HPC market
• Luckily, over the last few years, Graphics
Processing Units (GPUs) have evolved for the
highly lucrative gaming market
– And largely possess the right characteristics for HPC
– Many number-crunching cores

• GPUs now firmly established in HPC industry

12
AMD 12-core CPU
• Not much space on CPU is dedicated to compute

= compute unit
(= core)

13
NVIDIA Pascal GPU
• GPU dedicates much more space to compute
– At expense of caches, controllers, sophistication etc

= compute unit
(= SM
= 64 CUDA cores)

14
GPUs: 4 key factors
• Parallel processing
– GPUs have a much higher extent of parallelism than
CPUs: many more cores (high-end GPUs have thousands
of cores).
• Clock frequency
– GPUs typically have lower clock-frequency than CPUs,
and instead get performance through parallelism.
• Memory bandwidth
– GPUs use high bandwidth GDDR or HBM2 memory.
• Memory latency
– Memory latency from is similar to DDR.
– GPUs hide latency through very high levels of
multithreading.

15
NVIDIA Tesla Series GPU

• Chip partitioned into Streaming Multiprocessors (SMs) that act


independently of each other
• Multiple cores per SM. Groups of cores act in “lock-step”: they
perform the same instruction on different data elements
• Number of SMs, and cores per SM, varies across products. High-end
GPUs have thousands of cores

16
Performance trends

• GPU performance has been increasing much more rapidly


than CPU

17
Programming GPU
• GPUs cannot be used instead of CPUs
– They must be used together
– GPUs act as accelerators
– Responsible for the computationally expensive parts of the code

• CUDA: Extensions to the C language which allow


interfacing to the hardware (NVIDIA specific)
• OpenCL: Similar to CUDA but cross-platform
(including AMD and NVIDIA)

18
GPU Accelerated Systems
• CPUs and GPUs are used together
– Communicate over PCIe bus
– Or, in case of newest Pascal P100 GPUs, NVLINK (more later)

DRAM GDRAM/HBM2

CPU GPU

PCIe
I/O I/O

19
Scaling to larger systems
• Can have multiple CPUs and GPUs within each “workstation” or
“shared memory node”
– E.g. 2 CPUs +2 GPUs (below)
– CPUs share memory, but GPUs do not
PCIe
I/O I/O

GPU +
CPU
GDRAM/HB
Interconnect M2
DRAM

GPU +
CPU
Interconnect allows GDRAM/HB
multiple nodes to be M2
connected PCIe
I/O I/O

20
GPU Accelerated Supercomputer

GPU+CPU GPU+CPU … GPU+CPU


Node Node Node

GPU+CPU GPU+CPU GPU+CPU


Node Node
… Node

… … …

GPU+CPU GPU+CPU GPU+CPU


Node Node
… Node

21
Summary
• GPUs have higher compute and memory bandwidth
capabilities than CPUs
– Silicon dedicated to many simplistic cores
– Use of high bandwidth graphics or HBM2 memory
• Accelerators are typically not used alone, but work
in tandem with CPUs
• Most common are NVIDIA GPUs
– AMD also have high performance GPUs, but not so
widely used due to programming support
• GPU accelerated systems scale from simple
workstations to large-scale supercomputers
22
5.2 Introduction to CUDA
Programming

23
What is CUDA?
Programing system for machines with GPUs
• Programming Language
• Compilers
• Runtime Environments
• Drivers
• Hardware

24
CUDA : Heterogeneous Parallel
Computing
CPU optimized for fast single-thread execution
• Cores designed to execute 1 thread or 2 threads concurrently
• Large caches attempt to hide DRAM access times
• Cores optimized for low-latency cache accesses
• Complex control-logic for speculation and out-of-orderexecution

GPU optimized for high multi‐thread throughput


• Cores designed to execute many parallel threads concurrently
• Cores optimized for data‐parallel, throughput computation
• Chips use extensive multithreading to tolerate DRAM access times

25
CUDA C/C++ Program
Serial code executes in a Host (CPU) thread
Parallel code executes in many concurrent Device (GPU) threads
across multiple parallel processing elements

Host = CPU

Device = GPU

Host = CPU

Device = GPU

...

26
Compiling CUDA C/C++ Programs
/ / f oo. cpp
i n t foo(int x)
{
... CUDA C Rest of C
} Application
fl o at bar(float x)
Functions
{
...
}
NVCC CPU Compiler
/ / saxpy.cu
global void saxpy(int n , f l o a t . . . )
{
i n t i = threadIdx.x + . . . ;
i f ( i < n) y [ i ] = a * x [ i ] + y [ i ] ;
CUDA object CPU object
}
files Linker files
/ / main.cpp
void main( ) {
fl o at x = bar(1.0)
i f (x<2.0f) CPU + GPU
saxp y< < < . . . >>>(fo o (1), . . . ) ; Executable
...
}

© 2010, 2011 NVIDIA Corporation

27
Canonical execution flow
GPU CPU

Core Core
SMEM SMEM

Cache
PCIe Bus
SMEM SMEM

Device Global Memory Host Memory


(GDDRAM) (DRAM)

28
Step 1 – copy data to GPU memory
GPU CPU

Core Core
SMEM SMEM

Cache
PCIe Bus
SMEM SMEM

Device Memory Host Memory

29
Step 2 – launch kernel on GPU
GPU CPU

Core Core
SMEM SMEM

Cache
PCIe Bus
SMEM SMEM

Device Memory Host Memory


(GDDRAM) (DRAM)

30
Step 3 – execute kernel on GPU
GPU CPU

Core Core
SMEM SMEM

Cache
PCIe Bus
SMEM SMEM

Device Memory Host Memory


(GDDRAM) (DRAM)

31
Step 4 – copy data to
CPU memoryGPU CPU

Core Core
SMEM SMEM

Cache
PCIe Bus
SMEM SMEM

Device Global Memory Host Memory


(GDDRAM) (DRAM)

32
Example: Fermi's GPU Architecture
32 CUDA Cores per Streaming
Multiprocessor (SM)
• 32 fp32 ops/clock
• 16 fp64 ops/clock
• 32 int32 ops/clock
2 Warp schedulers per SM
• 1,536 concurrent threads
4 special--function units
64KB shared memory + L1 cache
32K 32-bit registers
Load/Store Units x 16
Special Func Units x 4

Fermi GPUs have as many as 16 SMs


• 24,576 concurrent threads

33
Multithreading
CPU architecture attempts to minimize latency within each thread
GPU architecture hides latency with computation from other thread
warps

GPU Stream Multiprocessor – Throughput Processor Computation Thread/


W4 Warp of parallelThreads

W3
Tn Executing
W2

W1
Waiting for data

Ready to execute
CPU core – Low Latency Processor
T1 T2 T1 Context switch

34
CUDA Kernels
Parallel portion of application: execute as a kernel
• E\ ntire GPU executes kernel
• Kernel launches create thousands of CUDA threads efficiently

CPU Host Executes functions


GPU Device Executes kernels

CUDA threads
• Lightweight
• Fast switching
• 1000s execute simultaneously
Kernel launches create hierarchical groups of threads
• Threads are grouped into Blocks, and Blocks into Grids
• Threads and Blocks represent different levels of parallelism

35
CUDA C : C with a few keywords
Kernel : function that executes on device (GPU) and can be called
from host (CPU)
• Can only access GPU memory
• No variable number ofarguments
• No static variables

Functions must be declared with a qualifier


global : GPU kernel function launched by CPU, must return void
device : can be called from GPU functions
host : can be called from CPU functions (default)
host and device qualifiers can be combined

Qualifiers determines how functionsare compiled


• Controls which compilers are used to compile functions

36
CUDA Kernels : Parallel Threads
A kernel is a function in[i] in[i+1] in[i+2] in[i+3]

executed on the GPU as an


array of parallel threads

All threads execute the same f l o a t x = in[threadIdx.x];


f l o a t y = func(x);
kernel code, but can take out[threadIdx.x] = y ;
different paths

Each thread has an ID out[i] out[i+1] out[i+2] out[i+3]


• Select input/outputdata
• Control decisions

37
5.3 Synchronization/Communication

38
CUDA Thread Organization

GPUs can handle thousands of concurrent threads


CUDA programming model supports even more
• Allows a kernel launch to specify more threads than the GPU can
execute concurrently
• Helps to amortize kernel launch times

39
Blocks of threads

Block Block Block

Threads are grouped into blocks

40
Grids of blocks

Block (0) Block (1) Block (2)

Grid

Threads are grouped into blocks


Blocks are grouped into a grid
A kernel is executed as a grid of blocks of threads

41
Blocks execute on Streaming
Multiprocessors Streaming Multiprocessor
Streaming Processor

SMEM

Thread Thread Block

Registers Per B
- lock
Shared
Memory

Global (Device) Global (Device)


Memory Memory

42
Grids of blocks executes across GPU
GPU Grid of Blocks

. ..

SMEM SMEM
. ..

… … …

. ..
SMEM SMEM

. ..

Global (Device) Memory

43
Kernel Execution Recap
A thread executes on a single streaming processor
• Allows use of familiar scalar code within kernel

A block executes on a single streaming multiprocessor


• Threads and blocks do not migrate to different SMs
• All threads within block execute in concurrently, in parallel
A streaming multiprocessor may execute multiple blocks
• Must be able to satisfy aggregate register and memory demands

A grid executes on a single device (GPU)


• Blocks from the same grid may execute concurrently or serially
• Blocks from multiple grids may execute concurrently
• A device can execute multiple kernels concurrently

44
Block abstraction provides scalability
Blocks may execute in arbitrary order, concurrently or
sequentially, and parallelism increases with resources
• Depends on when execution resources become available
Independent execution of blocks provides scalability
• Blocks can be distributed across any number of SMs
Kernel
Grid
Launch

Device with 2 SMs Block 0 Device with 4 SMs


Block 1
SM 0 SM 1 SM 0 SM 1 SM 2 SM 3
Block 2
Block 0 Block 1 Block 0 Block 1 Block 2 Block 3
Block 3
Block 2 Block 3 Block 4 Block 5 Block 6 Block 7
Block 4
Block 4 Block 5
Block 5
Block 6 Block 7
Block 6
Block 7

45
Blocks enable e cient collaboration
Threads often need tocollaborate
• Cooperatively load/store common data sets
• Share results or cooperate to produce a single result
• Synchronize with each other

Threads in the sameblock


• Can communicate through shared and global memory
• Can synchronize using fast synchronization hardware

Threads in different blocks of the same grid


• Cannot synchronize reliably
No guarantee that both threads are alive at the same time

46
Blocks must be independent
Any possible interleaving of blocks is allowed
• Blocks presumed to run to completion without pre--emption
• May run in any order, concurrently or sequentially

Programs that depend on block execution order within grid


for correctness are not wellformed
• May deadlock or return incorrectresults

Blocks may coordinate but not synchronize


• shared queue pointer: OK
• shared lock: BAD … can easily deadlock

47
Thread and Block ID and Dimensions
Threads: 3D IDs, unique within a
block
Device
Thread Blocks: 2D IDs, unique Grid 1
within a grid
Dimensions set at launch: Can be
unique for each grid
Block(1,1)
Build-in variables
• threadIdx, blockIdx
Block (1, 1)
• blockDim, gridDim
Thread Thread Thread Thread Thread
Programmers usually select (0, 0) (1, 0) (2, 0) (3, 0) (4, 0)

dimensions that simplify the Thread


(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)

mapping of the application data Thread Thread Thread Thread Thread

to CUDA threads (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

48
Examples of Indexes and Indexing
global v o i d k e r n e l ( i n t *a )
{
i n t i d x = b l o ck Idx. x* blockD im.x + t h r e a d I d x . x ;
a[idx] = 7;
} Output: 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7

global v o i d k e r n e l ( i n t *a )
{
i n t i d x = b l o ck Idx. x* blockD im.x + t h r e a d I d x . x ;
a [ i d x ] = block Idx.x;
} } Output: 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3

global v o i d k e r n e l ( i n t *a )
{
i n t i d x = b l o ck Idx. x* blockD im.x + t h r e a d I d x . x ;
a [ i d x ] = threadIdx.x;
} } Output: 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

49
Example of 2D indexing

global v o id k e r n e l ( i n t * a , i n t dim x, i n t dimy)


{
i n t i x = blockIdx.x*blockDim.x + t h r e a d I d x . x ;
i n t i y = blockIdx.y*blockDim.y + t h r e a d I d x . y ;
i n t i d x = i y * dimx + i x ;

a[idx] = a[idx]+1;
}

50
Independent address spaces
CPU and GPU have independent memory systems
• PCIe bus transfers data between CPU and GPU memory systems

Typically, CPU thread and GPU threads access what are


logically different, independent virtual address spaces

System GPU0 GPU1


Memory Memory Memory

PCI-e
51
51
Independent address spaces:
consequences
Cannot reliably determine whether a pointer references a host
(CPU) or device (GPU) address from the pointer value
• Dereferencing CPU/GPU pointer on GPU/CPU will likely cause crash
Unified virtual addressing (UVA) in CUDA 4.0
• One virtual address space shared by CPU thread and GPU threads

System GPU0 GPU1


Memory Memory Memory

PCI-e

52
CUDA Memory Hierarchy
Thread Thread Thread Thread

Thread
• Registers
Registers Registers Registers Registers

53

53
CUDA Memory Hierarchy
Thread Thread Thread Thread

Thread
• Registers
• Local memory
Registers Registers Registers Registers

Local Local Local Local

54

54
CUDA Memory Hierarchy
Thread Thread Thread Thread

Thread
• Registers
• Local memory
Registers Registers Registers Registers

Thread Block Local Local Local Local


• Shared memory

Shared

55

55
CUDA Memory Hierarchy

Thread
• Registers
• Local memory

Thread Block Global Memory


• Shared memory (DRAM)

All Thread Blocks


• Global Memory

56

56
Shared Memory
shared <type> x [<elements>];
Thread Thread Thread Thread
Allocated per thread block
Scope: threads in block
Data lifetime: same as block
Registers Registers Registers Registers
Capacity: small (about 48kB)
Latency: a few cycles Local Local Local Local
Bandwidth: very high
SM: 32 * 4 B * 1.15 GHz /2 = 73.6 GB/s Shared
GPU: 14 * 32 * 4 B * 1.15 GHz /2 = 1.03 TB/s
Common uses

• Sharing data among threads in a block


• User--managedcache (to reduce global memory accesses) 57
Global Memory
Allocated explicitly by host (CPU) thread
Scope:all threads of all kernels
Data lifetime: determine by host (CPU) thread
• cudaMalloc (void ** pointer, size_t nbytes)
• cudaFree (void* pointer)
Capacity: large (1--6GB)
Latency: 400--800cycles
Bandwidth: 156 GB/s
• Data access patterns will limit
bandwidth achieved in practice
Common uses Global Memory
• Staging data transfers to/from CPU (DRAM)
• Staging databetween kernel launches

58

58
Communication and Data Persistence
Kernel 1

Shared Shared
Memory . .. Memory

sequential Kernel 2
kernels
Shared Shared
Memory . .. Memory
Global
Memory

Kernel 3
Shared Shared
Memory . .. Memory

59

59
Managing Device (GPU) Memory
Host (CPU) manages device (GPU) memory
cudaMalloc (void ** pointer, size_t num_bytes)
cudaMemset (void* pointer, int value, size_t count)
cudaFree(void* pointer)

Example: allocate and initialize array of 1024 ints on device


/ / allocate and i n i t i a l i z e i n t x[1024] on device
i n t n = 1024;
i n t num_bytes = 1 0 2 4 * s i ze of (in t) ;
i n t * d_x = 0 ; / / holds device pointer
cudaMalloc((void**)&d_x, num_bytes);
cudaMemset(d_x, 0, num_bytes);
cudaFree(d_x);

60

60
Transferring Data
cudaMemcpy(void* d s t , void* s r c , size_t num_bytes, enum
cudaMemcpyKind d i r e c t i o n ) ;
• returns to host thread after the copy completes
• blocks CPU thread until all bytes have been copied
• doesn’t start copying until previous CUDA calls complete
Direction controlled by enum cudaMemcpyKind
cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice
CUDA also provides non--blocking
• Allows program to overlap data transfer with concurrent computation
on host anddevice
• Need to ensure that source locations are stable and destination locations are
not accessed

61

61
Example: SAXPY Kernel [1/4]
/ / [com pute] f o r ( i = 0 ; i < n ; i + + ) y [ i ] = a * x [ i ] + y [ i ] ;
// Each thread processes one element
g l o b a l v o id s a xp y( i n t n , f l o a t a , f l o a t * x , f l o a t * y )
{
i n t i = t h r e a d I d x. x + blockDim.x * b l o c k I d x . x ;
i f ( i < n) y [ i ] = a * x [ i ] + y [ i ] ;
}

i n t m ain()
{
...
/ / invoke p a r a l l e l SAXPY k e r n e l w i t h 256 threads / b lo ck
i n t nblocks = ( n + 2 5 5 )/2 5 6 ;
saxpy<<<nblocks, 256>>>(n, 2 . 0 , d_x, d _ y) ;
...
}

62

62
Example:SAXPY Kernel [1/4]
/ / [computes] f o r ( i = 0 ; i < n ; i + + ) y [ i ] = a * x [ i ] + y [ i ] ;
// Each thread processes one element
g l o b a l v o id s a xp y( i n t n , f l o a t a , f l o a t * x , f l o a t * y )
{
int i = t h r e a d I d x . x + blockDim.x * b l o c k I d x . x ;
if (i < n ) y [ i ] = a * x [ i ] + y [ i ] ;
}

i n t m ain() Device Code


{
...
/ / invoke p a r a l l e l SAXPY k e r n e l w i t h 256 threads / b lo ck
i n t nblocks = ( n + 2 5 5 )/2 5 6 ;
saxpy<<<nblocks, 256>>>(n, 2 . 0 , d_x, d _ y) ;
...
}

63

63
Example: SAXPY Kernel [1/4]
/ / [computes] f o r ( i = 0 ; i < n ; i + + ) y [ i ] = a * x [ i ] + y [ i ] ;
// Each thread processes one element
g l o b a l v o id s a xp y( i n t n , f l o a t a , f l o a t * x , f l o a t * y )
{
i n t i = t h r e a d I d x. x + blockDim.x * b l o c k I d x . x ;
i f ( i < n) y [ i ] = a * x [ i ] + y [ i ] ;
}
Host Code
i n t m ain()
{
...
/ / invoke p a r a l l e l SAXPY k e r n e l w i t h 256 threads / b lo ck
i n t nblocks = ( n + 2 5 5 )/2 5 6 ;
saxpy<<<nblocks, 256>>>(n, 2 . 0 , d_x, d _ y) ;
...
}

64

64
Example: SAXPY Kernel [2/4]
i n t m ain()
{
/ / a l l o c a t e and i n i t i a l i z e host (CPU) memory
float* x = . . . ;
float* y = . . . ;

/ / a l l o c a t e device (GPU) memory


f l o a t * d _ x, * d _ y;
c u d a M a l l o c ( ( v o id * *) &d_x, n * s i z e o f ( f l o a t ) ) ;
c u d a M a l l o c ( ( v o id * *) &d_y, n * s i z e o f ( f l o a t ) ) ;

/ / copy x and y from hos t memory t o dev ice memory


cudaMemcpy(d_x, x , n * s i z e o f ( f l o a t ) , cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y , n * s i z e o f ( f l o a t ) , cudaMemcpyHostToDevice);

/ / invoke p a r a l l e l SAXPY k e r n e l w i t h 256 threads / b lo ck


i n t nblocks = ( n + 2 5 5 )/2 5 6 ;
saxpy<<<nblocks, 256>>>(n, 2 . 0 , d_x, d _ y) ;

65

65
Example: SAXPY Kernel [2/4]
i n t m ain()
{
/ / a l l o c a t e and i n i t i a l i z e host (CPU) memory
float* x = . . . ;
float* y = . . . ;

/ / a l l o c a t e device (GPU) memory


f l o a t * d _ x, * d _ y;
c u d a M a l l o c ( ( v o id * *) &d_x, n * s i z e o f ( f l o a t ) ) ;
c u d a M a l l o c ( ( v o id * *) &d_y, n * s i z e o f ( f l o a t ) ) ;

/ / copy x and y from hos t memory t o dev ice memory


cudaMemcpy(d_x, x , n * s i z e o f ( f l o a t ) , cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y , n * s i z e o f ( f l o a t ) , cudaMemcpyHostToDevice);

/ / invoke p a r a l l e l SAXPY k e r n e l w i t h 256 threads / b lo ck


i n t nblocks = ( n + 2 5 5 )/2 5 6 ;
saxpy<<<nblocks, 256>>>(n, 2 . 0 , d_x, d _ y) ;

66

66
Example: SAXPY Kernel [3/4]
/ / invoke p a r a l l e l SAXPY k e r n e l w i t h 256 threads / b lo ck
i n t nblocks = ( n + 2 5 5 )/2 5 6 ;
saxpy<<<nblocks, 256>>>(n, 2 . 0 , d_x, d _ y) ;

/ / copy y from device (GPU) memory t o host (CPU) memory


cudaMemcpy(y, d_y, n * s i z e o f ( f l o a t ) , cudaMemcpyDeviceToHost);

/ / do something w i t h th e resultP

/ / f r e e device (GPU) memory


cudaFree(d_x);
cudaFree(d_y;

return 0;
}

67

67
Example: SAXPY Kernel [3/4]
/ / invoke p a r a l l e l SAXPY k e r n e l w i t h 256 threads / b lo ck
i n t nblocks = ( n + 2 5 5 )/2 5 6 ;
saxpy<<<nblocks, 256>>>(n, 2 . 0 , d_x, d _ y) ;

/ / copy y from device (GPU) memory t o host (CPU) memory


cudaMemcpy(y, d_y, n * s i z e o f ( f l o a t ) , cudaMemcpyDeviceToHost);

/ / do something w i t h th e resultP

/ / f r e e device (GPU) memory


cudaFree(d_x);
cudaFree(d_y;

return 0;
}

68

68
Example: SAXPY Kernel [4/4]
void s a xp y_ s e r i a l ( i n t n , f l o a t a , f l o a t * x , f l o a t * y)
{
f o r ( i n t i = 0; i < n; ++i)
y[ i ] = a*x[i] + y[ i ] ;
}
/ / invoke host SAXPYfu n c t i o n
s a xp y_ s e r i a l ( n , 2 . 0 , x , y ) ; Standard C Code

global v o id s a xp y( i n t n , f l o a t a , f l o a t * x , f l o a t * y )
{
i n t i = b lo ck Id x.x*b lo ck Dim .x + t h r e a d I d x . x ;
i f ( i < n) y [ i ] = a * x [ i ] + y [ i ] ;
}
/ / invoke p a r a l l e l SAXPY k e r n e l w i t h 256 th re a d s/b lo ck
i n t nblocks = ( n + 255) / 256;
saxpy<<<nblocks, 256>>>(n, 2 . 0 , x , y ) ; CUDA C Code
69

69
Thank
you for
your
attentions
!

70

You might also like