0% found this document useful (0 votes)

480 views54 pages

002 - Introduction To CUDA Programming - 1

This document provides an introduction and overview of CUDA programming. It discusses the programmer's view of the GPU as a coprocessor with its own memory. Kernels are executed in parallel across many lightweight threads arranged in a grid of blocks. The memory model and hardware architecture are explained, including different memory types and scopes. Execution configuration and ordering are also covered.

Uploaded by

Vinod VM

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

480 views54 pages

002 - Introduction To CUDA Programming - 1

Uploaded by

Vinod VM

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 54

CUDA Programming Introduction

Introduction to CUDA Programming

Andreas Moshovos
Winter 2009
Some slides/material from:
UIUC course by Wen-Mei Hwu and David Kirk
UCSB course by Andrea Di Blas
Universitat Jena by Waqar Saleem
NVIDIA by Simon Green
Programmer’s view
• GPU as a co-processor

CPU 3GB/s – 8GB.s GPU

141GB/sec

6.4GB/sec – 31.92GB/sec
8B per transfer
GPU Memory
Memory
1GB on our systems
Target Applications
int a[N]; // N is large
for all elements of a compute
a[i] = a[i] * fade
• Lots of independent computations
– CUDA thread need not be independent
Programmer’s View of the GPU

• GPU: a compute device that:

– Is a coprocessor to the CPU or host
– Has its own DRAM (device memory)
– Runs many threads in parallel

• Data-parallel portions of an application are

executed on the device as kernels which run in
parallel on many threads
Why are threads useful?
• Concurrency:
– Do multiple things in parallel
Needs more functional units

– Put more hardware  Get higher performance

Why are threads useful #2 – Tolerating stalls
• Often a thread stalls, e.g., memory access
Multiplex the same functional unit
Get more performance at a fraction of the cost
GPU vs. CPU Threads

• GPU threads are extremely lightweight

• Very little creation overhead
• In the order of microseconds

• GPU needs 1000s of threads for full efficiency

• Multi-core CPU needs only a few
Execution Timeline
CPU / Host
1. Copy to GPU mem
GPU / Device

2. Launch GPU Kernel

2’. Synchronize with GPU

3. Copy from GPU mem

time
GBT: Grids of Blocks of Threads

Why? Realities of integrated circuits: need to cluster computation and storage

to achieve high speeds
Grids of Blocks of Threads: Dimension Limits
• Grid of Blocks 1D or 2D
– Max x: 65535
– Max y: 65535

• Block of Threads: 1D, 2D, or 3D

– Max number of threads: 512
– Max x: 512
– Max y: 512
– Max z: 64

• Limits apply to Compute Capability 1.0, 1.1, 1.2, and 1.3

– GTX280 = 1.3
Block and Thread IDs

Device

• Threads and blocks have IDs Grid 1

– So each thread can decide what Block

(0, 0)
Block
(1, 0)
Block
(2, 0)
data to work on
Block Block Block
– Block ID: 1D or 2D (0, 1) (1, 1) (2, 1)

– Thread ID: 1D, 2D, or 3D

Block (1, 1)

• Simplifies memory Thread Thread Thread Thread Thread

(0, 0) (1, 0) (2, 0) (3, 0) (4, 0)

addressing when processing Thread Thread Thread Thread Thread

(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)
multidimensional data Thread Thread Thread Thread Thread
(0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

• IDs and dimensions are easily accessible through

predefined “variables”, e.g., blockDim.x and threadIdx.x
Thread Batching
• A kernel is executed as a
grid of thread blocks Host Device

– All threads share data memory Grid 1

space Kernel Block Block Block

1 (0, 0) (1, 0) (2, 0)
• A thread block: threads that
can cooperate with each Block
(0, 1)
Block
(1, 1)
Block
(2, 1)

other by:
Grid 2
– Synchronizing their execution
Kernel
• For hazard-free shared memory 2
accesses
– Efficiently sharing data through Block (1, 1)
a low latency shared memory
Thread Thread Thread Thread Thread
• Two threads from two (0, 0) (1, 0) (2, 0) (3, 0) (4, 0)

different blocks cannot Thread Thread Thread Thread Thread

(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)
cooperate Thread Thread Thread Thread Thread
(0, 2) (1, 2) (2, 2) (3, 2) (4, 2)
Thread Coordination Overview
• Race-free access to data
Programmer’s view: Memory Model
Programmer’s View: Memory Detail – Thread and Host

• Each thread can:

– R/W per-thread registers
– R/W per-thread local memory
– R/W per-block shared memory
– R/W per-grid global memory
– Read only per-grid constant memory
– Read only per-grid texture memory

• The host can R/W:

• global, constant, and texture memories
Memory Model: Global, Constant, and Texture Memories

• Global memory
– Main means of communicating R/W Data between
host and device
– Contents visible to all threads

• Texture and Constant Memories

– Constants initialized by host
– Contents visible to all threads
Memory Model Summary

Memory Location Cached Access Scope

Local off-chip No R/W thread
Shared on-chip N/A R/W all threads in a block
Global off-chip No R/W all threads + host
Constant off-chip Yes RO all threads + host

Texture off-chip Yes RO all threads + host

Execution Model: Ordering

• Execution order is undefined

• Do not assume and use:

• block 0 executes before block 1
• Thread 10 executes before thread 20
• And any other ordering even if you can observe it
Execution Model Summary
• Grid of blocks of threads
– 1D/2D grid of blocks
– 1D/2D/3D blocks of threads
• All blocks are identical:
– same structure and # of threads
• Block execution order is undefined
• Same block threads:
– can synchronize and share data fast (shared memory)
• Threads from different blocks:
– Cannot cooperate
– Communication through global memory
• Threads and Blocks have IDs
– Simplifies data indexing
– Can be 1D, 2D, or 3D (threads)
• Blocks do not migrate: execute on the same processor
• Several blocks may run over the same processor
CUDA Software Architecture

e.g., fft()

cuda…()

cu…()
Reasoning about CUDA call ordering
• GPU communication via cuda…() calls and
kernel invocations
– cudaMalloc, cudaMemCpy,
• Asynchronous from the CPU’s perspective
– CPU places a request in a “CUDA” queue
– requests are handled in-order
• Streams allow for multiple queues
– More on this much later one
CUDA API: Example
int a[N];
for (i =0; i < N; i++)
a[i] = a[i] + x;
1. Allocate CPU Data Structure
2. Initialize Data on CPU
3. Allocate GPU Data Structure
4. Copy Data from CPU to GPU
5. Define Execution Configuration
6. Run Kernel
7. CPU synchronizes with GPU
8. Copy Data from GPU to CPU
9. De-allocate GPU and CPU memory
1. Allocate CPU Data
float *ha;

main (int argc, char *argv[])

{
int N = atoi (argv[1]);
ha = (float *) malloc (sizeof (float) * N);

...
}

• Pinned memory allocation results in faster CPU to/from GPU

copies
• More on this later
• cudaMallocHost (…)
2. Initialize CPU Data
float *ha;

int i;

for (i = 0; i < N; i++)

ha[i] = i;
3. Allocate GPU Data

float *da;

cudaMalloc ((void **) &da, sizeof (float) * N);

• Notice: no assignment side

– NOT: da = cudaMalloc (…)

• Assignment is done internally:

– That why we pass &da

• Allocated space in Global Memory

GPU Memory Allocation
• The host manages GPU memory allocation:
– cudaMalloc (void **ptr, size_t nbytes)
– Must explicitly cast to (void **)
• cudaMalloc ((void **) &da, sizeof (float) * N);

– cudaFree (void *ptr);

• cudaFree (da);

– cudaMemset (void *ptr, int value, size_t

nbytes);
– cudaMemset (da, 0, N * sizeof (int));

• Check the CUDA Reference Manual

4. Copy Initialized CPU data to GPU

float *da;
float *ha;

cudaMemCpy ((void *) da, //

DESTINATION
(void *) ha, // SOURCE
sizeof (float) * N, // #bytes
cudaMemcpyHostToDevice); // DIRECTION
Host/Device Data Transfers
• The host initiates all transfers:
• cudaMemcpy( void *dst, void *src,
size_t nbytes,
enum cudaMemcpyKind direction)
• Asynchronous from the CPU’s perspective
– CPU thread continues
• In-order processing with other CUDA requests

• enum cudaMemcpyKind
– cudaMemcpyHostToDevice
– cudaMemcpyDeviceToHost
– cudaMemcpyDeviceToDevice
5. Define Execution Configuration
• How many blocks and threads/block

int threads_block = 64;

int blocks = N / threads_block;
if (blocks % N != 0) blocks += 1;

• Alternatively:

blocks = (N + threads_block – 1) /
threads_block;
6. Launch Kernel & 7. CPU/GPU Synchronization

• Instructs the GPU to launch blocks x thread_blocks

threads:

darradd <<<blocks, threads_block>> (da, 10f, N);

cudaThreadSynchronize ();

• darradd: kernel name

• <<<…>>> execution configuration
– More on this soon
• (da, x, N): arguments
– 256 – 8 byte limit / No variable arguments
CPU/GPU Synchronization
• CPU does not block on cuda…() calls
– Kernel/requests are queued and processed in-order
– Control returns to CPU immediately

• Good if there is other work to be done

– e.g., preparing for the next kernel invocation

• Eventually, CPU must know when GPU is done

• Then it can safely copy the GPU results

• cudaThreadSynchronize ()
– Block CPU until all preceding cuda…() and kernel requests
have completed
8. Copy data from GPU to CPU & 9. DeAllocate Memory
float *da;
float *ha;

cudaMemCpy ((void *) ha, //

DESTINATION
(void *) da, // SOURCE
sizeof (float) * N, // #bytes
cudaMemcpyDeviceToHost); // DIRECTION

cudaFree (da);
// display or process results here
free (ha);
The GPU Kernel
__global__ darradd (float *da, float x, int N)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;

if (i < N) da[i] = da[i] + x;

}

• BlockIdx: Unique Block ID.

– Numerically asceding: 0, 1, …
• ThreadIdx: Unique per Block Index
– 0, 1, …
– Per Block
• BlockDim: Dimensions of Block
– BlockDim.x, BlockDim.y, BlockDim.z
– Unused dimensions default to 0
Array Index Calculation Example
int i = blockIdx.x * blockDim.x + threadIdx.x;

blockIdx.x 0 blockIdx.x 1 blockIdx.x 2

a[0] a[63] a[64] a[127]a[128] a[255]a[256]

0
63
0

63
0

x
x

x.
.x
x.

.x
dx

Id
dx
Id

ad
I

I
ad

ad
I

I
ad

re
re

th
th

th
i=0 i = 63 i = 64 i = 127 i = 128 i = 255 i = 256

Assuming blockDim.x = 64
Generic Unique Thread and Block Index Calculations #1

• 1D Grid / 1D Blocks:

UniqueBlockIndex = blockIdx.x;
UniqueThreadIndex = blockIdx.x * blockDim.x +
threadIdx.x;

• 1D Grid / 2D Blocks:

UniqueBlockIndex = blockIdx.x;
UniqueThreadIndex = blockIdx.x * blockDim.x * blockDim.y
+ threadIdx.y * blockDim.x + threadIdx.x;

• 1D Grid / 3D Blocks:

UniqueBockIndex = blockIdx.x;
UniqueThreadIndex = blockIdx.x * blockDim.x * blockDim.y
* blockDim.z + threadIdx.z * blockDim.y * blockDim.x +
threadIdx.y * blockDim.x + threadIdx.x;

• Source: http://forums.nvidia.com/lofiversion/index.php?t82040.html
Generic Unique Thread and Block Index Calculations #2

• 2D Grid / 1D Blocks:

UniqueBlockIndex = blockIdx.y * gridDim.x + blockIdx.x;

UniqueThreadIndex = UniqueBlockIndex * blockDim.x +
threadIdx.x;

• 2D Grid / 2D Blocks:

UniqueBlockIndex = blockIdx.y * gridDim.x + blockIdx.x;

UniqueThreadIndex =UniqueBlockIndex * blockDim.y * blockDim.x
+ threadIdx.y * blockDim.x + threadIdx.x;

• 2D Grid / 3D Blocks:

UniqueBlockIndex = blockIdx.y * gridDim.x + blockIdx.x;

UniqueThreadIndex = UniqueBlockIndex * blockDim.z * blockDim.y
* blockDim.x + threadIdx.z * blockDim.y * blockDim.z +
threadIdx.y * blockDim.x + threadIdx.x;

• UniqueThreadIndex means unique per grid.

CUDA Function Declarations
Executed Only callable
on the: from the:

device float DeviceFunc() device device

__global__ void KernelFunc() device host
__host__ float HostFunc() host host

• global defines a kernel function

– Must return void
– Can only call __device__ functions
• __device__ and __host__ can be used together
__device__ Example
• Add x to a[i] multiple times

device float addmany (float a, float b, int count)

{
while (count--) a += b;
return a;
}

global darradd (float *da, float x, int N)

{
int i = blockIdx.x * blockDim.x + threadIdx.x;

if (i < N) da[i] = addmany (da[i], x, 10);

}
Kernel and Device Function Restrictions
• __device__ functions cannot have their address taken
– e.g., f = &addmany; *f(…);

• For functions executed on the device:

– No recursion
• darradd (…)
{
darradd (…)
}
– No static variable declarations inside the function
• darradd (…)
{
static int canthavethis;
}

– No variable number of arguments

• e.g., something like printf (…)
Predefined Vector Datatypes
• Can be used both in host and in device code.
– [u]char[1..4], [u]short[1..4],
[u]int[1..4], [u]long[1..4],
float[1..4]
• Structures accessed with .x, .y, .z, .w
fields
• default constructors, “make_TYPE (…)”:
– float4 f4 = make_float4 (1f, 10f,
1.2f, 0.5f);
• dim3
– type built on uint3
– Used to specify dimensions
– Default value is (1, 1, 1)
Execution Configuration
• Must specify when calling a __global__ function:
<<< Dg, Db [, Ns [, S]] >>>
• where:
– dim3 Dg: grid dimensions in blocks
– dim3 Db: block dimensions in threads
– size_t Ns: per block additional number of shared
memory bytes to allocate
• optional, defaults to 0
• more on this much later on
– cudaStream_t S: request stream(queue)
• optional, default to 0.
• Compute capability >= 1.1
Built-in Variables
• dim3 gridDim
– Number of blocks per grid, in 2D (.z always 1)

• uint3 blockIdx
– Block ID, in 2D (blockIdx.z = 1 always)

• dim3 blockDim
– Number of threads per block, in 3D

• uint3 threadIdx
– Thread ID in block, in 3D
Execution Configuration Examples
• 1D grid / 1D blocks
dim3 gd(1024)
dim3 bd(64)
akernel<<<gd, bd>>>(...)
gridDim.x = 1024, gridDim.y = 1,
blockDim.x = 64, blockDim.y = 1, blockDim.z = 1
• 2D grid / 3D blocks
dim3 gd(4, 128)
dim3 bd(64, 16, 4)
akernel<<<gd, bd>>>(...)
gridDim.x = 4, gridDim.y = 128,
blockDim.x = 64, blockDim.y = 16, blockDim.z
= 4
Error Handling
• Most cuda…() functions return a cudaError_t
– If cudaSuccess: Request completed without a problem

• cudaGetLastError():
– returns the last error to the CPU
– Use with cudaThreadSynchronize():
cudaError_t code;
cudaThreadSynchronize ();
code = cudaGetLastError ();

• char *cudaGetErrorString(cudaError_t code);

– returns a human-readable description of the error code
Error Handling Utility Function
void
cudaDie (const char *msg)
{
cudaError_t err;
cudaThreadSynchronize ();
err = cudaGetLastError();

if (err == cudaSuccess) return;

fprintf (stderr, "CUDA error: %s: %s.\n",
msg,
cudaGetErrorString (err));
exit(EXIT_FAILURE);
}
• adapted from: http://www.ddj.com/hpc-high-performance-computing/207603131
Error Handling Macros
• CUDA_SAFE_CALL ( some cuda call )
CUDA_SAFE_CALL (cudaMemcpy (a_h, a_d, arr_size,
cudaMemcpyDeviceToHost) );

• Must define #define _DEBUG

– No code emitted when undefined: Performance
• Use make dbg=1 under NVIDIA_CUDA_SDK
Measuring Time -- gettimeofday
• Unix-based:
#include <sys/time.h>
#include <time.h>

struct timeval start, end;

gettimeofday (&start, NULL);

WHAT WE ARE INTERESTED IN
gettimeofday (&end, NULL);

timeCpu = (float)(end.tv_sec - start.tv_sec);

if (end.tv_usec < start.tv_usec)
{
timeCpu -= 1.0;
timeCpu += (double)(1000000.0 + end.tv_usec -
start.tv_usec)/1000000.0;
} else
timeCpu += (double)(end.tv_usec - start.tv_usec)/1000000.0;
Using CUDA clock ()
• clock_t clock ();
• Can be used in device code
• returns per multiprocessor counter which is incremented every clock cycle
• Sample at the beginning and end
• upper bound since threads are time-sliced
• uint start = clock();
... compute (less than 3 sec) ....
uint end = clock();
if (end > start)
time = end - start;
else
time = end + (0xffffffff - start)
• Look at the clock example under projects in SDK
• Using takes some effort
– Every thread measures start and end
– Then must find min start and max end
– Accurate
Using cutTimer…() library calls
#include <cuda.h>
#include <cutil.h>
unsigned int htimer;

cutCreateTimer (&htimer);
CudaThreadSynchronize ();
cutStartTimer(htimer);
WHAT WE ARE INTERESTED IN
cudaThreadSynchronize ();
cutStopTimer(htimer);
printf (“time: %f\n", cutGetTimerValue(htimer));
Code Overview: Host side
#include <cuda.h>
#include <cutil.h>
unsigned int htimer;
float *ha, *da;
main (int argc, char *argv[]) {
int N = atoi (argv[1]);
ha = (float *) malloc (sizeof (float) * N);
for (int i = 0; i < N; i++) ha[i] = i;
cutCreateTimer (&htimer);
cudaMalloc ((void **) &da, sizeof (float) * N);
cudaMemCpy ((void *) da, (void *) ha, sizeof (float) * N,
cudaMemcpyHostToDevice);
cudaThreadSynchronize ();
cutStartTimer(htimer);
blocks = (N + threads_block – 1) / threads_block;
darradd <<<blocks, threads_block>> (da, 10f, N)
cudaThreadSynchronize ();
cutStopTimer(htimer);
cudaMemCpy ((void *) ha, (void *) da, sizeof (float) * N,
cudaMemcpyDeviceToHost);
cudaFree (da);
free (ha);
printf (“processing time: %f\n", cutGetTimerValue(htimer));
}
Code Overview: Device Side
__device__ float addmany (float a, float b, int count)
{
while (count--) a += b;
return a;
}

global darradd (float *da, float x, int N)

{
int i = blockIdx.x * blockDim.x + threadIdx.x;

if (i < N) da[i] = addmany (da[i], x, 10);

}
Variable Declarations – Will revisit next time

• __device__
– stored in device memory (large, high latency, no cache)
– Allocated with cudaMalloc (__device__qualifier implied)
– accessible by all threads
– lifetime: application

• __constant__
– same as __device__, but cached and read-only by GPU
– written by CPU via cudaMemcpyToSymbol(...) call
– lifetime: application

• __shared__
– stored in on-chip shared memory (very low latency)
– accessible by all threads in the same thread block
– lifetime: kernel launch

• Unqualified variables:
– scalars and built-in vector types are stored in registers
– arrays of more than 4 elements or run-time indices stored in device memory
Measurement Methodology
• You will not get exactly the same time
measurements every time
– Other processes running / external events (e.g., network
activity)
– Cannot control
– “Non-determinism”
• Must take sufficient samples
– say 10 or more
– There is theory on what the number of samples must be
• Measure average
• Will discuss this next time or will provide a handout
online
Handling Large Input Data Sets – 1D Example
• Recall gridDim.[xy] <= 65535
• Host calls kernel multiple times:
float *dac = da; // starting offset for current kernel

while (n_blocks)
{
int bn = n_blocks;
int elems; // array elements processed in this kernel
if (bn > 65535) bn = 65535;
elems = bn * block_size;
darradd <<<bn, block_size>>> (dac, 10.0f, elems);
n_blocks -= bn;
dac += elems;
}

Bio-Solv Product
No ratings yet
Bio-Solv Product
96 pages
NISSAN Note (E11) 1.4 16V CR14DE: Timing Chain: Removal/installation
No ratings yet
NISSAN Note (E11) 1.4 16V CR14DE: Timing Chain: Removal/installation
13 pages
Sap PDF
No ratings yet
Sap PDF
61 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
CUDA
No ratings yet
CUDA
18 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
CUDA
No ratings yet
CUDA
33 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
GPU Programming Slides 3
No ratings yet
GPU Programming Slides 3
73 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Lec 1
No ratings yet
Lec 1
27 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
CUDA Programming
No ratings yet
CUDA Programming
35 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Class 10
No ratings yet
Class 10
13 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
Threads
No ratings yet
Threads
54 pages
CUDA Lab Instruction
No ratings yet
CUDA Lab Instruction
40 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
Cuda 101
No ratings yet
Cuda 101
53 pages
Lect11 12 Cuda Threads
No ratings yet
Lect11 12 Cuda Threads
25 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
06-CUDA Thread Organization
No ratings yet
06-CUDA Thread Organization
27 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
Lecture12 GPUArchCUDA02-CUDAMem
No ratings yet
Lecture12 GPUArchCUDA02-CUDAMem
67 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
Lec6 Cuda Memory
No ratings yet
Lec6 Cuda Memory
18 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
CUDA Memory Architecture: GPGPU Class Week 4
No ratings yet
CUDA Memory Architecture: GPGPU Class Week 4
28 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Threads and Memory7
No ratings yet
Threads and Memory7
42 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
CUDA Programming On Nvidia Gpus: Mike Giles
No ratings yet
CUDA Programming On Nvidia Gpus: Mike Giles
21 pages
LM32 Ait L21
No ratings yet
LM32 Ait L21
19 pages
CUDA Putting It All Together
No ratings yet
CUDA Putting It All Together
39 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
Endsem Imp HPC Unit 5
No ratings yet
Endsem Imp HPC Unit 5
24 pages
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet
Neo Geo Architecture: Architecture of Consoles: A Practical Analysis, #23
From Everand
Neo Geo Architecture: Architecture of Consoles: A Practical Analysis, #23
Rodrigo Copetti
No ratings yet
Reservoir Fluids
No ratings yet
Reservoir Fluids
17 pages
AIATS Schedule For NEET - AIIMS (XI Studying) 2021 - 0 PDF
No ratings yet
AIATS Schedule For NEET - AIIMS (XI Studying) 2021 - 0 PDF
1 page
Chapter 23
No ratings yet
Chapter 23
10 pages
Auto Coater PQ 022
No ratings yet
Auto Coater PQ 022
16 pages
Criteo Tealium Integration
No ratings yet
Criteo Tealium Integration
24 pages
A 892 Â " 88 R95 QTG5MI04OFI5NUUX
No ratings yet
A 892 Â " 88 R95 QTG5MI04OFI5NUUX
5 pages
5.smart Multi-Jet Water Meter
No ratings yet
5.smart Multi-Jet Water Meter
12 pages
Harman Kardon-67045418-JBL Stage1 Speakers Spec Sheet English
No ratings yet
Harman Kardon-67045418-JBL Stage1 Speakers Spec Sheet English
2 pages
Auxiliary Drive Belt Autodata Nissan
No ratings yet
Auxiliary Drive Belt Autodata Nissan
4 pages
TP 154 Manual
No ratings yet
TP 154 Manual
25 pages
SDC Day Modules Schedule
No ratings yet
SDC Day Modules Schedule
26 pages
Overview of Parallel Computing: Shawn T. Brown
No ratings yet
Overview of Parallel Computing: Shawn T. Brown
46 pages
Rapid Metro Gurgaon Limited
No ratings yet
Rapid Metro Gurgaon Limited
37 pages
AOC V22t
No ratings yet
AOC V22t
2 pages
Transient Recovery Voltage
100% (1)
Transient Recovery Voltage
13 pages
Bicycle Based Refrigerator
No ratings yet
Bicycle Based Refrigerator
27 pages
Answer
No ratings yet
Answer
36 pages
Astec Concrete Testing
No ratings yet
Astec Concrete Testing
4 pages
PaleTrac - S PR Dryve-In 15-09-08
No ratings yet
PaleTrac - S PR Dryve-In 15-09-08
12 pages
3-D Scanner: B.Tech-Mechatronics (5 Semester
No ratings yet
3-D Scanner: B.Tech-Mechatronics (5 Semester
17 pages
CutSheet CBS753 100305303 AA 4196044 01
No ratings yet
CutSheet CBS753 100305303 AA 4196044 01
2 pages
Formwork Basic Manual
100% (4)
Formwork Basic Manual
44 pages
Testo 622 Data Sheet
No ratings yet
Testo 622 Data Sheet
2 pages
ITP Sample
No ratings yet
ITP Sample
6 pages
WM 07
No ratings yet
WM 07
124 pages
Green Recloser: Chardon Korea
No ratings yet
Green Recloser: Chardon Korea
6 pages
Microsoft Mobile Applications
No ratings yet
Microsoft Mobile Applications
133 pages

002 - Introduction To CUDA Programming - 1

Uploaded by

002 - Introduction To CUDA Programming - 1

Uploaded by

CUDA Programming Introduction

Introduction to CUDA Programming

CPU 3GB/s – 8GB.s GPU

• GPU: a compute device that:

• Data-parallel portions of an application are

– Put more hardware  Get higher performance

• GPU threads are extremely lightweight

• GPU needs 1000s of threads for full efficiency

2. Launch GPU Kernel

2’. Synchronize with GPU

3. Copy from GPU mem

Why? Realities of integrated circuits: need to cluster computation and storage

• Block of Threads: 1D, 2D, or 3D

• Limits apply to Compute Capability 1.0, 1.1, 1.2, and 1.3

• Threads and blocks have IDs Grid 1

– So each thread can decide what Block

– Thread ID: 1D, 2D, or 3D

• Simplifies memory Thread Thread Thread Thread Thread

addressing when processing Thread Thread Thread Thread Thread

• IDs and dimensions are easily accessible through

– All threads share data memory Grid 1

space Kernel Block Block Block

different blocks cannot Thread Thread Thread Thread Thread

• Each thread can:

• The host can R/W:

• Texture and Constant Memories

Memory Location Cached Access Scope

Texture off-chip Yes RO all threads + host

• Execution order is undefined

• Do not assume and use:

main (int argc, char *argv[])

• Pinned memory allocation results in faster CPU to/from GPU

for (i = 0; i < N; i++)

cudaMalloc ((void **) &da, sizeof (float) * N);

• Notice: no assignment side

• Assignment is done internally:

• Allocated space in Global Memory

– cudaFree (void *ptr);

– cudaMemset (void *ptr, int value, size_t

• Check the CUDA Reference Manual

cudaMemCpy ((void *) da, //

int threads_block = 64;

• Instructs the GPU to launch blocks x thread_blocks

darradd <<<blocks, threads_block>> (da, 10f, N);

• darradd: kernel name

• Good if there is other work to be done

• Eventually, CPU must know when GPU is done

cudaMemCpy ((void *) ha, //

if (i < N) da[i] = da[i] + x;

• BlockIdx: Unique Block ID.

blockIdx.x 0 blockIdx.x 1 blockIdx.x 2

a[0] a[63] a[64] a[127]a[128] a[255]a[256]

UniqueBlockIndex = blockIdx.y * gridDim.x + blockIdx.x;

UniqueBlockIndex = blockIdx.y * gridDim.x + blockIdx.x;

UniqueBlockIndex = blockIdx.y * gridDim.x + blockIdx.x;

• UniqueThreadIndex means unique per grid.

__device__ float DeviceFunc() device device

• __global__ defines a kernel function

__device__ float addmany (float a, float b, int count)

__global__ darradd (float *da, float x, int N)

if (i < N) da[i] = addmany (da[i], x, 10);

• For functions executed on the device:

– No variable number of arguments

• char *cudaGetErrorString(cudaError_t code);

if (err == cudaSuccess) return;

• Must define #define _DEBUG

struct timeval start, end;

gettimeofday (&start, NULL);

timeCpu = (float)(end.tv_sec - start.tv_sec);

__global__ darradd (float *da, float x, int N)

if (i < N) da[i] = addmany (da[i], x, 10);

You might also like

device float DeviceFunc() device device

• global defines a kernel function

device float addmany (float a, float b, int count)

global darradd (float *da, float x, int N)

global darradd (float *da, float x, int N)