0% found this document useful (0 votes)
1 views

5. Moving to Parallel With CUDA - Hello Program

Uploaded by

owboostrsh2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

5. Moving to Parallel With CUDA - Hello Program

Uploaded by

owboostrsh2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

CUDA Programming

Hello Program
Outline
CUDA Programming
Functions Qualifiers
Built-in Device Variables
Variable Qualifiers
Addition on the device
Moving to parallel using blocks
Moving to parallel using threads
Combining blocks and threads
Cuda Programming
• Kernels are C functions with some
restrictions
– Can only access GPU memory
– Must have void return type
– No variable number of arguments (“varargs”)
– Not recursive
– No static variables
• Function arguments automatically copied
from CPU to GPU memory
Function Qualifiers
• __ global__ : invoked from within host (CPU) code,
– cannot be called from device (GPU) code
– must return void
• __device__ : called from other GPU functions,
– cannot be called from host (CPU) code
• __host__ : can only be executed by CPU, called from
host
• __host__ and __device__ qualifiers can be combined
– Sample use: overloading operators
– Compiler will generate both CPU and GPU code
Variable Qualifiers (GPU code)
• __device__
– Stored in device memory (large, high latency, no cache)
– Allocated with cudaMalloc (__device__ qualifier implied)
– Accessible by all threads
– Lifetime: application
• __shared__
– Stored in on-chip shared memory (very low latency)
– Allocated by execution configuration or at compile time
– Accessible by all threads in the same thread block
– Lifetime: kernel execution
• Unqualified variables:
– Scalars and built-in vector types are stored in registers
– Arrays of more than 4 elements stored in device memory
CUDA Built-in Device Variables
All __global__ and __device__
functions have access to these
automatically defined variables
• dim3 gridDim;
– Dimensions of the grid in
blocks (at most 2D)
• dim3 blockDim;
– Dimensions of the block in
threads
• dim3 blockIdx;
– Block index within the grid
• dim3 threadIdx;
– Thread index within the block © NVIDIA Corporation
CUDA Compile
CUDA Compile
Hello World!
int main(void) {
printf("Hello World!\n");
return 0;
}
Output:
Standard C that runs on the host
$ nvcc
hello_world.cu
NVIDIA compiler (nvcc) can be used to $ a.out
compile programs with no device code Hello World!
$

© NVIDIA 2013
Hello World! with Device Code
__global__ void mykernel(void) {
}

int main(void) {
mykernel<<<1,1>>>();
printf("Hello World!\n");
return 0;
}

Two new syntactic elements…

© NVIDIA 2013
Hello World! with Device COde
mykernel<<<1,1>>>();

• Triple angle brackets mark a call from host


code to device code
– Also called a “kernel launch”
– We’ll return to the parameters (1,1) in a moment

• That’s all that is required to execute a function


on the GPU!
© NVIDIA 2013
Hello World! with Device Code
__global__ void mykernel(void){
}

Output:
int main(void) {
mykernel<<<1,1>>>();
$ nvcc
printf("Hello World!\n");
hello.cu
return 0;
$ a.out
}
Hello World!
$
•mykernel() does nothing

© NVIDIA 2013
Hello World! with Device Code
__global__ void mykernel(void){
printf("Hello World!\n");
}
Output:
int main(void) {
$ nvcc
mykernel<<<1,1>>>();
hello.cu
return 0;
$ a.out
}
Hello World!
$

© NVIDIA 2013
Hello World! with Device Code
__global__ void mykernel(void){
printf("Hello World!\n");
}
Output:
int main(void) {
$ nvcc
mykernel<<<2,2>>>();
hello.cu
return 0;
$ a.out
}
Hello World!
Hello World!
Hello World!
Hello World!
$
© NVIDIA 2013

You might also like