Module 05 Massive Multi-Core Programming GPGPUs, CUDA
Module 05 Massive Multi-Core Programming GPGPUs, CUDA
• Block diagrams of the Nvidia Titan GPU and the Intel i7-5960X
octa-core CPU.
clear that while memory cache dominates the die in the CPU case,
compute logic dominates in the case of the GPU
CSIS BITS Pilani
Nvidia’s KEPLER
• Kepler is the third GPU architecture of Nvidia
• The cores in a Kepler GPU are arranged in groups
called Streaming Multiprocessors (abbreviated to SMX
in Kepler)
• Each Kepler SMX contains 192 cores that execute in a
SIMD fashion, i.e., they run the same sequence of
instructions but on different data. Each SMX can run
its own program.
The most powerful chip in the Kepler family is the GTX Titan,
with a total of 15 SMXs
One of the SMXs is disabled in order to improve production
yields, resulting in a total of 14 · 192 = 2688 cores!
CUDA
CUDA’S Programming Model: Threads, Blocks, Grids
• GPUs are coprocessors that can be used to accelerate
parts of a program
A CUDA program executes like a sequential program
delegating whenever parallelism is required.
Host Device
#include <iostream>
#include <algorithm>
#define N 1024
#define RADIUS 3
#define BLOCK_SIZE 16
parallel fn
// Synchronize (ensure all the data is available)
__syncthreads();
int main(void) {
int *in, *out; // host copies of a, b, c
int *d_in, *d_out; // device copies of a, b, c
int size = (N + 2*RADIUS) * sizeof(int);
serial code
cudaMalloc((void **)&d_out, size);
// Copy to device
cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_out, out, size, cudaMemcpyHostToDevice);
parallel code
cudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost);
// Cleanup
free(in); free(out);
cudaFree(d_in); cudaFree(d_out);
return 0;
serial code
}
• Output:
$ nvcc hello_world.cu
$ a.out
Hello World!
$
The program can be compiled and executed as follows (CUDA programs should
be stored in files with a .cu extension):
$ nvcc −arch=sm_20 hello.cu −o hello
$ ./hello
• The “architecture” switch (-arch=sm_20) in the Nvidia CUDA Compiler
(nvcc) driver command line above instructs the generation of GPU
code for a device of compute capability 2.0. Compatibility with 2.0
and higher
CSIS BITS Pilani
Hello World! In CUDA
M = 8 threadIdx.x = 5
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
blockIdx.x = 2
• Because threads execute in blocks and each block executes warp by warp,
explicit synchronization of the threads must take place between the discrete
phases of the kernel, e.g., between initializing the shared-memory histogram
array and starting to calculate the histogram, and so on
the __syncthreads() function can be called inside a kernel to act as a barrier for all the
threads in a block
For given stream, this function will block until all the queued
up operations are complete
• Events
The time instance a command (and everything preceding it
in a stream) completes can be captured in the form of an
event. CUDA uses the cudaEvent_t type for managing events
3. int id = get_global_id(0);
4. result[id] = a[id] + b[id];
5. }
6. CUDA:
7. __global__ void cuenergy(…) {
8. Unsigned int xindex= blockIdx.x *blockDim.x +threadIdx.x;
9. unsigned int yindex= blockIdx.y *blockDim.y +threadIdx.y;
10. unsigned int outaddr= gridDim.x *blockDim.x *
UNROLLX*yindex+xindex
// Set the arguments of the kernel // Finally release all OpenCL allocated objects and host
clStatus = clSetKernelArg(kernel, 0, buffers.
sizeof(float), (void *)&alpha); clStatus = clReleaseKernel(kernel);
clStatus = clSetKernelArg(kernel, 1, clStatus = clReleaseProgram(program);
sizeof(cl_mem), (void *)&A_clmem); clStatus = clReleaseMemObject(A_clmem);
clStatus = clSetKernelArg(kernel, 2, clStatus = clReleaseMemObject(B_clmem);
sizeof(cl_mem), (void *)&B_clmem); clStatus = clReleaseMemObject(C_clmem);
clStatus = clSetKernelArg(kernel, 3, clStatus = clReleaseCommandQueue(command_queue);
sizeof(cl_mem), (void *)&C_clmem); clStatus = clReleaseContext(context);
free(A);
// Execute the OpenCL kernel on the list free(B);
size_t global_size = VECTOR_SIZE; // free(C);
Process the entire lists free(platforms);
size_t local_size = 64; // free(device_list);
Process one item at a time return 0;
}
clStatus =
clEnqueueNDRangeKernel(command_queue,
kernel, 1, NULL, &global_size, &local_size,
0, NULL, NULL);
Thank You