Overview of GPGPU's
Overview of GPGPU's
Deepika H.V
C-DAC Knowledge Park
[email protected]
Software
Based on industry-standard C
Pre-Requistes
You (probably) need experience with C or C++
CPU threads typically take thousands of clock cycles to generate and schedule.
Triple angle brackets mark a call from host code to device code
-- Sometimes called a “kernel launch”
--We’ll discuss the parameters inside the angle brackets later
Memory Copy
cudaMemcpy()
points to the destination location
pointer to the source data object
number of bytes to copy
type of memory involved in copy
What is kernel
How do you call kernel
How to sync
threadID 0 1 2 3 4 5 6
.
Float x = input[threadID];
Float y = func(x);
Output[threadID] = y;
..
Kernel invocation
Grid
1024 threads ??
Kernel invocation
Grid
Block(0,0) Block(1,0) Block(2,0)
Grid 1
Kerne
The computational grid consist of a grid
Block Block Block
l1 (0, 0) (1, 0) (2, 0) of thread blocks
Block Block Block The application specifies the grid and
(0, 1) (1, 1) (2, 1) block size
Grid 2
The grid layouts can be 1, 2 dimensional
Kerne The maximal sizes are determined by
l2 GPU memory and card capability.
Each block has an unique block ID
Block (1, 1)
Each thread has an unique thread
Thread Thread Thread Thread Thread
(0, 0) (1, 0) (2, 0) (3, 0) (4, 0)
ID(within the block)
Thread Thread Thread Thread Thread
(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)
Purpose:
Global memory: IO for grid
Shared memory: thread collaboration
Registers: thread space
Main means of
communicating R/W Data
between host and device
Contents visible to all
threads
Long latency (100s cycles)
Off-chip, read/write access
Host can read/write
GT200
• Up to 4 GB
GF100
• Up to 6Gb
Summary
Registers Per thread Read-write On-chip No cache
Local memory Per thread Read-write Off-chip No cache
Shared memory Per block Read-write On-chip No cache
Global memory Per grid Read-write Off-chip No cache
Constant memory Per grid Read-only Off-chip cache
Thread 0 Thread 1
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);
Thread 2 Thread 3
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);
Time: 0
19, June 2013 Think Parallel - 2013 49
Thread Synchronization
Thread 0 Thread 1
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);
Thread 2 Thread 3
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);
Time: 1
19, June 2013 Think Parallel - 2013 50
Thread Synchronization
Thread 0 Thread 1
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);
Thread 2 Thread 3
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);
Thread 0 Thread 1
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);
Thread 2 Thread 3
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);
Time: 2
19, June 2013 Think Parallel - 2013 52
Thread Synchronization
Thread 0 Thread 1
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);
Thread 2 Thread 3
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);
Time: 3
19, June 2013 Think Parallel - 2013 53
Thread Synchronization
Thread 0 Thread 1
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);
Thread 2 Thread 3
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);
Thread 0 Thread 1
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);
Thread 2 Thread 3
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);
Time: 4
19, June 2013 Think Parallel - 2013 55
Thread Synchronization
Thread 0 Thread 1
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);
Thread 2 Thread 3
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);
Time: 5
19, June 2013 Think Parallel - 2013 56
Sample : Dot Product
c = a∙ b
c = (a0, a1, a2, a3) ∙ (b0, b1, b2, b3)
c = a0b0+ a1 b1+ a2 b2+ a3 b3
A0 * B0
How to add ??
A1 * B1
A2
A3
*
*
B2
B3
+ C
A4 * B4
__syncthreads()
C
Block N
A512 * B512
A513 * B513
A514
A515
*
*
B514
B515
+
A516 * B516
*c +=sum 0 0 3 3 3 7
3 3+4 =7 7
Block 1 Reads 3 Computes 0+3 Writes 7
Sum = 4
Read-Modify-Write
Nvcc outputs:
- C code
- Assembly code (ptx)
Or
- directly object code.
Debugging
-- cuda-gdb
-- cuda-memcheck
-- Parallel NSight Debugger
Performance Analysis
-- CUDA Visual Profiler
-- Parallel NSight Analyser
Note : Speedup calculated only for computation time and data transfer time not included
19, June 2013
79
Think Parallel - 2013 79
References