0% found this document useful (0 votes)
46 views22 pages

CS 179: GPU Computing: Recitation 2: Synchronization, Shared

This document discusses synchronization techniques for GPU parallel programs, including shared memory and atomic instructions. It provides examples of algorithms that require synchronization, such as parallel breadth-first search and summing a list of numbers. It describes the __syncthreads() call for synchronizing threads within a block and atomic instructions for serialization. It also discusses optimizations like reducing atomic operations and using warp shuffle to compute across a warp efficiently. The document concludes with notes on using shared memory for the matrix transpose algorithm to avoid non-coalesced global memory accesses.

Uploaded by

Rajul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views22 pages

CS 179: GPU Computing: Recitation 2: Synchronization, Shared

This document discusses synchronization techniques for GPU parallel programs, including shared memory and atomic instructions. It provides examples of algorithms that require synchronization, such as parallel breadth-first search and summing a list of numbers. It describes the __syncthreads() call for synchronizing threads within a block and atomic instructions for serialization. It also discusses optimizations like reducing atomic operations and using warp shuffle to compute across a warp efficiently. The document concludes with notes on using shared memory for the matrix transpose algorithm to avoid non-coalesced global memory accesses.

Uploaded by

Rajul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

CS 179:

GPU Computing

Recitation 2: Synchronization, Shared


memory, Matrix Transpose
Synchronization
Ideal case for parallelism:
● no resources shared between threads
● no communication between threads

Many algorithms that require just a little bit of resource


sharing can still be accelerated by massive parallelism of
GPU
Examples needing synchronization
(1) Parallel BFS
(2) Summing a list of numbers
(3) Loading data into a GPU’s shared memory
__syncthreads()
• __syncthreads()
synchronizes all threads in a
block.

• Remember that shared memory


is per block. Every block that is
launched will have to allocate
shared memory for its own itself
on its resident SM.

• This __synchthreads() call is


very useful for kernels using
shared memory.
Atomic instructions: motivation
Two threads try to increment variable x=42 concurrently.
Final value should be 44.
Possible execution order:
thread 0 load x (=42) into register r0
thread 1 load x (=42) into register r1
thread 0 increment r0 to 43
thread 1 increment r1 to 43
thread 0 store r0 (=43) into x
thread 1 store r1 (=43) into x Actual final value of x: 43
:(
Atomic instructions
• An atomic instruction executes as a single unit, cannot
be interrupted.

• Serializes access
Atomic instructions on CUDA
atomic{Add, Sub, Exch, Min, Max, Inc, Dec, CAS,
And, Or, Xor}

Syntax: atomicAdd(float *address, float val)

Work in both global and shared memory!


(Synchronization) budget advice
Do more cheap things and fewer expensive things!

Example: computing sum of list of numbers

Naive:
each thread atomically increments each number to
accumulator in global memory
Sum example
Smarter solution:
● each thread computes its own sum in register
● use warp shuffle (next slide) to compute sum over warp
● each warp does a single atomic increment to
accumulator in global memory
● Reduce number of atomic instructions by a factor of 32
(warp size)
Warp-synchronous programming
What if I only need to synchronize between all
threads in a warp?
Warps are already synchronized!

Can reduce __syncthreads() calls


Warp shuffle
Read value of register from another thread in warp.
int __shfl(int var, int srcLane, int width=warpSize)

Extremely useful to compute sum of values across a warp.

First available on Kepler (no Fermi, only CC >= 3.0)


Quick Aside: blur_v from Lab 1
Shared memory is great place to
put blur_v.

1) blur_v is relatively small and


easily fits in shared memory.

2) Every thread reads from


blur_v

3) Stride 0 access. No bank


conflicts when i >
GAUSSIAN_SIZE (majority
of threads)
Lab 2
(1) Questions on latency hiding, thread divergence,
coalesced memory access, bank conflicts, instruction
dependencies

(2) What you actually have to do: Need to comment on all


non-coalesced memory accesses and bank conflicts in
provided kernel code. Lastly, improve the matrix
transpose kernel by using cache and memory
optimizations.
Matrix Transpose
An interesting IO problem, because you
have a stride 1 access and a stride n
access. Not a trivial access pattern like
“blur_v” from Lab 1.

Transpose is just a fancy memcpy, so


memcpy provides a great performance
target.

Note: This example output is for a clean


project without the shmem and optimal
kernels completed. Your final output
should show a decline in kernel time for
the different kernels.
Matrix Transpose
__global__
void naiveTransposeKernel(const float *input, float *output, int n) {
// launched with (64, 16) block size and (n / 64, n / 64) grid size
// each block transposes a 64x64 block

const int i = threadIdx.x + 64 * blockIdx.x;


int j = 4 * threadIdx.y + 64 * blockIdx.y;
const int end_j = j + 4;

for (; j < end_j; j++) {


output[j + n * i] = input[i + n * j];
}
}
Shared memory & matrix transpose
Idea to avoid non-coalesced accesses:
● Load from global memory with stride 1
● Store into shared memory with stride x
● __syncthreads()
● Load from shared memory with stride y
● Store to global memory with stride 1

Choose values of x and y perform the transpose.


Example of an SM’s shared memory cache

Let’s populate shared memory with random integers.


Here’s what the first 8 of 32 banks look like:

Bank Conflicts
Example of an SM’s shared memory cache

Bank Conflicts
Example of an SM’s shared memory cache

Bank Conflicts
Example of an SM’s shared memory cache

Bank Conflicts
Avoiding bank conflicts
You can choose x and y to avoid bank conflicts.

Remember that there are 32 banks and the GPU


runs threads in batches of 32 (called warps).

A stride n access to shared memory avoids bank conflicts


iff gcd(n, 32) == 1.
ta_utils.cpp
● Included in the UNIX
version of this set

● Should minimize lag or


infinite waits on GPU
function calls.

● Please leave these


functions in the code if
you are using Haru

● Namespace TA_Utilities

You might also like