CS 179: GPU Computing: Recitation 2: Synchronization, Shared
CS 179: GPU Computing: Recitation 2: Synchronization, Shared
GPU Computing
• Serializes access
Atomic instructions on CUDA
atomic{Add, Sub, Exch, Min, Max, Inc, Dec, CAS,
And, Or, Xor}
Naive:
each thread atomically increments each number to
accumulator in global memory
Sum example
Smarter solution:
● each thread computes its own sum in register
● use warp shuffle (next slide) to compute sum over warp
● each warp does a single atomic increment to
accumulator in global memory
● Reduce number of atomic instructions by a factor of 32
(warp size)
Warp-synchronous programming
What if I only need to synchronize between all
threads in a warp?
Warps are already synchronized!
Bank Conflicts
Example of an SM’s shared memory cache
Bank Conflicts
Example of an SM’s shared memory cache
Bank Conflicts
Example of an SM’s shared memory cache
Bank Conflicts
Avoiding bank conflicts
You can choose x and y to avoid bank conflicts.
● Namespace TA_Utilities