0% found this document useful (0 votes)
5 views25 pages

Unit 5 Part2

The document discusses synchronization and scalability in CUDA programming, highlighting the use of barrier synchronization with __syncthreads() to coordinate thread execution within blocks. It explains the assignment of thread blocks to streaming multiprocessors (SMs) and the importance of maintaining resource limits to prevent excessive waiting times. Additionally, it covers thread scheduling through warps to optimize execution and manage latency effectively.

Uploaded by

D Dhaaran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views25 pages

Unit 5 Part2

The document discusses synchronization and scalability in CUDA programming, highlighting the use of barrier synchronization with __syncthreads() to coordinate thread execution within blocks. It explains the assignment of thread blocks to streaming multiprocessors (SMs) and the importance of maintaining resource limits to prevent excessive waiting times. Additionally, it covers thread scheduling through warps to optimize execution and manage latency effectively.

Uploaded by

D Dhaaran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

SYNCHRONIZATION AND

TRANSPARENT SCALABILITY
Coordinate the execution of multiple threads
• CUDA allows threads in the same block to coordinate their activities
using a barrier synchronization function __syncthreads().
• When a kernel function calls __syncthreads(), all threads in a block
will be held at the calling location until every thread in the block
reaches the location.
• It ensures that all threads in a block have completed a phase of their
execution of the kernel before any of them can move on to the next
phase.
Barrier synchronization
__syncthreads()

• In CUDA, a __syncthreads() statement, if present, must be executed by all


threads in a block.
• When a __syncthread() statement is placed in an if statement, either all
threads in a block execute the path that includes the __syncthreads() or none
of them does.
• For an if-then-else statement, if each path has a __syncthreads() statement,
either all threads in a block execute the __syncthreads() on the then path or
all of them execute the else path.
• The two __syncthreads() are different barrier synchronization points.
• If a thread in a block executes the then path and another executes the else
path, they would be waiting at different barrier synchronization points.
• They would end up waiting for each other forever. It is the responsibility of the
programmers to write their code so that these requirements are satisfied.
Synchronize
• The ability to synchronize also imposes execution constraints on
threads within a block. These threads should execute in close time
proximity with each other to avoid excessively long waiting times.
• In fact, one needs to make sure that all threads involved in the barrier
synchronization have access to the necessary resources to eventually
arrive at the barrier.
• Otherwise, a thread that never arrived at the barrier synchronization
point can cause everyone else to wait forever
CUDA runtime systems
• CUDA runtime systems satisfy this constraint by assigning execution
resources to all threads in a block as a unit.
• A block can begin execution only when the runtime system has secured
all the resources needed for all threads in the block to complete
execution.
• When a thread of a block is assigned to an execution resource, all other
threads in the same block are also assigned to the same resource.
• This ensures the time proximity of all threads in a block and prevents
excessive or indefinite waiting time during barrier synchronization.
• CUDA runtime system can execute blocks in any order relative to each
other since none of them need to wait for each other.
Lack of synchronization constraints between
blocks enables transparent scalability for CUDA
programs.
ASSIGNING RESOURCES TO BLOCKS
• Once a kernel is launched, the CUDA runtime system generates the
corresponding grid of threads.
• Threads are assigned to execution resources on a block-by-block
basis.
• The execution resources are organized into streaming multiprocessors
(SMs).
Thread block assignment to SMs.

Three thread blocks are assigned to each SM.


Thread block assignment to SMs.
• Multiple thread blocks can be assigned to each SM.
• Each device has a limit on the number of blocks that can be assigned
to each SM.
• In situations where there is an insufficient amount of any one or more
types of resources needed for the simultaneous execution of n blocks,
the CUDA runtime automatically reduces the number of blocks
assigned to each SM until their combined resource usage falls under
the limit.
• With a limited numbers of SMs and a limited number of blocks that
can be assigned to each SM, there is a limit on the number of blocks
that can be actively executing in a CUDA device
Thread block assignment to SMs.

• The runtime system maintains a list of blocks that need to execute


and assigns new blocks to SMs as they complete executing the blocks
previously assigned to them.
• One of the SM resource limitations is the number of threads that can
be simultaneously tracked and scheduled.
• It takes hardware resources for SMs to maintain the thread and block
indices and track their execution status
Advancement
• In more recent CUDA device designs, up to 1,536 threads can be assigned
to each SM.
• That is of the form of 6 blocks of 256 threads each, 3 blocks of 512 threads
each, etc.
• If the device only allows up to 8 blocks in an SM, it should be obvious that
12 blocks of 128threads each is not a viable option.
• If a CUDA device has 30 SMs and each SM can accommodate up to1,536
threads, the device can have up to 46,080 threads simultaneously residing
in the CUDA device for execution
THREAD SCHEDULING AND
LATENCY TOLERANCE
Thread scheduling

• In most implementations to date, once a block is assigned to a SM, it


is further divided into 32-thread units called warps.
• The size of warps is implementation-specific. In fact, warps are not
part of the CUDA specification.
• The warp is the unit of thread scheduling in SMs.
Blocks are partitioned into warps for thread scheduling.
Division of blocks into warps
• Each warp consists of 32 threads of consecutive threadIdx values:
threads 0 to 31 form the first warp, 32 to 63 the second warp, and so
on.
• In this example, there are three blocks—block 1, block 2, and block 3,
all assigned to an SM.
• Each of the three blocks is further divided into warps for scheduling
purposes.
• We can calculate the number of warps that reside in an SM for a given
block size and a given number of blocks assigned to each SM.
Example
• If each block has 256 threads, we can determine that each block has
256/32 or 8 warps. With three blocks in each SM, we have 8 * 3 = 24
warps in each SM.
• An SM is designed to execute all threads in a warp following the single
instruction, multiple data (SIMD) model.
• At any instant of time, one instruction is fetched and executed for all
threads in the warp.
• Same is implemented by single instruction fetch/dispatch shared
among execution units in the SM.
• Note that these threads will apply the same instruction to different
portions of the data. As a result, all threads in a warp will always have
the same execution timing.
Hardware streaming processors
• In general, there are fewer SPs (hardware streaming processors )than
the number of threads assigned to each SM.
• Each SM has only enough hardware to execute instructions from a
small subset of all threads assigned to the SM at any point in time.
• In earlier GPU design, each SM can execute only one instruction for a
single warp at any given instant.
• In more recent designs, each SM can execute instructions for a small
number of warps at any given point in time.
Why Warps
• When an instruction executed by the threads in a warp needs to wait
for the result of a previously initiated long-latency operation, the
warp is not selected for execution.
• Another resident warp that is no longer waiting for results will be
selected for execution.
• If more than one warp is ready for execution, a priority mechanism is
used to select one for execution.
• This mechanism of filling the latency time of operations with work
from other threads is often called latency tolerance or latency hiding.
Warp scheduling.
• Warp scheduling is also used for tolerating other types of operation
latencies such as pipelined floating-point arithmetic and branch
instructions.
• The selection of ready warps for execution does not introduce any
idle time into the execution timeline, which is referred to as zero-
overhead thread scheduling.
• With warp scheduling, the long waiting time of warp instructions is
“hidden” by executing instructions from other warps.
Exercise
• Assume that a CUDA device allows up to 8 blocks and 1,024 threads
per SM, whichever becomes a limitation first. Furthermore, it allows
up to 512 threads in each block. For matrix matrix multiplication,
should we use 8 * 8, 16 * 16, or 32 * 32 thread blocks?

Analyse the pros and cons of each choice


8*8
• If we use 8 * 8 blocks, each block would have only 64 threads.
• We will need 1,024/64 = 16 blocks to fully occupy an SM.
• But limitation of up to 8 blocks in each SM, we will end up with only
64 * 8 = 512 threads in each SM.
• The SM execution resources will likely be underutilized because there
will be fewer warps to schedule around long-latency operations.
16*16
• The 16 * 16 blocks give 256 threads per block.
• Each SM can take 1,024/256 = 4 blocks. This is within the 8-block
limitation.
• This is a good configuration since we will have full thread capacity in
each SM and a maximal number of warps for scheduling around the
long latency operations.
32*32
• The 32 * 32 blocks would give 1,024 threads in each block, exceeding
the limit of 512 threads per block for this device

You might also like