Lecture4 CUDA Threads Part2
Lecture4 CUDA Threads Part2
Block 2 Block 3
Block 4 Block 5
Each block can execute in any order relative to
Block 6 Block 7 other blocks.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 3
ECE498AL, University of Illinois, Urbana-Champaign
G80 CUDA mode – A Review
• Processors execute computing threads
• New operating mode/HW interface for computing
Host
Input Assembler
Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache
Texture
Texture Texture Texture Texture Texture Texture Texture Texture
Global Memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 4
ECE498AL, University of Illinois, Urbana-Champaign
G80 Example: Executing Thread Blocks
t0 t1 t2 … tm SM 0 SM 1 t0 t1 t2 … tm
MT IU MT IU
Blocks
SP SP
– For 8X8, we have 64 threads per Block. Since each SM can take
up to 768 threads, there are 12 Blocks. However, each SM can
only take up to 8 Blocks, only 512 threads will go into each SM!
– For 16X16, we have 256 threads per Block. Since each SM can
take up to 768 threads, it can take up to 3 Blocks and achieve full
capacity unless other resource considerations overrule.
– For 32X32, we have 1024 threads per Block. Not even one can fit
into an SM!
• dim3 gridDim;
– Dimensions of the grid in blocks (gridDim.z
unused)
• dim3 blockDim;
– Dimensions of the block in threads
• dim3 blockIdx;
– Block index within the grid
• dim3 threadIdx;
– Thread index within the block
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 11
ECE498AL, University of Illinois, Urbana-Champaign
Common Runtime Component:
Mathematical Functions
• pow, sqrt, cbrt, hypot
• exp, exp2, expm1
• log, log2, log10, log1p
• sin, cos, tan, asin, acos, atan, atan2
• sinh, cosh, tanh, asinh, acosh, atanh
• ceil, floor, trunc, round
• Etc.
– When executed on the host, a given function uses
the C runtime implementation if available
– These functions are only supported for scalar types,
not vector types
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 12
ECE498AL, University of Illinois, Urbana-Champaign
Device Runtime Component:
Mathematical Functions
• Some mathematical functions (e.g. sin(x))
have a less accurate, but faster device-only
version (e.g. __sin(x))
– __pow
– __log, __log2, __log10
– __exp
– __sin, __cos, __tan