GPU Architecture
GPU Architecture
GPU Architecture
1
15-02-2024
2
15-02-2024
CPU regs
CPU
caches
12.8GB/sec – 31.92GB/sec
8B per transfer
Memory
Memory
• What is a GPU
– Specialized processor for graphics
– Massively parallel:
• Lots of:
– Read data, calculate, write
– Used to be fixed function
– Are becoming more programmable
• What is CUDA
– A C extension for programming for NVIDIA
GPUs
– Straightforward to learn
– Challenge is in getting performance
3
15-02-2024
12.8GB/sec – 31.92GB/sec
8B per transfer
GPU Memory
Memory
1GB on our systems
4
15-02-2024
10
5
15-02-2024
11
Block 0: a[0]…a[11]
…
Block 4: a[48] .. a[59]
a[48]
a[59]
13
6
15-02-2024
• Memory Hierarchy
Anything declared inside
The kernel
__shared__ int…
__global__ int…
14
15
7
15-02-2024
GPU Computing
• GPU computing is the use of a GPU (graphics processing unit) as a co-
processor to accelerate CPUs for general-purpose scientific and
engineering computing.
• The GPU accelerates applications running on the CPU by offloading
some of the compute-intensive and time-consuming portions of the
code.
• The rest of the application still runs on the CPU. From a user's
perspective, the application runs faster because it's using the
massively parallel processing power of the GPU to boost
performance. This is known as "heterogeneous" or "hybrid"
computing.
17
18
8
15-02-2024
19
Data Parallelism
• Modern applications process large amounts of data that incur significant
execution time on sequential computers. An example of such an
application is rendering pixels. For example, an application that converts
sRGB pixels to grayscale. To process a 1920x1080 image, the application
must process 2073600 pixels.
• Processing all those pixels on a traditional uniprocessor CPU will take a
very long time since the execution will be done sequentially. (The time
taken will be proportional to the number of pixels in the image).
• Further, it is very inefficient since the operation that is performed on each
pixel is the same, but different on the data (SPMD).
• Since processing one pixel is independent of the processing of any other
pixel, all the pixels can be processed in parallel.
• If we use 2073600 threads (“workers”) and each thread processes one
pixel, the task can be reduced to constant time.
• Millions of such threads can be launched on modern GPUs.
20
9
15-02-2024
21
22
10
15-02-2024
23
2. Thread blocks
• As the name implies, a thread block -- or
CUDA block -- is a grouping of CUDA cores
(threads) that can be executed together in
series or parallel.
• The logical grouping of cores enables more
efficient data mapping. Thread blocks share
memory on a per-block basis.
24
11
15-02-2024
3. Kernel grids
• The next layer of abstraction up from
thread blocks is the kernel grid. Kernel
grids are groupings of thread blocks on the
same kernel. Grids can be used to perform
larger computations in parallel
25
12