Gpus
Gpus
Stewart Weiss
GPUs and GPU Programming
The availability of these more powerful graphics chips led software makers to create software that used these
chips, fueled by the public's insatiable desire for life-like real-time 3D graphics, and the demands of the
computer gaming industry, the movie industry, and the television industry. Chip manufacturers completed
this cycle by responding in turn with more powerful chips. Eventually, this cycle of growth resulted in
graphics controllers that had as much processing power as the CPU itself, although their limited purpose
made them unsuitable to be used as CPUs.
The rst graphics processing unit (GPU), NVIDIA's GEForce 256, appeared in 1999. In addition to the
operations that had become standard by then, this chip incorporated functions to perform in hardware
transforms (movement in 3D), lighting and shading (altering the color of surfaces of the scene based on
lighting information.) In November 2006, NVIDIA's GeForce 8800 gave birth to their new GPU Computing
model. The GeForce 8800 was based on the G80 architecture and brought several key innovations to GPU
computing. The G80 series was the largest commercial GPU at the time, containing approximately 686
million transistors. By 2008, eight generations of the GEForce had been built, some of which provided full
support for 3D graphics libraries such as Direct3D.
Over the next few years, the GPUs became more and more programmable, replacing xed function dedicated
logic by programmable processors. Integer arithmetic was replaced by oating-point arithmetic, and the de-
gree of parallelism within the chips increased dramatically. It was not long before chip manufacturers started
adding instructions and memory to the GPUs so that they could support general purpose programming, and
this turned them into fully general-purpose processors, known as GPGPUs. At this point the GPGPU is a
processor with unprecedented oating-point performance and programmability.
Although the hardware is SIMD, the Tesla programmer interface creates the illusion that the multiprocessor
is MIMD. It achieves this by making certain threads inactive when they are not supposed to execute a
given instruction, and by high level parallel ne-grained multithreading. At its best, all threads are busy
1
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
all of the time. The hardware is more like SIMT single instruction multiple thread, than SIMD. But if
the programmer does not write the code carefully, the machine will not take advantage of the maximum
parallelism possible.
A GPU is a multiprocessor, sometimes containing hundreds of processors. The intended purpose of a GPU
is to perform graphics operations, which is what they do well, but they can be used for other computations
as well. So how are they dierent from CPUs at this point?
• GPUs do not perform all of the operations that a CPU can perform; their instruction sets are narrowly
focused on graphics acceleration.
• The programming interfaces to GPUs are high-level application programming interfaces such as OpenGL,
Cg, and DirectX, together with high-level graphics shading languages such as C for Graphics (Cg ) and
the High Level Shader Language (HLSL). These languages are supported by compilers that generate
intermediate languages, which are optimized by the specic GPU driver software, which generates the
the specic machine instructions for the GPU.
• Graphics processing includes many stages in a pipeline of operations, including vertex shading, geom-
etry shading, rasterization, and pixel shading. These operations, described below, are performed on a
massively parallel scale in a pipe-lined fashion.
• Vertices can be drawn independently, and pixel fragments can be rendered independently. This inde-
pendence allows the computation to proceed using many independent and parallel threads of control.
• GPUs are designed to work well on 4-tuples. This is because a vertex in three dimensions is represented
by a set of four coordinates, (x, y, z, w), called homogeneous coordinates . The fourth coordinate,
w, is used to facilitate projecting the 3D point into two-dimensions in a way that creates the illusion of
depth (known to the artist as perspective drawing and to the mathematician as projective geometry.)
Also, having the fourth coordinate makes the basic transformations of rotation, translation, and scaling
obtainable by matrix multiplication. Pixels consist of four coordinates also, from a color space with an
alpha channel, (r, g, b, alpha). Vertices and pixels each consist of four 32-bit oating point numbers.
• Unlike general purpose applications, graphics computations have working sets that can be hundreds of
megabytes.
• There is much more data parallelism in graphics applications than general purpose applications.
• GPUs do not rely on multi-level caches to overcome long latencies to memory. Instead, they rely on
having enough threads to hide the latency and they use multithreading.
• GPU main memory is designed for high bandwidth rather than small latency. GPU's have smaller
memories than CPUs.
• GPU processors are multithreaded and each is highly parallel. In the past, they had specialized
processors for each stage of the pipeline, but more of them now have homogeneous processors.
• GPUs used to have four-stream SIMD processors within them, but now these are being replaced by
regular processors and scalar instructions.
• GPUs have no need for double-precision oating point instructions, but as they become used more and
more for non-graphical processing, this functionality is being added to them.
There is a growing base of general purpose applications that have been ported to GPUs. The term general
purpose GPU, or GPGPU refers to a method of using GPUs for non-graphics applications. NVIDIA has de-
veloped a programming language, CUDA (Compute Unied Device Architecture) that enables programmers
to write C code for GPUs.
2
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
Before we look at the GPU architecture in detail, it is best to explain briey the various graphics operations
that are performed in graphics processing.
A 3D image is represented by a 3D model, which is a collection of 3D points. Usually, 3D models start out
as triangulated surfaces, as shown in Figure 2. The vertices of the triangles dene the shape and are the
starting point for display of the object. Thus, a 3D shape begins as a set of these vertices.
Graphics processing proceeds as a sequence of pipelined stages. The basic graphics pipeline is shown in
Figure 3. The individual stages are described below.
1.4.2 Z-Buers
One of the major tasks in three-dimensional (3D) graphics processing is determining where each 3D point
should be placed on the 2D screen (which is essentially a projection problem), and what color it should have.
A key concept underlying the projection problem is how to determine which 3D points are visible and which
are obscured by other points, which is the visibility problem .
In computer graphics, z-buering is the management of image depth coordinates in three-dimensional
(3D) graphics, usually done in hardware, sometimes in software. It is one solution to the visibility problem,
3
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
which is the problem of deciding which elements of a rendered scene are visible, and which are hidden. The
painter's algorithm is another common solution which, though less ecient, can also handle non-opaque
scene elements. Z-buering is also known as depth buering .
An arbitrary point on a surface in 3D is represented by the coordinates (x,y,z), where z is the depth coordinate.
Initially all depth coordinates are normalized, varying between 0 and 1, with 1 being the furthest distance
from the viewing plane. When an object is rendered by a 3D graphics card, the depth of a generated pixel
(its z coordinate) is stored into the z-buer (or depth buer). This buer is usually arranged as a two-
dimensional array (x-y) with one element for each screen pixel. The intensity value for that pixel (its color
and alpha channel) are stored in a parallel 2D buer called the refresh buer .
When multiple objects in the scene must be rendered in the same pixel, the graphics card must decide which
object's pixel is closest to the observer. The chosen depth is then saved to the z-buer, replacing the old one,
and its intensity replaces the current value of the intensity at that pixel in the refresh buer. In the end, the
z-buer will allow the graphics card to correctly reproduce the usual depth perception: a close object hides
a farther one. This is called z-culling .
Determining which pixel is closest is a calculation based upon the planar surfaces of the 3D object. When
a surface is triangulated, every point is either a vertex of a triangle or a point inside the triangle formed
by the three vertices. Each triangle denes a plane, whose equation can be obtained by solving a system
of linear equations using those three sets of coordinates. This leads to a planar equation of the form
Ax + By + Cz = D. where A, B, C, and D are constants. Each point in that plane satises this equation.
By algebraic manipulation, if x and y are known, z can be obtained from the equation
−Ax − By + D
z=
C
The z-values so obtained lie between 0 and 1.0.
For eciency, the depth-buering algorithm will process each scan line (horizontal line of pixels) in succession.
Two adjacent pixels dier only in their x-coordinate: (x, y) is followed by(x + 1, y). If the depth value for
(x, y) is z, then the depth value z0 for (x + 1, y) can be obtained by a single subtraction operation, since
−A(x + 1) − By + D −Ax − By + D A A
z0 = = − =z−
C C C C
4
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
The input data may be a list of vertices, lines and so on, but it might also include an index buer. Without
an index buer, a vertex might appear multiple times in the input to the graphics processor, because the
vertex data was arranged in the order in which the larger primitives (e.g., lines, triangles) would be processed.
Rather than storing the vertices themselves in an array, the array could contain references to vertices, which
would be in a secondary structure. This array is an index buer . By using an index buer, each vertex is
stored exactly once, and references to it as a part of larger primitives are the indices into this array.
The input may also include a collection of textures that are to be applied to the scene to be displayed. You
can think of a texture as a two-dimensional shape, usually a rectangle, containing colors or an image to be
applied to a surface.
Vertex Shading Shader programs in general determine how lighting and shadows interact with the surfaces
to be rendered. A vertex shader is a graphics processing function that maps vertices onto the screen and
adds special eects to objects in a 3D environment by performing mathematical operations on the objects'
vertex data. One of its purposes is to transform each vertex's 3D position in virtual space to the 2D
coordinate at which it appears on the screen, as well as a depth value for the Z-buer, and then to apply
color to it.
Examples of vertex shading eects include matrix palette skinning, which allows programmers to create
realistic character animation with up to 32 "bones" per joint; deformation of surfaces, which gives developers
the power to create realistic surfaces such as waves; and vertex morphing, which is used to morph triangle
meshes from one shape to another, providing smooth skeletal animation. Vertex shaders are run once for
each input vertex. Although vertex shaders can manipulate properties such as position, color, and texture
coordinate, they cannot create new vertices. The output of the vertex shader goes to the next stage in the
pipeline, which is either a geometry shader if present or the rasterizer otherwise.
Geometry Shading Geometry shading is the stage of the graphics pipeline after vertex shading. Its
purpose is to enhance the details and accuracy of the 2D image by working at a larger degree of granularity
than individual vertices. Its inputs consist of geometric primitives consisting of more than one vertex, such
as lines and triangles. A geometry shader can take as its input, for example, the three points of a triangle
and output intermediate points that can be used to rene the surface. It can only do this by operating with
greater granularity. The geometry shader can modify the positions and orientation of the primitives.
Rasterization The word "raster" was originally used in the raster scan of cathode ray tubes (CRT), which
paint the image line by line; the term is now used to mean a grid of pixels. Rasterization is a process
for converting vectorized input to a bitmap form, i.e. a 2D array of pixels. Given a triangle, for example,
represented by three vertices, it determines the locations of all pixels and pixel fragments that lie inside,
or on the edge of, this triangle. There can be pixel fragments because a triangle is a mathematical object
consisting of lines of innitely small width that can intersect pixels rather than lie between them. Therefore,
a single pixel can lie on each side of a line segment. The fragment of the pixel on the inside edge of a
triangle's perimeter is a pixel fragment belonging to that triangle.
The rasterization stage actually does other processing as well. Typically it also does clipping , meaning
throwing away regions that are outside of the view frustum . A frustum is a 3D shape that can be
described as a 4-sided pyramid with its top lopped o. It is the 3D analog of a trapezoid. When you view
5
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
a scene in perspective, you are looking at a frustum lying on its side usually. The front plane is the bottom
of the pyramid, and the back plane is the lopped-o top. This is the view frustum.
The rasterization step typically does z-culling of pixels. As it is generating pixels, it can discover that some
of them are behind others and should be culled. When this step is nished, what is left are the visible pixels
and pixel fragments, represented by their screen positions and their depth values. These are passed to the
pixel shader.
Pixel Shading A pixel shader is a function that computes the color and other attributes of each pixel or
pixel fragment. Pixel shaders range from always outputting the same color, to applying a lighting value, to
doing bump mapping, shadows, specular highlights, translucency and other phenomena. They can alter the
depth of the pixel (for Z-buering), or output more than one color if multiple render targets are active. A
pixel shader alone cannot produce very complex eects, because it operates only on a single pixel, without
knowledge of a scene's geometry or of neighboring pixels.
The following program, taken from the Patterson-Hennessey text, can give you an idea of the type of work
that a pixel shader can do. It is written in Cg. Each pixel is run through this program.
void reflection(
float2 texCoord : TEXCOORD0,
float3 reflection_dir : TEXCOORD1,
uniform float shiny,
uniform sampler2D surfaceMap,
uniform samplerCUBE envMap,
out float4 color : COLOR )
{
// Fetch the surface color from a texture
float4 surfaceColor = tex2D(surfaceMap, texCoord);
The rst input parameter, texCoord, is the (x, y) position of the pixel. This is used as an index into a
texture map in order to apply a particular texture to that pixel. The texture map, surfaceMap, is a large
2D array that is passed to all pixels in the fourth parameter. The second input parameter, reflection_dir,
is a 3D vector that represents the direction of the view with respect to the surface. To understand this,
imagine that you could draw a line between the pixel and your eye. Next, rotate the plane in which the
pixel lies so that it is the horizontal plane and then translate the plane so that the pixel is at the origin.
Assume that the z-axis is upward and x and y are in the horizontal plane. The line from the pixel to your
eye is a vector called the reection direction . It is the direction in which the light will bounce from the
surface to your eye. The envMap parameter stores the colors that each face of a cube would have given the
lighting conditions for the surface. Those colors are used to determine what color the reected light would
be, assuming the pixel did not absorb any of the light's spectrum.
The reection function calls two functions, tex2D() and texCUBE() to compute the surface color and reected
color and then uses a weighted average of these two colors to return the value by which to color this particular
pixel. The weighted average function is lerp(). Its third parameter is a oating point value that represents
the weight to apply to each of the two colors. lerp is short for linear interpolation , which, given two
values a1 a2 and a constant 0 ≤ c ≤ 1, produces a value (1 − c)a1 + ca2 . When c = 0, this evaluates
and
to a1 and when c = 1, it evaluates to a2 . For all values in between, the distance of the interpolated point
from a1 is proportional to c and hence the name. Linear interpolation is used more generally as follows. If
6
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
f (x) is a function of x all of whose values are either hard to compute or unknown, and an approximation or
estimate of it is needed, and two values are known, at, say at points x1 and x2 , then a linear interpolation
of f (x) (1 − c)f (x1 ) + cf (x2 ), 0 ≤ c ≤ 1, which in eect treats the
between the pair of points is given by
function as a linear function (a straight line) between the points. The approximation to f , denoted fˆ(x),
would be dened by letting c = (x − x1 )/(x2 − x1 ).
A discrete GPU is one that sits on a card that is plugged into the PCI-Express interconnect bus. In
contrast, a motherboard-GPU is integrated into the chipset on the motherboard. Tesla-based GPUs can
7
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
have from 1 to 16 nodes, each called a streaming multiprocessor (SM ). NVIDIA uses the terms "node"
and streaming multiprocessor interchangeably. The largest version available in 2008 was the GeForce 8800
GTX, with 16 SMs, each containing 8 multithreaded single-precision oating point and integer processing
units, which used to be called streaming processors (SP ) but are now called cores . The clock rate was
1.35 GHz. The later Fermi-based GPUs support 32 cores per streaming multiprocessor, implying that they
have up to 512 cores. The newest architecture, at the time of this writing, is the Kepler, which has up to 15
Streaming Multiprocessor (SMX) units, each of which has up to 192 single-precision CUDA cores, with each
core having fully pipelined oating-point and integer arithmetic logic units. Figure 6 shows the architecture
of a single Kepler SMX processor.
To give you some idea of the power of these GPUs, consider the fact that the GEForce 8800 GTX has a
single-precision multiply-add instruction. This means that the add and multiply operations take place in a
single instruction cycle. Given the preceding parameters, we can compute the peak performance of the 8800
GTX, i.e., the performance when all processors are kept busy all of the time:
With the same clock speed, the Kepler GK110 would have a performance of almost 8 Teraops!
Each of the 16 SMs of the GEForce 8800 has a local store with a capacity of 16KB as well as 8192 32-bit
registers. The shared memory is partitioned into six partitions of 900 MHz Graphics DDR3 DRAM, each
with an 8-byte wide datapath, with 128 MB per partition. Therefore, there is a total of 768 MB of memory,
and the peak bandwidth is
8
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
= 6 * 8 * 2 * 0.9 GB/second
= 86.4 GB/second
Figure4 depicts the architecture of an NVidia GEForce 8800 that has 14 streaming multiprocessors, dis-
tributed into 7 pairs of SMs. Each SM pair is integrated into a unit called a texture/processor cluster
(TPC) that contains a shared geometry controller, a streaming multiprocessor controller (SMC), and a shared
texture unit and cache and shared lines for load/store and I/O, as shown in Figure 5. The SMC controls
access to the shared texture unit, load/store path, and I/O path. Each TPC is connected to an interconnec-
tion network that connects it to the device memory, L2 cache (containing textures), raster processors, and
the interface to the actual display device, as well as to the bridge to the CPU and system memory.
Fig. 6: The SMX architecture of the Kepler GK110 ( from the NVIDIA Kepler GK110 whitepaper.)
Each SM in the Tesla GPUs contains 8 cores (32 in the Fermi GPUs), as well as its own local shared
memory, an instruction cache, a constant cache, a multithreaded instruction unit, and two special function
units (SFUs). The special function units compute special functions such as the transcendental functions
(e.g., trigonometric functions), reciprocals, and square roots.
A core (SP) is the primary thread processor. In the GEForce 8800, each core is a multithreaded processor
supporting 96 threads, with a register le containing 1024 scalar 32-bit registers for the use of these threads.
The processor is fully pipelined and implements all 32-bit and 64-bit integer arithmetic, comparison, conver-
sion, and logical PTX
1 instructions, as well as IEEE 754 standard single precision oating point operations
1 PTX instructions are parallel thread execution instructions. Each such instruction is issued to multiple threads simultane-
ously.
9
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
Fig. 7: Die photo of the Kepler GK110 with 15 SMX processors, and overlay showing locations.
and a compatible add/multiply operation. Earlier GPUs executed vector instructions, but the later GPUs
were designed so that core processors executed ordinary scalar instructions.
Until recently, there were two dierent ways to use a GPU. The earlier method was called General Purpose
computing on a GPU (GPGPU) and evolved as a means to do general purpose computing by using the
GPU's graphics API to do non-graphics tasks. Subsequently, as GPUs included more general purpose
instructions in their instruction sets, it became possible to write non-graphics programs using a parallel
programming language and API. This paradigm was called GPU computing .
In 2007, NVIDIA released a software architecture and computational model to make it easier to write C
or C++ programs that could exploit the high degree of parallelism in the GPU. They called this model,
Compute Unied Device Architecture , or CUDA, for short. CUDA consists of a software library and a
compiler that maps the enhanced C/C++ code into instructions for the GPU and CPU. It essentially allows
the programmer to write highly parallel C/C++ programs by using additional data types and functions
provided by the library. In fact, CUDA can be used to write parallel programs for multiple-core CPUs as
well.
The computational hierarchy consists of threads at the lowest level, to thread blocks , which are groups of
threads, to grids , which are groups of blocks. Each level has an associated memory that maps naturally to
the physical memories inside the GPU. Similarly, there are methods of barrier synchronization that can be
10
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
used with threads or with thread blocks. Figure 8 shows the relationship between computational units and
their associated memories.
CUDA threads may access data from multiple memory spaces during their execution as illustrated by Figure
8. Each thread has private local memory, called per-thread memory . Each thread block has shared
memory visible to all threads of the block and with the same lifetime as the block. This is called per-block
shared memory in the gure. All threads have access to the same global memory.
There are also two additional read-only memory spaces accessible by all threads: the constant and texture
memory spaces. (In the Kepler models, each SMX has a separate 48 KB read-only data cache accessible to
all threads in a thread block.) We will see later that the global, constant, and texture memory spaces are
optimized for dierent memory usages. The global, constant, and texture memory spaces are persistent across
invocations of threads within the same application. In other words, these three memory spaces retain their
data as threads are created and destroyed. A large part of the challenge of GPU computing is understanding
the capacities and methods of access of each dierent type of memory in order to maximize performance.
A program consists of dierent kinds of data and functions. Certain variables and data are local to a single
thread, others to thread blocks, and others, to the entire application. Similarly, special functions called
kernels are functions that are executed in parallel by all threads. An example of a kernel is the following,
which adds two arrays A and B of size N and stores the result into array C:
__global__
void VecAdd(float* A, float* B, float* C )
{
int i = threadIdx.x;
C[i] = A[i] + B[i];
}
The qualier __global__ is a CUDA extension to C that identies the function as a kernel function, which
11
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
means that every thread executes it. Each thread is given a unique integer identier, which in this case is
stored in the variable threadIdx.x. To call this function from the main program, one would use the syntax
int main()
{
...
// Kernel invocation with one block of N threads
int numBlocks = 1;
dim3 threadsPerBlock(N);
VecAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
}
The triple-angle brackets <<<...>>> identify the execution conguration of the kernel, which indicates
how many blocks per grid (up to three dimensions) and how many threads per block (up to three dimensions).
numBlocks = 1, and the block is a set of N threads.
In this example, the grid consists of one block because
threadsPerBlock is declared as a dim3, which is a 3-dimensional structure having x, y, and z members; the
initialization sets x=N and y=1 and z=1 by default.
This is the avor of computing in CUDA, but there is much more to be said about it later. For now, the
important observation is that the CUDA programming model requires an underlying architecture that can
execute many threads extremely quickly, and be able to switch among groups of threads as well. It also
has to provide the dierent memory spaces in an ecient way. This leads us to explore the multithreading
capabilities of the GPU.
• To hide the latency of memory loads and texture fetches from DRAM and shared block memories.
Memory accesses can take hundreds of processor cycles. Multithreading allows the processor to switch
to another thread while one thread is waiting for a load or texture fetch to complete. The extremely
high degree of multithreading can keep many cores busy even though many threads might be stalled
waiting for memory loads, because if there are enough active threads, then the probability that there
are threads to keep all cores busy will be high.
• To support ne-grained parallel graphics shader programming models and parallel computing models.
Graphics shader programs typically execute many dierent stages dynamically, from vertex shading to
pixel shading. Because of this, the streaming multiprocessors are designed to execute dierent thread
programs concurrently.
• To virtualize the physical processors as threads and thread blocks in order to make them highly scalable.
Furthermore, each thread can have its own private registers, private memory, program counter, and
thread execution state, and can execute its own independent code sequence. To make all of this
possible, the GPU multiprocessor is hardware multithreaded, managing hundreds of threads without
scheduling overhead. Threads within a thread block can synchronize with each other using a barrier
synchronization instruction (like the one we saw in Chapter 7 notes).
• To simplify the parallel programming model so that the programmer only has to write serial kernel
functions.
All of the concurrency in the CUDA extensions to C and C++ takes place in kernel functions. This
has the potential to simplify the logic of many data parallel programs.
12
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
Each SP core has scalar integer and oating point arithmetic units that execute most of its instructions.
It is hardware multithreaded and can support anywhere from 32 to 96 threads, depending on the model.
The SP core is also pipelined, and can run several threads concurrently. The program declares how many
of the SP's 1024 registers each thread needs. If each thread uses few registers, then more threads can run
concurrently. If each thread uses many registers, then fewer threads can run at a time. For example, a pixel
shader program usually uses no more than 16 registers per thread, so the 1024 registers can be allocated to
64 threads, which implies that each SP core can run 64 pixel shader threads concurrently.
The SM uses a model something like SIMD, called Single Instruction Multiple Thread or SIMT for
short. It creates groups of threads called warps 2 . A warp is a group of 32 threads scheduled on a single
SM together. A single SM will generally execute multiple warps, interleaving their executions to hide stalls,
as shown in Figure 10. A thread block is never smaller than a warp, though thread blocks can consist of
multiple warps. Individual threads composing a warp start together at the same program address, but they
have their own instruction address counter and register state and are therefore free to branch and execute
independently.
In the Tesla, warps are scheduled onto the cores in units of four clock cycles; a warp of 32 threads is
distributed among 8 cores with four threads per core. Over four clock cycles, each of the four threads
executes an instruction, so that after four clock cycles, 32 threads have executed one instruction each. In the
Fermi architecture, each SM has two warp schedulers and two instruction dispatch units, allowing two warps
2 The term warp originates from weaving, the rst parallel thread technology. A half-warp is either the rst or second half
of a warp. A quarter-warp is either the rst, second, third, or fourth quarter of a warp.
13
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
to be issued and executed concurrently. Fermi's dual warp scheduler selects two warps, and issues one
instruction from each warp to a group of sixteen cores, sixteen load/store units, or four SFUs. Running two
warps in parallel increases the ability to hide stalls due to memory accesses. In the Kepler architecture, each
SMX has four warp schedulers and eight instruction dispatch units, allowing four warps to be issued and
executed concurrently. Kepler's quad warp scheduler selects four warps, and two independent instructions
per warp can be dispatched each cycle. Figure 11 depicts one of the four warp schedulers' actions over time.
Unlike Fermi, which did not permit double precision instructions to be paired with other instructions, the
Kepler GK110 allows double precision instructions to be paired with other instructions.
The SM tries to identify which threads can execute simultaneously and schedules them dynamically. The
processors can handle data hazards and conditional instructions by allowing threads to follow dierent paths.
14
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
Because some threads are executing dierent paths than others, certain threads will have to be inactive at
times (otherwise they would be executing the wrong instructions). Thus, when branch conditions evaluated
on dierent data cause dierent paths to be followed, the throughput is diminished. When the threads
rejoin, they start executing in unison again. If the paths through the code on dierent branches are the same
length, the performance loss is not as great as when they are of unequal lengths.
The Tesla uses ne-grained multithreading to schedule 24 warps over time, which run in blocks of four cycles
each. The Tesla will not switch threads more frequently than every two clock cycles. Figure 12 illustrates
how warps are scheduled on a single Tesla SM. As noted above, each warp is distributed across eight cores.
Since there are 32 threads, at its best, it will take 4 cycles to execute all threads of the warp. It will take
longer if there are many conditional branches and consequent inactive threads.
The SM's SIMT multithreaded instruction unit picks a warp that is ready to execute its next instruction
(because it has enough active threads) and issues that instruction to the active threads in that warp. There
may be warps of dierent types being executed concurrently in a single SM; a warp of pixels may execute
concurrently with a warp of vertices.
15
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
Thus, from a large scale view, it is MIMD. Each Tesla processor consists of eight streaming processors, with
a SIMD architecture. Each streaming processor executes the same instruction on dierent data. Within the
processor there is a shared memory that each streaming processor can access.
Although the hardware is SIMD, the Tesla programmer interface creates the illusion that the multiprocessor
is MIMD. It achieves this by making certain threads inactive when they are not supposed to execute a given
instruction, and by high level parallel ne-grained multithreading. At its best, all 32 threads in a warp are
busy all of the time. The hardware is more like SIMT single instruction multiple thread, than SIMD. But
if the programmer does not write the code carefully, the machine will not take advantage of the maximum
parallelism possible.
Another issue is a practical one; how does one actually write general purpose programs that can be run on
a GPU? There are dierent approaches to this problem, but the easiest solution is to use NVIDIA's CUDA
extension to the C/C++ language. These notes present an overview of CUDA Version 4.0. They are not
intended as a reference manual nor as a technical guide. Most of the material comes from the NVIDIA
CUDA C Programming Guide, Version 4.0.
CUDA extends C by adding constants, types, and functions that expose the capabilities of a GPU. As noted
earlier, it consists primarily of three key abstractions:
• barrier synchronization,
that allow the programmer to write programs that have ne-grained data parallelism and thread parallelism,
as well as coarse-grained data parallelism and task parallelism.
For the purpose of writing correct programs, it is enough to learn the syntax and semantics of a relatively
small subset of CUDA, but for writing programs with optimal performance, it is important to understand
the underlying execution model and the memory model. We start with an overview of the memory model
and then look at some of the details of CUDA.
16
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
Global memory instructions support reading or writing words of size equal to 1, 2, 4, 8, or 16 bytes. Any
access (via a variable or a pointer) to data residing in global memory compiles to a single global memory
instruction if and only if the size of the data type is 1, 2, 4, 8, or 16 bytes and the data is naturally aligned
(i.e. its address is a multiple of that size). If the read (or write) requests from multiple threads can be made
to addresses in global memory that satisfy these constraints, then the reads (or writes) are coalesced into a
single instruction of higher bandwidth. This will be illustrated by example below.
To achieve high bandwidth, shared memory is divided into equally-sized memory modules, called banks ,
which can be accessed simultaneously. Any memory read or write request made of n addresses that fall in
n distinct memory banks can therefore be serviced simultaneously, yielding an overall bandwidth that is n
times as high as the bandwidth of a single module.
However, if two addresses of a memory request fall in the same memory bank, there is a bank conict and the
access has to be serialized. The hardware splits a memory request with bank conicts into as many separate
conict-free requests as necessary, decreasing throughput by a factor equal to the number of separate memory
requests. If the number of separate memory requests is n, the initial memory request is said to cause n-way
bank conicts.
To get maximum performance, it is therefore important to understand how memory addresses map to memory
banks in shared memory in order to schedule the memory requests so as to minimize bank conicts.
For devices of compute-capability 1.x, shared memory has 16 banks that are organized such that successive
32-bit words are assigned to successive banks, i.e. interleaved. Each bank has a bandwidth of 32 bits per two
clock cycles. A shared memory request for a warp is split into two memory requests, one for each half-warp,
that are issued independently. As a consequence, there can be no bank conict between a thread belonging
to the rst half of a warp and a thread belonging to the second half of the same warp.
3 Again, this is best
illustrated with an example.
3 For devices of compute-capability 2.x, shared memory has 32 banks. Therefore, unlike for devices of lower compute
capability, there may be bank conicts between a thread belonging to the rst half of a warp and a thread belonging to the
second half of the same warp.
17
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
char, uchar, int, uint, short, ushort, long, ulong, longlong, ulonglong, float
if X is one of these types, then there are vector types X1, X2, X3, and X4. For example, there are types
int1, int2, int3, and int4. The term vector type is a misnomer these are not vectors, but structures.
st nd rd th
The 1 , 2 , 3 , and 4 components are accessible through the members x, y, z, and w, respectively. For
example, if we declare
uint3 point;
then point.x, point.y, and point.z are the members of point. There are also double1 and double2, but
not double3 or double4.
The type dim3 is a an extension of uint3. It is used to specify the dimensions of things such as thread
blocks and grids. Unlike uint3 though, when an object of type dim3 is declared , its constructor initializes
all uninitialized components to the value 1. For example,
1. An automatic variable declared in device code without any of the qualiers __device__, __shared__
and __constant__ usually resides in a register.
2. The __device__ qualier declares a variable that resides on the device (meaning global memory), has
the lifetime of an application, and is accessible from all the threads within the grid and from the host
through the runtime library (with specic functions designed to allow the host program to access the
GPU's device memory.)
3. The __shared__ qualier, optionally used together with __device__, declares a variable that it resides
in the shared memory space of a thread block, has the lifetime of the block, and is only accessible from
all the threads within the block.
4. The __constant__ qualier, optionally used together with __device__, declares a variable that resides
in constant memory space, has the lifetime of an application, and is accessible from all the threads
within the grid and from the host through the runtime library (with specic functions.)
2.5 Kernels
A kernel is a C function that is executed in parallel by more than one CUDA thread, as opposed to only once
like an ordinary C function. A kernel is dened using the __global__ declaration specier. For example,
// kernel definition
__global__
void VecAdd(float* A, float* B, float* C )
{
int i = threadIdx.x;
C[i] = A[i] + B[i];
}
18
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
species to the compiler that VecAdd() is a kernel. The number of threads that execute the kernel is
determined by how the function is called. CUDA has a special execution conguration syntax for this
purpose. Any call to a __global__ function must specify the execution conguration for that call. The
execution conguration denes the dimension of the grid and blocks that will be used to execute the function
on the device. The syntax of the call is
where
• Dg is of type int or dim3 and species the dimension and size of the grid, such that Dg.x * Dg.y *
Dg.z equals the number of blocks being launched ;
4
• Db is of type int or dim3 and species the dimension of the block. Db.x * Db.y * Db.z equals the
number of threads being launched in each block.
• Ns is of type size_t and species the number of bytes in shared memory that is dynamically allocated
per block for this call in addition to the statically allocated memory; this dynamically allocated memory
is used by any of the variables declared as an external array. Ns is an optional argument which defaults
to 0.
• S denotes a cudastream_t, which we will ignore here. It is an optional argument that defaults to 0.
int main()
{
...
// Kernel invocation with N threads
VecAdd<<<1, N>>>(A, B, C);
}
In this case, Dg =1 and Db =N, so there is there is a single block containing N threads.
Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through
threadIdx variable. The threadIdX variable is of type uint3, so (threadIdx.x, threadIdx.y,
the built-in
threadIdx.z) are the coordinates of the thread within the block. Coordinates are Cartesian, not matrix. In
other words, the x coordinate is the column position and the y coordinate is the row position.
Each block is given a unique block ID that is accessible within the kernel through the built-in blockIdx
variable. The blockIdX variable is of type uint3, so (blockIdx.x, blockIdx.y, blockIdx.z) are the
coordinates of the block within the grid.
The dimensions of the grid are specied with up to three dimensions and these dimensions are accessible to
each thread in the kernel through the built-in blockDim variable, which is of type dim3.
To illustrate, suppose we want to process a matrix M that has 32 columns and 18 rows by dividing it into a
grid that has 4 columns and 3 rows of blocks, as shown in Figure 13. Each block will consist of 6 rows and
8 columns of unique threads. The main program would call our kernel using the execution conguration
dim3 gridSize(4,3);
dim3 blockSize(8,6);
float A[32][48];
ProcessMatrix<<<gridSize, blockSize>>>(A);
4 Dg.z must be equal to 1 for devices of compute capability 1.x.
19
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
// kernel definition
__global__
void ProcessMat(float** A )
{
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
// process A[row][col]
}
Figure 13 shows how matrix element M[9][17] is being processed by the thread with id (1,3) in the block
with id (2,1).
j
j=1*6+3 = 9
Thread(1,3)
Fig. 13: Decomposition of a matrix into a grid of thread blocks. In this example, the matrix is 18 rows by
32 columns. Each thread block has 8 columns and 6 rows of threads. A thread id within a block
is (x, y), 0 ≤ x ≤ 7, 0 ≤ y ≤ 5. The gure shows how a thread with id (1,3) within the block with
id (2,1) is associated with the matrix entry (17,9). Matrix elements are referred to using Cartesian
coordinates column rst, then row.
Matrix multiplication is not commutative AB is not the same as BA in general, so when we say that we
are multiplying a matrix A on the right by matrix B, we mean the product AB. We can multiply a matrix A
that is r × s on the right by a matrix B only if it has s rows. In other words, B must be of dimensions s × t
for some t. The formal denition is that if A is an r × s matrix and B is an s × t matrix then the product
C=AB is the r × t matrix whose entries are dened as
s
X
Ci,j = Ai,k · Bk,j 1 ≤ i ≤ r, 1 ≤ j ≤ t
k=1
20
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
In other words, Ci,j is the dot product (inner product) of the ith row of A and the column of B. The
j th
dot product is only dened if these are the same length, which is why the width of A must match the height
of B. Below is an example.
1 3 13 11 9 7
−1 1 2 3 4
2 · = 7 4 1 −2
4 3 2 1
−2 1 2 −1 −4 −7
The matrices are passed as dynamic arrays because otherwise one dimension of each matrix would have to
be xed size, making it a rather useless function. Client code that allocated the matrices and called this
function might look like the following:
multiply(A,B,C,Aheight,Awidth,Bwidth);
Figure 14 illustrates how a single entry of the product is obtained. This matrix multiplication algorithm
requires O(n3) multiplications and additions. Now we see how it can be done using the GPU.
21
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
In C/C++ and in CUDA on the GPU, two-dimensional matrices are stored in row-major order. This means
that row 0 is stored rst, followed by row 1, then row 2, and so on. We can represent a two-dimensional
matrix in one dimension using the above representation.
The listing below is not a main program, but a function that can utilize the resources of the GPU to perform
matrix multiplication. It computes the matrix product C = AB, where A and B are matrices of type oat.
The macro
#define BLOCK_SIZE 16
declares the thread block size, which will be a square block of 256 threads.
22
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
The Algorithm This code is an implementation of matrix multiplication that does not take advantage of
the shared memory within each SM (streaming multiprocessor). There will be a thread for each cell of the
C result matrix. Let A be an M by N matrix and B, a N by R matrix. We assume that M,N, and R are
divisible by 16, the block size for this implementation.
The A and B matrices are copied from host memory to device memory. There, a kernel is run on a grid of
blocks of threads, each of which computes a single cell of C. To compute a single cell C[i][j] of C, it has
to read the ith row of A and the j th column of B, each of size N. Each cell does this. Many cells read the ith
row of A and many read the j th row of B, but only one does both of these uniquely.
To be clear, each row of A is read R times (for each column of B) and each column of B is read M times
(for each row of A). Therefore, A is read R times from global memory and B is read M times. The basic
steps are
1. Allocate memory on the device for matrix A, and copy A onto the device's global memory.
2. Do the same for matrix B.
3. Allocate space on the device for C, the matrix product.
4. Compute the shape of the grid and the number of blocks needed to cover the product matrix C.
5. Launch the kernel with the grid and blocks.
6. When the kernel is done, copy matrix C from the device memory to host memory.
/∗
CUDA−based MATRIX MULTIPLICATION Without Using Shared Memory
#i n c l u d e <s t d i o . h>
#i n c l u d e <math . h>
#i n c l u d e < s t d l i b . h>
23
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
#i n c l u d e <cuda . h>
#i n c l u d e <cuda_runtime_api . h>
#i n c l u d e " matrixmult . h"
#d e f i n e BLOCK_SIZE 16
// Forward d e c l a r a t i o n o f t h e matrix m u l t i p l i c a t i o n k e r n e l ( i n s e p a r a t e l i s t i n g )
__global__ v o i d MatMulKernel ( c o n s t Matrix , c o n s t Matrix , Matrix ) ;
/∗
Matrix m u l t i p l i c a t i o n − Host code
Matrix d i m e n s i o n s a r e assumed t o be m u l t i p l e s o f BLOCK_SIZE
∗/
v o i d MatMul ( c o n s t Matrix A, c o n s t Matrix B, Matrix C)
{
size_t s i z e ;
// D e c l a r e t h e matrix t h a t w i l l r e s i d e on t h e d e v i c e ( t h e GPU)
Matrix d_A;
d_A. width = A. width ;
d_A. h e i g h t = A. h e i g h t ;
/∗ C a l c u l a t e i t s s i z e i n b y t e s ∗/
s i z e = A. width ∗ A. h e i g h t ∗ s i z e o f ( f l o a t ) ;
s i z e = B . width ∗ B . h e i g h t ∗ s i z e o f ( f l o a t ) ;
cudaMalloc (&d_B . e l e m e n t s , s i z e ) ;
cudaMemcpy (d_B . e l e m e n t s , B . e l e m e n t s , s i z e , cudaMemcpyHostToDevice ) ;
24
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
/ ∗ Now t h e k e r n e l i s i n v o k e d on t h i s g r i d o f b l o c k s o f t h r e a d s ∗ /
MatMulKernel<<<dimGrid , dimBlock>>>(d_A, d_B, d_C ) ;
/∗
cudaError_t cudaFree ( v o i d ∗ devPtr )
This f r e e s t h e memory s p a c e p o i n t e d t o by devPtr , which must have been
r e t u r n e d by a p r e v i o u s c a l l t o cudaMalloc ( ) o r c u d a M a l l o c P i t c h ( ) .
Otherwise , o r i f cudaFree ( devPtr ) has a l r e a d y been c a l l e d b e f o r e ,
an e r r o r i s r e t u r n e d . I f devPtr i s 0 , no o p e r a t i o n i s performed .
∗/
cudaFree (d_A. e l e m e n t s ) ;
cudaFree (d_B . e l e m e n t s ) ;
cudaFree (d_C. e l e m e n t s ) ;
}
The if statement terminates the thread if its row or column place it outside the bounds of the product
matrix. This will happen only in those blocks that overhang either the right or bottom side of the matrix.
The next three lines loop over the entries of the row of A and the column of B (these have the same size)
needed to compute the (row, col)-entry of the product, and the sum of these products is accumulated in the
Cvalue variable. The last line of the kernel copies the sum of the products into the appropriate element of
the product matrix C, in the device's global memory.
25
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
loop, one from matrix A and one from matrix B. Since accesses to global memory are relatively slow, this
slows down the kernel code, leaving many threads idle for hundreds of clock cycles.
One way to reduce the number of accesses to global memory is to have the threads load portions of matrices
A and B into shared memory, where they can access them much more quickly. The problem is that shared
memory is not large enough to store two large matrices. Devices of compute capability 1.x have 16 KB of
shared memory per multiprocessor, and devices of compute capability 2.x have 48 KB.
So instead, portions of A and B are loaded into shared memory as needed, and utilized as eciently as
possible while they are loaded. Figure 15 shows how the matrix product can be computed in a block-
structured way. Matrix A is shown on the left and matrix B is shown at the top, with matrix C, their
product, on the bottom-right. Each element of C is the product of the row to its left in A, and the column
above it in B. The matrices A and B are partitioned into 16 by 16 submatrices (BLOCK_SIZE=16.)
Each thread block is responsible for computing one square sub-matrix (Csub in the code) of the product
matrix C, and each thread will be responsible for computing one element of the product matrix C. (The
yellow square in matrix C represents the submatrix computed by a thread block, whereas the red box inside
the yellow square represents a single entry in C, what is computed by a single thread within that block.)
The thread responsible for this entry computes its value by computing the dot product of the red row of A
and the red column of B. Unfortunately, it is not so simple as this, because it must do this in pieces, as not
all of its input data will be in shared memory at the same time.
Csub is equal to the product of two rectangular matrices: the sub-matrix of A of dimension A_width x
BLOCK_SIZE that has the same row indices as Csub, and the submatrix of B of dimension BLOCK_SIZE x
A_width that has the same column indices as Csub. These are the yellow rectangular regions within the A
and B matrices in Figure 15. These regions will not all be in shared memory together. But notice that the
red row and red column pass through the same number of submatrices, since they are of equal length. This
leads to the idea: if we load the left-most of those submatrices of matrix A into shared memory, and the
top-most of those submatrices of matrix B into shared memory, then we can compute the rst BLOCK_SIZE
products of the dot-product entirely from the shared memory. Every thread in the thread block can do this
for its own little red square from the gure, and because those submatrices are in shared memory, making
this very fast.
26
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
When this is done, we no longer need the left-most block of A and the top-most block of B. We load from
global memory to shared memory the block of A to the right of the previous one, and the block of B below
the previous one and repeat the above step. Each thread computes the dot product of the next BLOCK_SIZE
entries. This gets added to the running total that the thread is maintaining for its (i,j) entry. This process
continues until the entire row of A has been multiplied by the entire column of B. When this is nished, the
resulting square submatrix is written back to global memory.
To make the copying from global to shared memory ecient, each thread is responsible for copying a single
element from each of the A and B matrices. The copying is done in such a way to maximize the memory
bandwidth, which will be explained within the listing below.
// M a t r i c e s a r e s t o r e d i n row−major o r d e r :
// M( row , c o l ) = ∗ (M. e l e m e n t s + row ∗ M. s t r i d e + c o l )
// The s t r i d e o f a matrix i s t h e number o f b y t e s from t h e s t a r t o f a row t o
// t h e s t a r t o f t h e next row . The s t r i d e i s not n e c e s s a r i l y e q u a l t o t h e width ;
// t h e s t r i d e can be l a r g e r s o t h a t rows a r e a l i g n e d i n s h a r e d memory t o
// a c h i e v e good bandwidth .
typedef struct {
i n t width ; // width o f matrix
int height ; // h e i g h t o f matrix
int stride ; // number o f b y t e s from s t a r t o f row k t o s t a r t o f row k+1
f l o a t ∗ e l e m e n t s ; // p o i n t e r t o a c t u a l matrix data
} Matrix ;
// Thread b l o c k s i z e
#d e f i n e BLOCK_SIZE 16
// The __global__ q u a l i f i e r d e c l a r e s a f u n c t i o n a s b e i n g a k e r n e l .
// A k e r n e l i s a f u n c t i o n t h a t i s e x e c u t e d on t h e d e v i c e ( t h e GPU) , and
// i s c a l l a b l e from t h e h o s t (CPU) .
// This i s a f o r w a r d d e c l a r a t i o n o f t h e d e v i c e m u l t i p l i c a t i o n f u n c t i o n
__global__ v o i d Muld ( f l o a t ∗ , f l o a t ∗ , i n t , i n t , f l o a t ∗ ) ;
// Host m u l t i p l i c a t i o n f u n c t i o n
// Compute C = A ∗ B
// height_A i s t h e h e i g h t o f A
// width_A i s t h e width o f A
// width_B i s t h e width o f B
v o i d M u l t i p l y ( c o n s t f l o a t ∗A,
c o n s t f l o a t ∗B,
int height_A ,
27
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
int width_A ,
int width_B ,
float ∗C)
{
int size ;
/∗
Copy m a t r i c e s A and B t o t h e d e v i c e memory
To copy r e q u i r e s u s i n g cudaMemCpy , which can copy e i t h e r from h o s t
t o d e v i c e , from d e v i c e t o host , o r from d e v i c e t o d e v i c e .
∗/
f l o a t ∗ A_on_device ;
s i z e = height_A ∗ width_A ∗ s i z e o f ( f l o a t ) ;
cudaMalloc ( ( v o i d ∗∗ )&A_on_device , s i z e ) ;
cudaMemcpy ( A_on_device , A, s i z e , cudaMemcpyHostToDevice ) ;
f l o a t ∗ B_on_device ;
s i z e = width_A ∗ width_B ∗ s i z e o f ( f l o a t ) ;
cudaMalloc ( ( v o i d ∗∗ )&B_on_device , s i z e ) ;
cudaMemcpy ( B_on_device , B, s i z e , cudaMemcpyHostToDevice ) ;
// A l l o c a t e matrix C on t h e d e v i c e
f l o a t ∗ C_on_device ;
s i z e = height_A ∗ width_B ∗ s i z e o f ( f l o a t ) ;
cudaMalloc ( ( v o i d ∗∗ )&C_on_device , s i z e ) ;
// Compute t h e e x e c u t i o n c o n f i g u r a t i o n assuming
// t h e matrix d i m e n s i o n s a r e m u l t i p l e s o f BLOCK_SIZE
// The dim3 d e c l a r a t i o n i s used h e r e . This s p e c i f i e s t h a t dimBlock
// i s BLOCK_SIZE x BLOCK_SIZE x 1
dim3 dimBlock (BLOCK_SIZE, BLOCK_SIZE ) ;
// Launch t h e d e v i c e computation
Muld <<<dimGrid , dimBlock>>> ( A_on_device ,
B_on_device ,
width_A ,
width_B ,
C_on_device ) ;
// Free d e v i c e memory
cudaFree ( A_on_device ) ;
cudaFree ( B_on_device ) ;
cudaFree ( C_on_device ) ;
}
28
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
/∗
Note a g a i n t h e __global__ q u a l i f i e r : t h i s i s a k e r n e l f u n c t i o n
This a l s o means t h a t e v e r y t h r e a d e x e c u t e s t h i s f u n c t i o n .
When a t h r e a d e x e c u t e s t h i s f u n c t i o n , i t has a s p e c i f i c t h r e a d i d
and b l o c k i d . The t h r e a d i d i s t h e v a l u e o f t h r e a d I d x , used below ,
and t h e b l o c k i d i s s t o r e d i n b l o c k I d x , used below .
t h r e a d I d x and b l o c k I d x a r e each o f type dim3 .
∗/
__global__
v o i d Muld ( f l o a t ∗ A, f l o a t ∗ B, i n t width_A , i n t width_B , f l o a t ∗ C)
{
// Block i n d e x
i n t block_col = blockIdx . x ;
i n t block_row = b l o c k I d x . y ;
// Thread i n d e x
i n t thread_col = threadIdx . x ;
i n t thread_row = t h r e a d I d x . y ;
// Index o f t h e f i r s t sub−matrix o f A p r o c e s s e d by t h e b l o c k
i n t aBegin = width_A ∗ BLOCK_SIZE ∗ block_row ;
// Index o f t h e l a s t sub−matrix o f A p r o c e s s e d by t h e b l o c k
i n t aEnd = aBegin + width_A − 1 ;
// Index o f t h e f i r s t sub−matrix o f B p r o c e s s e d by t h e b l o c k
i n t bBegin = BLOCK_SIZE ∗ b l o c k _ c o l ;
// The __shared__ q u a l i f i e r d e c l a r e s a v a r i a b l e t h a t
// ∗ r e s i d e s i n t h e s h a r e d memory s p a c e o f a t h r e a d block ,
// ∗ has t h e l i f e t i m e o f t h e block , and
// i s o n l y a c c e s s i b l e from a l l t h e t h r e a d s w i t h i n t h e b l o c k .
// The As and Bs m a t r i c e s d e c l a r e d below a r e i n t h e s h a r e d memory o f
// t h e b l o c k ; As i s f o r t h e sub−matrix o f A, and Bs , a submatrix o f B .
29
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
/∗
The next s t e p l o a d s t h e m a t r i c e s from g l o b a l memory t o s h a r e d memory ;
each t h r e a d l o a d s one e l e m e n t o f each matrix .
// S y n c h r o n i z e t o make s u r e t h e m a t r i c e s a r e l o a d e d
// __syncthreads ( ) i s a b a r r i e r s y n c h r o n i z a t i o n c a l l ; a l l t h r e a d s
// w a i t h e r e u n t i l e v e r y t h r e a d has made t h e c a l l , a t which p o i n t i t
// r e t u r n s i n each t h r e a d .
__syncthreads ( ) ;
/∗
The l o o p below computes t h e i n n e r p r o d u c t o f t h e row o f A_shared
and column o f B_shared a s s i g n e d t o t h i s thread , As [ thread_row ] ,
and Bs [ t h r e a d _ c o l ] , i . e . , t h e sum
As [ i ] [ 0 ] ∗ Bs [ 0 ] [ j ] + As [ i ] [ 1 ] ∗ Bs [ 1 ] [ j ] + . . . As [ i ] [ N− 1]Bs [ N− 1 ] [ j ]
C o n s i d e r t h a t a s s i g n m e n t statement ,
Csub += As [ thread_row ] [ k ] ∗ Bs [ k ] [ t h r e a d _ c o l ] ;
30
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
∗/
f o r ( i n t k = 0 ; k < BLOCK_SIZE ; ++k )
Csub += As [ thread_row ] [ k ] ∗ Bs [ k ] [ t h r e a d _ c o l ] ;
// S y n c h r o n i z e t o make s u r e t h a t t h e p r e c e d i n g
// computation i s done b e f o r e l o a d i n g two new
// sub− m a t r i c e s o f A and B i n t h e next i t e r a t i o n
__syncthreads ( ) ;
}
v o i d i n i t M a t r i x ( f l o a t M[ ] , f l o a t value , i n t width , i n t h e i g h t )
{
int i , j ;
f o r ( i = 0 ; i < h e i g h t ; i++ )
f o r ( j = 0 ; j < width ; j++ )
M[ j + i ∗ width ] = v a l u e ;
}
v o i d p r i n t M a t r i x ( f l o a t M[ ] , i n t width , i n t h e i g h t )
{
int i , j ;
f o r ( i = 0 ; i < h e i g h t ; i++ ) {
f o r ( j = 0 ; j < width ; j++ )
p r i n t f ("%8.2 f " , M[ j + i ∗ width ] ) ;
p r i n t f ("\n " ) ;
}
}
A = ( f l o a t ∗ ) c a l l o c ( width ∗ h e i g h t , s i z e o f ( f l o a t ) ) ;
i f ( A == NULL )
exit (1);
31
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming
B = ( f l o a t ∗ ) c a l l o c ( width ∗ h e i g h t , s i z e o f ( f l o a t ) ) ;
i f ( B == NULL )
exit (1);
C = ( f l o a t ∗ ) c a l l o c ( width ∗ h e i g h t , s i z e o f ( f l o a t ) ) ;
i f ( C == NULL )
exit (1);
i n i t M a t r i x (A, 2 . 0 , width , h e i g h t ) ;
i n i t M a t r i x (B, 3 . 0 , width , h e i g h t ) ;
i n i t M a t r i x (C, 0 . 0 , width , h e i g h t ) ;
M u l t i p l y ( A, B, width , width , h e i g h t , C ) ;
p r i n t M a t r i x (C, width , h e i g h t ) ;
return 0;
}
32