0% found this document useful (0 votes)

12 views32 pages

Gpus

The document discusses the evolution of Graphics Processing Units (GPUs) from their inception in 1999 to their current capabilities, highlighting the transition from fixed-function graphics accelerators to programmable processors. It explains the differences between GPUs and CPUs, emphasizing the parallel processing capabilities of GPUs and their specialized instruction sets for graphics tasks. Additionally, it outlines the logical 3D graphics pipeline, detailing the stages involved in rendering 3D images, including input assembly, vertex shading, geometry shading, and rasterization.

Uploaded by

buiphat251399

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views32 pages

Gpus

Uploaded by

buiphat251399

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

CSci 360 Computer Architecture 3 Prof.

Stewart Weiss
GPUs and GPU Programming

GPUs and GPU Programming

1 Contemporary GPU System Architecture

1.1 Historical Context
Up until 1999, the GPU did not exist. Graphics on a personal computer was performed by a video graphics
array (VGA) controller, sometimes called a graphics accelerator. A VGA controller was a combination
of a memory controller and a display generator with attached DRAM. As it became less expensive to
build more powerful processors, it also became less expensive to add more functionality to these graphics
accelerators. In the 1990's, VGA controllers began to incorporate some three-dimensional (3D) functions,
such as triangulation, rasterization, and texture mapping and shading. These operations are described below.

The availability of these more powerful graphics chips led software makers to create software that used these
chips, fueled by the public's insatiable desire for life-like real-time 3D graphics, and the demands of the
computer gaming industry, the movie industry, and the television industry. Chip manufacturers completed
this cycle by responding in turn with more powerful chips. Eventually, this cycle of growth resulted in
graphics controllers that had as much processing power as the CPU itself, although their limited purpose
made them unsuitable to be used as CPUs.

The rst graphics processing unit (GPU), NVIDIA's GEForce 256, appeared in 1999. In addition to the
operations that had become standard by then, this chip incorporated functions to perform in hardware
transforms (movement in 3D), lighting and shading (altering the color of surfaces of the scene based on
lighting information.) In November 2006, NVIDIA's GeForce 8800 gave birth to their new GPU Computing
model. The GeForce 8800 was based on the G80 architecture and brought several key innovations to GPU
computing. The G80 series was the largest commercial GPU at the time, containing approximately 686
million transistors. By 2008, eight generations of the GEForce had been built, some of which provided full
support for 3D graphics libraries such as Direct3D.

Over the next few years, the GPUs became more and more programmable, replacing xed function dedicated
logic by programmable processors. Integer arithmetic was replaced by oating-point arithmetic, and the de-
gree of parallelism within the chips increased dramatically. It was not long before chip manufacturers started
adding instructions and memory to the GPUs so that they could support general purpose programming, and
this turned them into fully general-purpose processors, known as GPGPUs. At this point the GPGPU is a
processor with unprecedented oating-point performance and programmability.

1.2 Dierences Between GPUs and CPUs

GPUs do not t into any classication scheme. Parts of it are MIMD and parts of it are SIMD. The NVIDIA
GeForce 8800 GTX, for example has 16 identical Tesla processors, each of which is a sort of multiprocessor.
Thus, from a large scale view, it is MIMD. Each Tesla processor consists of eight streaming processors, with
a SIMD architecture. Each streaming processor executes the same instruction on dierent data. Within the
processor there is a shared memory that each streaming processor can access.

1
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

all of the time. The hardware is more like SIMT single instruction multiple thread, than SIMD. But if
the programmer does not write the code carefully, the machine will not take advantage of the maximum
parallelism possible.

A GPU is a multiprocessor, sometimes containing hundreds of processors. The intended purpose of a GPU
is to perform graphics operations, which is what they do well, but they can be used for other computations
as well. So how are they dierent from CPUs at this point?

• GPUs do not perform all of the operations that a CPU can perform; their instruction sets are narrowly
focused on graphics acceleration.

• The programming interfaces to GPUs are high-level application programming interfaces such as OpenGL,
Cg, and DirectX, together with high-level graphics shading languages such as C for Graphics (Cg ) and
the High Level Shader Language (HLSL). These languages are supported by compilers that generate
intermediate languages, which are optimized by the specic GPU driver software, which generates the
the specic machine instructions for the GPU.

• Graphics processing includes many stages in a pipeline of operations, including vertex shading, geom-
etry shading, rasterization, and pixel shading. These operations, described below, are performed on a
massively parallel scale in a pipe-lined fashion.

• Vertices can be drawn independently, and pixel fragments can be rendered independently. This inde-
pendence allows the computation to proceed using many independent and parallel threads of control.

• GPUs are designed to work well on 4-tuples. This is because a vertex in three dimensions is represented
by a set of four coordinates, (x, y, z, w), called homogeneous coordinates . The fourth coordinate,
w, is used to facilitate projecting the 3D point into two-dimensions in a way that creates the illusion of
depth (known to the artist as perspective drawing and to the mathematician as projective geometry.)
Also, having the fourth coordinate makes the basic transformations of rotation, translation, and scaling
obtainable by matrix multiplication. Pixels consist of four coordinates also, from a color space with an
alpha channel, (r, g, b, alpha). Vertices and pixels each consist of four 32-bit oating point numbers.

• Unlike general purpose applications, graphics computations have working sets that can be hundreds of
megabytes.

• There is much more data parallelism in graphics applications than general purpose applications.

A consequence of these dierences is that

• GPUs do not rely on multi-level caches to overcome long latencies to memory. Instead, they rely on
having enough threads to hide the latency and they use multithreading.

• GPUs rely on extensive parallelism to obtain high performance.

• GPU main memory is designed for high bandwidth rather than small latency. GPU's have smaller
memories than CPUs.

• GPU processors are multithreaded and each is highly parallel. In the past, they had specialized
processors for each stage of the pipeline, but more of them now have homogeneous processors.

• GPUs used to have four-stream SIMD processors within them, but now these are being replaced by
regular processors and scalar instructions.

• GPUs have no need for double-precision oating point instructions, but as they become used more and
more for non-graphical processing, this functionality is being added to them.

There is a growing base of general purpose applications that have been ported to GPUs. The term general
purpose GPU, or GPGPU refers to a method of using GPUs for non-graphics applications. NVIDIA has de-
veloped a programming language, CUDA (Compute Unied Device Architecture) that enables programmers
to write C code for GPUs.

2
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

Fig. 1: Intel interconnects in the CPU

1.3 The GPU in Relation to the CPU

To get a perspective on the position of the GPU within a typical computer, see Figure 1. In Intel processors,
the GPU is connected via the PCI-Express link to the North Bridge in the Intel chip-set, and the GPU's
device memory, not accessible to the rest of the chip, is connected directly to it. Applications that run in the
CPU do not have access to this GPU device memory, but a program specically written to run on a GPU
can take advantage of it.

Before we look at the GPU architecture in detail, it is best to explain briey the various graphics operations
that are performed in graphics processing.

1.4 The Logical 3D Graphics Pipeline

1.4.1 The Input
It will help to understand why the GPU architecture is what it is if you have a basic understanding of what
must be done to render a 3D image to the screen, with all of the lighting eects and texturizing that you see
in modern day applications. The most demanding application of a graphics processor is to perform real-time,
high-resolution 3D image processing at a frame rate no less than 60 frames per second, as is necessary to
create computer-based animations. To make this possible, the graphics processor must be able to exploit
the high degree of data-parallelism in the problem.

A 3D image is represented by a 3D model, which is a collection of 3D points. Usually, 3D models start out
as triangulated surfaces, as shown in Figure 2. The vertices of the triangles dene the shape and are the
starting point for display of the object. Thus, a 3D shape begins as a set of these vertices.

Graphics processing proceeds as a sequence of pipelined stages. The basic graphics pipeline is shown in
Figure 3. The individual stages are described below.

1.4.2 Z-Buers
One of the major tasks in three-dimensional (3D) graphics processing is determining where each 3D point
should be placed on the 2D screen (which is essentially a projection problem), and what color it should have.
A key concept underlying the projection problem is how to determine which 3D points are visible and which
are obscured by other points, which is the visibility problem .
In computer graphics, z-buering is the management of image depth coordinates in three-dimensional
(3D) graphics, usually done in hardware, sometimes in software. It is one solution to the visibility problem,

3
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

Fig. 2: Triangulated representation of a head

Fig. 3: Graphics Logical Pipeline

which is the problem of deciding which elements of a rendered scene are visible, and which are hidden. The
painter's algorithm is another common solution which, though less ecient, can also handle non-opaque
scene elements. Z-buering is also known as depth buering .

An arbitrary point on a surface in 3D is represented by the coordinates (x,y,z), where z is the depth coordinate.
Initially all depth coordinates are normalized, varying between 0 and 1, with 1 being the furthest distance
from the viewing plane. When an object is rendered by a 3D graphics card, the depth of a generated pixel
(its z coordinate) is stored into the z-buer (or depth buer). This buer is usually arranged as a two-
dimensional array (x-y) with one element for each screen pixel. The intensity value for that pixel (its color
and alpha channel) are stored in a parallel 2D buer called the refresh buer .
When multiple objects in the scene must be rendered in the same pixel, the graphics card must decide which
object's pixel is closest to the observer. The chosen depth is then saved to the z-buer, replacing the old one,
and its intensity replaces the current value of the intensity at that pixel in the refresh buer. In the end, the
z-buer will allow the graphics card to correctly reproduce the usual depth perception: a close object hides
a farther one. This is called z-culling .
Determining which pixel is closest is a calculation based upon the planar surfaces of the 3D object. When
a surface is triangulated, every point is either a vertex of a triangle or a point inside the triangle formed
by the three vertices. Each triangle denes a plane, whose equation can be obtained by solving a system
of linear equations using those three sets of coordinates. This leads to a planar equation of the form
Ax + By + Cz = D. where A, B, C, and D are constants. Each point in that plane satises this equation.
By algebraic manipulation, if x and y are known, z can be obtained from the equation

−Ax − By + D
z=
C
The z-values so obtained lie between 0 and 1.0.

For eciency, the depth-buering algorithm will process each scan line (horizontal line of pixels) in succession.
Two adjacent pixels dier only in their x-coordinate: (x, y) is followed by(x + 1, y). If the depth value for
(x, y) is z, then the depth value z0 for (x + 1, y) can be obtained by a single subtraction operation, since

−A(x + 1) − By + D −Ax − By + D A A
z0 = = − =z−
C C C C

4
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

and A/C is a stored constant for each plane.

1.4.3 The Pipeline Stages

Input Assembler The input assembler receives the 3D representation of the scene as a collection of
geometric primitives such as points, lines, and vertices. It then distributes the vertices to the vertex shader.
The input assembler typically assembles vertices into several dierent primitive types such as line lists,
triangle strips, or primitives with adjacency.

The input data may be a list of vertices, lines and so on, but it might also include an index buer. Without
an index buer, a vertex might appear multiple times in the input to the graphics processor, because the
vertex data was arranged in the order in which the larger primitives (e.g., lines, triangles) would be processed.
Rather than storing the vertices themselves in an array, the array could contain references to vertices, which
would be in a secondary structure. This array is an index buer . By using an index buer, each vertex is
stored exactly once, and references to it as a part of larger primitives are the indices into this array.

The input may also include a collection of textures that are to be applied to the scene to be displayed. You
can think of a texture as a two-dimensional shape, usually a rectangle, containing colors or an image to be
applied to a surface.

Vertex Shading Shader programs in general determine how lighting and shadows interact with the surfaces
to be rendered. A vertex shader is a graphics processing function that maps vertices onto the screen and
adds special eects to objects in a 3D environment by performing mathematical operations on the objects'
vertex data. One of its purposes is to transform each vertex's 3D position in virtual space to the 2D
coordinate at which it appears on the screen, as well as a depth value for the Z-buer, and then to apply
color to it.

Examples of vertex shading eects include matrix palette skinning, which allows programmers to create
realistic character animation with up to 32 "bones" per joint; deformation of surfaces, which gives developers
the power to create realistic surfaces such as waves; and vertex morphing, which is used to morph triangle
meshes from one shape to another, providing smooth skeletal animation. Vertex shaders are run once for
each input vertex. Although vertex shaders can manipulate properties such as position, color, and texture
coordinate, they cannot create new vertices. The output of the vertex shader goes to the next stage in the
pipeline, which is either a geometry shader if present or the rasterizer otherwise.

Geometry Shading Geometry shading is the stage of the graphics pipeline after vertex shading. Its
purpose is to enhance the details and accuracy of the 2D image by working at a larger degree of granularity
than individual vertices. Its inputs consist of geometric primitives consisting of more than one vertex, such
as lines and triangles. A geometry shader can take as its input, for example, the three points of a triangle
and output intermediate points that can be used to rene the surface. It can only do this by operating with
greater granularity. The geometry shader can modify the positions and orientation of the primitives.

Rasterization The word "raster" was originally used in the raster scan of cathode ray tubes (CRT), which
paint the image line by line; the term is now used to mean a grid of pixels. Rasterization is a process
for converting vectorized input to a bitmap form, i.e. a 2D array of pixels. Given a triangle, for example,
represented by three vertices, it determines the locations of all pixels and pixel fragments that lie inside,
or on the edge of, this triangle. There can be pixel fragments because a triangle is a mathematical object
consisting of lines of innitely small width that can intersect pixels rather than lie between them. Therefore,
a single pixel can lie on each side of a line segment. The fragment of the pixel on the inside edge of a
triangle's perimeter is a pixel fragment belonging to that triangle.

The rasterization stage actually does other processing as well. Typically it also does clipping , meaning
throwing away regions that are outside of the view frustum . A frustum is a 3D shape that can be
described as a 4-sided pyramid with its top lopped o. It is the 3D analog of a trapezoid. When you view

5
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

a scene in perspective, you are looking at a frustum lying on its side usually. The front plane is the bottom
of the pyramid, and the back plane is the lopped-o top. This is the view frustum.

The rasterization step typically does z-culling of pixels. As it is generating pixels, it can discover that some
of them are behind others and should be culled. When this step is nished, what is left are the visible pixels
and pixel fragments, represented by their screen positions and their depth values. These are passed to the
pixel shader.

Pixel Shading A pixel shader is a function that computes the color and other attributes of each pixel or
pixel fragment. Pixel shaders range from always outputting the same color, to applying a lighting value, to
doing bump mapping, shadows, specular highlights, translucency and other phenomena. They can alter the
depth of the pixel (for Z-buering), or output more than one color if multiple render targets are active. A
pixel shader alone cannot produce very complex eects, because it operates only on a single pixel, without
knowledge of a scene's geometry or of neighboring pixels.

The following program, taken from the Patterson-Hennessey text, can give you an idea of the type of work
that a pixel shader can do. It is written in Cg. Each pixel is run through this program.

void reflection(
float2 texCoord : TEXCOORD0,
float3 reflection_dir : TEXCOORD1,
uniform float shiny,
uniform sampler2D surfaceMap,
uniform samplerCUBE envMap,
out float4 color : COLOR )
{
// Fetch the surface color from a texture
float4 surfaceColor = tex2D(surfaceMap, texCoord);

// Fetch reflected color by sampling a cube map

float4 reflectedColor = texCUBE(envMap, reflection_dir);

// Output is a weighted average of the surface and reflected colors

color = lerp(surfaceColor, reflectedColor, shiny);
}

The rst input parameter, texCoord, is the (x, y) position of the pixel. This is used as an index into a
texture map in order to apply a particular texture to that pixel. The texture map, surfaceMap, is a large
2D array that is passed to all pixels in the fourth parameter. The second input parameter, reflection_dir,
is a 3D vector that represents the direction of the view with respect to the surface. To understand this,
imagine that you could draw a line between the pixel and your eye. Next, rotate the plane in which the
pixel lies so that it is the horizontal plane and then translate the plane so that the pixel is at the origin.
Assume that the z-axis is upward and x and y are in the horizontal plane. The line from the pixel to your
eye is a vector called the reection direction . It is the direction in which the light will bounce from the
surface to your eye. The envMap parameter stores the colors that each face of a cube would have given the
lighting conditions for the surface. Those colors are used to determine what color the reected light would
be, assuming the pixel did not absorb any of the light's spectrum.

The reection function calls two functions, tex2D() and texCUBE() to compute the surface color and reected
color and then uses a weighted average of these two colors to return the value by which to color this particular
pixel. The weighted average function is lerp(). Its third parameter is a oating point value that represents
the weight to apply to each of the two colors. lerp is short for linear interpolation , which, given two
values a1 a2 and a constant 0 ≤ c ≤ 1, produces a value (1 − c)a1 + ca2 . When c = 0, this evaluates
and
to a1 and when c = 1, it evaluates to a2 . For all values in between, the distance of the interpolated point
from a1 is proportional to c and hence the name. Linear interpolation is used more generally as follows. If

6
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

f (x) is a function of x all of whose values are either hard to compute or unknown, and an approximation or
estimate of it is needed, and two values are known, at, say at points x1 and x2 , then a linear interpolation
of f (x) (1 − c)f (x1 ) + cf (x2 ), 0 ≤ c ≤ 1, which in eect treats the
between the pair of points is given by
function as a linear function (a straight line) between the points. The approximation to f , denoted fˆ(x),
would be dened by letting c = (x − x1 )/(x2 − x1 ).

The Relation Between the Graphics Pipeline and the GPU

Dierent GPUs map the functions of the pipeline onto the hardware in dierent ways. Older GPUs had
special purpose hardware units to implement each separate pipeline stage. Some of these stages are not
programmable they are xed functions that do the same thing for all inputs. They are indicated in white
in Figure 3. The other stages are programmable. Later GPUs have a single processor type that can be used
to implement most of the pipeline stages. This is the case for the NVidia GEForce 8800 for example. These
later GPUs use multiple, identical processors, each containing several independent functional units or even
SIMD processors. As you will see though, some of the stages, such as input assembly and rasterization,
remain in special purpose processors.

Fig. 4: Unied GPU architecture (GEForce 8800 w/ 112 SPs

1.5 Overview of the NVIDIA GPU Architecture

GPU terminology diers from conventional CPU language, and to make matters worse, NVIDIA changed
its terminology over time, so we begin by dening the basic terms. We will use the GEForce 8800 GTX as
an example architecture.

A discrete GPU is one that sits on a card that is plugged into the PCI-Express interconnect bus. In
contrast, a motherboard-GPU is integrated into the chipset on the motherboard. Tesla-based GPUs can
7
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

have from 1 to 16 nodes, each called a streaming multiprocessor (SM ). NVIDIA uses the terms "node"
and streaming multiprocessor interchangeably. The largest version available in 2008 was the GeForce 8800
GTX, with 16 SMs, each containing 8 multithreaded single-precision oating point and integer processing
units, which used to be called streaming processors (SP ) but are now called cores . The clock rate was
1.35 GHz. The later Fermi-based GPUs support 32 cores per streaming multiprocessor, implying that they
have up to 512 cores. The newest architecture, at the time of this writing, is the Kepler, which has up to 15
Streaming Multiprocessor (SMX) units, each of which has up to 192 single-precision CUDA cores, with each
core having fully pipelined oating-point and integer arithmetic logic units. Figure 6 shows the architecture
of a single Kepler SMX processor.

Fig. 5: The texture/processing cluster.

To give you some idea of the power of these GPUs, consider the fact that the GEForce 8800 GTX has a
single-precision multiply-add instruction. This means that the add and multiply operations take place in a
single instruction cycle. Given the preceding parameters, we can compute the peak performance of the 8800
GTX, i.e., the performance when all processors are kept busy all of the time:

Peak performance in GFLOPS/second = 16 SMs x (8 cores/SM)

x 2 FLOPS/instruction (add and multiply) /core
x 1 instruction/clock cycle
x 1.35 x 10^9 clock cycles/second
= 16 * 8 * 2 * 1.35*10^9 FLOPs/second
= 345.6 GFLOPs / second

With the same clock speed, the Kepler GK110 would have a performance of almost 8 Teraops!

Each of the 16 SMs of the GEForce 8800 has a local store with a capacity of 16KB as well as 8192 32-bit
registers. The shared memory is partitioned into six partitions of 900 MHz Graphics DDR3 DRAM, each
with an 8-byte wide datapath, with 128 MB per partition. Therefore, there is a total of 768 MB of memory,
and the peak bandwidth is

Peak bandwidth (GB/second) = 6 x 8 bytes/transfer

x 2 transfers/clock cycle (double-data rate)
x 0.9 x 10^9 clock cycles/second

8
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

= 6 * 8 * 2 * 0.9 GB/second
= 86.4 GB/second

Figure4 depicts the architecture of an NVidia GEForce 8800 that has 14 streaming multiprocessors, dis-
tributed into 7 pairs of SMs. Each SM pair is integrated into a unit called a texture/processor cluster
(TPC) that contains a shared geometry controller, a streaming multiprocessor controller (SMC), and a shared
texture unit and cache and shared lines for load/store and I/O, as shown in Figure 5. The SMC controls
access to the shared texture unit, load/store path, and I/O path. Each TPC is connected to an interconnec-
tion network that connects it to the device memory, L2 cache (containing textures), raster processors, and
the interface to the actual display device, as well as to the bridge to the CPU and system memory.

Fig. 6: The SMX architecture of the Kepler GK110 ( from the NVIDIA Kepler GK110 whitepaper.)

Each SM in the Tesla GPUs contains 8 cores (32 in the Fermi GPUs), as well as its own local shared
memory, an instruction cache, a constant cache, a multithreaded instruction unit, and two special function
units (SFUs). The special function units compute special functions such as the transcendental functions
(e.g., trigonometric functions), reciprocals, and square roots.

A core (SP) is the primary thread processor. In the GEForce 8800, each core is a multithreaded processor
supporting 96 threads, with a register le containing 1024 scalar 32-bit registers for the use of these threads.
The processor is fully pipelined and implements all 32-bit and 64-bit integer arithmetic, comparison, conver-
sion, and logical PTX
1 instructions, as well as IEEE 754 standard single precision oating point operations

1 PTX instructions are parallel thread execution instructions. Each such instruction is issued to multiple threads simultane-
ously.

9
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

Fig. 7: Die photo of the Kepler GK110 with 15 SMX processors, and overlay showing locations.

and a compatible add/multiply operation. Earlier GPUs executed vector instructions, but the later GPUs
were designed so that core processors executed ordinary scalar instructions.

1.6 Overview of the Computational Model

The GPU architecture is tightly coupled with the computational model; it is dicult to appreciate the
architecture without knowing something about how programs are designed and what type of parallelism is
provided in the computational model. Therefore, we turn our attention to how the GPU is programmed.

Until recently, there were two dierent ways to use a GPU. The earlier method was called General Purpose
computing on a GPU (GPGPU) and evolved as a means to do general purpose computing by using the
GPU's graphics API to do non-graphics tasks. Subsequently, as GPUs included more general purpose
instructions in their instruction sets, it became possible to write non-graphics programs using a parallel
programming language and API. This paradigm was called GPU computing .
In 2007, NVIDIA released a software architecture and computational model to make it easier to write C
or C++ programs that could exploit the high degree of parallelism in the GPU. They called this model,
Compute Unied Device Architecture , or CUDA, for short. CUDA consists of a software library and a
compiler that maps the enhanced C/C++ code into instructions for the GPU and CPU. It essentially allows
the programmer to write highly parallel C/C++ programs by using additional data types and functions
provided by the library. In fact, CUDA can be used to write parallel programs for multiple-core CPUs as
well.

There are three general concepts that dene CUDA:

• A hierarchy of computational units

• A hierarchy of shared memories

• A hierarchy of barrier synchronization

The computational hierarchy consists of threads at the lowest level, to thread blocks , which are groups of
threads, to grids , which are groups of blocks. Each level has an associated memory that maps naturally to
the physical memories inside the GPU. Similarly, there are methods of barrier synchronization that can be

10
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

Fig. 8: Hierarchy of computational units and memories.

used with threads or with thread blocks. Figure 8 shows the relationship between computational units and
their associated memories.

CUDA threads may access data from multiple memory spaces during their execution as illustrated by Figure
8. Each thread has private local memory, called per-thread memory . Each thread block has shared
memory visible to all threads of the block and with the same lifetime as the block. This is called per-block
shared memory in the gure. All threads have access to the same global memory.
There are also two additional read-only memory spaces accessible by all threads: the constant and texture
memory spaces. (In the Kepler models, each SMX has a separate 48 KB read-only data cache accessible to
all threads in a thread block.) We will see later that the global, constant, and texture memory spaces are
optimized for dierent memory usages. The global, constant, and texture memory spaces are persistent across
invocations of threads within the same application. In other words, these three memory spaces retain their
data as threads are created and destroyed. A large part of the challenge of GPU computing is understanding
the capacities and methods of access of each dierent type of memory in order to maximize performance.

A program consists of dierent kinds of data and functions. Certain variables and data are local to a single
thread, others to thread blocks, and others, to the entire application. Similarly, special functions called
kernels are functions that are executed in parallel by all threads. An example of a kernel is the following,
which adds two arrays A and B of size N and stores the result into array C:

__global__
void VecAdd(float* A, float* B, float* C )
{
int i = threadIdx.x;
C[i] = A[i] + B[i];
}

The qualier __global__ is a CUDA extension to C that identies the function as a kernel function, which

11
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

means that every thread executes it. Each thread is given a unique integer identier, which in this case is
stored in the variable threadIdx.x. To call this function from the main program, one would use the syntax

int main()
{
...
// Kernel invocation with one block of N threads
int numBlocks = 1;
dim3 threadsPerBlock(N);
VecAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
}

The triple-angle brackets <<<...>>> identify the execution conguration of the kernel, which indicates
how many blocks per grid (up to three dimensions) and how many threads per block (up to three dimensions).
numBlocks = 1, and the block is a set of N threads.
In this example, the grid consists of one block because
threadsPerBlock is declared as a dim3, which is a 3-dimensional structure having x, y, and z members; the
initialization sets x=N and y=1 and z=1 by default.

This is the avor of computing in CUDA, but there is much more to be said about it later. For now, the
important observation is that the CUDA programming model requires an underlying architecture that can
execute many threads extremely quickly, and be able to switch among groups of threads as well. It also
has to provide the dierent memory spaces in an ecient way. This leads us to explore the multithreading
capabilities of the GPU.

1.7 The GPU's Multithreaded Multiprocessor

The GPU processor's multithreading is designed to achieve several goals:

• To hide the latency of memory loads and texture fetches from DRAM and shared block memories.
Memory accesses can take hundreds of processor cycles. Multithreading allows the processor to switch
to another thread while one thread is waiting for a load or texture fetch to complete. The extremely
high degree of multithreading can keep many cores busy even though many threads might be stalled
waiting for memory loads, because if there are enough active threads, then the probability that there
are threads to keep all cores busy will be high.

• To support ne-grained parallel graphics shader programming models and parallel computing models.
Graphics shader programs typically execute many dierent stages dynamically, from vertex shading to
pixel shading. Because of this, the streaming multiprocessors are designed to execute dierent thread
programs concurrently.

• To virtualize the physical processors as threads and thread blocks in order to make them highly scalable.
Furthermore, each thread can have its own private registers, private memory, program counter, and
thread execution state, and can execute its own independent code sequence. To make all of this
possible, the GPU multiprocessor is hardware multithreaded, managing hundreds of threads without
scheduling overhead. Threads within a thread block can synchronize with each other using a barrier
synchronization instruction (like the one we saw in Chapter 7 notes).

• To simplify the parallel programming model so that the programmer only has to write serial kernel
functions.
All of the concurrency in the CUDA extensions to C and C++ takes place in kernel functions. This
has the potential to simplify the logic of many data parallel programs.

12
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

Fig. 9: A single multithreaded multiprocessor

1.7.1 The Multiprocessor Architecture

Figure 9 shows the architecture of a generic streaming multiprocessor. A streaming multiprocessor has a
xed number of cores that varies from one model to another. There are eight cores in the G80 and GT200
generation GPUs and 32 in the Fermi models. There are 192 cores in the Kepler SMX models. Each SM
has a shared memory, indicated in CUDA programs by the __shared__ qualier in variable declarations.
Shared variables are accessible to all threads in a thread block.

Each SP core has scalar integer and oating point arithmetic units that execute most of its instructions.
It is hardware multithreaded and can support anywhere from 32 to 96 threads, depending on the model.
The SP core is also pipelined, and can run several threads concurrently. The program declares how many
of the SP's 1024 registers each thread needs. If each thread uses few registers, then more threads can run
concurrently. If each thread uses many registers, then fewer threads can run at a time. For example, a pixel
shader program usually uses no more than 16 registers per thread, so the 1024 registers can be allocated to
64 threads, which implies that each SP core can run 64 pixel shader threads concurrently.

The SM uses a model something like SIMD, called Single Instruction Multiple Thread or SIMT for
short. It creates groups of threads called warps 2 . A warp is a group of 32 threads scheduled on a single
SM together. A single SM will generally execute multiple warps, interleaving their executions to hide stalls,
as shown in Figure 10. A thread block is never smaller than a warp, though thread blocks can consist of
multiple warps. Individual threads composing a warp start together at the same program address, but they
have their own instruction address counter and register state and are therefore free to branch and execute
independently.

In the Tesla, warps are scheduled onto the cores in units of four clock cycles; a warp of 32 threads is
distributed among 8 cores with four threads per core. Over four clock cycles, each of the four threads
executes an instruction, so that after four clock cycles, 32 threads have executed one instruction each. In the
Fermi architecture, each SM has two warp schedulers and two instruction dispatch units, allowing two warps

2 The term warp originates from weaving, the rst parallel thread technology. A half-warp is either the rst or second half
of a warp. A quarter-warp is either the rst, second, third, or fourth quarter of a warp.

13
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

Fig. 10: SIMT multithreaded warp scheduling

to be issued and executed concurrently. Fermi's dual warp scheduler selects two warps, and issues one
instruction from each warp to a group of sixteen cores, sixteen load/store units, or four SFUs. Running two
warps in parallel increases the ability to hide stalls due to memory accesses. In the Kepler architecture, each
SMX has four warp schedulers and eight instruction dispatch units, allowing four warps to be issued and
executed concurrently. Kepler's quad warp scheduler selects four warps, and two independent instructions
per warp can be dispatched each cycle. Figure 11 depicts one of the four warp schedulers' actions over time.
Unlike Fermi, which did not permit double precision instructions to be paired with other instructions, the
Kepler GK110 allows double precision instructions to be paired with other instructions.

Fig. 11: A single Kepler dual-dispatch warp scheduler.

The SM tries to identify which threads can execute simultaneously and schedules them dynamically. The
processors can handle data hazards and conditional instructions by allowing threads to follow dierent paths.

14
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

Fig. 12: Comparison of Sun UltraSPARC T2 to a single Tesla, 8-core SM

Because some threads are executing dierent paths than others, certain threads will have to be inactive at
times (otherwise they would be executing the wrong instructions). Thus, when branch conditions evaluated
on dierent data cause dierent paths to be followed, the throughput is diminished. When the threads
rejoin, they start executing in unison again. If the paths through the code on dierent branches are the same
length, the performance loss is not as great as when they are of unequal lengths.

The Tesla uses ne-grained multithreading to schedule 24 warps over time, which run in blocks of four cycles
each. The Tesla will not switch threads more frequently than every two clock cycles. Figure 12 illustrates
how warps are scheduled on a single Tesla SM. As noted above, each warp is distributed across eight cores.
Since there are 32 threads, at its best, it will take 4 cycles to execute all threads of the warp. It will take
longer if there are many conditional branches and consequent inactive threads.

The SM's SIMT multithreaded instruction unit picks a warp that is ready to execute its next instruction
(because it has enough active threads) and issues that instruction to the active threads in that warp. There
may be warps of dierent types being executed concurrently in a single SM; a warp of pixels may execute
concurrently with a warp of vertices.

1.7.2 SIMT Warp Execution and Divergence

A warp executes one common instruction at a time; full eciency is realized when all 32 threads of a warp
agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the
warp serially executes each branch path taken, disabling threads that are not on that path, and when all
paths complete, the threads converge back to the same execution path. Branch divergence occurs only within
a warp; dierent warps execute independently regardless of whether they are executing common or disjoint
code paths.

1.8 GPUs in Perspective

15
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

Thus, from a large scale view, it is MIMD. Each Tesla processor consists of eight streaming processors, with
a SIMD architecture. Each streaming processor executes the same instruction on dierent data. Within the
processor there is a shared memory that each streaming processor can access.

Although the hardware is SIMD, the Tesla programmer interface creates the illusion that the multiprocessor
is MIMD. It achieves this by making certain threads inactive when they are not supposed to execute a given
instruction, and by high level parallel ne-grained multithreading. At its best, all 32 threads in a warp are
busy all of the time. The hardware is more like SIMT single instruction multiple thread, than SIMD. But
if the programmer does not write the code carefully, the machine will not take advantage of the maximum
parallelism possible.

2 GPU Programming and CUDA

There are really two fundamental issues in programming a GPU correctness and performance. Of course it
is imperative that the programs are correct, but because the reason that one chooses to write a program to
run on a multiprocessor is to decrease running time and/or increase problem size, performance is almost as
important. Thus, while a programmer can generally ignore the design of the SIMT architecture and ignore
how warps are scheduled and executed in terms of correctness, he or she can greatly improve performance
by having threads in a warp execute the same code path and access memory in nearby addresses, and take
advantage of knowledge of how global and shared memory are accessed.

Another issue is a practical one; how does one actually write general purpose programs that can be run on
a GPU? There are dierent approaches to this problem, but the easiest solution is to use NVIDIA's CUDA
extension to the C/C++ language. These notes present an overview of CUDA Version 4.0. They are not
intended as a reference manual nor as a technical guide. Most of the material comes from the NVIDIA
CUDA C Programming Guide, Version 4.0.

CUDA extends C by adding constants, types, and functions that expose the capabilities of a GPU. As noted
earlier, it consists primarily of three key abstractions:

• a hierarchy of thread groups,

• shared memories, and

• barrier synchronization,

that allow the programmer to write programs that have ne-grained data parallelism and thread parallelism,
as well as coarse-grained data parallelism and task parallelism.

For the purpose of writing correct programs, it is enough to learn the syntax and semantics of a relatively
small subset of CUDA, but for writing programs with optimal performance, it is important to understand
the underlying execution model and the memory model. We start with an overview of the memory model
and then look at some of the details of CUDA.

2.1 The Dierent Types of Memory Accesses

There are various levels of memory that a thread can access explicitly - thread private, shared, constant,
texture, and global. An instruction that accesses addressable memory might need to be re-issued multiple
times depending on the distribution of the memory addresses across the threads within the warp. How the
distribution aects the instruction throughput this way is specic to each type of memory and described in
the following sections. For example, for global memory, as a general rule, the more scattered the addresses
are, the more reduced the throughput is. For shared memory, certain patterns result in higher bandwidth.

16
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

2.1.1 Global Memory Accesses

Global memory is o-chip and generally has access times that are on the order of a hundred times longer
than the on-chip shared memory and caches. Therefore, one wants to minimize access to global memory.
Unfortunately, the on-chip shared memory is often not large enough to store all of the data needed by the
threads executing within the thread blocks. The problem facing the programmer is understanding the most
ecient ways to move data between global memory and shared or thread private memory.

Global memory instructions support reading or writing words of size equal to 1, 2, 4, 8, or 16 bytes. Any
access (via a variable or a pointer) to data residing in global memory compiles to a single global memory
instruction if and only if the size of the data type is 1, 2, 4, 8, or 16 bytes and the data is naturally aligned
(i.e. its address is a multiple of that size). If the read (or write) requests from multiple threads can be made
to addresses in global memory that satisfy these constraints, then the reads (or writes) are coalesced into a
single instruction of higher bandwidth. This will be illustrated by example below.

2.1.2 Shared Memory Accesses

Because it is on-chip, the shared memory space is much faster than the local and global memory spaces.
In fact, for all threads of a warp, accessing shared memory is fast as long as there are no bank conicts
between the threads, as detailed below.

To achieve high bandwidth, shared memory is divided into equally-sized memory modules, called banks ,
which can be accessed simultaneously. Any memory read or write request made of n addresses that fall in
n distinct memory banks can therefore be serviced simultaneously, yielding an overall bandwidth that is n
times as high as the bandwidth of a single module.

However, if two addresses of a memory request fall in the same memory bank, there is a bank conict and the
access has to be serialized. The hardware splits a memory request with bank conicts into as many separate
conict-free requests as necessary, decreasing throughput by a factor equal to the number of separate memory
requests. If the number of separate memory requests is n, the initial memory request is said to cause n-way
bank conicts.

To get maximum performance, it is therefore important to understand how memory addresses map to memory
banks in shared memory in order to schedule the memory requests so as to minimize bank conicts.

For devices of compute-capability 1.x, shared memory has 16 banks that are organized such that successive
32-bit words are assigned to successive banks, i.e. interleaved. Each bank has a bandwidth of 32 bits per two
clock cycles. A shared memory request for a warp is split into two memory requests, one for each half-warp,
that are issued independently. As a consequence, there can be no bank conict between a thread belonging
to the rst half of a warp and a thread belonging to the second half of the same warp.
3 Again, this is best
illustrated with an example.

2.2 Host and Device: What Runs Where

A program runs as two separate pieces: the host code and the device code . Host code is code that runs
on the host CPU. All references to the concept of the host are references to the CPU that spawned the GPU
computation. In the Tesla and Fermi models, there could be only one host using the GPU at a time, but
the Kepler allows multiple hosts to share the GPU. Device code is code that runs on the GPU.

2.3 CUDA Extensions to C Types

CUDA adds vector types to the standard set of C elementary types. To be precise, for each of the types

3 For devices of compute-capability 2.x, shared memory has 32 banks. Therefore, unlike for devices of lower compute
capability, there may be bank conicts between a thread belonging to the rst half of a warp and a thread belonging to the
second half of the same warp.

17
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

char, uchar, int, uint, short, ushort, long, ulong, longlong, ulonglong, float

if X is one of these types, then there are vector types X1, X2, X3, and X4. For example, there are types
int1, int2, int3, and int4. The term vector type is a misnomer these are not vectors, but structures.
st nd rd th
The 1 , 2 , 3 , and 4 components are accessible through the members x, y, z, and w, respectively. For
example, if we declare

uint3 point;

then point.x, point.y, and point.z are the members of point. There are also double1 and double2, but
not double3 or double4.
The type dim3 is a an extension of uint3. It is used to specify the dimensions of things such as thread
blocks and grids. Unlike uint3 though, when an object of type dim3 is declared , its constructor initializes
all uninitialized components to the value 1. For example,

dim3 block(64, 64);

species that block is 64 by 64 by 1, because the z member is set to 1.

2.4 Variable Qualiers: Where Things Are Located

The following rules can be used to determine in which memory a variable resides.

1. An automatic variable declared in device code without any of the qualiers __device__, __shared__
and __constant__ usually resides in a register.

2. The __device__ qualier declares a variable that resides on the device (meaning global memory), has
the lifetime of an application, and is accessible from all the threads within the grid and from the host
through the runtime library (with specic functions designed to allow the host program to access the
GPU's device memory.)

3. The __shared__ qualier, optionally used together with __device__, declares a variable that it resides
in the shared memory space of a thread block, has the lifetime of the block, and is only accessible from
all the threads within the block.

4. The __constant__ qualier, optionally used together with __device__, declares a variable that resides
in constant memory space, has the lifetime of an application, and is accessible from all the threads
within the grid and from the host through the runtime library (with specic functions.)

2.5 Kernels
A kernel is a C function that is executed in parallel by more than one CUDA thread, as opposed to only once
like an ordinary C function. A kernel is dened using the __global__ declaration specier. For example,

// kernel definition
__global__
void VecAdd(float* A, float* B, float* C )
{
int i = threadIdx.x;
C[i] = A[i] + B[i];
}

18
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

species to the compiler that VecAdd() is a kernel. The number of threads that execute the kernel is
determined by how the function is called. CUDA has a special execution conguration syntax for this
purpose. Any call to a __global__ function must specify the execution conguration for that call. The
execution conguration denes the dimension of the grid and blocks that will be used to execute the function
on the device. The syntax of the call is

function_name <<< Dg, Db, Ns, S >>>(argument_list );

where

• Dg is of type int or dim3 and species the dimension and size of the grid, such that Dg.x * Dg.y *
Dg.z equals the number of blocks being launched ;
4

• Db is of type int or dim3 and species the dimension of the block. Db.x * Db.y * Db.z equals the
number of threads being launched in each block.

• Ns is of type size_t and species the number of bytes in shared memory that is dynamically allocated
per block for this call in addition to the statically allocated memory; this dynamically allocated memory
is used by any of the variables declared as an external array. Ns is an optional argument which defaults
to 0.

• S denotes a cudastream_t, which we will ignore here. It is an optional argument that defaults to 0.

We could invoke the above kernel in the main program as follows:

int main()
{
...
// Kernel invocation with N threads
VecAdd<<<1, N>>>(A, B, C);
}

In this case, Dg =1 and Db =N, so there is there is a single block containing N threads.
Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through
threadIdx variable. The threadIdX variable is of type uint3, so (threadIdx.x, threadIdx.y,
the built-in
threadIdx.z) are the coordinates of the thread within the block. Coordinates are Cartesian, not matrix. In
other words, the x coordinate is the column position and the y coordinate is the row position.

Each block is given a unique block ID that is accessible within the kernel through the built-in blockIdx
variable. The blockIdX variable is of type uint3, so (blockIdx.x, blockIdx.y, blockIdx.z) are the
coordinates of the block within the grid.

The dimensions of the grid are specied with up to three dimensions and these dimensions are accessible to
each thread in the kernel through the built-in blockDim variable, which is of type dim3.
To illustrate, suppose we want to process a matrix M that has 32 columns and 18 rows by dividing it into a
grid that has 4 columns and 3 rows of blocks, as shown in Figure 13. Each block will consist of 6 rows and
8 columns of unique threads. The main program would call our kernel using the execution conguration

dim3 gridSize(4,3);
dim3 blockSize(8,6);
float A[32][48];

ProcessMatrix<<<gridSize, blockSize>>>(A);
4 Dg.z must be equal to 1 for devices of compute capability 1.x.

19
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

and the kernel function might be something like this:

// kernel definition
__global__
void ProcessMat(float** A )
{
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
// process A[row][col]
}

Figure 13 shows how matrix element M[9][17] is being processed by the thread with id (1,3) in the block
with id (2,1).

8 threads wide i= 2*8+1=17

6 threads high

Block(0,0) Block(1,0) Block(2,0) Block(3,0)

Block(0,1) Block(1,1) Block(2,1) Block(3,1)

j
j=1*6+3 = 9
Thread(1,3)

Block(0,2) Block(1,2) Block(2,2) Block(3,2)

Fig. 13: Decomposition of a matrix into a grid of thread blocks. In this example, the matrix is 18 rows by

32 columns. Each thread block has 8 columns and 6 rows of threads. A thread id within a block
is (x, y), 0 ≤ x ≤ 7, 0 ≤ y ≤ 5. The gure shows how a thread with id (1,3) within the block with
id (2,1) is associated with the matrix entry (17,9). Matrix elements are referred to using Cartesian
coordinates column rst, then row.

2.6 Matrix Multiplication Example

We illustrate CUDA programming using matrix multiplication as an example. Matrix multiplication is so
important because it arises in many dierent contexts and can be time-consuming. For those who have not
yet had a course in matrix algebra, we begin with what it means to nd the product of two matrices. To
begin, when we write that a matrix A is r × s we mean that it has r rows and s columns.

Matrix multiplication is not commutative AB is not the same as BA in general, so when we say that we
are multiplying a matrix A on the right by matrix B, we mean the product AB. We can multiply a matrix A
that is r × s on the right by a matrix B only if it has s rows. In other words, B must be of dimensions s × t
for some t. The formal denition is that if A is an r × s matrix and B is an s × t matrix then the product
C=AB is the r × t matrix whose entries are dened as
s
X
Ci,j = Ai,k · Bk,j 1 ≤ i ≤ r, 1 ≤ j ≤ t
k=1

20
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

In other words, Ci,j is the dot product (inner product) of the ith row of A and the column of B. The
j th
dot product is only dened if these are the same length, which is why the width of A must match the height
of B. Below is an example.
   
1 3 13 11 9 7
 −1 1 2 3 4
2 · = 7 4 1 −2 
4 3 2 1
−2 1 2 −1 −4 −7

Make sure you understand this example before continuing.

2.6.1 Sequential Algorithm to Multiply Two Matrices

First, to make sure you understand the problem, following is an ordinary sequential C++ function that
multiplies two matrices A and B of the given dimensions, storing the result in C:

void multiply( float A, float B, float** C,

int Aheight, int Awidth, int Bwidth)
// precondition: Awidth == Bheight
{
for ( int i = 0; i < Aheight; i++ ) {
for ( int j = 0; j < Bwidth; j++ ) {
C[i][j] = 0;
for ( int k = 0; k < Awidth; k++ ) {
C[i][j] += A[i][k]*B[k][j];
}
}
}
}

The matrices are passed as dynamic arrays because otherwise one dimension of each matrix would have to
be xed size, making it a rather useless function. Client code that allocated the matrices and called this
function might look like the following:

float** A = new float*[Aheight];

for ( i = 0; i < Aheight; i++ ) {
A[i] = new float[Awidth];
}

float** B = new float*[Bheight];

for ( i = 0; i < Bheight; i++ ) {
B[i] = new float[Bwidth];
}

float** C = new float*[Aheight];

for ( i = 0; i < Aheight; i++ ) {
C[i] = new float[Bwidth];
}

multiply(A,B,C,Aheight,Awidth,Bwidth);

Figure 14 illustrates how a single entry of the product is obtained. This matrix multiplication algorithm
requires O(n3) multiplications and additions. Now we see how it can be done using the GPU.

21
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

Fig. 14: Matrix Multiplication (without shared memory)

2.6.2 Matrix Multiplication the Simple Way

We start with a method of multiplying matrices that is easy to understand but which does not take advantage
of the dierent access times of shared and global memory in the GPU.The code uses the following denition
of a Matrix structure:

// The matrix is stored in a linear array. Conceptually

// M[row][col] = *(M.elements + row * M.width + col)
typedef struct {
int width;
int height;
float* elements;
} Matrix;

In C/C++ and in CUDA on the GPU, two-dimensional matrices are stored in row-major order. This means
that row 0 is stored rst, followed by row 1, then row 2, and so on. We can represent a two-dimensional
matrix in one dimension using the above representation.

The listing below is not a main program, but a function that can utilize the resources of the GPU to perform
matrix multiplication. It computes the matrix product C = AB, where A and B are matrices of type oat.

The macro

#define BLOCK_SIZE 16

declares the thread block size, which will be a square block of 256 threads.

22
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

The Algorithm This code is an implementation of matrix multiplication that does not take advantage of
the shared memory within each SM (streaming multiprocessor). There will be a thread for each cell of the
C result matrix. Let A be an M by N matrix and B, a N by R matrix. We assume that M,N, and R are
divisible by 16, the block size for this implementation.

The A and B matrices are copied from host memory to device memory. There, a kernel is run on a grid of
blocks of threads, each of which computes a single cell of C. To compute a single cell C[i][j] of C, it has
to read the ith row of A and the j th column of B, each of size N. Each cell does this. Many cells read the ith
row of A and many read the j th row of B, but only one does both of these uniquely.

To be clear, each row of A is read R times (for each column of B) and each column of B is read M times
(for each row of A). Therefore, A is read R times from global memory and B is read M times. The basic
steps are

1. Allocate memory on the device for matrix A, and copy A onto the device's global memory.
2. Do the same for matrix B.
3. Allocate space on the device for C, the matrix product.
4. Compute the shape of the grid and the number of blocks needed to cover the product matrix C.
5. Launch the kernel with the grid and blocks.

6. When the kernel is done, copy matrix C from the device memory to host memory.

7. Free the device memory used by A, B, and C.

The listing follows.

Listing 1: Matrix multiplication without shared memory

/∗
CUDA−based MATRIX MULTIPLICATION Without Using Shared Memory

F u n c t i o n s from CUDA used h e r e i n c l u d e :

cudaError_t cudaMalloc ( v o i d ∗∗ devPtr , s i z e _ t count )
This a l l o c a t e s count b y t e s o f l i n e a r memory on t h e GPU and r e t u r n s
i n ∗ devPtr a p o i n t e r t o t h e a l l o c a t e d memory . The a l l o c a t e d memory
i s s u i t a b l y a l i g n e d f o r any kind o f v a r i a b l e . The memory i s not z e r o e d .
cudaMalloc ( )
r e t u r n s cudaErrorMemoryAllocation i n c a s e o f f a i l u r e .
c u d a S u c c e s s on s u c c e s s .

cudaError_t cudaMemcpy ( v o i d ∗ dst , c o n s t v o i d ∗ s r c , s i z e _ t count ,

enum cudaMemcpyKind kind )
This c o p i e s count b y t e s from t h e memory a r e a p o i n t e d t o by s r c t o t h e
memory a r e a p o i n t e d t o by dst , where kind i s one o f
cudaMemcpyHostToHost ,
cudaMemcpyHostToDevice ,
cudaMemcpyDeviceToHost , o r
cudaMemcpyDeviceToDevice ,
and s p e c i f i e s t h e d i r e c t i o n o f t h e copy . The memory a r e a s may not o v e r l a p .
C a l l i n g cudaMemcpy ( ) with d s t and s r c p o i n t e r s t h a t do not match t h e
d i r e c t i o n o f t h e copy r e s u l t s i n an u n d e f i n e d b e h a v i o r .
∗/

#i n c l u d e <s t d i o . h>
#i n c l u d e <math . h>
#i n c l u d e < s t d l i b . h>

23
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

#i n c l u d e <cuda . h>
#i n c l u d e <cuda_runtime_api . h>
#i n c l u d e " matrixmult . h"

#d e f i n e BLOCK_SIZE 16

// Forward d e c l a r a t i o n o f t h e matrix m u l t i p l i c a t i o n k e r n e l ( i n s e p a r a t e l i s t i n g )
__global__ v o i d MatMulKernel ( c o n s t Matrix , c o n s t Matrix , Matrix ) ;

/∗
Matrix m u l t i p l i c a t i o n − Host code
Matrix d i m e n s i o n s a r e assumed t o be m u l t i p l e s o f BLOCK_SIZE
∗/
v o i d MatMul ( c o n s t Matrix A, c o n s t Matrix B, Matrix C)
{
size_t s i z e ;

// D e c l a r e t h e matrix t h a t w i l l r e s i d e on t h e d e v i c e ( t h e GPU)
Matrix d_A;
d_A. width = A. width ;
d_A. h e i g h t = A. h e i g h t ;

/∗ C a l c u l a t e i t s s i z e i n b y t e s ∗/
s i z e = A. width ∗ A. h e i g h t ∗ s i z e o f ( f l o a t ) ;

/ ∗ Next a l l o c a t e s p a c e f o r d_A on t h e d e v i c e . This u s e s t h e f u n c t i o n

cudaMalloc ( ) . I t f i l l s d_A. e l e m e n t s with t h e l o c a t i o n o f t h e s t a r t
o f s t o r a g e f o r d_A. ∗ /
cudaMalloc (&d_A. e l e m e n t s , s i z e ) ;

/ ∗ Next , copy t h e a r r a y from t h e h o s t memory t o t h e d e v i c e memory

u s i n g cudaMemCpy ( ) . ∗ /
cudaMemcpy (d_A. e l e m e n t s , A. e l e m e n t s , s i z e , cudaMemcpyHostToDevice ) ;

/ ∗ Repeat t h e above s t e p s f o r t h e matrix B ∗ /

Matrix d_B ;
d_B . width = B . width ;
d_B . h e i g h t = B . h e i g h t ;

s i z e = B . width ∗ B . h e i g h t ∗ s i z e o f ( f l o a t ) ;
cudaMalloc (&d_B . e l e m e n t s , s i z e ) ;
cudaMemcpy (d_B . e l e m e n t s , B . e l e m e n t s , s i z e , cudaMemcpyHostToDevice ) ;

/ ∗ Repeat a l m o s t a l l o f t h e s t e p s f o r t h e C matrix . We do not copy i t

from h o s t memory b e c a u s e i t g e t s computed on t h e d e v i c e and s e n t back
to the host .
∗/
Matrix d_C;
d_C. width = C . width ;
d_C. h e i g h t = C. h e i g h t ;
s i z e = C . width ∗ C . h e i g h t ∗ s i z e o f ( f l o a t ) ;
cudaMalloc (&d_C. e l e m e n t s , s i z e ) ;

/ ∗ I t i s time t o i n v o k e t h e k e r n e l on t h e GPU. The k e r n e l i s run by

s e v e r a l s t r e a m i n g m u l t i p r o c e s s o r s . The C r e s u l t matrix i s decomposed
i n t o s q u a r e b l o c k s o f s i z e BLOCK_SIZE by BLOCK_SIZE . Each o f t h e s e
i s computed by a b l o c k o f t h r e a d s . There w i l l be a g r i d o f such b l o c k s .
I f t h e B matrix i s R c e l l s wide and t h e A matrix i s M c e l l s high , then
C i s M by R. Assuming R and M a r e m u l t i p l e s o f BLOCK_SIZE, t h e g r i d

24
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

i s o f width R/BLOCK_SIZE and o f h e i g h t M/BLOCK_SIZE . The next two

i n s t r u c t i o n s d e f i n e t h e b l o c k and g r i d d i m e n s i o n s .
∗/
dim3 dimBlock (BLOCK_SIZE, BLOCK_SIZE ) ;
dim3 dimGrid (B . width / dimBlock . x , A. h e i g h t / dimBlock . y ) ;

/ ∗ Now t h e k e r n e l i s i n v o k e d on t h i s g r i d o f b l o c k s o f t h r e a d s ∗ /
MatMulKernel<<<dimGrid , dimBlock>>>(d_A, d_B, d_C ) ;

/ ∗ When t h e k e r n e l c o m p l e t e s , we r e a d C from d e v i c e memory back t o

h o s t memory .
∗/
cudaMemcpy (C . e l e m e n t s , d_C. e l e m e n t s , s i z e , cudaMemcpyDeviceToHost ) ;

/∗
cudaError_t cudaFree ( v o i d ∗ devPtr )
This f r e e s t h e memory s p a c e p o i n t e d t o by devPtr , which must have been
r e t u r n e d by a p r e v i o u s c a l l t o cudaMalloc ( ) o r c u d a M a l l o c P i t c h ( ) .
Otherwise , o r i f cudaFree ( devPtr ) has a l r e a d y been c a l l e d b e f o r e ,
an e r r o r i s r e t u r n e d . I f devPtr i s 0 , no o p e r a t i o n i s performed .
∗/
cudaFree (d_A. e l e m e n t s ) ;
cudaFree (d_B . e l e m e n t s ) ;
cudaFree (d_C. e l e m e n t s ) ;
}

The matrix multiplication kernel is in the listing below:

Listing 2: Matrix multiplication kernel.

// Matrix multiplication kernel called by MatMul ( )

__global__ void MatMulKernel ( M a t r i x A, Matrix B, M a t r i x C)
{
/∗ Each thread computes one element of C
by accumulating results into Cvalue
∗/
float Cvalue = 0 ;
int k;
int row = b l o c k I d x . y ∗ blockDim . y + t h r e a d I d x . y ;
int col = blockIdx . x ∗ blockDim . x + t h r e a d I d x . x ;
for ( k = 0; k < A. width ; k++)
C v a l u e += A . e l e m e n t s [ row ∗ A. width + k ] ∗
B. elements [ k ∗ B. width + c o l ] ;

C . e l e m e n t s [ row ∗ C. width + c o l ] = Cvalue ;

}

The if statement terminates the thread if its row or column place it outside the bounds of the product
matrix. This will happen only in those blocks that overhang either the right or bottom side of the matrix.
The next three lines loop over the entries of the row of A and the column of B (these have the same size)
needed to compute the (row, col)-entry of the product, and the sum of these products is accumulated in the
Cvalue variable. The last line of the kernel copies the sum of the products into the appropriate element of
the product matrix C, in the device's global memory.

2.6.3 Matrix Multiplication Using Shared Memory

The problem with the simplied algorithm we just explored is that it makes many accesses to global memory.
In the loop in the kernel code, each thread loads (2*A.width) elements, two for each iteration through the

25
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

loop, one from matrix A and one from matrix B. Since accesses to global memory are relatively slow, this
slows down the kernel code, leaving many threads idle for hundreds of clock cycles.

One way to reduce the number of accesses to global memory is to have the threads load portions of matrices
A and B into shared memory, where they can access them much more quickly. The problem is that shared
memory is not large enough to store two large matrices. Devices of compute capability 1.x have 16 KB of
shared memory per multiprocessor, and devices of compute capability 2.x have 48 KB.

So instead, portions of A and B are loaded into shared memory as needed, and utilized as eciently as
possible while they are loaded. Figure 15 shows how the matrix product can be computed in a block-
structured way. Matrix A is shown on the left and matrix B is shown at the top, with matrix C, their
product, on the bottom-right. Each element of C is the product of the row to its left in A, and the column
above it in B. The matrices A and B are partitioned into 16 by 16 submatrices (BLOCK_SIZE=16.)

Each thread block is responsible for computing one square sub-matrix (Csub in the code) of the product
matrix C, and each thread will be responsible for computing one element of the product matrix C. (The
yellow square in matrix C represents the submatrix computed by a thread block, whereas the red box inside
the yellow square represents a single entry in C, what is computed by a single thread within that block.)
The thread responsible for this entry computes its value by computing the dot product of the red row of A
and the red column of B. Unfortunately, it is not so simple as this, because it must do this in pieces, as not
all of its input data will be in shared memory at the same time.

Fig. 15: Matrix Multiplication (using shared memory)

Csub is equal to the product of two rectangular matrices: the sub-matrix of A of dimension A_width x
BLOCK_SIZE that has the same row indices as Csub, and the submatrix of B of dimension BLOCK_SIZE x
A_width that has the same column indices as Csub. These are the yellow rectangular regions within the A
and B matrices in Figure 15. These regions will not all be in shared memory together. But notice that the
red row and red column pass through the same number of submatrices, since they are of equal length. This
leads to the idea: if we load the left-most of those submatrices of matrix A into shared memory, and the
top-most of those submatrices of matrix B into shared memory, then we can compute the rst BLOCK_SIZE
products of the dot-product entirely from the shared memory. Every thread in the thread block can do this
for its own little red square from the gure, and because those submatrices are in shared memory, making
this very fast.

26
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

When this is done, we no longer need the left-most block of A and the top-most block of B. We load from
global memory to shared memory the block of A to the right of the previous one, and the block of B below
the previous one and repeat the above step. Each thread computes the dot product of the next BLOCK_SIZE
entries. This gets added to the running total that the thread is maintaining for its (i,j) entry. This process
continues until the entire row of A has been multiplied by the entire column of B. When this is nished, the
resulting square submatrix is written back to global memory.

To make the copying from global to shared memory ecient, each thread is responsible for copying a single
element from each of the A and B matrices. The copying is done in such a way to maximize the memory
bandwidth, which will be explained within the listing below.

Listing 3: Matrix multiplication using shared memory.

// M a t r i c e s a r e s t o r e d i n row−major o r d e r :
// M( row , c o l ) = ∗ (M. e l e m e n t s + row ∗ M. s t r i d e + c o l )
// The s t r i d e o f a matrix i s t h e number o f b y t e s from t h e s t a r t o f a row t o
// t h e s t a r t o f t h e next row . The s t r i d e i s not n e c e s s a r i l y e q u a l t o t h e width ;
// t h e s t r i d e can be l a r g e r s o t h a t rows a r e a l i g n e d i n s h a r e d memory t o
// a c h i e v e good bandwidth .
typedef struct {
i n t width ; // width o f matrix
int height ; // h e i g h t o f matrix
int stride ; // number o f b y t e s from s t a r t o f row k t o s t a r t o f row k+1
f l o a t ∗ e l e m e n t s ; // p o i n t e r t o a c t u a l matrix data
} Matrix ;

// Thread b l o c k s i z e
#d e f i n e BLOCK_SIZE 16

// Get t h e BLOCK_SIZExBLOCK_SIZE sub−matrix Msub o f M t h a t i s

// l o c a t e d c o l sub− m a t r i c e s t o t h e r i g h t and row sub− m a t r i c e s down
// from t h e upper − l e f t c o r n e r o f M
// This r e t u r n s a p o i n t e r t o t h e f i r s t e l e m e n t o f t h a t submatrix .
__device__ Matrix GetSubMatrix ( Matrix M, i n t row , i n t c o l )
{
Matrix Msub ;
Msub . width = BLOCK_SIZE ;
Msub . h e i g h t = BLOCK_SIZE ;
Msub . s t r i d e = M. s t r i d e ;
Msub . e l e m e n t s = &M. e l e m e n t s [M. s t r i d e ∗ BLOCK_SIZE ∗ row
+ BLOCK_SIZE ∗ c o l ] ;
r e t u r n Msub ;
}

// The __global__ q u a l i f i e r d e c l a r e s a f u n c t i o n a s b e i n g a k e r n e l .
// A k e r n e l i s a f u n c t i o n t h a t i s e x e c u t e d on t h e d e v i c e ( t h e GPU) , and
// i s c a l l a b l e from t h e h o s t (CPU) .
// This i s a f o r w a r d d e c l a r a t i o n o f t h e d e v i c e m u l t i p l i c a t i o n f u n c t i o n
__global__ v o i d Muld ( f l o a t ∗ , f l o a t ∗ , i n t , i n t , f l o a t ∗ ) ;

// Host m u l t i p l i c a t i o n f u n c t i o n
// Compute C = A ∗ B
// height_A i s t h e h e i g h t o f A
// width_A i s t h e width o f A
// width_B i s t h e width o f B
v o i d M u l t i p l y ( c o n s t f l o a t ∗A,
c o n s t f l o a t ∗B,
int height_A ,

27
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

int width_A ,
int width_B ,
float ∗C)
{
int size ;

/∗
Copy m a t r i c e s A and B t o t h e d e v i c e memory
To copy r e q u i r e s u s i n g cudaMemCpy , which can copy e i t h e r from h o s t
t o d e v i c e , from d e v i c e t o host , o r from d e v i c e t o d e v i c e .
∗/
f l o a t ∗ A_on_device ;
s i z e = height_A ∗ width_A ∗ s i z e o f ( f l o a t ) ;
cudaMalloc ( ( v o i d ∗∗ )&A_on_device , s i z e ) ;
cudaMemcpy ( A_on_device , A, s i z e , cudaMemcpyHostToDevice ) ;

f l o a t ∗ B_on_device ;
s i z e = width_A ∗ width_B ∗ s i z e o f ( f l o a t ) ;
cudaMalloc ( ( v o i d ∗∗ )&B_on_device , s i z e ) ;
cudaMemcpy ( B_on_device , B, s i z e , cudaMemcpyHostToDevice ) ;

// A l l o c a t e matrix C on t h e d e v i c e
f l o a t ∗ C_on_device ;
s i z e = height_A ∗ width_B ∗ s i z e o f ( f l o a t ) ;
cudaMalloc ( ( v o i d ∗∗ )&C_on_device , s i z e ) ;

// Compute t h e e x e c u t i o n c o n f i g u r a t i o n assuming
// t h e matrix d i m e n s i o n s a r e m u l t i p l e s o f BLOCK_SIZE
// The dim3 d e c l a r a t i o n i s used h e r e . This s p e c i f i e s t h a t dimBlock
// i s BLOCK_SIZE x BLOCK_SIZE x 1
dim3 dimBlock (BLOCK_SIZE, BLOCK_SIZE ) ;

// width_B/ dimBlock . x i s t h e l e n g t h o f a row o f B d i v i d e d by t h e h o r i z o n t a l

// s i z e o f a block , which y i e l d s t h e number o f b l o c k s i n t h e h o r i z o n t a l
// d i m e n s i o n o f t h e Grid .
// S i m i l a r l y , height_A / dimBlock . y i s t h e h e i g h t o f A d i v i d e d by t h e v e r t i c a l
// s i z e o f a block , which i s t h e number o f b l o c k s v e r t i c a l l y i n t h e g r i d .
// These two v a l u e s d e f i n e t h e shape o f t h e g r i d , i . e . t h e number o f
// b l o c k h o r i z o n t a l l y and v e r t i c a l l y .
dim3 dimGrid ( width_B / dimBlock . x , height_A / dimBlock . y ) ;

// Launch t h e d e v i c e computation
Muld <<<dimGrid , dimBlock>>> ( A_on_device ,
B_on_device ,
width_A ,
width_B ,
C_on_device ) ;

// Copy r e s u l t C from t h e d e v i c e t o h o s t memory

cudaMemcpy (C, C_on_device , s i z e , cudaMemcpyDeviceToHost ) ;

// Free d e v i c e memory
cudaFree ( A_on_device ) ;
cudaFree ( B_on_device ) ;
cudaFree ( C_on_device ) ;
}

28
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

/∗
Note a g a i n t h e __global__ q u a l i f i e r : t h i s i s a k e r n e l f u n c t i o n
This a l s o means t h a t e v e r y t h r e a d e x e c u t e s t h i s f u n c t i o n .
When a t h r e a d e x e c u t e s t h i s f u n c t i o n , i t has a s p e c i f i c t h r e a d i d
and b l o c k i d . The t h r e a d i d i s t h e v a l u e o f t h r e a d I d x , used below ,
and t h e b l o c k i d i s s t o r e d i n b l o c k I d x , used below .
t h r e a d I d x and b l o c k I d x a r e each o f type dim3 .
∗/
__global__
v o i d Muld ( f l o a t ∗ A, f l o a t ∗ B, i n t width_A , i n t width_B , f l o a t ∗ C)
{
// Block i n d e x
i n t block_col = blockIdx . x ;
i n t block_row = b l o c k I d x . y ;

// Thread i n d e x
i n t thread_col = threadIdx . x ;
i n t thread_row = t h r e a d I d x . y ;

// Index o f t h e f i r s t sub−matrix o f A p r o c e s s e d by t h e b l o c k
i n t aBegin = width_A ∗ BLOCK_SIZE ∗ block_row ;

// Index o f t h e l a s t sub−matrix o f A p r o c e s s e d by t h e b l o c k
i n t aEnd = aBegin + width_A − 1 ;

// Step s i z e used t o i t e r a t e through t h e sub− m a t r i c e s o f A

// The upper l e f t c o r n e r o f t h e next b l o c k o f A i s BLOCK_SIZE columns
// from t h e c u r r e n t block ' s c o r n e r , s o t h e i n c r e m e n t i s j u s t BLOCK_SIZE .
i n t aStep = BLOCK_SIZE ;

// Index o f t h e f i r s t sub−matrix o f B p r o c e s s e d by t h e b l o c k
i n t bBegin = BLOCK_SIZE ∗ b l o c k _ c o l ;

// Step s i z e used t o i t e r a t e through t h e sub− m a t r i c e s o f B

// The upper l e f t c o r n e r o f t h e next b l o c k o f B i s BLOCK_SIZE rows
// below t h e upper l e f t c o r n e r o f t h e c u r r e n t b l o c k . Each row has
// width_B bytes , s o t h e i n c r e m e n t i s BLOCK_SIZE ∗ width_B .
i n t bStep = BLOCK_SIZE ∗ width_B ;

// The e l e m e n t o f t h e b l o c k sub−matrix t h a t i s computed

// by t h e t h r e a d
f l o a t Csub = 0 ;

// Loop o v e r a l l t h e sub− m a t r i c e s o f A and B r e q u i r e d t o

// compute t h e b l o c k sub−matrix
f o r ( i n t a = aBegin , b = bBegin ;
a <= aEnd ;
a += aStep , b += bStep ) {

// The __shared__ q u a l i f i e r d e c l a r e s a v a r i a b l e t h a t
// ∗ r e s i d e s i n t h e s h a r e d memory s p a c e o f a t h r e a d block ,
// ∗ has t h e l i f e t i m e o f t h e block , and
// i s o n l y a c c e s s i b l e from a l l t h e t h r e a d s w i t h i n t h e b l o c k .
// The As and Bs m a t r i c e s d e c l a r e d below a r e i n t h e s h a r e d memory o f
// t h e b l o c k ; As i s f o r t h e sub−matrix o f A, and Bs , a submatrix o f B .

shared f l o a t As [ BLOCK_SIZE ] [ BLOCK_SIZE ] ;

// Shared memory f o r t h e sub−matrix o f B

29
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

shared f l o a t Bs [ BLOCK_SIZE ] [ BLOCK_SIZE ] ;

/∗
The next s t e p l o a d s t h e m a t r i c e s from g l o b a l memory t o s h a r e d memory ;
each t h r e a d l o a d s one e l e m e n t o f each matrix .

The g l o b a l memory a c c e s s by a l l t h r e a d s o f a h a l f −warp i s c o a l e s c e d

i n t o one o r two memory t r a n s a c t i o n s i f i t s a t i s f i e s t h e f o l l o w i n g
three conditions :
1 . Threads must a c c e s s
E i t h e r 4− byte words , r e s u l t i n g i n one 64− byte memory t r a n s a c t i o n ,
Or 8− byte words , r e s u l t i n g i n one 128 − byte memory t r a n s a c t i o n ,
Or 16− byte words , r e s u l t i n g i n two 128 − byte memory t r a n s a c t i o n s ;
2 . A l l 16 words must l i e i n t h e same segment o f s i z e e q u a l t o t h e
memory t r a n s a c t i o n s i z e ( o r t w i c e t h e memory t r a n s a c t i o n s i z e
when a c c e s s i n g 16− byte words ) ;
3 . Threads must a c c e s s t h e words i n s e q u e n c e : The kth t h r e a d i n t h e
h a l f −warp must a c c e s s t h e kth word .

In t h e s e i n s t r u c t i o n s , each t h r e a d i s a c c e s s i n g a 4− byte word ( f l o a t )

and 16 t h r e a d s do t h i s s i m u l t a n e o u s l y , f o r a t o t a l o f 64− b y t e s i n
each t r a n s a c t i o n . The words a c c e s s e d by t h e t h r e a d s a r e i n s e q u e n c e
s o t h e a c c e s s i s c o a l e s c e d and w i l l t a k e j u s t two t r i p s t o g l o b a l
memory .
∗/
As [ thread_row ] [ t h r e a d _ c o l ] = A[ a + width_A ∗ thread_row + t h r e a d _ c o l ] ;
Bs [ thread_row ] [ t h r e a d _ c o l ] = B [ b + width_B ∗ thread_row + t h r e a d _ c o l ] ;

// S y n c h r o n i z e t o make s u r e t h e m a t r i c e s a r e l o a d e d
// __syncthreads ( ) i s a b a r r i e r s y n c h r o n i z a t i o n c a l l ; a l l t h r e a d s
// w a i t h e r e u n t i l e v e r y t h r e a d has made t h e c a l l , a t which p o i n t i t
// r e t u r n s i n each t h r e a d .
__syncthreads ( ) ;

/∗
The l o o p below computes t h e i n n e r p r o d u c t o f t h e row o f A_shared
and column o f B_shared a s s i g n e d t o t h i s thread , As [ thread_row ] ,
and Bs [ t h r e a d _ c o l ] , i . e . , t h e sum

As [ i ] [ 0 ] ∗ Bs [ 0 ] [ j ] + As [ i ] [ 1 ] ∗ Bs [ 1 ] [ j ] + . . . As [ i ] [ N− 1]Bs [ N− 1 ] [ j ]

For a d e v i c e with compute c a p a b i l i t y 1 . x ( most d e v i c e s i n u s e today

with CUDA c a p a b i l i t y ) , t h e r e a r e 16 banks i n s h a r e d memory , each
4− b y t e s wide . A f l o a t i s s t o r e d i n one word o f a bank .

Remember t h a t a l l t h r e a d s w i t h i n a s i n g l e warp e x e c u t e t h e same

i n s t r u c t i o n a t a time . T h e r e f o r e , when a k e r n e l has a l o o p such a s
f o r ( i n t k = 0 ; k < BLOCK_SIZE ; ++k )
Csub += As [ thread_row ] [ k ] ∗ Bs [ k ] [ t h r e a d _ c o l ] ;
each t h r e a d e x e c u t e s t h e i n s t r u c t i o n i n t h e bosy s i m u l t a n e o u s l y .

C o n s i d e r t h a t a s s i g n m e n t statement ,

Csub += As [ thread_row ] [ k ] ∗ Bs [ k ] [ t h r e a d _ c o l ] ;

f o r f i x e d k . Each o f t h e t h r e a d s i n t h e h a l f −warp have t h e same

v a l u e o f thread_row , b e c a u s e a warp c o n s i s t s o f 16 t h r e a d s i n
row i f o l l o w e d by 16 t h r e a d s i n row i +1, f o r i = 0 , 2 , 4 , . . . 3 0 .

30
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

So t h e f i r s t h a l f −warp a c c e s s e s o n l y row i , and t h e s e c o n d h a l f −warp

a c c e s s e s o n l y row i +1.
T h e r e f o r e , f o r f i x e d k , As [ thread_row ] [ k ] i s t h e same memory l o c a t i o n
and w i l l be r e a d once and s t o r e d l o c a l l y f o r each t h r e a d i n a
register .

The v a l u e s Bs [ k ] [ t h r e a d _ c o l ] a r e 16 s u c c e s s i v e 4− byte words , s t o r e d

i n banks 0 , 1 , 2 , . . . , 15 o f t h e s h a r e d memory , and t h e r e f o r e t h e r e
i s no bank c o n f l i c t i n t h e i r a c c e s s i n g t h e s e banks s i m u l a t a n e o u s l y .
T h e r e f o r e , by Rule 1 above , t h i s i s a 64− byte memory t r a n s a c t i o n t h a t
i s c o a l e s c e d i n t o one o r two c y c l e s . In s h o r t
t h e f o r −l o o p below maximizes t h e s h a r e d memory bandwidth .

∗/
f o r ( i n t k = 0 ; k < BLOCK_SIZE ; ++k )
Csub += As [ thread_row ] [ k ] ∗ Bs [ k ] [ t h r e a d _ c o l ] ;

// S y n c h r o n i z e t o make s u r e t h a t t h e p r e c e d i n g
// computation i s done b e f o r e l o a d i n g two new
// sub− m a t r i c e s o f A and B i n t h e next i t e r a t i o n
__syncthreads ( ) ;
}

// Write t h e b l o c k sub−matrix t o g l o b a l memory ;

// each t h r e a d w r i t e s one e l e m e n t
i n t c = width_B ∗ BLOCK_SIZE ∗ block_row + BLOCK_SIZE ∗ b l o c k _ c o l ;
C [ c + width_B ∗ thread_row + t h r e a d _ c o l ] = Csub ;
}

v o i d i n i t M a t r i x ( f l o a t M[ ] , f l o a t value , i n t width , i n t h e i g h t )
{
int i , j ;

f o r ( i = 0 ; i < h e i g h t ; i++ )
f o r ( j = 0 ; j < width ; j++ )
M[ j + i ∗ width ] = v a l u e ;
}

v o i d p r i n t M a t r i x ( f l o a t M[ ] , i n t width , i n t h e i g h t )
{
int i , j ;

f o r ( i = 0 ; i < h e i g h t ; i++ ) {
f o r ( j = 0 ; j < width ; j++ )
p r i n t f ("%8.2 f " , M[ j + i ∗ width ] ) ;
p r i n t f ("\n " ) ;
}
}

i n t main ( i n t argc , c h a r ∗ argv [ ] )

{
f l o a t ∗A, ∗B, ∗C ;
i n t width = MATRIXSIZE ;
i n t h e i g h t = MATRIXSIZE ;

A = ( f l o a t ∗ ) c a l l o c ( width ∗ h e i g h t , s i z e o f ( f l o a t ) ) ;
i f ( A == NULL )
exit (1);

31
CSci 360 Computer Architecture 3 Prof. Stewart Weiss
GPUs and GPU Programming

B = ( f l o a t ∗ ) c a l l o c ( width ∗ h e i g h t , s i z e o f ( f l o a t ) ) ;
i f ( B == NULL )
exit (1);

C = ( f l o a t ∗ ) c a l l o c ( width ∗ h e i g h t , s i z e o f ( f l o a t ) ) ;
i f ( C == NULL )
exit (1);

i n i t M a t r i x (A, 2 . 0 , width , h e i g h t ) ;
i n i t M a t r i x (B, 3 . 0 , width , h e i g h t ) ;
i n i t M a t r i x (C, 0 . 0 , width , h e i g h t ) ;

M u l t i p l y ( A, B, width , width , h e i g h t , C ) ;
p r i n t M a t r i x (C, width , h e i g h t ) ;
return 0;
}

PUB - 1032292 - XS-111 - XS-211 Install-Owners
100% (1)
PUB - 1032292 - XS-111 - XS-211 Install-Owners
42 pages
Report On Gpu
No ratings yet
Report On Gpu
39 pages
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
From Everand
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
Rodrigo Copetti
No ratings yet
Judicial Review Notes Slides-Aggrey Wakili Msomi
No ratings yet
Judicial Review Notes Slides-Aggrey Wakili Msomi
58 pages
Abundance Meditation
75% (12)
Abundance Meditation
15 pages
GPU Architecture and Function: Michael Foster and Ian Frasch
No ratings yet
GPU Architecture and Function: Michael Foster and Ian Frasch
35 pages
p10 Cuda
No ratings yet
p10 Cuda
28 pages
Gpgpu Workshop Cuda
No ratings yet
Gpgpu Workshop Cuda
10 pages
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
No ratings yet
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
21 pages
Graphics Processing Units Paper PDF
No ratings yet
Graphics Processing Units Paper PDF
14 pages
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
No ratings yet
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
24 pages
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
No ratings yet
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
14 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
UNIT 4 GPU Computing - HPC
No ratings yet
UNIT 4 GPU Computing - HPC
13 pages
GPU (Graphics Processing Unit)
No ratings yet
GPU (Graphics Processing Unit)
23 pages
Developers Had To Map Scientific Calculations Onto Problems That Could Be Represented by Triangles and Polygons
No ratings yet
Developers Had To Map Scientific Calculations Onto Problems That Could Be Represented by Triangles and Polygons
2 pages
Unit 2 - GPU DFG
No ratings yet
Unit 2 - GPU DFG
27 pages
Lecture 2
No ratings yet
Lecture 2
15 pages
HPC 5th Unit - 240504 - 160548
No ratings yet
HPC 5th Unit - 240504 - 160548
18 pages
Design of Graphics Processing Framework On FPGA
No ratings yet
Design of Graphics Processing Framework On FPGA
5 pages
Gpu IEEE Paper
No ratings yet
Gpu IEEE Paper
14 pages
Graphics Processing Unit
No ratings yet
Graphics Processing Unit
14 pages
Brodtkorb Etal Meta10
No ratings yet
Brodtkorb Etal Meta10
15 pages
Parallel Processing Using GPU's
No ratings yet
Parallel Processing Using GPU's
34 pages
History and Evolution of Gpu Architecture: Chris Mcclanahan
No ratings yet
History and Evolution of Gpu Architecture: Chris Mcclanahan
7 pages
GPU Gems2 ch29
No ratings yet
GPU Gems2 ch29
21 pages
An Introduction To Graphical Processing Unit: Jayshree Ghorpade, Jitendra Parande, Rohan Kasat, Amit Anand
No ratings yet
An Introduction To Graphical Processing Unit: Jayshree Ghorpade, Jitendra Parande, Rohan Kasat, Amit Anand
6 pages
GPU Architecture
No ratings yet
GPU Architecture
12 pages
GPU Architecture
33% (3)
GPU Architecture
28 pages
Graphic Processing Unit
100% (1)
Graphic Processing Unit
20 pages
Gpu Research Paper
No ratings yet
Gpu Research Paper
6 pages
Intro Computing BCSM-F18-071 - Assignment 1
No ratings yet
Intro Computing BCSM-F18-071 - Assignment 1
10 pages
GPUIntro
No ratings yet
GPUIntro
21 pages
Lecture-12-PDC - CUDA
No ratings yet
Lecture-12-PDC - CUDA
25 pages
Graphics Processing Unit
No ratings yet
Graphics Processing Unit
9 pages
GPU Gpgpu Computing: Rajan Panigrahi
No ratings yet
GPU Gpgpu Computing: Rajan Panigrahi
24 pages
Graphics Processing Unit (Gpu) : BY Amal Raj.R Electronics C.P.T.C
No ratings yet
Graphics Processing Unit (Gpu) : BY Amal Raj.R Electronics C.P.T.C
30 pages
Lecture 17-Introduction To GPU
No ratings yet
Lecture 17-Introduction To GPU
36 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
Graphics Processing Unit
No ratings yet
Graphics Processing Unit
21 pages
GPGPU
No ratings yet
GPGPU
139 pages
Matter
No ratings yet
Matter
2 pages
GPU Architecture and Programming
No ratings yet
GPU Architecture and Programming
3 pages
Unit 4
No ratings yet
Unit 4
48 pages
Presentation Prepared by Saatwik Kumar 1101219423 ETC, ET-2
No ratings yet
Presentation Prepared by Saatwik Kumar 1101219423 ETC, ET-2
18 pages
AHA U4
No ratings yet
AHA U4
199 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
A Seminar Report On GPU by M.Marshal Murmu (1801109169)
No ratings yet
A Seminar Report On GPU by M.Marshal Murmu (1801109169)
28 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
06 Intro Gpus
No ratings yet
06 Intro Gpus
33 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
Submitted By: - : Brijesh Kumar Patel BT - Cs 0709510012
No ratings yet
Submitted By: - : Brijesh Kumar Patel BT - Cs 0709510012
24 pages
Accelerating Large Graph Algorithms On The GPU Using Cuda
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using Cuda
12 pages
GPU Assembly and Shader Programming for Compute: Low-Level Optimization Techniques for High-Performance Parallel Processing
From Everand
GPU Assembly and Shader Programming for Compute: Low-Level Optimization Techniques for High-Performance Parallel Processing
Robert Johnson
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Chapter 9 - Multiple Core Computers
No ratings yet
Chapter 9 - Multiple Core Computers
44 pages
Graphics Processing Unit (GPU)
No ratings yet
Graphics Processing Unit (GPU)
13 pages
The Evolution of Gpus For General Purpose Computing
No ratings yet
The Evolution of Gpus For General Purpose Computing
38 pages
Bava Kalai Final
No ratings yet
Bava Kalai Final
235 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
Raphics Rocessing NIT: Nust College of Electrical and Mechanical Engineering
No ratings yet
Raphics Rocessing NIT: Nust College of Electrical and Mechanical Engineering
27 pages
Engineering AI Excellence
From Everand
Engineering AI Excellence
Azhar ul Haque Sario
No ratings yet
02 ArchitectureOS
No ratings yet
02 ArchitectureOS
15 pages
Kiem Tra Lan 2 de Thi Dap An
No ratings yet
Kiem Tra Lan 2 de Thi Dap An
11 pages
Practice - Chapter 3 - Timer
No ratings yet
Practice - Chapter 3 - Timer
4 pages
Practice Lesson 5 - Queue
No ratings yet
Practice Lesson 5 - Queue
1 page
Chapter 01
No ratings yet
Chapter 01
78 pages
Petition - Notarial Commission - Template
No ratings yet
Petition - Notarial Commission - Template
5 pages
Simcom Sim5215 Sim5216 Atc en v1.21
No ratings yet
Simcom Sim5215 Sim5216 Atc en v1.21
527 pages
Kabel - PC SPSC2000 FW2 PDF
No ratings yet
Kabel - PC SPSC2000 FW2 PDF
1 page
SQL Interview Questions
No ratings yet
SQL Interview Questions
7 pages
Friction Stir Welding FSW Process
No ratings yet
Friction Stir Welding FSW Process
6 pages
On Line Audit 2
No ratings yet
On Line Audit 2
2 pages
Applications of Multi Rotor Drone Technologies in Construction Management
No ratings yet
Applications of Multi Rotor Drone Technologies in Construction Management
14 pages
FAT For PLC PDF Programmable Logic Controller
No ratings yet
FAT For PLC PDF Programmable Logic Controller
1 page
Conceptual Modeling (CM) For Military
100% (1)
Conceptual Modeling (CM) For Military
334 pages
Government Gazette - 24th January
No ratings yet
Government Gazette - 24th January
168 pages
Sample SOP For Visitor Visa Australia
No ratings yet
Sample SOP For Visitor Visa Australia
6 pages
CV
No ratings yet
CV
3 pages
BM Costs New
No ratings yet
BM Costs New
35 pages
APT cnc2
No ratings yet
APT cnc2
65 pages
Leave and License Agreement: LICENSOR" (Which Expression Shall Unless It Be Repugnant To The Context or Meaning
No ratings yet
Leave and License Agreement: LICENSOR" (Which Expression Shall Unless It Be Repugnant To The Context or Meaning
7 pages
LABNICS Filter Integrity Tester NFIT 101
No ratings yet
LABNICS Filter Integrity Tester NFIT 101
5 pages
Trend Micro Control Manager: Installation Guide
No ratings yet
Trend Micro Control Manager: Installation Guide
144 pages
JBL Store Concept Presentation
No ratings yet
JBL Store Concept Presentation
22 pages
Surya Veera Reddy
No ratings yet
Surya Veera Reddy
8 pages
Using DDL Statements To Create and Manage Tables
No ratings yet
Using DDL Statements To Create and Manage Tables
41 pages
Chaifetz, Jagger - 2014 - 40 Years of Dialogue On Food Sovereignty A Review and A Look Ahead
No ratings yet
Chaifetz, Jagger - 2014 - 40 Years of Dialogue On Food Sovereignty A Review and A Look Ahead
7 pages
100% Online: MSC Project Management Offered in Exclusive Partnership With Robert Kennedy College
No ratings yet
100% Online: MSC Project Management Offered in Exclusive Partnership With Robert Kennedy College
7 pages
ASRock ION 330HT Quick Installation Guide
No ratings yet
ASRock ION 330HT Quick Installation Guide
2 pages
UL WelcomeGuide
No ratings yet
UL WelcomeGuide
28 pages
USE Modals: Reading of Academic Texts in English II Modal Verbs
No ratings yet
USE Modals: Reading of Academic Texts in English II Modal Verbs
1 page
Documento Actas JEOR
No ratings yet
Documento Actas JEOR
15 pages
Limits, Continuity & Differentiability - DPP 04 - Lakshya JEE AIR O1 (2026)
No ratings yet
Limits, Continuity & Differentiability - DPP 04 - Lakshya JEE AIR O1 (2026)
3 pages

Gpus

Uploaded by

Gpus

Uploaded by

CSci 360 Computer Architecture 3 Prof.

GPUs and GPU Programming

1 Contemporary GPU System Architecture

1.2 Dierences Between GPUs and CPUs

A consequence of these dierences is that

• GPUs rely on extensive parallelism to obtain high performance.

Fig. 1: Intel interconnects in the CPU

1.3 The GPU in Relation to the CPU

1.4 The Logical 3D Graphics Pipeline

Fig. 2: Triangulated representation of a head

Fig. 3: Graphics Logical Pipeline

and A/C is a stored constant for each plane.

1.4.3 The Pipeline Stages

// Fetch reflected color by sampling a cube map

// Output is a weighted average of the surface and reflected colors

The Relation Between the Graphics Pipeline and the GPU

Fig. 4: Unied GPU architecture (GEForce 8800 w/ 112 SPs

1.5 Overview of the NVIDIA GPU Architecture

Fig. 5: The texture/processing cluster.

Peak performance in GFLOPS/second = 16 SMs x (8 cores/SM)

Peak bandwidth (GB/second) = 6 x 8 bytes/transfer

1.6 Overview of the Computational Model

There are three general concepts that dene CUDA:

• A hierarchy of computational units

• A hierarchy of shared memories

• A hierarchy of barrier synchronization

Fig. 8: Hierarchy of computational units and memories.

1.7 The GPU's Multithreaded Multiprocessor

Fig. 9: A single multithreaded multiprocessor

1.7.1 The Multiprocessor Architecture

Fig. 10: SIMT multithreaded warp scheduling

Fig. 11: A single Kepler dual-dispatch warp scheduler.

Fig. 12: Comparison of Sun UltraSPARC T2 to a single Tesla, 8-core SM

1.7.2 SIMT Warp Execution and Divergence

1.8 GPUs in Perspective

2 GPU Programming and CUDA

• a hierarchy of thread groups,

• shared memories, and

2.1 The Dierent Types of Memory Accesses

2.1.1 Global Memory Accesses

2.1.2 Shared Memory Accesses

2.2 Host and Device: What Runs Where

2.3 CUDA Extensions to C Types

dim3 block(64, 64);

species that block is 64 by 64 by 1, because the z member is set to 1.

2.4 Variable Qualiers: Where Things Are Located

function_name <<< Dg, Db, Ns, S >>>(argument_list );

We could invoke the above kernel in the main program as follows:

and the kernel function might be something like this:

8 threads wide i= 2*8+1=17

Block(0,0) Block(1,0) Block(2,0) Block(3,0)

Block(0,1) Block(1,1) Block(2,1) Block(3,1)

Block(0,2) Block(1,2) Block(2,2) Block(3,2)

2.6 Matrix Multiplication Example

Make sure you understand this example before continuing.

2.6.1 Sequential Algorithm to Multiply Two Matrices

void multiply( float** A, float** B, float** C,

float** A = new float*[Aheight];

float** B = new float*[Bheight];

float** C = new float*[Aheight];

Fig. 14: Matrix Multiplication (without shared memory)

2.6.2 Matrix Multiplication the Simple Way

// The matrix is stored in a linear array. Conceptually

7. Free the device memory used by A, B, and C.

The listing follows.

Listing 1: Matrix multiplication without shared memory

F u n c t i o n s from CUDA used h e r e i n c l u d e :

cudaError_t cudaMemcpy ( v o i d ∗ dst , c o n s t v o i d ∗ s r c , s i z e _ t count ,

/ ∗ Next a l l o c a t e s p a c e f o r d_A on t h e d e v i c e . This u s e s t h e f u n c t i o n

/ ∗ Next , copy t h e a r r a y from t h e h o s t memory t o t h e d e v i c e memory

/ ∗ Repeat t h e above s t e p s f o r t h e matrix B ∗ /

/ ∗ Repeat a l m o s t a l l o f t h e s t e p s f o r t h e C matrix . We do not copy i t

/ ∗ I t i s time t o i n v o k e t h e k e r n e l on t h e GPU. The k e r n e l i s run by

i s o f width R/BLOCK_SIZE and o f h e i g h t M/BLOCK_SIZE . The next two

/ ∗ When t h e k e r n e l c o m p l e t e s , we r e a d C from d e v i c e memory back t o

The matrix multiplication kernel is in the listing below:

1.2 Dierences Between GPUs and CPUs

A consequence of these dierences is that

Fig. 4: Unied GPU architecture (GEForce 8800 w/ 112 SPs

There are three general concepts that dene CUDA:

2.1 The Dierent Types of Memory Accesses

species that block is 64 by 64 by 1, because the z member is set to 1.

2.4 Variable Qualiers: Where Things Are Located

function_name <<< Dg, Db, Ns, S >>>(argument_list );

void multiply( float A, float B, float** C,

shared f l o a t As [ BLOCK_SIZE ] [ BLOCK_SIZE ] ;

shared f l o a t Bs [ BLOCK_SIZE ] [ BLOCK_SIZE ] ;