Gpu Parallel Program Development Cuda
Gpu Parallel Program Development Cuda
Horst Simon
Deputy Director
Lawrence Berkeley National Laboratory
Berkeley, California, U.S.A.
PUBLISHED TITLES
Tolga Soyata
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize
to copyright holders if permission to publish in this form has not been obtained. If any copyright material
has not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, trans-
mitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter
invented, including photocopying, microfilming, and recording, or in any information storage or retrieval
system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com
(http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive,
Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and regis-
tration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identification and explanation without intent to infringe.
——————————————————————————————————————————————–
Library of Congress Cataloging-in-Publication Data
——————————————————————————————————————————————–
Names: Soyata, Tolga, 1967- author.
Title: GPU parallel program development using CUDA
/ by Tolga Soyata.
Description: Boca Raton, Florida : CRC Press, [2018] | Includes bibliographical
references and index.
Identifiers: LCCN 2017043292 | ISBN 9781498750752 (hardback) |
ISBN 9781315368290 (e-book)
Subjects: LCSH: Parallel programming (Computer science) | CUDA (Computer architecture) |
Graphics processing units–Programming.
Classification: LCC QA76.642.S67 2018 | DDC 005.2/75–dc23
LC record available at https://lccn.loc.gov/2017043292
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
ix
x Contents
1.1 Harvesting each coconut requires two consecutive 30-second tasks (threads).
Thread 1: get a coconut. Thread 2: crack (process) that coconut using the
hammer. 4
1.2 Simultaneously executing Thread 1 (“1”) and Thread 2 (“2”). Accessing
shared resources will cause a thread to wait (“-”). 6
1.3 Serial (single-threaded) program imflip.c flips a 640×480 dog picture (left)
horizontally (middle) or vertically (right). 8
1.4 Running gdb to catch a segmentation fault. 20
1.5 Running valgrind to catch a memory access error. 23
2.1 Windows Task Manager, showing 1499 threads, however, there is 0% CPU
utilization. 33
3.1 The life cycle of a thread. From the creation to its termination, a thread is
cycled through many different statuses, assigned by the OS. 60
3.2 Memory access patterns of MTFlipH() in Code 2.8. A total of 3200 pixels’
RGB values (9600 Bytes) are flipped for each row. 65
3.3 The memory map of a process when only a single thread is running within
the process (left) or multiple threads are running in it (right). 75
4.1 Inside a computer containing an i7-5930K CPU [10] (CPU5 in Table 3.1),
and 64 GB of DDR4 memory. This PC has a GTX Titan Z GPU that will
be used to test a lot of the programs in Part II. 80
4.2 The imrotate.c program rotates a picture by a specified angle. Original dog
(top left), rotated +10◦ (top right), +45◦ (bottom left), and −75◦ (bottom
right) clockwise. Scaling is done to avoid cropping of the original image area. 84
4.3 The architecture of one core of the i7-5930K CPU (the PC in Figure 4.1).
This core is capable of executing two threads (hyper-threading, as defined
by Intel). These two threads share most of the core resources, but have their
own register files. 92
4.4 Architecture of the i7-5930K CPU (6C/12T). This CPU connects to the
GPUs through an external PCI express bus and memory through the mem-
ory bus. 94
5.1 The imedge.c program is used to detect edges in the original image
astronaut.bmp (top left). Intermediate processing steps are: GaussianFilter()
(top right), Sobel() (bottom left), and finally Threshold() (bottom right). 108
xxiii
Contents xix
Bibliography 435
Index 439
xxiv List of Figures
6.1 Turning the dog picture into a 3D wire frame. Triangles are used to represent
the object, rather than pixels. This representation allows us to map a texture
to each triangle. When the object moves, so does each triangle, along with
their associated textures. To increase the resolution of this kind of an object
representation, we can divide triangles into smaller triangles in a process
called tesselation. 139
6.2 Steps to move triangulated 3D objects. Triangles contain two attributes:
their location and their texture. Objects are moved by performing mathe-
matical operations only on their coordinates. A final texture mapping places
the texture back on the moved object coordinates, while a 3D-to-2D transfor-
mation allows the resulting image to be displayed on a regular 2D computer
monitor. 140
6.3 Three farmer teams compete in Analogy 6.1: (1) Arnold competes alone
with his 2× bigger tractor and “the strongest farmer” reputation, (2) Fred
and Jim compete together in a much smaller tractor than Arnold. (3) Tolga,
along with 32 boy and girl scouts, compete together using a bus. Who wins? 145
6.4 Nvidia Runtime Engine is built into your GPU drivers, shown in your Win-
dows 10 Pro SysTray. When you click the Nvidia symbol, you can open the
Nvidia control panel to see the driver version as well as the parameters of
your GPU(s). 156
6.5 Creating a Visual Studio 2015 CUDA project named imflipG.cu. Assume
that the code will be in a directory named Z:\code\imflipG in this example. 172
6.6 Visual Studio 2015 source files are in the Z:\code\imflipG\imflipG direc-
tory. In this specific example, we will remove the default file, kernel.cu, that
VS 2015 creates. After this, we will add an existing file, imflipG.cu, to the
project. 173
6.7 The default CPU platform is x86. We will change it to x64. We will also
remove the GPU debugging option. 174
6.8 The default Compute Capability is 2.0. This is too old. We will change it to
Compute Capability 3.0, which is done by editing Code Generation under
Device and changing it to compute 30, sm 30. 175
6.9 Compiling imflipG.cu to get the executable file imflipG.exe in the
Z:\code\imflipG\x64\Debug directory. 176
6.10 Running imflipG.exe from a CMD command line window. 177
6.11 The /usr/local directory in Unix contains your CUDA directories. 181
6.12 Creating a new CUDA project using the Eclipse IDE in Unix. 183
7.1 The PCIe bus connects for the host (CPU) and the device(s) (GPUs).
The host and each device have their own I/O controllers to allow transfers
through the PCIe bus, while both the host and the device have their own
memory, with a dedicated bus to it; in the GPU this memory is called global
memory. 205
List of Figures xxv
8.1 Analogy 8.1 for executing a massively parallel program using a significant
number of GPU cores, which receive their instructions and data from differ-
ent sources. Melissa (Memory controller ) is solely responsible for bringing
the coconuts from the jungle and dumping them into the big barrel (L2$).
Larry (L2$ controller ) is responsible for distributing these coconuts into the
smaller barrels (L1$) of Laura, Linda, Lilly, and Libby; eventually, these four
folks distribute the coconuts (data) to the scouts (GPU cores). On the right
side, Gina (Giga-Thread Scheduler ) has the big list of tasks (list of blocks to
be executed ); she assigns each block to a school bus (SM or streaming mul-
tiprocessor ). Inside the bus, one person — Tolga, Tony, Tom, and Tim —
is responsible to assign them to the scouts (instruction schedulers). 228
8.2 The internal architecture of the GTX550Ti GPU. A total of 192 GPU cores
are organized into six streaming multiprocessor (SM) groups of 32 GPU
cores. A single L2$ is shared among all 192 cores, while each SM has its
own L1$. A dedicated memory controller is responsible for bringing data in
and out of the GDDR5 global memory and dumping it into the shared L2$,
while a dedicated host interface is responsible for shuttling data (and code)
between the CPU and GPU over the PCIe bus. 230
8.3 A sample output of the imedgeG.cu program executed on the astronaut.bmp
image using a GTX Titan Z GPU. Kernel execution times and the amount
of data movement for each kernel is clearly shown. 242
9.1 GF110 Fermi architecture with 16 SMs, where each SM houses 32 cores, 16
LD/ST units, and 4 Special Function Units (SFUs). The highest end Fermi
GPU contains 512 cores (e.g., GTX 580). 264
9.2 GF110 Fermi SM structure. Each SM has a 128 KB register file that contains
32,768 (32 K) registers, where each register is 32-bits. This register file feeds
operands to the 32 cores and 4 Special Function Units (SFU). 16 Load/Store
(LD/ST) units are used to queue memory load/store requests. A 64 KB total
cache memory is used for L1$ and shared memory. 265
9.3 GK110 Kepler architecture with 15 SMXs, where each SMX houses 192
cores, 48 double precision units (DPU), 32 LD/ST units, and 32 Special
Function Units (SFU). The highest end Kepler GPU contains 2880 cores
(e.g., GTX Titan Black); its “double” version GTX Titan Z contains 5760
cores. 266
9.4 GK110 Kepler SMX structure. A 256 KB (64 K-register) register file feeds
192 cores, 64 Double-Precision Units (DPU), 32 Load/Store units, and 32
SFUs. Four warp schedulers can schedule four warps, which are dispatched
as 8 half-warps. Read-only cache is used to hold constants. 267
9.5 GM200 Maxwell architecture with 24 SMMs, housed inside 6 larger GPC
units; each SMM houses 128 cores, 32 LD/ST units, and 32 Special Function
Units (SFU), does not contain double-precision units (DPUs). The highest
end Maxwell GPU contains 3072 cores (e.g., GTX Titan X). 268
9.6 GM200 Maxwell SMM structure consists of 4 identical sub-structures with
32 cores, 8 LD/ST units, 8 SFUs, and 16 K registers. Two of these sub-
structures share an L1$, while four of them share a 96 KB shared memory. 269
9.7 GP100 Pascal architecture with 60 SMs, housed inside 6 larger GPC units,
each containing 10 SMs. The highest end Pascal GPU contains 3840 cores
(e.g., P100 compute accelerator). NVLink and High Bandwidth Memory
xxvi List of Figures
2.1 Serial and multithreaded execution time of imflipP.c, both for vertical flip
and horizontal flip, on an i7-960 (4C/8T) CPU. 51
4.1 imrotate.c execution times for the CPUs in Table 3.1 (+45◦ rotation). 89
4.2 imrotate.c threading efficiency (η) and parallelization overhead (1 − η) for
CPU3, CPU5. The last column reports the speedup achieved by using CPU5
that has more cores/threads, although there is no speedup up to 6 launched
SW threads. 90
4.3 imrotateMC.c execution times for the CPUs in Table 3.1. 105
5.1 Array variables and their types, used during edge detection. 111
5.2 imedge.c execution times for the W3690 CPU (6C/12T). 118
5.3 imedgeMC.c execution times for the W3690 CPU (6C/12T) in ms for a vary-
ing number of threads (above). For comparison, execution times of imedge.c
are repeated from Table 5.2 (below). 126
5.4 imedgeMCT.c execution times (in ms) for the W3690 CPU (6C/12T), using
the Astronaut.bmp image file (top) and Xeon Phi 5110P (60C/240T) using
the dogL.bmp file (bottom). 134
6.1 CUDA keyword and symbols that we learned in this chapter. 170
7.1 Vflip() kernel execution times (ms) for different size images on a GTX TITAN
Z GPU. 188
7.2 Variables available to a kernel upon launch. 190
7.3 Specifications of different computers used in testing the imflipG.cu program,
along with the execution results, compiled using Compute Capability 3.0. 202
xxix
xxx List of Tables
7.4 Introduction date and peak bandwidth of different bus types. 203
7.5 Introduction date and peak throughput of different CPU and GPU memory
types. 206
7.6 Results of the imflipG2.cu program, which uses the VfCC20() and PxCC20()
kernels and works in Compute Capability 2.0. 215
9.1 Nvidia microarchitecture families and their peak computational power for
single precision (GFLOPS) and double-precision floating point (DGFLOPS). 273
9.2 Comparison of kernel performances between (Hflip() and Hflip2()) as well as
(Vflip() and HVflip2()). 289
9.3 Kernel performances: Hflip(),· · · ,Hflip3(), and Vflip(),· · · ,Vflip3(). 293
9.4 Kernel performances: Hflip(),· · · ,Hflip4(), and Vflip(),· · · ,Vflip4(). 295
9.5 Kernel performances: Hflip(),· · · ,Hflip5(), and Vflip(),· · · ,Vflip5(). 297
9.6 Kernel performances: PixCopy(), PixCopy2(), and PixCopy3(). 298
9.7 Kernel performances: BWKernel() and BWKernel2(). 300
9.8 Kernel performances: GaussKernel() and GaussKernel2(). 301
10.1 Nvidia microarchitecture families and the size of global memory, L1$, L2$
and shared memory in each one of them. 305
10.2 Kernel performances: Hflip() vs. Hflip6() and Vflip() vs. Vflip6(). 309
10.3 Kernel performances: Hflip(), Hflip6(), and Hflip7() using mars.bmp. 311
10.4 Kernel performances: Hflip6(), Hflip7(), Hflip8() using mars.bmp. 313
10.5 Kernel performances: Vflip(), Vflip6(), Vflip7(), and Vflip8(). 315
10.6 Kernel performances: Vflip(), Vflip6(), Vflip7(), Vflip8(), and Vflip9(). 316
10.7 Kernel performances: PixCopy(), PixCopy2(), . . . , PixCopy5(). 317
10.8 Kernel performances: PixCopy(), PixCopy4(), . . . , PixCopy7(). 318
10.9 Kernel performances: BWKernel(), BWKernel2(), and BWKernel3(). 320
10.10 Kernel performances: GaussKernel(), GaussKernel2(), GaussKernel3() 322
10.11 Kernel performances: GaussKernel(), . . . , GaussKernel4(). 324
10.12 Kernel performances: GaussKernel1(), . . . , GaussKernel5(). 325
10.13 Kernel performances: GaussKernel3(), . . . , GaussKernel6(). 327
10.14 Kernel performances: GaussKernel3(), . . . , GaussKernel7(). 330
10.15 Kernel performances: GaussKernel3(), . . . , GaussKernel8(). 331
List of Tables xxxi
11.1 Runtime for edge detection and horizontal flip for astronaut.bmp (in ms). 346
11.2 Execution timeline for the second team in Analogy 11.1. 347
11.3 Streaming performance results (in ms) for imGStr, on the astronaut.bmp
image. 371
I am from the days when computer engineers and scientists had to write assembly language
on IBM mainframes to develop high-performance programs. Programs were written on
punch cards and compilation was a one-day process; you dropped off your punch-code
written program and picked up the results the next day. If there was an error, you did
it again. In those days, a good programmer had to understand the underlying machine
hardware to produce good code. I get a little nervous when I see computer science students
being taught only at a high abstraction level and languages like Ruby. Although abstraction
is a beautiful thing to develop things without getting bogged down with unnecessary details,
it is a bad thing when you are trying to develop super high performance code.
Since the introduction of the first CPU, computer architects added incredible features
into CPU hardware to “forgive” bad programming skills; while you had to order the sequence
of machine code instructions by hand two decades ago, CPUs do that in hardware for you
today (e.g., out of order processing). A similar trend is clearly visible in the GPU world.
Most of the techniques that were taught as performance improvement techniques in GPU
programming five years ago (e.g., thread divergence, shared memory bank conflicts, and
reduced usage of atomics) are becoming less relevant with the improved GPU architectures
because GPU architects are adding hardware features that are improving these previous
inefficiencies so much that it won’t even matter if a programmer is sloppy about it within
another 5–10 years. However, this is just a guess. What GPU architects can do depends on
their (i) transistor budget, as well as (ii) their customers’ demands. When I say transistor
budget, I am referring to how many transistors the GPU manufacturers can cram into an
Integrated Circuit (IC), aka a “chip.” When I say customer demands, I mean that even if
they can implement a feature, the applications that their customers are using might not
benefit from it, which will mean a wasted transistor budget.
From the standpoint of writing a book, I took all of these facts to heart and decided
that the best way to teach GPU programming is to show the differences among different
families of GPUs (e.g., Fermi, Kepler, Maxwell, and Pascal) and point out the trend, which
lets the reader be prepared about the upcoming advances in the next generation GPUs,
and the next, and the next . . . I put a lot of emphasis on concepts that will stay relevant
for a long period of time, rather than concepts that are platform-specific. That being said,
GPU programming is all about performance and you can get a lot higher performance if
you know the exact architecture you are running on, i.e., if you write platform-dependent
code. So, providing platform-dependent explanations are as valuable as generalized GPU
concepts. I engineered this book in such a way so that the later the chapter, the more
platform-specific it gets.
I believe that the most unique feature of this book is the fact that it starts explaining
parallelism by using CPU multi-threading in Part I. GPU massive parallelism (which differs
from CPU parallelism) is introduced in Part II. Due to the way the CPU parallelism is
explained in Part I, there is a smooth transition into understanding GPU parallelism in
Part II. I devised this methodology within the past six years of teaching GPU programming;
I realized that the concept of massive parallelism was not clear with students who have never
xxxiii
xxxiv Preface
Tolga Soyata
About the Author
xxxv
PART I
Understanding CPU Parallelism
1
CHAPTER 1
his book is a self-sufficient GPU and CUDA programming textbook. I can imagine the
T surprise of somebody who purchased a GPU programming book and the first chapter
is named “Introduction to CPU Parallel Programming.” The idea is that this book expects
the readers to be sufficiently good at a low-level programming language, like C, but not in
CPU parallel programming. To make this book a self-sufficient GPU programming resource
for somebody that meets this criteria, any prior CPU parallel programming experience
cannot be expected from the readers, yet it is not difficult to gain sufficient CPU parallel
programming skills within a few weeks with an introduction such as Part I of this book.
No worries, in these few weeks of learning CPU parallel programming, no time will be
wasted toward our eventual goal of learning GPU programming, since almost every concept
that I introduce here in the CPU world will be applicable to the GPU world. If you are
skeptical, here is one example for you: The thread ID, or, as we will call it tid, is the identifier
of an executing thread in a multi-threaded program, whether it is a CPU or GPU thread.
All of the CPU parallel programs we write will use the tid concept, which will make the
programs directly transportable to the GPU environment. Don’t worry if the term thread
is not familiar to you. Half the book is about threads, as it is the backbone of how CPUs
or GPUs execute multiple tasks simultaneously.
3
4 GPU Parallel Program Development Using CUDA
Thread 2 = Crack
Coconut (process) a coconut
harvesting
Farm House
FIGURE 1.1 Harvesting each coconut requires two consecutive 30-second tasks
(threads). Thread 1: get a coconut. Thread 2: crack (process) that coconut using
the hammer.
coconuts per minute, they still have a performance improvement from 1 to 1.5 coconuts per
minute.
After harvesting a few coconuts, Jim asks himself the question: “Why do I have to wait
for Fred to crack the coconut? When he is cracking the coconut, I can immediately walk
to the tree, and get the next coconut. Since Th1 and Th2 take exactly the same amount
of time, we never have to be in a situation where they are waiting for the cracker to be
free. Exactly the time Fred is back from picking the next coconut, I will be done cracking
my coconut and we will both be 100% busy.” This genius idea brings them back to the
2 coconuts/minute speed without even needing an extra tractor. The big deal was that
Jim re-engineered the program, which is the sequence of the threads to execute, so the
threads are never caught in a situation where they are waiting for the shared resources inside
the core, like the cracker inside the tractor. As we will see very shortly, a shared resource
inside a core is an ALU, FPU, cache memory, and more ... For now, don’t worry about
these.
The two scenarios I described in this analogy are having two cores (2C), each executing a
single thread (1T) versus having a single core (1C) that is capable of executing two threads
(2T). In the CPU world, they are called 2C/2T versus 1C/2T. In other words, there are two
ways to give a program the capability to execute two simultaneous threads: 2C/2T (2 cores,
which are capable of executing a single thread each—just like two separate tractors for
Jim and Fred) or 1C/2T (a single core, capable of executing two threads—just like a single
tractor shared by Jim and Fred). Although, from the programmer’s standpoint, both of them
mean the ability to execute two threads, they are very different options from the hardware
standpoint, and they require the programmer to be highly aware of the implications of the
threads that share resources. Otherwise, the performance advantages of the extra threads
could vanish. Just to remind again: our almighty INTEL i7-5960X [11] CPU is an 8C/16T,
which has eight cores, each capable of executing two threads.
Three options are shown in Figure 1.2: (a) is the 2C/2T option with two separate cores.
(b) is the 1C/2T option with bad programming, yielding only 1.5 coconuts per minute, and
(c) is the sequence-corrected version, where the access to the cracker is never simultaneous,
yielding 2 coconuts per minute.
Introduction to CPU Parallel Programming 7
FIGURE 1.3 Serial (single-threaded) program imflip.c flips a 640 × 480 dog picture
(left) horizontally (middle) or vertically (right).
memory access speed, and ones that are memory intensive, which are highly sensitive to
the memory access speed, as I have just shown.
and then measure only the amount of time that we spend flipping the image. Due to the
drastic differences in the data transfer speeds of different hardware components, we need to
analyze the amount of time spent in the disk, memory, and CPU separately.
In many of the parallel programs we will develop in this book, our focus is CPU time
and memory access time, because we can influence them; disk access time (which we will
call I/O time) typically saturates even with a single thread, thereby seeing almost no benefit
from multithreaded programming. Also, make a mental note that this slow I/O speed will
haunt us when we start GPU programming; since I/O is the slowest part of a computer
and the data from the CPU to the GPU is transferred through the PCI express bus, which
is a part of the I/O subsystem, we will have a challenge in feeding data to the GPU fast
enough. But, nobody said that GPU programming was easy! To give you an idea about the
magnitudes of transfer speeds of different hardware components, let me now itemize them:
• A typical network interface card (NIC) has a transfer speed of 1 Gbps (Giga-bits-per-
second or billion-bits-per-second). These cards are called “Gigabit network cards” or
“Gig NICs” colloquially. Note that 1 Gbps is only the amount of “raw data,” which
includes a significant amount of error correction coding and other synchronization
signals. The amount of actual data that is transferred is less than half of that. Since
my goal is to give the reader a rough idea for comparison, this detail is not that
important for us.
• A typical hard disk drive (HDD) can barely reach transfer speeds of 1–2 Gbps, even if
connected to a SATA3 cable that has a peak 6 Gbps transfer speed. The mechanical
read-write nature of the HDDs simply do not allow them to access the data that fast.
The transfer speed isn’t even the worst problem with an HDD, but the seek time is;
it takes the mechanical head of an HDD some time to locate the data on the spinning
metal cylinder, therefore forcing it to wait until the rotating head reaches the position
where the data resides. This could take milli-seconds (ms) if the data is distributed
in an irregular fashion (i.e., fragmented). Therefore, HDDs could have transfer speeds
that are far less than the peak speed of the SATA3 cable that they are connected to.
• A flash disk that is hooked up to a USB 2.0 port has a peak transfer speed of 480 Mbps
(Mega-bits-per-second or million-bits-per-second). However, the USB 3.0 standard has
a faster 5 Gbps transfer speed. The newer USB 3.1 can reach around 10 Gbps transfer
rates. Since flash disks are built using flash memory, there is no seek time, as they are
directly accessible by simply providing the data address.
• A typical solid state disk (SSD) can be read from a SATA3 cable at speeds close
to 4–5 Gbps. Therefore, an SSD is really the only device that can saturate a SATA3
cable, i.e., deliver data at its intended peak rate of 6 Gbps.
• Once the data is transferred from I/O (SDD, HDD, or flash disk) into the memory
of the CPU, transfer speeds are drastically higher. Core i7 family all the way up to
the sixth generation (i7-6xxx) and the higher-end Xeon CPUs use DDR2, DDR3,
and DDR4 memory technologies and have memory-to-CPU transfer speeds of 20–60
GBps (Giga-Bytes-per-second). Notice that this speed is Giga-Bytes; a Byte is 8 bits,
thereby translating to memory transfer speeds of 160–480 Gbps (Giga-bits-per-second)
just to compare readily to the other slower devices.
• As we will see in Part II and beyond, transfer speeds of GPU internal memory sub-
systems can reach 100–1000 GBps. The new Pascal series GPUs, for example, have
an internal memory transfer rate, which is close to the latter number. This translates
10 GPU Parallel Program Development Using CUDA
to 8000 Gbps, which is an order-of-magnitude faster than the CPU internal memory
and three orders-of-magnitude faster than a flash disk, and almost four orders-of-
magnitude faster than an HDD.
between the actual code that flips the image that is in memory and intentionally excluded
the I/O time.
Before reporting the elapsed time, the imflip.c program de-allocates all of the memory
that was allocated by ReadBMP() using a bunch of free() functions to avoid memory leaks.
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include "ImageStuff.h"
#define REPS 129
struct ImgProp ip;
img[row][col] = img[ip.Vpixels-(row+1)][col];
img[row][col+1] = img[ip.Vpixels-(row+1)][col+1];
img[row][col+2] = img[ip.Vpixels-(row+1)][col+2];
img[ip.Vpixels-(row+1)][col] = pix.B;
img[ip.Vpixels-(row+1)][col+1] = pix.G;
img[ip.Vpixels-(row+1)][col+2] = pix.R;
row++;
}
}
return img;
}
The ReadBMP() places the image width and height in two variables ip.Hpixels and
ip.Vpixels, respectively. The number of bytes that we need to store each row of the image
is placed in ip.Hbytes. FlipImageV() function has two loops: The outer loop goes through
all ip.Hbytes of the image and for every column, it swaps the corresponding vertical mirror
pixels one at a time in the inner loop.
//horizontal flip
for(row=0; row<ip.Vpixels; row++){ // go through the rows
col = 0;
while(col<(ip.Hpixels*3)/2){ // go through the columns
pix.B = img[row][col];
pix.G = img[row][col+1];
pix.R = img[row][col+2];
img[row][col] = img[row][ip.Hpixels*3-(col+3)];
img[row][col+1] = img[row][ip.Hpixels*3-(col+2)];
img[row][col+2] = img[row][ip.Hpixels*3-(col+1)];
img[row][ip.Hpixels*3-(col+3)] = pix.B;
img[row][ip.Hpixels*3-(col+2)] = pix.G;
img[row][ip.Hpixels*3-(col+1)] = pix.R;
col+=3;
}
}
return img;
}
pix.B = img[row][col];
pix.G = img[row][col+1];
pix.R = img[row][col+2];
Introduction to CPU Parallel Programming 13
are simply reading one pixel at the vertical row and horizontal column col; the blue color
components of the pixel are at address img[row][col], the green component is at address
img[row][col + 1], and the red components are at img[row][col + 2]. The pointer to the
beginning address of the image, img, was passed onto the FlipImageH() function by main()
after the ReadBMP() allocated space for it, as we will see in the next chapter.
text terminal, “Cygwin64 Terminal,” which is a Unix bash shell. Using a text-only shell
has a few implications:
1. Since every single program we are developing operates on an image, we need a way
to display our images outside the terminal. In Cygwin64, since each directory you are
browsing on the Cygwin64 terminal corresponds to an actual Windows directory, all
you have to do is to find that Windows directory and display the input and output
images using a simple program like mspaint or an Internet Explorer browser. Both
programs will allow you to resize the monster 3200×2400 image to any size you want
and display it comfortably.
2. Cygwin commands ls, md, and cd are all indeed working on a Windows directory. The
cygwin64-Windows directory mapping is:
~/Cyg64dir ←→ C:\cygwin64\home\Tolga\Cyg64dir
where Tolga is my login, and, hence, my Cygwin64 root directory name. Every cyg-
win64 user’s home directory will be under the same C:\cygwin64\home directory. In
many cases, there will be only one user, which is your name.
3. We need to run Notepad++ outside the Cygwin64 terminal (i.e., from Windows) by
drag-and-dropping the C source files inside Notepad++ to edit them. Once edited,
we compile them in the Cygwin64 terminal, and display them outside the terminal.
4. There is another way to run Notepad++ and display the images in Cygwin64, without
going to Windows. Type the following command lines:
One thing might look mysterious to you: Why did I precede our program’s name with . /
and didn’t do the same thing for cygstart? Type this:
echo $PATH
and you will not have . / in the current PATH environment variable after an initial Cygwin64
install. Therefore, Cygwin64 won’t know to search the current directory for any command
you type. If you already have the . / in your PATH, you do not have to worry about this.
If you don’t, you can add that to your PATH within your .bash profile file, and now it will
start recognizing it. This file is in your home directory and the line to add is:
export PATH=$PATH:. /
Introduction to CPU Parallel Programming 15
Since the cygstart command was in one of the paths that existed in your PATH environment
variable, you didn’t need to precede it with any directory name such as . / which implies
current directory.
The ls -al command enables you to see the sizes and permissions of a directory and the
files contained in it (i.e., detailed listing) no matter which directory you are in. You will
also see two directories Unix created for you automatically with special names. (meaning
this) and .. (meaning upper ) directory in relationship to where you are. So, for example,
the command . /imflip ... is telling Unix to run imflip from this directory.
Using the pwd command to find out where you are, you will get a directory that doesn’t
start with a tilde, but rather looks something like /home/Tolga/cuda Why? because pwd
reports where you are in relationship to the Unix root, which is / rather than your home
directory /home/Tolga/ or the ~/ short notation. While the cd command will take you to your
home directory, cd / command will take you to the Unix root, where you will see the directory
named home. You can drill down into home/Tolga with commands cd home/Tolga and end
up at your home directory, but clearly the short notation cd is much more convenient.
rmdir command removes a directory as long as it is empty. However, if it has files or
other directories in it (i.e., subdirectories), you will get an error indicating that the di-
rectory is not empty and cannot be removed. If you want to remove a directory that has
files in it, use the file deletion command rm with the switch “-r” that implies “recur-
sive.” What rm -r dirname means is: remove every file from the directory dirname along
with all of its subdirectories. There is possibly no need to emphasize how dangerous this
command is. Once you issue this command, the directory is gone, so are the entire con-
tents inside it, not to mention all of the subdirectories. So, use this command with extreme
caution.
mv command works for files and also directories. For example, mv dir1 dir2 “moves” a
directory dir1 into dir2. This effectively renames the directory dir1 as dir2 and the old
directory dir1 is gone. When you ls, you will only see the new directory dir2.
program generates. For example, to run the imflip.c serial program we saw in Section 1.4,
which flips a picture, you need the program itself, you need to compile it, and when this
program generates an output BMP picture, you need to be able to see that picture. You
also need to bring (copy) the picture into this directory. There is also a Makefile that I
created for you which helps the compilation. Here are some common Unix commands for
file manipulation:
• cat Makefile | grep imflip pipes the output of the cat command into another command
grep that looks and lists the lines containing the keyword imflip. grep is excellent for
searching some text strings inside text files. The output of any Unix command could
be piped into grep.
• ls -al | grep imflip pipes the output of the ls command into the grep imflip. This is
effectively looking for the string imflip in the directory listing. This is very useful in
determining file names that contain a certain string.
• make imflip finds imflip : file1 file2 file3 ... inside Makefile and remakes imflip if
a file has been modified in the list.
• cp imflip if1 copies the executable file imflip that you just created under a different
filename if1, so you do not lose it.
• man cp displays a help file for the cp command. This is great to display detailed
information about any Unix command.
• ls -al can be used to show the permissions and file sizes of source files and input/output
files. For example, it is perfect to check whether the sizes of the input and output BMP
files dogL.bmp and dogH.bmp are identical. If they are not, this is an early indication
of a bug!
• ls imf* lists every file whose names start with “imf.” This is great for listing files that
you know contain this “imf” prefix, like the ones we are creating in this book named
imflip, imflipP, ... Start (*) is a wildcard that means “anything.” Of course, you can
get fancier with the * like : ls imf*12 that means files starting with “imf ” and ending
with “12.” Another example is ls imf*12* that means files starting with “imf ” and
having “12” in the middle of the file name.
• diff file1 file2 displays the differences between two text files. This is great to determine
if a file has changed. It can also be used for binary files.
• imflip or imflip dog... runs the program if . / is in your $PATH. Otherwise, you have
to use ./imflip dog...
• touch imflip updates the “last access time” of the file imflip.
• rm imflip deletes the imflip executable file.
• mv command, just like renaming directories, can also be used to rename files as well
as truly move them. mv file1 file2 renames file1 as file2 and keeps it in the same
directory. If you want to move files from one directory to another, precede the filename
with the directory name and it will move to that directory. You can also move a file
without renaming them. Most Unix commands allow such versatility. For example, the
cp command can be used exactly the way mv is used to copy files from one directory
to another.
• history lists the commands you used since you opened the terminal.
The Unix commands to compile our first serial program imflip.c and turn it into the
executable imflip (or, imflip.exe in Windows) will produce an output that looks something
Introduction to CPU Parallel Programming 19
like the listing below. Only the important commands that the user entered are shown on
the left and the Unix output is shown on a right-indentation:
ls
ImageStuff.c ImageStuff.h Makefile dogL.bmp imflip.c
cat Makefile
imflip : imflip.c ImageStuff.c ImageStuff.h
g++ imflip.c ImageStuff.c -o imflip
make imflip
ls
ImageStuff.c ImageStuff.h Makefile dogL.bmp imflip.c imflip
imflip
Usage : imflip [input][output][v/h]
imflip dogL.bmp dogH.bmp h
Input BMP File Name : dogL.bmp (3200x2400)
Output BMP File Name : dogH.bmp (3200x2400)
In this listing, each file’s permissions are shown as -rwxr-x, etc. Your output might
be slightly different depending on the computer or organization you are running these
commands at. The Unix command chmod is used to change these permissions to make
them read-only, etc.
The Unix make tool allows us to automate routinely executed commands and makes it
easy to compile a file, among other tasks. In our case, “make imflip” asks Unix to look inside
the Makefile and execute the line “gcc imflip.c ImageStuff.c -o imflip” which will invoke the
gcc compiler and will compile imflip.c and ImageStuff.c source files and will produce an
executable file named imflip. In our Makefile, the first line is showing file dependencies: It
instructs make to make the executable file imflip only if one of the listed source files, imflip.c,
ImageStuff.c, or ImageStuff.h have changed. To force a compile, you can use the touch Unix
command.
debug flag, typically “-g”. This tells the compiler to include debug symbols, which includes
things like line numbers, to tell you where your code went wrong. An example is shown
below:
$ gcc imflip.c imageStuff.c -o imflip -g
1.7.1 gdb
For the sake of showing what happens when you mess up your code, I’ve inserted a memory
free() into imflip.c before the code is done using the data. This will knowingly cause a
segmentation error in the code as shown below:
$ gcc imflip.c imageStuff.c -o imflip -g
$ . /imflip dogL.bmp flipped.bmp V
Segmentation fault (core dumped)
Since imflip was compiled with debug symbols, gdb, the GNU debugger, can be run to try
to figure out where the segmentation fault is happening. The output from gdb is given in
Figure 1.4. gdb is first called by running
$ gdb . /imflip
Once gdb is running, the program arguments are set by the command:
set args dogL.bmp flipped.bmp V
Introduction to CPU Parallel Programming 21
After this, the program is run using the simple run command. gdb then proceeds to spit out
a bunch of errors saying that your code is all messed up. The where command helps give a
little more information as to where the code went wrong. Initially, gdb thought the error was
in the WriteBMP() function within ImageStuff.c at line 73, but the where command narrows
it down to line 98 in imflip.c. Further inspection in imflip.c code reveals that a free(data)
command was called before writing data to a BMP image with the WriteBMP() function.
This is just a simple example, but gdb can be expanded to use break points, watching
specific variables, and a host of other options. A sample of common commands is listed in
Table 1.1.
Most integrated development environments (IDEs) have a built-in debugging module
that makes debugging quite easy to use. Typically the back-end is still gdb or some propri-
etary debugging engine. Regardless of whether you have an IDE to use, gdb is still available
from the command line and contains just as much, if not more functionality compared to
your IDE (depending on your IDE of choice).
of the code. These are nothing more than manual breakpoints. If you feel that the bug in
your code is fairly easy to discover, there is no reason to go through the lengthy gdb process,
as I described in Section 1.7.1. Stick a bunch of printf()’s inside the code and they will tell
you what is going on. A printf() can display a lot of information about multiple variables,
clearly being a much more powerful tool than the few lights.
assert: An assert statement does not do anything unless a condition — that you spec-
ified — is violated as opposed to printf(), which always prints something. For example, if
your code had the following lines:
ImgPtr=malloc(...);
assert(ImgPtr != NULL);
In this case, you are simply trying to make sure that the pointer that is given to us is not
NULL, which red flags a huge problem with memory allocation. While assert() would do
nothing under normal circumstances, it would issue an error like the one shown below if the
condition is violated:
Assertion violation: file mycode.c, line 36: ImgPtr != NULL
Comment lines: Surprisingly enough, there is something easier than sticking a bunch
of printf()’s in your code. Although C doesn’t care about the “lines,” it is fairly common
that the C programmers write their code pretty similarly to a line-by-line fashion, much
like Python. This is why Python received some criticism for making the line-by-line style
the actual syntax of the language, rather than an option as in C. In the commenting-driven
debugging, you simply comment out a line that you are suspicious of, recompile, re-execute
to see if the problem went away, although the result is definitely no longer correct. This
is perfect in situations where you are getting core dumps, etc. In the example below, your
program will give you a Divide By 0 error if the user enters 0 for speed. You insert the
printf() there to give you an idea about where it might be crashing, but an assert() is much
better, because assert() would do nothing under normal circumstances avoiding the clutter
on your screen during debugging.
scanf(&speed);
printf("DEBUG: user entered speed=%d\n",speed);
assert(speed != 0);
distance=100; time=distance/speed;
Comments are super practical, because you can insert them in the middle of the code in
case there are multiple C statements in your code as shown below:
scanf(&speed);
distance=100; // time=distance/speed;
1.7.3 valgrind
Another tool to debug that can be extremely useful is a framework called valgrind. Once
the code has been compiled with debug symbols on, valgrind is simple to run. It has a host
of options to run with, similar to GDB, but the basic usage is quite easy. The output from
valgrind on the same imflip code with a memory error is shown in Figure 1.5. It catches
quite a few more errors and even locates the proper line of the error on line 96 in imflip.c
where the improper free() command is located.
Introduction to CPU Parallel Programming 23
valgrind also excels at finding memory errors that won’t show up during run time. Typ-
ically memory leaks are harder to find with simple print statements or a debugger like gdb.
For example, if imflip did not free any of the memory at the end, a memory leak would be
present, and valgrind would pick up on this. valgrind also has a module called cachegrind that
helps simulate how your code interacts with the CPU’s cache memory system. cachegrind is
called with the –tool=cachegrind command line option. Further options and documentation
can be found at http://valgrind.org.
memory access time should be. Alternatively, if we have almost a completely dedicated core
when processing the small image, we are running the entire program inside the core itself,
without needing to go to the main memory. And, we are not sharing that core with anyone.
So, there is nearly zero uncertainty in determining the execution time. If these concepts are
slightly blurry, do not worry about them, there will be an entire chapter dedicated to the
CPU architecture. Here is the meaning of the C/T (cores/threads) notation:
Ok, this is good enough brain warm up ... Let’s write our first parallel program.
CHAPTER 2
his chapter is dedicated to understanding our very first CPU parallel program, imflipP.c.
T Notice the ’P’ at the end of the file name that indicates parallel. For the CPU parallel
programs, the development platform makes no difference. In this chapter, I will slowly start
introducing the concepts that are the backbone of a parallel program, and these concepts
will be readily applicable to GPU programming when we start developing GPU programs
in Part II. As you might have noticed, I never say GPU Parallel Programming, but, rather,
GPU Programming. This is much like there is no reason to say a car with wheels; it suffices
to say a car. In other words, there is no GPU serial programming, which would mean using
one GPU thread out of the available 100,000s! So, GPU programming by definition implies
GPU parallel programming.
2. pthread join() allows you to join any given thread into the thread that originally cre-
ated it. Think of the “join” process as “uncreating” threads, or like the top thread
“swallowing” the thread if just created.
3. pthread attr() allows you to initialize attributes for threads.
27
28 GPU Parallel Program Development Using CUDA
4. pthread attr setdetachstate() allows you to set attributes for the threads you just
initialized.
#include <sys/time.h>
...
struct timeval t;
double StartTime, EndTime;
double TimeElapsed;
...
gettimeofday(&t, NULL);
StartTime = (double)t.tv_sec*1000000.0 + ((double)t.tv_usec);
// work is done here : thread creation, task/data assignment, join
...
gettimeofday(&t, NULL);
EndTime = (double)t.tv_sec*1000000.0 + ((double)t.tv_usec);
TimeElapsed=(EndTime-StartTime)/1000.00;
TimeElapsed/=(double)REPS;
...
printf("\n\nTotal execution time: %9.4f ms ...",TimeElapsed,...
switch (argc){
case 3: NumThreads=1; Flip=’V’; break;
case 4: NumThreads=1; Flip=toupper(argv[3][0]); break;
case 5: NumThreads=atoi(argv[4]); Flip=toupper(argv[3][0]); break;
default:printf("Usage: imflipP input output [v/h] [threads]");
printf("Example: imflipP infile.bmp out.bmp h 8\n\n");
return 0;
}
if((Flip != ’V’) && (Flip != ’H’)) {
printf("Invalid option ’%c’ ... Exiting...\n",Flip);
exit(EXIT_FAILURE);
}
if((NumThreads<1) || (NumThreads>MAXTHREADS)){
printf("Threads must be in [1..%u]... Exiting...\n",MAXTHREADS);
exit(EXIT_FAILURE);
}else{
if(NumThreads != 1){
printf("\nExecuting %u threads...\n",NumThreads);
MTFlipFunc = (Flip==’V’) ? MTFlipV:MTFlipH;
}else{
printf("\nExecuting the serial version ...\n");
FlipFunc = (Flip==’V’) ? FlipImageV:FlipImageH;
}
}
TheImage = ReadBMP(argv[1]);
gettimeofday(&t, NULL);
StartTime = (double)t.tv_sec*1000000.0 + ((double)t.tv_usec);
...
}
Developing Our First Parallel CPU Program 31
#include <pthread.h>
...
#define REPS 129
#define MAXTHREADS 128
...
long NumThreads; // Total # of parallel threads
int ThParam[MAXTHREADS]; // Thread parameters ...
pthread_t ThHandle[MAXTHREADS]; // Thread handles
pthread_attr_t ThAttr; // Pthread attributes
...
pthread_attr_init(&ThAttr);
pthread_attr_setdetachstate(&ThAttr, PTHREAD_CREATE_JOINABLE);
for(a=0; a<REPS; a++){
...
}
The OS needs about the first and second arguments: The second argument, &ThAttr, is
the same for all threads and contains the thread attributes. The first argument contains the
“handles” of each thread and is very important to the OS, to be able to keep track of the
threads. If the OS cannot create a thread for any reason, it will return a NULL (i.e., 0),
and this is our clue to know that we can no longer create a thread. This is a show-stopper,
so our program issues a runtime error and exits.
Here is the interesting question: If main() creates two threads, is our program a
dual-threaded program? As we will see shortly, when main() creates two threads using
pthread create(), the best we can expect is a 2x program speed-up. What about the main()
itself? It turns out main() itself is most definitely a thread too. So, there are 3 threads
involved in a program where main() created two child threads. The reason we only expect
a 2x speed-up is the fact that, while main() is only doing trivial work, the other two threads
are doing heavy work.
To quantify this: the main() function creates threads, assigns tasks to them, and joins
them, which constitutes, say, 1% of the activity, while the other 99% of the activity is
caused by the other two threads doing the actual heavy work (about 49.5% each). That
being the case, the amount of time that the third thread takes, running the main() function,
is negligible. Figure 2.1 shows my PC’s Windows Task Manager, which indicates 1499 active
threads. However, the CPU load is negligible (almost 0%). These 1499 are the threads that
the Windows OS created to listen to network packets, keyboard strokes, other interrupts,
etc. If, for example, the OS realizes that a network packet has arrived, it wakes up the
responsible thread, immediately processes that packet in a very short period of time and
the thread goes back to sleep, although still active. Remember: the CPU is drastically faster
than the network packet.
34 GPU Parallel Program Development Using CUDA
and the OS sets up the memory and stack area and allocates these two super-active threads 2
of its available virtual CPUs. After successful creation of the threads, they must be launched.
pthread create() also implies launching a thread that has just been created. Launching a
thread effectively corresponds to calling the following functions:
(*MTFlipFunc)(ThParam[0]);
(*MTFlipFunc)(ThParam[1]);
which will turn into either one of the horizontal or vertical flip functions, determined at
runtime, based on user input. If the user chose ’H’ as the flip option, the launch will be
effectively equivalent to this:
...
MTFlipFunc=MTFlipH;
...
(*MTFlipH)(ThParam[0]);
(*MTFlipH)(ThParam[1]);
Developing Our First Parallel CPU Program 35
pthread_join(ThHandle[0], NULL);
pthread_join(ThHandle[1], NULL);
After the first pthread join(), we are down to 1500 threads. The first child thread got swal-
lowed by main(). After the second pthread join(), we are down to 1499 threads. The second
child thread got gobbled up too. This stops the tornado! And, a few ms later, the main()
reports and time and exits. As we will see in Code 2.5, imageStuff.c contains code to dy-
namically allocate the memory area to store the image that is read from the disk. malloc()
function is used for dynamic (i.e., at run time) memory allocation. Before exiting main(),
all of this memory area is unallocated using free() as shown below.
...
// free() the allocated memory for the image
for(i = 0; i < ip.Vpixels; i++) { free(TheImage[i]); }
free(TheImage);
printf("\n\nTotal execution time: %9.4f ms (%s flip)",TimeElapsed,
Flip==’V’?"Vertical":"Horizontal");
printf(" (%6.3f ns/pixel)\n",
1000000*TimeElapsed/(double)(ip.Hpixels*ip.Vpixels));
return (EXIT_SUCCESS);
}
When main() exits, the parent OS thread swallows the child thread running main(). These
threads are an interesting life form, they are like some sort of bacteria that create and
swallow each other!
In Analogy 2.1, harvesting a portion of the coconut trees is the task of each thread
and it is the same for every farmer, regardless of how many farmers show up. The farmers
are threads that are executing. Each farmer must be given a unique ID to know which
part of the trees he or she must harvest. This unique ID is analogous to a Thread ID, or,
tid. The number of trees is 1800, which is all of the data elements to process. The most
interesting thing is that the task (i.e., the top part of the instructions) can be separated
from the data (i.e., the bottom part of the instructions). While the task is the same for
every farmer, the data is completely different. So, in a sense, there is only a single task,
applied to different data elements, determined by tid.
It is clear that the task can be completely predetermined at compile-time, which means,
during the preparation of the instructions. However, it looks like the data portion must
be determined at runtime, i.e., when everybody shows up and we know exactly how many
farmers we have. The key question is whether the data portion can also be determined at
compile-time. In other words, the mayor of the town can write only one set of instructions
and make, say, 60 photocopies (i.e., the maximum number of farmers that is ever expected)
and never have to prepare anything else when the farmers show up. If 2 farmers show up, the
mayor hands out 2 instructions and assigns tid = 0 and tid = 1 to them. If 5 farmers show
up, he hands out 5 instructions and assigns them tid = 0, tid = 1, tid = 2, tid = 3, tid = 4.
More generally, the only thing that must be determined at runtime is the tid assignment,
i.e., tid = 0 ... tid = N − 1. Everything else is determined at compile time, including the
parameterized task. Is this possible? It turns out, it is most definitely possible. In the end,
for N farmers, we clearly know what the data splitting will look like: Each farmer will get
1800/N coconut trees to harvest, and farmer number tid will have to harvest trees in the
range
1800 1800
×tid ... ×(tid + 1) − 1 (2.1)
N N
To validate this, let us calculate the data split for tid = [0...4] (5 farmers). They end
up being [360 × tid ... 360 × tid + 359] for a given tid. Therefore, for 5 farmers, the data
split ends up being [0...359], [360...719], [720...1079], [1080...1439], and [1440...1799]. This
is precisely what we wanted. This means that for an N -threaded program, such as flipping
a picture horizontally, we really need to write a single function that will get assigned to a
launched thread at runtime. All we need to do is to let the launched thread know what its
Developing Our First Parallel CPU Program 37
tid is ... The thread, then, will be able to figure out exactly what portion of the data to
process at run time using an equation similar to Equation 2.1.
One important note here is that none of the data elements have dependencies, i.e., they
can be processed independently and in parallel. Therefore, we expect that when we launch
N threads, the entire task (i.e., 1800 coconut trees) can be processed N times faster. In
other words, if 1800 coconut trees took 1800 hours to harvest, when 5 farmers show up, we
expect it to take 360 hours. As we will see shortly, this perfectness proves to be difficult to
achieve. There is an inherent overhead in parallelizing a task. It is called the parallelization
overhead. Because of this overhead may 5 farmers would take 400 hours to complete the
job. The details will depend on the hardware and the way we wrote the function for each
thread. We will be paying a lot of attention to this issue in the coming chapters.
If you look at Analogy 2.2, the clerk was able to get all of the information that is
necessary to draw an entire picture of the 40x45 tree farm by providing only a single
tree picture (OneTree.BMP) and the picture of the one that looked a little different than
the other 1798 (DifferentTree.BMP). Assume that each of these pictures occupies 1 KB of
38 GPU Parallel Program Development Using CUDA
storage. Including the text file that the clerk provided, this information fits in roughly 3 KB
on January 1, 2015. If we were to make a BMP file out of the entire 40x45 tree farm, we
would need 1 KB for all 1800 trees, occupying 1800 KB. Repetitious (i.e., redundant) data
allowed the clerk to substantially reduce the size of the file we need to deliver the same
information.
This concept is called data compression and can be applied to any kind of data that has
redundancies in it. This is why an uncompressed image format like BMP will be drastically
larger in size in comparison to a compressed image format like JPEG (or JPG) that com-
presses the information before it stores it; the techniques used in compression require the
knowledge of frequency domain analysis, however, the abstract idea is simple and is exactly
what is conceptualized in Analogy 2.2.
A BMP file stores “raw” image pixels without compressing them; because compression
is not performed, no additional processing is necessary before each pixel is stored in a BMP
file. This contrasts with a JPEG file, which first applies a frequency-domain transformation
like Cosine Transform. Another interesting artifact of a JPEG file is that only 90–99%
of the actual image information might be there; this concept of losing part of the image
information — though not noticeable to the eye — means that a JPEG file is a lossy image
storage format, whereas no information is lost in a BMP file because each pixel is stored
without any transformation. Considering that a 20 MB BMP file could be stored as a 1 MB
JPG file if we could tolerate a 1% loss in image data, this trade-off is perfectly acceptable
to almost any user. This is why almost every smartphone stores images in JPG format to
avoid quickly filling your storage space.
struct ImgProp{
int Hpixels;
int Vpixels;
unsigned char HeaderInfo[54];
unsigned long int Hbytes;
};
struct Pixel{
unsigned char R;
unsigned char G;
unsigned char B;
};
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include "ImageStuff.h"
ip.Vpixels = height;
ip.Hpixels = width;
int RowBytes = (width*3 + 3) & (˜3);
ip.Hbytes = RowBytes;
printf("\n Input BMP File name: %20s (%u x %u)",filename,ip.Hpixels,ip.Vpixels);
unsigned char tmp;
unsigned char **TheImage = (unsigned char **)malloc(height *
sizeof(unsigned char*));
for(i=0; i<height; i++) {
TheImage[i] = (unsigned char *)malloc(RowBytes * sizeof(unsigned char));
}
for(i = 0; i < height; i++) {
fread(TheImage[i], sizeof(unsigned char), RowBytes, f);
}
fclose(f);
return TheImage; // remember to free() it in caller!
}
ReadBMP() extracts Hpixels and Vpixels values from the BMP header, i.e., the first 54
bytes of the BMP file, and calculates Hbytes from Equation 2.2. It dynamically allocates
sufficient memory for the image using the malloc() function, which will be released using
the free() function at the end of main(). The image is read from a user-specified file name
that is passed onto ReadBMP() within the string filename. This BMP file header is saved in
HeaderInfo[] to use when we need to write the processed file back to the disk.
With the ReadBMP() and WriteBMP() functions use the C library function fopen() with
either the “rb” or “wb” options which mean read or write a binary file. If the file cannot
be opened by the OS, the return value of fopen() is NULL, an error is issued to the user.
42 GPU Parallel Program Development Using CUDA
This would happen due to a wrong file name or an existing lock on the file. fopen() allocates
a file handle and a read/write buffer area for the new file and returns it to the caller.
Depending on the fopen() parameters, it also places a lock on the file to prevent multiple
programs from corrupting the file due to a simultaneous access. Each byte is read/writ-
ten from/to the file one byte at a time (i.e., the C variable type unsigned char) by using
this buffer. Function fclose() de-allocates this buffer and removes the lock (if any) from
the file.
• main() is responsible for creating the threads and assigning a unique tid to each one
at runtime (e.g., ThParam[i] shown below).
• main() invokes a function for each thread (function pointer MTFlipFunc).
• main() must also pass other necessary values to the thread, if any (also passed in
ThParam[i] below).
• main() is also responsible to let the OS know what type of thread it is creating (i.e.,
thread attributes, passed in &ThAttr). In the end, main() is nothing but another thread
itself and it is speaking on behalf of the other child threads it will create in a moment.
• The OS is responsible for deciding whether a thread can be created. Threads are
nothing but resources that must be managed by the OS. If a thread can be created,
the OS is responsible for assigning a handle to that thread (ThHandle[i]). If not, OS
returns NULL (ThErr).
• If the OS cannot create a thread, main() is also responsible for either exiting or some
other action.
if(ThErr != 0){
printf("\nThread Creation Error %d. Exiting abruptly...
\n",ThErr);
exit(EXIT_FAILURE);
}
• The responsibility of each thread is to receive its tid and perform its task MTFlipFunc()
only on the data portion that it is required to process. We will spend multiple pages
on this.
• The final responsibility of main() is to wait for the threads to be done, and join them.
This instructs the OS to de-allocate the thread resources.
Developing Our First Parallel CPU Program 43
pthread_attr_destroy(&ThAttr);
for(i=0; i<NumThreads; i++){
pthread_join(ThHandle[i], NULL);
}
The first two functions are exactly what we had in Chapter 1.1. These are the serial
functions that flip an image in the vertical or horizontal direction. We just introduced their
multithreaded versions (the last two above), that will do exactly what the serial version did,
except faster by using multiple threads (hopefully)! Note that the multithreaded versions
will need the tid as we described before, whereas the serial versions don’t ...
Now, our goal is to understand how we pass the function pointer and data to each
launched thread. The serial versions of the function are slightly modified to eliminate the
return value (i.e., void), so they are consistent with the multithreaded versions that also do
not return a value. All four of these functions simply modify the image that is pointed to
by the pointer TheImage. It turns out, we do not really have to pass the function pointer to
the thread. Instead, we have to call the function that is pointed to by the function pointer.
This process is called thread launch.
The way we pass the data and launch a thread differs based on whether we are launching
the serial or multithreaded version of the function. I designed imflipP.c to be able to run
the older serial versions of the code as well as the new multithreaded versions, based on the
user command-line parameters. Since the input variables of the two families of functions
are slightly different, it was easier to define two separate function pointers, FlipFunc and
MTFlipFunc, that were responsible for launching the serial and multithreaded version of the
functions. I maintained two function pointers shown below:
Let us clarify the difference between creating and launching a thread, both of which are
implied in pthread create(). Creating a thread involves a request/grant mechanism between
the parent thread main() and the OS. If the OS says No, nothing else can happen. So, it
is the OS that actually creates the thread and sets up a memory area, a handle, a virtual
CPU, and a stack area for it and gives a nonzero thread handle to the parent thread, thereby
granting the parent permission to launch (aka run) another thread in parallel.
44 GPU Parallel Program Development Using CUDA
...
void (*FlipFunc)(unsigned char** img); // Serial flip function ptr
void* (*MTFlipFunc)(void *arg); // Multi-threaded flip func ptr
...
void FlipImageV(unsigned char** img)
{
...
Notice, although the parent now has the license to run, nothing is happening yet. Launch-
ing a thread is effectively a parallel function call. In other words, main() knows that another
child thread is running after the launch, and can communicate with it if it needs to.
Developing Our First Parallel CPU Program 45
The main() function may never communicate (e.g., pass data back and forth) with its
child thread(s), as exemplified in Code 2.2, since it doesn’t need to. Child threads modify
the required memory areas and return. The tasks assigned to child threads, in this spe-
cific case, leave only a single responsibility to main(): Wait until the child threads are
done and terminate (join) them. Therefore, the only thing that main() cares about: it
has the handle of the new thread, and it can determine when that thread has finished
execution (i.e., returned) by using pthread join(). So, effectively, pthread join(x) means,
wait until thread with the handle number x is done. When that thread is done, well, it
means that it executed a return and finished its job. There is no reason to keep him
around.
When the thread (with handle x) joins main(), the OS gets rid of all of the memory,
virtual CPU, and stack areas it allocated to that thread, and this thread disappears. How-
ever, main() is still alive and well ... until it reaches the last code line and returns (last line
in Code 2.2) ... When main() executes a return, the OS de-allocates all of the resources it
allocated for main() (i.e., the imflipP program). The program has just completed its execu-
tion. You then get a prompt back in Unix, waiting for your next Unix command, since the
execution of imflipP has just been completed.
Refreshing our memory with Code 1.2, FlipImageV() function that swaps pixels looked some-
thing like this. Note: the return value type is modified to be void to be consistent with the
multithreaded versions of the same program. Otherwise, the rest of the code below looks
exactly like Code 1.2.
//vertical flip
for(col=0; col<ip.Hbytes; col+=3){
row = 0;
while(row<ip.Vpixels/2){
pix.B = img[row][col];
...
row++;
}
}
return img;
}
The question now is: how to modify this FlipImageV() function to allow multithreading? The
multithreaded version of the function, MTFlipV(), will receive one parameter named tid as
we emphasized before. The image it will work on is a global variable TheImage, so it doesn’t
need to be passed as an additional input. Since our friend pthread create() expects us to
give it a function pointer, we will define MTFlipV() as follows:
During the course of this book, we will be encountering other types of functions that are
not so amenable to being parallelized. A function that doesn’t easily parallelize is commonly
referred to as a function that doesn’t thread well. There should be no question in any of the
readers’ mind at this point that, if a function doesn’t thread well, it is expected to be not
GPU-friendly. Here, in this section, I am also making the point that such a function is also
possibly not CPU multithreading friendly.
So, what do we do when a task is “born to be serial”? You clearly do not run this task
on a GPU. You keep it on the CPU ... keep it serial ... run it fast. Most modern CPUs,
such as the i7-5960x [11] I mentioned in Section 1.1, have a feature called Turbo Boost that
allows the CPU to achieve very high performance on a single thread when running a serial
(single-threaded) code. They achieve this by clocking one of the cores at, say, 4 GHz, while
other cores are at, say, 3 GHz, thereby significantly boosting the performance of single-
threaded code. This allows the CPU to achieve a good performance for both modern and
old-fashioned serial code ...
Developing Our First Parallel CPU Program 47
...
long NumThreads; // Total # threads working in parallel
unsigned char** TheImage; // This is the main image
struct ImgProp ip;
...
void *MTFlipV(void* tid)
{
struct Pixel pix; //temp swap pixel
int row, col;
TheImage[row][col] = TheImage[ip.Vpixels-(row+1)][col];
TheImage[row][col+1] = TheImage[ip.Vpixels-(row+1)][col+1];
TheImage[row][col+2] = TheImage[ip.Vpixels-(row+1)][col+2];
TheImage[ip.Vpixels-(row+1)][col] = pix.B;
TheImage[ip.Vpixels-(row+1)][col+1] = pix.G;
TheImage[ip.Vpixels-(row+1)][col+2] = pix.R;
row++;
}
}
pthread_exit(NULL);
}
The entire code listing for MTFlipV() is shown in Code 2.7. Comparing this to the serial
version of the function, shown in Code 1.2, there aren’t really a lot of differences other than
the concept of tid, which acts as the data partitioning agent. Please note that this code is
an overly simple multithreaded code. Normally, what each thread does completely depends
on the logic of the programmer. For our purposes, though, this simple example is perfect
to demonstrate the basic ideas. Additionally, the FlipImageV() function is a well-mannered
function that is very amenable to multithreading.
48 GPU Parallel Program Development Using CUDA
• FlipImageV() is designed to process the entire image, while its parallel counterpart
MTFlipV() is designed to process only a portion of the image defined by an equation
similar to Equation 2.1. Therefore, MTFlipV() needs a variable tid passed to it to know
who he is. This is done when launching the thread using pthread create().
• Besides the option of using the MTFlipFunc function pointer in launching threads using
pthread create(), nothing prevents us from simply calling the function ourselves by
using the MTFlipFunc function pointer (and its serial version FlipFunc). To call the
functions that these pointers are pointing to, the following notation has to be used:
FlipFunc = FlipImageV;
MTFlipFunc = MTFlipV;
...
(*FlipFunc)(TheImage); // call the serial version
(*MTFlipFunc)(void *(&ThParam[0])); // call the multi-threaded version
• Each image row occupies ip.Hbytes bytes. For example, for the 640 × 480 image
dog.bmp, ip.Hbytes= 1920 bytes according to Equation 2.2. The serial function
FlipImageV() clearly has to loop through every byte in the range [0...1919]. However,
the multithreaded version MTFlipV() partitions these horizontal 1920 bytes based on
tid. If 4 threads are launched, the byte (and pixel) range that has to be processed for
each thread is:
tid = 0 : Pixels [0...159] Hbytes [0...477]
tid = 1 : Pixels [160...319] Hbytes [480...959]
tid = 2 : Pixels [320...479] Hbytes [960...1439]
tid = 3 : Pixels [480...639] Hbytes [1440...1919]
Developing Our First Parallel CPU Program 49
• Multi-threaded function’s first task is to calculate which data range it has to process.
If every thread does this, all 4 of these pixel ranges shown above can be processed in
parallel. Here is how each thread calculates its own range:
The thread, as its very first task, is calculating its ts and te values (thread start and
thread end ). These are the Hbytes ranges, similar to the ones shown above and the
split is based on Equation 2.1. Since each pixel occupies 3 bytes (one byte for each
of the RGB colors), the function is adding 3 to the col variable in the for loop. The
FlipImageV() function doesn’t have to do such a computation since it is expected to
process everything, i.e., Hbytes range 0...1919.
• The image to process is passed via a local variable img in the serial FlipImageV() to
be compatible with the version introduced in Chapter 1.1, whereas a global variable
(TheImage) is used in MTFlipV() for reasons that will be clear in the coming chapters.
• The multithreaded function executes pthread exit() to let main() know that it is done.
This is when the pthread join() function advances to the next line for the thread that
finished.
...
long NumThreads; // Total # threads working in parallel
unsigned char** TheImage; // This is the main image
struct ImgProp ip;
...
void *MTFlipH(void* tid)
{
struct Pixel pix; //temp swap pixel
int row, col;
TheImage[row][col] = TheImage[row][ip.Hpixels*3-(col+3)];
TheImage[row][col+1] = TheImage[row][ip.Hpixels*3-(col+2)];
TheImage[row][col+2] = TheImage[row][ip.Hpixels*3-(col+1)];
TheImage[row][ip.Hpixels*3-(col+3)] = pix.B;
TheImage[row][ip.Hpixels*3-(col+2)] = pix.G;
TheImage[row][ip.Hpixels*3-(col+1)] = pix.R;
col+=3;
}
}
pthread_exit(NULL);
}
TABLE 2.1Serial and multithreaded execution time of imflipP.c, both for vertical flip
and horizontal flip, on an i7-960 (4C/8T) CPU.
#Threads Command line Run time (ms)
Serial imflipP dogL.bmp dogV.bmp v 131
2 imflipP dogL.bmp dogV2.bmp v 2 70
3 imflipP dogL.bmp dogV3.bmp v 3 46
4 imflipP dogL.bmp dogV4.bmp v 4 67
5 imflipP dogL.bmp dogV5.bmp v 5 55
6 imflipP dogL.bmp dogV6.bmp v 6 51
8 imflipP dogL.bmp dogV8.bmp v 8 52
9 imflipP dogL.bmp dogV9.bmp v 9 47
10 imflipP dogL.bmp dogV10.bmp v 10 51
12 imflipP dogL.bmp dogV10.bmp v 12 44
Serial imflipP dogL.bmp dogH.bmp h 81
2 imflipP dogL.bmp dogH2.bmp h 2 41
3 imflipP dogL.bmp dogH3.bmp h 3 28
4 imflipP dogL.bmp dogH4.bmp h 4 41
5 imflipP dogL.bmp dogH5.bmp h 5 33
6 imflipP dogL.bmp dogH6.bmp h 6 28
8 imflipP dogL.bmp dogH8.bmp h 8 32
9 imflipP dogL.bmp dogH9.bmp h 9 30
10 imflipP dogL.bmp dogH10.bmp h 10 33
12 imflipP dogL.bmp dogH7.bmp h 12 29
For each row that a thread is responsible for, each pixel’s 3-byte RGB values are swapped
with its horizontal mirror. This swap starts at col= [0...2] which holds the RGB values of
pixel 0, and continues until the last RGB (3 byte) value has been swapped. For a 640×480
image, since Hbytes= 1920, and there is no wasted byte, the last pixel (i.e., pixel 639) is at
col= [1917...1919].
So, what do these results tell us ? First of all, in both the vertical and horizontal flip case, it
is clear that using more than a single thread helps. So, our efforts to parallelize the program
weren’t for nothing. However, the troubling news is that, beyond 3 threads, there seems to
be no performance improvement at all, in both the vertical and horizontal case. For ≥ 4
threads, you can simply regard the data as noise!
What Table 2.1 clearly shows is that multithreading helps up to 3 threads. Of course,
this is not a generalized statement. This statement strictly applies to my i7-960 test CPU
(4C/8T) and the code I have shown in Code 2.7 and Code 2.8, which are the heart of the
imflipP.c code. By this time, you should have a thousand questions in your mind. Here are
some of them:
• Would the results be different with a less powerful CPU like a 2C/2T?
e parallelized our first serial program imflip.c and developed its parallel version
W imflipP.c in Chapter 2. The parallel version achieved a reasonable speed-up using
pthreads, as shown in Table 2.1. Using multiple threads reduced the execution time from
131 ms (serial version) down to 70 ms, and 46 ms when we launched two and three threads,
respectively, on an i7-960 CPU with 4C/8T. Introducing more threads (i.e., ≥ 4) didn’t
help. In this chapter, we want to understand the factors that contributed to the perfor-
mance numbers that were reported in Table 2.1. We might not be able to improve them,
but we have to be able to explain why we cannot improve them. We do not want to achieve
good performance by luck !
• Compiler is a large piece of code that is packed with routine functionality for two
things: (1) compilation, and (2) optimization. (1) is the compiler’s job and (2) is
the additional work the compiler has to do at compile time to potentially optimize
the inefficient code that the programmer wrote. So, the compiler is the “organizer”
at compile time. At compile time, time is frozen, meaning that the compiler could
53
54 GPU Parallel Program Development Using CUDA
contemplate many alternative scenarios that can happen at run time and produce
the best code for run time. When we run the program, the clock starts ticking. One
thing the compiler cannot know is the data, which could completely change the flow
of the program. The data can only be known at runtime, when the OS and CPU are
in action.
• The Operating System (OS) is the software that is the “boss” or the “manager” of the
hardware at run time. Its job is to allocate and map the hardware resources efficiently
at run time. Hardware resources include the virtual CPUs (i.e., threads), memory,
hard disk, flash drives (via Universal Serial Bus [USB] ports), network cards, keyboard,
monitor, GPU (to a certain degree), and more. A good OS knows its resources and
how to map them very well. Why is this important? Because the resources themselves
(e.g., CPU) have no idea what to do. They simply follow orders. The OS is the general
and the threads are the soldiers.
• Hardware is the CPU+memory+peripherals. OS takes the binary code that the
compiler produced and assigns them to virtual cores at run time. Virtual cores execute
them as fast as possible at run time. OS also facilitates the data movement between
the CPU and the memory, disk, keyboard, network card, etc.
• User is the final piece of the puzzle: Understanding the user is also important in
writing good code. The user of a program isn’t a programmer, yet the programmer
has to appeal to the user and has to communicate with him or her. This is not an
easy task!
In this book, the major focus will be on hardware, particularly the CPU and memory (and,
later in Part II, GPU and memory). Understanding the hardware holds the key to developing
high-performance code, whether for CPU or GPU. In this chapter, we will discover the truth
about whether it is possible to speed-up our first parallel program, imflipP.c. If we can, how?
The only problem is: we don’t know which part of the hardware we can use more efficiently
for performance improvement. So, we will look at everything.
Analogy 3.1 emphasizes the performance advantage of OoO execution; an OoO core
can execute independent dependence-chains (i.e., chains of CPU instructions that do not
have dependent results) in parallel, without having to wait for the next instruction to finish,
achieving a healthy speed-up. But, there are other trade-offs when a CPU is designed using
one of the two paradigms. One wonders which one is a better design idea: (1) more cores
that are inO, or (2) fewer cores that are OoO? What if we took the idea to the extreme
and placed, say, 60 cores in a CPU that are all inO? Would this work faster than a CPU
that has 8 OoO cores? The answer is not as easy as just picking one of them.
Here are the facts related to inO versus OoO CPUs:
• Since both design ideas are valid, there is a real inO CPU design like this, called
Xeon Phi, manufactured by Intel. One model, Xeon Phi 5110P, has 60 inO cores and
4 threads in each core, making it capable of executing 240 threads. It is considered a
Many Integrated Core (MIC) rather than a CPU; each core works at a very low speed
like 1 GHz, but it gets its computational advantage from the sheer number of cores
and threads. Since inO cores consume much less power, a 60C/240T Xeon Phi power
consumption is only slightly higher than a comparable 6C/12T Core i7 CPU. I will
be providing execution times on Xeon 5110P shortly.
• An inO CPU would only benefit a restricted set of applications; not every application
can take advantage of so many cores or threads. In most applications, we get diminish-
ing returns beyond a certain number of cores or threads. Generally, image and signal
processing applications are perfect for inO CPUs or MICs. Scientific high performance
processing applications are also typically a good candidate for inO CPUs.
• Another advantage of inO cores is their low power consumption. Since each core is
much simpler, it does not consume as much power as a comparable OoO core. This
is why most of today’s netbooks incorporate Intel Atom CPUs, which have inO cores.
An Atom CPU consumes only 2–10 Watts. The Xeon Phi MIC is basically 60 Atom
cores, with 4 threads/core, stuffed into a chip.
Improving Our First Parallel CPU Program 57
• If having so many cores and threads can benefit even just a small set of applications,
why not take this idea even farther and put thousands of cores in a compute unit
that can execute even more than 4 threads per core? It turns out, even this idea is
valid. Such a processor, which could execute something like hundreds of thousands of
threads in thousands of cores, is called a GPU, which is what this book is all about!
• Maybe the right way to phrase this question is: is there a difference between launching
and executing a thread?
• When we design a program to be “8-threaded,” what are we assuming about runtime?
Are we assuming that all 8 threads are executing?
• Remember from Section 2.1.5: there were 1499 threads launched on the computer, yet
the CPU utilization was 0%. So, not every thread is executing in parallel. Otherwise,
CPU utilization would be hitting the roof. If a thread is not executing, what is it
doing? Who is managing these threads at runtime?
58 GPU Parallel Program Development Using CUDA
TABLE 3.2 imflipP.c execution times (ms) for the CPUs listed in Table 3.1.
#Threads CPU1 CPU2 CPU3 CPU4 CPU5 CPU6
Serial V 109 131 159 117 181 185
2 V 93 70 50 58 104 95
3 V 78 46 33 43 75 64
4 V 78 67 49 59 54 49
5 V 93 55 40 52 35 57
6 V 78 51 35 55 35 48
8 V 78 52 37 53 26 37
9 V 47 34 52 25 49
10 V 40 23 45
12 V 35 28 38
Serial H 62 81 50 60 66 73
2 H 31 41 25 36 57 38
3 H 46 28 16 29 39 25
4 H 46 41 25 41 23 19
5 H 33 20 34 13 28
6 H 28 18 31 17 24
8 H 32 20 23 13 18
9 H 30 19 21 12 24
10 H 20 11 22
12 H 18 14 19
• Probably, the answer for why ≥ 4 threads is not helping our performance in Table 3.2
is hidden in these questions.
• Yet another question is whether the thick versus thin threads could change this answer.
• Another obvious question is: all of the CPUs in Table 3.1 are OoO.
• Compiler compiles the thread creation routines to machine code (CPU language),
at compile time. The final product of the compiler is an executable instruction list,
Improving Our First Parallel CPU Program 59
or the binary executable. Note that the compiler has minimal information (or idea)
about what is going to happen at runtime when it is compiling from the programming
language to machine code.
• The OS is responsible for the runtime. Why do we need such an intermediary?
It is because a multitude of different things can happen when executing the binary
that the compiler produced. BAD THINGS could happen such as: (1) the disk could
be full, (2) memory could be full, (3) the user could enter a response that causes
the program to crash, (4) the program could request an extreme number of threads,
for which there are no available thread handles. Alternatively, even if nothing goes
wrong, somebody has to be responsible for RESOURCE EFFICIENCY, i.e., run-
ning programs efficiently by taking care things like (1) who gets which virtual CPU,
(2) when a program asks to create memory, should it get it, if so, what is the pointer,
(3) if a program wants to create a child thread, do we have enough resources to create
it? If so, what thread handle to give, (4) accessing disk resources, (5) network re-
sources, (6) any other resource you can imagine. Resources are managed at runtime
and there is no way to know them precisely at compile time.
• Hardware executes the machine code. The machine code for the CPU to execute
is assigned to the CPU at runtime by the OS. Similarly, the memory is read and
transferred mostly by the peripherals that are under the OS’s control (e.g., Direct
Memory Access – DMA controller).
• User enjoys the program which produces excellent results if the program was written
well and everything goes right at runtime.
pthread_create(…) pthread_join(…)
Quantum End
Context Switch
Thread Queued
Terminated
Thread Runnable Running
Created
OS Scheduler Dispatch
Event Wait
Event
Arrival
Stopped
FIGURE 3.1 The life cycle of a thread. From the creation to its termination, a thread
is cycled through many different statuses, assigned by the OS.
The OS, then, tries to find a virtual CPU resource that is available to execute this code.
The parent thread does not care which virtual CPU the OS chooses, since this is a resource
management issue that the OS is responsible for. The OS maps the thread handle it just
assigned to an available virtual CPU at runtime (e.g., handle 1763 → vCPU4), assuming
that virtual CPU 4 (vCPU4) is available right at the time that the pthread create() is
called.
For that, two things must happen: (1) a virtual CPU (vCPU) is found for that thread to
execute on, (2) the status of the thread now changes to running.
The Runnable=⇒Running status change is handled by a part of the OS called the
dispatcher. The OS treats each one of the available CPU threads as a virtual CPU (vCPU);
so, for example, a 8C/16T CPU has 16 vCPUs. The thread that was waiting in the queue
starts running on a vCPU. A sophisticated OS will pay attention to where to place the
threads to optimize the performance. This placement, called the core affinity can actually
be manually modified by the user to override a potentially suboptimal placement by the OS.
The OS allows each thread to run for a certain period of time (called quantum) before it
switches to another thread that has been waiting in the Runnable status. This is necessary
to avoid starvation, i.e., a thread being stuck forever in the Runnable status. When a thread
is switched from Running=⇒Runnable, all of its register information – and more – has
to be saved in an area; this information is called the context of a thread. Similarly, the
Running=⇒Runnable status change is called a context switch. A context switch takes
a certain amount of time to complete and has performance implications, although it is an
unavoidable reality.
During execution (in the Running status), a thread might call a function, say scanf(), to
read a keyboard input. Reading a keyboard is much slower than any other CPU operation;
so, there is no reason why the OS should make our thread keep in the Running status
while waiting for the keyboard input, which would starve other threads of core time. In
this case, the OS cannot switch this thread to Runnable either, since the Runnable
queue is dedicated to the threads that can be immediately switched over to the Running
status when the time is right. A thread that is waiting for a keyboard input could wait
for an indefinite amount of time; it could happen immediately or it could happen within
10 minutes, in case the user has left to get coffee! So, there is another status to distinguish
this specific status; it is called Stopped.
A thread undergoes a Running=⇒Stopped status switch when it requests a resource
that is not going to be available for a period of time, or it has to wait for an event to hap-
pen for an indeterminate amount of time. When the requested resource (or data) becomes
available, the thread undergoes a Stopped=⇒Runnable status switch and is placed in
the queue of Runnable threads that are waiting for their time to start executing again.
It would make no sense for the OS to switch this thread to Running either, as this would
mean a chaotic unscheduling of another peacefully executing thread, i.e., kicking it out of
the core! So, to do things in a calm and orderly fashion, the OS places the Stopped thread
back in the Runnable queue and decides when to allow it to execute again later. It might,
however, assign a different priority to threads that should be dispatched ahead of others
for whatever reason.
Finally, when a thread completes execution, upon calling, say, the pthread join() function,
the OS makes the Running=⇒Terminated status switch and the thread is permanently
out of the Runnable queue. Once its memory areas etc. are cleaned up, the handle for that
thread is destroyed and it is available later for another pthread create().
at any given point in time. It picks one of the 1499 tasks, and assigns one of the people to
do it. If another task becomes more urgent for that one person (for example, if a network
packet arrives requiring immediate attention), the OS switches that person to doing that
more urgent task and suspends what he or she was currently doing.
We are curious about how these status switches affect our application’s performance. In
the case of 1499 threads in Figure 2.1, it is highly likely that something like 1495 threads are
Stopped or Runnable, waiting for you to hit some key on the keyboard or a network packet
to arrive, and only four threads are Running, probably your multithreaded application
code. Here is an analogy:
In Analogy 3.2, sitting on the chair corresponds to the thread status Runnable and doing
what is written on the paper corresponds to the thread status Running and the people
are the virtual CPUs. The manager, who is allowed to switch the status of people, is the
OS, whereas his or her notebook is where the thread contexts are saved to be used later
during a context switch. Crumpling up a paper (task) is equivalent to switching it to the
Terminated status.
The number of the launched threads could fluctuate from 1499 to 1505 down to 1365,
etc., but the number of available virtual CPUs cannot change (e.g., 8 in this example),
since they are a “physical” entity. A good way to define the 1499 quantity is software
threads, i.e., the threads that the OS creates. The available number of physical threads
(virtual CPUs) is the hardware threads, i.e., the maximum number of threads that the
CPU manufacturer designed the CPU to be able to execute. It is a little bit confusing
that both of them are called “threads,” since the software threads are nothing but a data
structure containing information about the task that the thread will perform as well as the
thread handle, memory areas, etc., whereas the hardware threads are the physical hardware
component of the CPU that is executing machine code (i.e., the compiled version of the
task). The job of the OS is to find an available hardware thread for each one of the software
threads it is managing. The OS is responsible for managing the virtual CPUs, which are
hardware resources, much like the available memory.
designed. When you launch a program that executes two heavily active threads, the OS
will do its best to bring them into the Running status as soon as possible. Possibly one
more thread, belonging to the OS’s thread scheduler is very active, making the highly active
threads 3.
So, how does this help in explaining the results in Table 3.2? Although the exact answer
depends on the model of the CPU, there are some highly distinguishable patterns that can
be explained with what we just learned. Let us pick one CPU2 as an example. While CPU2
should be able to execute 8 threads in parallel (it is a 4C/8T), the performance falls off a
cliff beyond 3 launched threads. Why? Let’s try to guess this by fact-checking:
• Remember our Analogy 1.1, where two farmers were sharing a tractor. By timing the
tasks perfectly, together, they could get 2x more work done. This is the hope behind
4C/8T getting an 8T performance, otherwise, you really have only 4 physical cores
(i.e., tractors).
• If this best case scenario happened here in our Code 2.1 and Code 2.2, we should
expect the performance improvement to continue to 8 threads, or at least, something
like 6 or 7. This is not what we see in Table 3.2!
• So, what if one of the tasks required one of the farmers to use the hammer and other
resources in the tractor in a chaotic way? The other wouldn’t be able to do anything
useful since they would be continuously bumping into each other and keep falling, etc;
the performance wouldn’t even be close to 2x (i.e., 1+1 = 2)! It would be more like
0.9x! As far as efficiency is concerned, 1+1 = 0.9 sounds pretty bad! In other words,
if both threads are “thick threads,” they are not meant to work simultaneously with
another thread inside the same core ... I mean, efficiently ... This must be somehow
the case in Code 2.1 and Code 2.2, since we are not getting anything out of the dual
threads inside each core ...
• What about memory? We will see an entire architectural organization of the cores
and memory in Chapter 4. But, for now, it suffices to say that, no matter how many
cores/threads you have in a CPU, you only have a single main memory for all of the
threads to share. So, if one thread was a memory-unfriendly thread, it would mess
up everybody’s memory accesses. This is another possibility in explaining why the
performance hits a brick wall at ≥ 4 threads.
• Let’s say that we explained the problems with why we are not able to use the double-
threads in each core (called hyper-threading by Intel), but why does the performance
stop improving at 3 threads, not 4? The performance from 3 to 4 threads is lower,
which is counterintuitive. Are these threads not even able to use all of the cores?
A similar pattern is visible in almost every CPU, although the exact thread count
depends on the maximum available threads and varies from CPU to CPU.
Where is the best place to start? If you want to improve a computer program’s perfor-
mance, the best place to start is the innermost loops. Let’s start with the MTFlipH() function
shown in Code 2.8. This function is taking a pixel value and moving it to another memory
area one byte at a time. The MTFlipV() function shown in Code 2.7 is very similar. For each
pixel, both functions move R, G, and B values one byte at a time. What is wrong with this
picture? A lot! When we go through the details of the CPU and memory architecture in
Chapter 4, you will be amazed with how horribly inefficient Code 2.7 and Code 2.8 are. But,
for now, we just want to find obvious fixes and apply them and observe the improvements
quantitatively. We will not comment on them until we learn more about the memory/core
architecture in Chapter 4.
...
for(row=ts; row<=te; row++) {
col=0;
while(col<ip.Hpixels*3/2){
// example: Swap pixel[42][0] , pixel[42][3199]
pix.B = TheImage[row][col];
pix.G = TheImage[row][col+1];
pix.R = TheImage[row][col+2];
TheImage[row][col] = TheImage[row][ip.Hpixels*3-(col+3)];
...
So, to improve this program, we should strictly analyze the memory access patterns.
Figure 3.2 shows the memory access patterns of the MTFlipH() function during the processing
of the 22 MB image dogL.bmp. This dog picture consists of 2400 rows and 3200 columns. For
example, when flipping Row 42 horizontally (no specific reason for choosing this number),
here is the swap pattern for pixels (also shown in Figure 3.2):
[42][0]←→[42][3199], [42][1]←→[42][3198] ... [42][1598]←→[42][1601], [42][1599]←→[42][1600]
RGB 42,3197
RGB 42,3198
RGB 42,3199
22MB
…
Row 42 … …
RGB 42,0
RGB 42,1
RGB 42,2
…
Row 2397
Row 2398
2400 rows Row 2399
3200 columns …
(~22MB)
…
Swap (42,2) RGB (42,3197)
Swap (42,1) RGB (42,3198)
Main Memory Swap (42,0) RGB (42,3199)
FIGURE 3.2 Memory access patterns of MTFlipH() in Code 2.8. A total of 3200 pixels’
RGB values (9600 Bytes) are flipped for each row.
In Figure 3.2, notice that each pixel corresponds to 3 consecutive bytes holding that pixel’s
RGB values. During just this one pixel swap, the function MTFlipH() requests 6 memory
accesses, 3 to read the bytes [0..2] and 3 to write them into the flipped pixel location
held at bytes [9597..9599]. This means that, to merely flip one row, our MTFlipH() function
requests 3200×6 = 19, 200 memory accesses, with mixed read and writes. Now, let’s see what
happens when, say, 4 threads are launched. Each thread is trying to finish the following
tasks, consisting of flipping 600 rows.
tid= 0 : Flip Row[0] , Flip Row[1] ... Flip Row [598] , Flip Row [599]
tid= 1 : Flip Row[600] , Flip Row[601] ... Flip Row [1198] , Flip Row [1199]
tid= 2 : Flip Row[1200] , Flip Row[1201] ... Flip Row [1798] , Flip Row [1799]
tid= 3 : Flip Row[1800] , Flip Row[1801] ... Flip Row [2398] , Flip Row [2399]
Notice that each one of the 4 threads is requesting as frequent memory accesses as the
single thread does. If each thread was designed improperly, causing chaotic memory access
requests, 4 of them together will have 4x the mess! Let’s look at the very early part of the
execution when main() launches all 4 threads and assigns them the MTHFlip() function to
execute. If we assume that all 4 threads started executing at exactly the same time, this is
what all 4 threads are trying to do simultaneously for the first few bytes:
Although there will be slight variations in the progression of each thread, it doesn’t
change the story. What do you see when you look at these memory access patterns? The
very first thread, tid= 0 is trying to read the pixel [0][0], whose value is at memory addresses
mem(00000000..00000002). This is the very beginning of tid= 0’s task that requires it to
swap the entire row 0 first before it moves on to row 1.
While tid= 0 is waiting for its 3 bytes to come in from the memory, at precisely the same
time, tid= 1 is trying to read pixel[600][0] that is the very first pixel of row 600, located at
memory addresses mem(05760000..05760002), i.e., 5.5 MB (Megabytes) away from the very
first request. Hold on, tid= 2 is not standing still. It is trying to do its own job that starts
by swapping the entire row 1200. The first pixel to be read is pixel[1200][0], located at the 3
consecutive bytes with memory addresses mem(11520000..11520002), i.e., 11 MB away from
the 3 bytes that tid= 0 is trying to read. Similarly, tid= 3 is trying to read the 3 bytes that
are 16.5 MB away from the very first 3 bytes ... Remember that the total image was 22 MB
and the processing of it was divided into 4 threads, each responsible for a 5.5 MB chunk
(i.e., 600 rows).
When we learn the detailed inner-workings of a DRAM (Dynamic Random Access Mem-
ory) in Chapter 4, we will understand why this kind of a memory access pattern is nothing
but a disaster, but, for now, we will find a very simple fix for this problem. For the folks
who are craving to get into the GPU world, let me make a comment here that the DRAM
in the CPU and the GPU are almost identical operationally. So, anything we learn here will
be readily applicable to the GPU memory with some exceptions resulting from the massive
parallelism of the GPUs. An identical “disaster memory access” example will be provided
for the GPUs and you will be able to immediately guess what the problem is by relying on
what you learned in the CPU world.
While this is an excellent guide to improving the performance of our first parallel program
imflipP.c, let’s first check to see if we were obeying these rules in the first place. Here is the
summary of the MTFlipH() function’s memory access patterns (Code 2.8):
• Granularity rule is clearly violated, since we are trying to access one byte at a time.
• Locality rule wouldn’t be violated if there was only a single thread. However, multiple
simultaneous (and distant) accesses by different threads cause violations.
• L1, L2, L3 caching do not help us at all since there isn’t a good “data reuse” scenario.
This is because we never need any data element more than once.
With almost every rule violated, it is no wonder that the performance of imflipP.c is
miserable. Unless we obey the access rules of DRAM, we are just creating massive inefficient
memory access patterns that cripple the overall performance.
Improving Our First Parallel CPU Program 67
The key observation is that, since the dog picture is in the main memory (i.e., DRAM),
every single pixel-read triggers an access to DRAM. According to Table 3.3, we know that
DRAM doesn’t like to be bothered frequently.
68 GPU Parallel Program Development Using CUDA
unsigned char Buffer[16384]; // This is the buffer to use to get the entire row
...
for(row=ts; row<=te; row++) {
// bulk copy from DRAM to cache
memcpy((void *) Buffer, (void *) TheImage[row], (size_t) ip.Hbytes);
col=0;
while(col<ip.Hpixels*3/2){
pix.B = Buffer[col];
pix.G = Buffer[col+1];
pix.R = Buffer[col+2];
Buffer[col] = Buffer[ip.Hpixels*3-(col+3)];
Buffer[col+1] = Buffer[ip.Hpixels*3-(col+2)];
Buffer[col+2] = Buffer[ip.Hpixels*3-(col+1)];
Buffer[ip.Hpixels*3-(col+3)] = pix.B;
Buffer[ip.Hpixels*3-(col+2)] = pix.G;
Buffer[ip.Hpixels*3-(col+1)] = pix.R;
col+=3;
}
// bulk copy back from cache to DRAM
memcpy((void *) TheImage[row], (void *) Buffer, (size_t) ip.Hbytes);
...
When we transfer the 9600 B from the main memory into the Buffer, we are relying on
the efficiency of the memcpy() function, which is provided as part of the standard C library.
During the execution of memcpy(), 9600 bytes are transferred from the main memory into
the memory area that we name Buffer. This access is super efficient, since it only involves
a single continuous memory transfer that obeys every rule in Table 3.3.
Let’s not kid ourselves: Buffer is also in the main memory; however, there is a huge
difference in the way we will use these 9600 bytes. Since we will access them continuously,
they will be cached and will no longer bother DRAM. This is what will allow the accesses
to the Buffer memory area to be significantly more efficient, thereby obeying most of the
rules in Table 3.3. Let us now re-engineer the code to use the Buffer.
Improving Our First Parallel CPU Program 69
in the 16 KB range that will allow MTFlipHM() to be compliant with the L1 caching rule of
thumb, shown in Table 3.3.
Here are some lines from Code 3.1, highlighting the buffering in MTFlipHM(). Note that
the global array TheImage[] is in DRAM, since it was read into DRAM by the ReadBMP()
function (see Code 2.5). This is the variable that should obey strict DRAM rules in Table 3.3.
I guess we cannot do better than accessing it once to read 9600 B of data and copying this
data into our local memory area. This makes it 100% DRAM-friendly.
unsigned char Buffer[16384]; // This is the buffer to use to get the entire row
...
for(...){
// bulk copy from DRAM to cache
memcpy((void *) Buffer, (void *) TheImage[row], (size_t) ip.Hbytes);
...
while(...){
... =Buffer[...]
... =Buffer[...]
... =Buffer[...]
Buffer[...]=Buffer[...]
...
Buffer[...]=...
Buffer[...]=...
...
}
// bulk copy back from cache to DRAM
memcpy((void *) TheImage[row], (void *) Buffer, (size_t) ip.Hbytes);
...
The big question is: Why is the local variable Buffer[] fair game? We modified the
innermost loop and made it access the Buffer[] array as terribly as we were accessing
TheImage[] before. What is so different with the Buffer[] array? Also, another nagging
question is the claim that the contents of the Buffer[] array will be “cached.” Where did
that come from? There is no indication in the code that says “put these 9600 bytes into the
cache.” How are we so sure that it does go into cache? The answer is actually surprisingly
simple and has everything to do with the design of the CPU architecture.
A CPU caching algorithm predicts which values inside DRAM (the “bad area”) should
be temporarily brought into the cache (the “good area”). These guesses do not have to
be 100% accurate, since if the guess is bad, it could always be corrected later. The result
is an efficiency penalty rather than a crash or something. Bringing “recently used DRAM
contents” into the cache memory is called caching. The CPU could get lazy and bring
everything into the cache, but this is not possible since there are only small amounts of
cache memory available. In the i7 family processors, L1 cache is 32 KB for data elements
and L2 cache is 256 KB. L1 is faster to access than L2. Caching helps for three major reasons:
• Access Patterns: Cache memory is SRAM (static random access memory), not
DRAM like the main memory. The rules governing SRAM access patterns are a lot
less strict as compared to the DRAM efficiency rules listed in Table 3.3.
• Speed: Since SRAM is much faster than DRAM, accessing cache is substantially
faster once something is cached.
• Isolation: Each core has its own cache memory (L1$ and L2$). So, if each thread
was accessing up to a 256 KB of data frequently, this data would be very efficiently
cached in that core’s cache and would not bother the DRAM.
Improving Our First Parallel CPU Program 71
We will get into the details of how the CPU cores and CPU main memory work together
in Chapter 4. However, we have learned enough so far about the concept of buffering to
improve our code. Note that caching is extremely important for both CPUs and even more so
in GPUs. So, understanding the buffering concept that causes data to be cached is extremely
important. There is no way to tell the CPU to cache something explicitly, although some
theoretical research has investigated this topic. It is done completely automatically by the
CPU. However, the programmer can influence the caching dramatically by the memory
access patterns of the code. We experienced first-hand what happens when the memory
access patterns are chaotic like the ones shown in Code 2.7 and Code 2.8. The CPU caching
algorithms simply cannot correct for these chaotic patterns, since their simplistic caching/
eviction algorithms throw in the towel. The compiler cannot correct for these either, since
it literally requires the compiler to read the programmer’s mind in many cases! The only
thing that can help the performance is the logic of the programmer.
TABLE 3.4 imflipPM.c execution times (ms) for the CPUs listed in Table 3.1.
#Threads CPU1 CPU2 CPU3 CPU4 CPU5 CPU6
Serial W 4.116 5.49 3.35 4.11 5.24 3.87
2 W 3.3861 3.32 2.76 2.43 3.51 2.41
3 W 3.0233 2.90 2.66 1.96 2.78 2.52
4 W 3.1442 3.48 2.81 2.21 1.57 1.95
5 W 3.1442 3.27 2.71 2.17 1.47 2.07
6 W 3.05 2.73 2.04 1.69 2.00
8 W 3.02 2.75 2.03 1.45 2.09
9 W 2.74 1.45 2.26
10 W 2.74 1.98 1.45 1.93
12 W 2.75 1.33 1.91
Serial I 35.8 49.4 29.0 34.6 45.3 42.6
2 I 23.7 25.2 14.7 17.6 34.5 21.4
3 I 21.2 17.4 9.8 12.3 19.5 14.3
4 I 22.7 20.1 14.6 17.6 12.5 10.9
5 I 22.3 17.1 11.8 14.3 8.8 15.8
6 I 21.8 15.8 10.5 11.8 10.5 13.2
8 I 18.4 10.4 12.1 8.3 10.0
9 I 9.8 7.5 13.5
10 I 16.6 9.5 11.6 6.9 12.3
12 I 9.2 8.6 11.2
TABLE 3.5Comparing imflipP.c execution times (H, V type flips in Table 3.2) to
imflipPM.c execution times (I, W type flips in Table 3.4).
#Threads CPU5 CPU5 Speedup CPU5 CPU5 Speedup
V W V→W H I H→I
Serial 181 5.24 34× 66 45.3 1.5×
2 104 3.51 30× 57 34.5 1.7×
3 75 2.78 27× 39 19.5 2×
4 54 1.57 34× 23 12.5 1.8×
5 35 1.47 24× 13 8.8 1.5×
6 35 1.69 20× 17 10.5 1.6×
8 26 1.45 18× 13 8.3 1.6×
9 25 1.45 17× 12 7.5 1.6×
10 23 1.45 16× 11 6.9 1.6×
12 28 1.33 21× 14 8.6 1.6×
MTFlipVM() and MTFlipHM() and the unfriendly ones MTFlipV() and MTFlipH(). It is hard
to make a generic comment such as ”major improvement all across the board,” since this is
not really what we are seeing here. The improvements in the horizontal-family and vertical-
family functions are so different that we need to comment on them separately.
chapter. To be able to make educated guesses, let’s look at the facts. First, let’s explain
why there would be a difference in the vertical versus horizontal family of flips, although,
in the end, both of the functions are flipping exactly the same number of pixels:
Comparing the MTFlipH() in Code 2.8 to its memory-friendly version MTFlipHM() in
Code 3.1, we see that the only difference is the local buffering, and the rest of the code is
identical. In other words, if there is any speedup between these two functions, it is strictly
due to buffering. So, it is fair to say that
â Local buffering allowed us to utilize cache memory,
which resulted in a 1.6× speedup.
â This number fluctuates minimally with more threads.
On the other hand, comparing the MTFlipV() in Code 2.7 to its memory-friendly ver-
sion MTFlipVM() in Code 3.2, we see that we turned the function from a core-intensive
function to a memory-intensive function. While MTFlipV() is picking at the data one byte
at a time, and keeping the core’s internal resources completely busy, MTFlipVM() uses
the memcpy() bulk memory copy function and does everything through the bulk mem-
ory transfer, possibly completely bypassing the core involvement. The magical memcpy()
function is extremely efficient to copy something from DRAM when you are grabbing a big
chunk of data, like we are here. This is also consistent with our DRAM efficiency rules in
Table 3.3.
If this is all true, why is the speedup somehow saturating? In other words, why are we
getting a lower speedup when we launch more threads? It looks like the program execution
time is not going below a ≈ 1.5× speedup, no matter what the number of threads is. This
can actually be explained intuitively as follows:
â When a program is highly memory-intensive, its performance
will be strictly determined by the memory bandwidth.
â We seemed to have saturated the memory bandwidth at ≈ 4 threads.
First of all, we are asking the executable program imflipPM (or, imflipPM.exe in Windows)
to be launched. To launch this program (i.e., start executing), the OS creates a process with
a Process ID assigned to it. When this program is executing, it will need three different
memory areas:
• A stack area to store the function call return addresses and arguments that are
passed/returned onto/from the function calls. This area grows from top to bottom
(from high address to the low addresses), since this is how each microprocessor uses
a stack.
• A heap area to store the dynamically allocated memory contents using the malloc()
function. This memory area grows in the opposite direction of the stack to al-
low the OS to use every possible byte of memory without bumping into the stack
contents.
Improving Our First Parallel CPU Program 75
• A code area to store the program code and the constants that were declared within
the program. This is the code area and is not modified. The constants within the
program are stored here, since they also are not modified.
The memory map of the process that the OS created will look like Figure 3.3: First, since
the program is launched with only a single thread that is running main(), the memory map
looks like Figure 3.3 (left). As the four pthreads are launched using the pthread create(),
the memory map will look like Figure 3.3 (right). The stack of each thread is saved even
if the OS decides to switch out of that thread to allow another one to run (i.e., context
switching). The context of the thread is saved in the same memory area. Furthermore, the
code is on the bottom memory area and the shared heap among all threads is just above
the code. This is all the threads need to resume their operation when they are scheduled to
run again, following a context switch.
The size of the stack and the heap are not known to the OS while launching imflipPM.c
the first time. There are default settings that can be modified. Unix and Mac OS allow
specifying them at the commands prompt using switches, while Windows allows changing
them through the right click, modify application properties. Since the programmer is the
one who has the best idea about how much stack, heap a program needs, generous stack,
heap areas should be allocated to applications to avoid a core dump in case of a scenario
when an invalid memory address access occurs due to these memory areas clashing with
each other.
Let’s look at our favorite Figure 2.1 again; it shows 1499 launched threads and 98
processes. What this means is that many of the processes that OS launched internally are
multithreaded, very possibly all of them, resembling a memory map shown in Figure 3.3;
Memory Allocated for the Process by the OS
Shared
Heap
Heap
FIGURE 3.3The memory map of a process when only a single thread is running within
the process (left) or multiple threads are running in it (right).
76 GPU Parallel Program Development Using CUDA
each process seems to have launched an average of 15 threads, all of which must have a very
low activity ratio. We saw what happens when even 5,6 threads are super active for a short
period of time; in Figure 2.1, if all 1499 threads had a high activity ratio like the threads
that we wrote so far, your CPU would possibly choke and you would not even be able to
move your mouse on your computer.
There is another thing to keep in mind when it comes to the 1499 threads: the OS
writers must design their threads to be as thin as possible to avoid the OS from interfering
with application performance. In other words, if any of the OS threads are creating a lot
of disturbance when changing their status from Runnable to Running, they will over-
tax some of the core resources and will not allow their hyper-thread to work efficiently
when one of your threads is scheduled alongside an OS thread. Of course, not every task
can be performed so thinly, and what I just described about the OS has its limits. The
other side of the coin is that the application designed should pay attention to making the
application threads thin. We will only go into the details of this a little bit, since this is not
a CPU parallel programming book, but rather, a GPU one. When I get a chance to make
a comment about how a CPU thread could have been made a little thinner, I will make a
point throughout the book though.
TABLE 3.6Comparing imflipP.c execution times (H, V type flips in Table 3.2) to
imflipPM.c execution times (I, W type flips in Table 3.4) for Xeon Phi 5110P.
Xeon Phi Speedup Speedup
#Threads V W V→W H I H→I
Serial 673 60.9 11× 358 150 2.4×
2 330 30.8 10.7× 179 75 2.4×
4 183 16.4 11.1× 90 38 2.35×
8 110 11.1 9.9× 52 22 2.35×
16 54 11.9 4.6× 27 15 1.8×
32 38 16.1 2.4× 22 18 1.18×
64 39 29.0 1.3× 28.6 29.4 0.98×
128 68 56.7 1.2× 48 53 0.91×
256 133 114 1.15× 90 130 0.69×
512 224 234 0.95× 205 234 0.87×
Here is the command line to compile imflipPM.c to execute on Xeon Phi to get the
performance numbers in Table 3.6:
Performance results for Xeon Phi 5110P, executing the imflipPM.c program are shown
in Table 3.6. While there is a healthy improvement from multiple threads up to 16 or 32
threads, the performance improvement limit is reached at 32 threads. Sixty-four threads
provides no additional performance improvement. The primary reason for this is that the
threads in our imflipPM.c program are so thick that they cannot take advantage of the
multiple threads in each core.
The entire Part I of this book is dedicated to understanding how to “think parallel,”
in fact, this is not even enough. You have to start thinking “massively parallel.” When we
had 2, 4, 8 threads to execute in the examples shown before, it was somehow easy to adjust
the sequence to make every thread do useful work. However, in the GPU world, you will
be dealing with thousands of threads. Teaching how to think in such an absurdly parallel
world should start by learning how to sequence two threads first! This is the reason why
the CPU environment was perfect to warm up to parallelism, and is the philosophy of this
book. By the time you finish Part I of this book, you will not only learn CPU parallelism,
but will be totally ready to take on the massive parallelism that the GPUs bring you in
Part II.
If you are still not convinced, let me mention this to you: GPUs actually support hun-
dreds of thousands of threads, not just thousands! Convinced yet? A corporation like IBM
with hundreds of thousands of employees can run as well as a corporation with 1 or 2 em-
ployees, and, yet, IBM is able to harvest the manpower of all of its employees. But, it takes
extreme discipline and a systematic approach. This is what the GPU programming is all
about. If you cannot wait until we reach Part II to learn GPU programming, you can peak
at it now; but, unless you understand the concepts introduced in Part I, something will
always be missing.
GPUs are here to give us shear computational power. A GPU program that works 10×
faster than a comparable CPU program is better than one that works only 5× faster. If
somebody could come in and rewrite the same GPU program to be 20× faster, that person
is the king (or queen). There is no point in writing a GPU program unless you are targeting
speed. There are three things that matter in a GPU program: speed, speed, and speed ! So,
the goal of this book is to get you to be a GPU programmer that writes super-fast GPU
code. This doesn’t happen unless we systematically learn every important concept, so, in
the middle of a program, when we encounter some weird bottleneck, we can explain it and
remove the bottleneck. Otherwise, if you are going to write slow GPU code, you might as
well spend your time learning much better CPU multithreading techniques, since there is
no point in using a GPU unless your code gets every little bit of extra speed out of it. This
is the reason why we will time our code throughout the entire book and will find ways to
make our GPU code faster.
79
80 GPU Parallel Program Development Using CUDA
FIGURE 4.1 Inside a computer containing an i7-5930K CPU [10] (CPU5 in Table 3.1),
and 64 GB of DDR4 memory. This PC has a GTX Titan Z GPU that will be used
to test a lot of the programs in Part II.
DDR4 standard (contained inside the PC in Figure 4.1). These standards are determined by
a large consortium of chip manufacturers that define such standards and precise timing of
the memory chips. If the memory manufacturers design their memory to be 100% compliant
with these standards, and INTEL designs their CPU to be 100% compliant, there is no need
for both of these chips to be manufactured by the same company.
In the past four decades, the main memory was always made out of DRAM. A new
DRAM standard was released every 2–3 years to take advantage of the exciting develop-
ments in the DRAM manufacturing technology. As the CPUs improved, so did the DRAM.
However, not only the improvements in CPU and DRAM technology followed different
patterns, but also improvement meant something different for CPU and DRAM:
• For CPU designs, improvement meant more work done per second.
• For DRAM designs, improvement meant more data read per second (bandwidth) as
well as more storage (capacity).
CPU manufacturers improved their CPUs using better architectural designs and by tak-
ing advantage of more MOS transistors that became available at each shrinking technology
node (130 nm, 90 nm, ... and 14 nm as of 2016). On the other hand, DRAM manufacturers
improved their memories by the ability to continuously pack more capacitors into the same
area, thereby resulting in more storage. Additionally, they were able to continuously improve
their bandwidths by the newer standards that became available (e.g., DDR3, DDR4).
FIGURE 4.2 The imrotate.c program rotates a picture by a specified angle. Origi-
nal dog (top left), rotated +10◦ (top right), +45◦ (bottom left), and −75◦ (bottom
right) clockwise. Scaling is done to avoid cropping of the original image area.
• To avoid this cropping, the resulting image is scaled, so that the resulting image always
fits in its original size.
• This scaling naturally implies empty areas in the resulting image (black pixels with
RGB=000 values are used to fill the blank pixels).
• The scaling is not automatic, i.e., the same exact amount of scaling is applied at the
beginning, thereby leaving more blank area for certain rotation amounts, as is clearly
evidenced in Figure 4.2.
#include <pthread.h>
#include <stdint.h>
#include <ctype.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <sys/time.h>
#include "ImageStuff.h"
#define REPS 1
#define MAXTHREADS 128
long NumThreads; // Total number of threads
int ThParam[MAXTHREADS]; // Thread parameters ...
double RotAngle; // rotation angle
pthread_t ThHandle[MAXTHREADS]; // Thread handles
pthread_attr_t ThAttr; // Pthread attributes
void* (*RotateFunc)(void *arg); // Func. ptr to rotate img
unsigned char** TheImage; // This is the main image
unsigned char** CopyImage; // This is the copy image
struct ImgProp ip;
...
int main(int argc, char** argv)
{
int RotDegrees, a, i, ThErr;
struct timeval t;
double StartTime, EndTime, TimeElapsed;
switch (argc){
case 3 : NumThreads=1; RotDegrees=45; break;
case 4 : NumThreads=1; RotDegrees=atoi(argv[3]); break;
case 5 : NumThreads=atoi(argv[4]); RotDegrees = atoi(argv[3]); break;
default: printf("\n\nUsage: imrotate inputBMP outputBMP [RotAngle] [1-128]");
printf("\n\nExample: imrotate infilename.bmp outname.bmp 45 8\n\n");
printf("\n\nNothing executed ... Exiting ...\n\n");
exit(EXIT_FAILURE);
}
if((NumThreads<1) || (NumThreads>MAXTHREADS)){
printf("\nNumber of threads must be between 1 and %u... \n",MAXTHREADS);
printf("\n’1’ means Pthreads version with a single thread\n");
printf("\n\nNothing executed ... Exiting ...\n\n"); exit(EXIT_FAILURE);
}
if((RotDegrees<-360) || (RotDegrees>360)){
printf("\nRotation angle of %d degrees is invalid ...\n",RotDegrees);
printf("\nPlease enter an angle between -360 and +360 degrees ...\n");
printf("\n\nNothing executed ... Exiting ...\n\n"); exit(EXIT_FAILURE);
}
...
Understanding the Cores and Memory 87
...
if((RotDegrees<-360) || (RotDegrees>360)){
...
}
printf("\nExecuting the Pthreads version with %u threads ...\n",NumThreads);
RotAngle=2*3.141592/360.000*(double) RotDegrees; // Convert the angle to radians
printf("\nRotating %d deg (%5.4f rad) ...\n",RotDegrees,RotAngle);
RotateFunc=Rotate;
TheImage = ReadBMP(argv[1]);
CopyImage = CreateBlankBMP();
gettimeofday(&t, NULL);
StartTime = (double)t.tv_sec*1000000.0 + ((double)t.tv_usec);
pthread_attr_init(&ThAttr);
pthread_attr_setdetachstate(&ThAttr, PTHREAD_CREATE_JOINABLE);
for(a=0; a<REPS; a++){
for(i=0; i<NumThreads; i++){
ThParam[i] = i;
ThErr = pthread_create(&ThHandle[i], &ThAttr, RotateFunc,
(void *)&ThParam[i]);
if(ThErr != 0){
printf("\nThread Creation Error %d. Exiting abruptly... \n",ThErr);
exit(EXIT_FAILURE);
}
}
pthread_attr_destroy(&ThAttr);
for(i=0; i<NumThreads; i++){ pthread_join(ThHandle[i], NULL); }
}
gettimeofday(&t, NULL);
EndTime = (double)t.tv_sec*1000000.0 + ((double)t.tv_usec);
TimeElapsed=(EndTime-StartTime)/1000.00;
TimeElapsed/=(double)REPS;
//merge with header and write to file
WriteBMP(CopyImage, argv[2]);
tn = *((int *) tid);
tn *= ip.Vpixels/NumThreads;
TABLE 4.1 imrotate.c execution times for the CPUs in Table 3.1 (+45◦ rotation).
# HW Threads 2C/4T 4C/8T 4C/8T 4C/8T 6C/12T 8C/16T
i5-4200M i7-960 i7-4770K i7-3820 i7-5930K E5-2650
CPU1 CPU2 CPU3 CPU4 CPU5 CPU6
# SW Threads
1 951 1365 782 1090 1027 845
2 530 696 389 546 548 423
3 514 462 261 368 365 282
4 499 399 253 322 272 227
5 499 422 216 295 231 248
6 387 283 338 214 213
8 374 237 297 188 163
9 237 177 199
10 341 228 285 163 201
12 217 158 171
TABLE 4.2imrotate.c threading efficiency (η) and parallelization overhead (1−η) for
CPU3, CPU5. The last column reports the speedup achieved by using CPU5 that
has more cores/threads, although there is no speedup up to 6 launched SW threads.
# SW CPU3: i7-4770K 4C/8T CPU5:i7-5930K 6C/12T Speedup
Thr Time η 1−η Time η 1−η CPU5→CPU3
1 782 100% 0% 1027 100% 0% 0.76×
2 389 100% 0% 548 94% 6% 0.71×
3 261 100% 0% 365 94% 6% 0.72×
4 253 77% 23% 272 95% 5% 0.93×
5 216 72% 28% 231 89% 11% 0.94×
6 283 46% 54% 214 80% 20% 1.32×
8 237 42% 58% 188 68% 32% 1.26×
9 237 41% 69% 177 65% 35% 1.34×
10 228 34% 66% 163 63% 37% 1.4×
12 217 30% 70% 158 54% 46% 1.37×
execution time of 1027 ms as our baseline (i.e., 100% efficiency), then, when we launch two
threads, we ideally expect half of that execution time ( 1027
2 = 513.5 ms). However, we see
548 ms. Not bad, but only ≈ 94% of the performance we hoped for.
In other words, launching the additional thread improved the execution time, but hurt
the efficiency of the CPU. Quantifying this efficiency metric (η) is fairly straightforward as
shown in Equation 4.4. One corollary of Equation 4.4 is that parallelization has an overhead
that can be defined as shown in Equation 4.5.
see that CPU3 beats CPU5. However, beyond 5 launched threads, CPU5 beats CPU3 for
any number of threads. The reason has something to do with both the cores and memory
as we will see shortly in the next section. Our imrotate.c program is by nature designed
inefficiently, so, it is not taking advantage of the advanced architectural improvements that
are built into CPU5.
We will get into the architectural details shortly, but, for now, it suffices to comment
that just because a CPU is a later generation doesn’t mean that it will work faster for every
program. Newer generations CPUs’ architectural improvements are typically geared toward
programs that are written well. A program that causes choppy memory and core access
patterns, like imrotate.c, will not benefit from the beautiful architectural improvements of
newer generation CPUs. INTEL’s message to the programmers is clear:
â Newer generation CPUs and memories are always designed to work more
efficiently when rules, like the ones in Table 3.3, are obeyed.
â The influence of these rules will keep increasing in future generations.
â CPU says: If you’re gonna be a bad program designer, I’ll be a bad CPU!
more sophisticated than addition and multiplication, so there is a separate unit for
division. All of these execution units are shared by both threads.
• In each generation, more sophisticated computational units are available as shared ex-
ecution units. However, multiple units are incorporated for common operations that
might be executed by both threads, such as ALUs. Also, Figure 4.3 is overly simpli-
fied and the exact details of each generation might change. However, the ALU-FPU
functionality separation has never changed in the past 3–4 decades of CPU designs.
• The addresses that both threads generate must be calculated to write the data from
both threads back into the memory. For address computations, load and store address
generation units (LAGU and SAGU) are shared by both threads, as well as a unit
that properly orders the destination memory addresses (MOB).
• Instructions are prefetched and decoded only once and routed to the owner thread.
Therefore, prefetcher and decoder are shared by both threads.
• Acronyms are:
ALU=Arithmetic Logic Unit
FPU=Floating Point Unit
FPMUL=Floating Point Dedicated Multiplier
FPADD = FP Adder
MUL/DIV=Dedicated Multiplier/Divider
LAGU=Load Address Generation Unit
SAGU=Store Address Generation Unit
MOB=Memory Order Buffer
The most important message from Figure 4.3 is as follows:
â Our program performance will suffer if both threads inside a core are
requesting exactly the same shared core resources.
â For example, if both threads require heavy floating point operations,
they will pressure the FP resources.
â On the data side, if both threads are extremely memory intensive,
they will pressure L1 D$ or L2$, and eventually L3$ and main memory.
94 GPU Parallel Program Development Using CUDA
CPU
DDR4 MEMORY
CORE 1 CORE 2 CORE 3
CONTROLLER
MEMORY
L3$ 15 MB
BUS
FIGURE 4.4 Architecture of the i7-5930K CPU (6C/12T). This CPU connects to the
GPUs through an external PCI express bus and memory through the memory bus.
...
void* (*RotateFunc)(void *arg); // Func. ptr to rotate the image (multi-threaded)
...
void *Rotate(void* tid)
{
...
}
...
int main(int argc, char** argv)
{
...
RotateFunc=Rotate;
...
}
To keep improving this function, we will design different versions of this function and
will allow the user to select the desired function through the command line. To run the
imrotateMC.c program, the following command line is used:
imrotateMC InputfileName OutputfileName [degrees] [threads] [func]
where degrees specifies the clockwise rotation, [threads] specifies the number of threads to
launch as before, and the newly added [func] parameter (1–7) specifies which function to run
(i.e., 1 is to run Rotate(), 2 to run Rotate2(), etc.) The improved functions will be consistently
named Rotate2(), Rotate3(), etc. and the appropriate function pointer will be assigned to the
RotateFunc based on the command line argument selection as shown in Code 4.5. The name
of this new program constraints “MC” for “memory and core friendly.”
98 GPU Parallel Program Development Using CUDA
switch (argc){
case 3 : NumThreads=1; RotDegrees=45; Function=1; break;
case 4 : NumThreads=1; RotDegrees=at... Function=1; break;
case 5 : NumThreads=at... RotDegrees=at... Function=1; break;
case 6 : NumThreads=at... RotDegrees=at... Function=atoi(argv[5]); break;
default: printf("\nUsage: %s inputBMP outBMP [RotAngle] [1-128] [1-7]...");
printf("Example: %s infilename.bmp outname.bmp 125 4 3\n\n",argv[0]);
printf("Nothing executed ... Exiting ...\n\n");
exit(EXIT_FAILURE);
}
if((NumThreads<1) || (NumThreads>MAXTHREADS)){
...
}
if((RotDegrees<-360) || (RotDegrees>360)){
...
}
switch(Function){
case 1: strcpy(FuncName,"Rotate()"); RotateFunc=Rotate; break;
case 2: strcpy(FuncName,"Rotate2()"); RotateFunc=Rotate2; break;
case 3: strcpy(FuncName,"Rotate3()"); RotateFunc=Rotate3; break;
case 4: strcpy(FuncName,"Rotate4()"); RotateFunc=Rotate4; break;
case 5: strcpy(FuncName,"Rotate5()"); RotateFunc=Rotate5; break;
case 6: strcpy(FuncName,"Rotate6()"); RotateFunc=Rotate6; break;
case 7: strcpy(FuncName,"Rotate7()"); RotateFunc=Rotate7; break;
// case 8: strcpy(FuncName,"Rotate8()"); RotateFunc=Rotate8; break;
// case 9: strcpy(FuncName,"Rotate9()"); RotateFunc=Rotate9; break;
default: printf("Wrong function %d ... \n",Function);
printf("\n\nNothing executed ... Exiting ...\n\n");
exit(EXIT_FAILURE);
}
printf("\nLaunching %d Pthreads using function: %s\n",NumThreads,FuncName);
RotAngle=2*3.141592/360.000*(double) RotDegrees; // Convert the angle to radians
printf("\nRotating image by %d degrees ...\n",RotDegrees);
TheImage = ReadBMP(argv[1]);
...
}
Understanding the Cores and Memory 99
H=(double)ip.Hpixels;
V=(double)ip.Vpixels;
Diagonal=sqrt(H*H+V*V);
ScaleFactor=(ip.Hpixels>ip.Vpixels) ? V/Diagonal : H/Diagonal;
Moving these lines outside both loops will not change the functionality at all, since they
really only need to be calculated once. We are particularly interested in understanding what
parts of the CPU core in Figure 4.3 these computations use and how much speedup we will
get from this move. Revised Rotate2() function is shown in Code 4.6. Identical lines compared
to the original function Rotate() are not repeated and denoted as “...” to improve readability.
An entire list of execution times for every version of this function will be provided later
in this chapter in Table 4.3. For now, let’s quickly compare the single-threaded performance
of Rotate2() versus Rotate() on CPU5. To run the single-threaded version of Rotate2(), type:
imrotateMC dogL.bmp d.bmp 45 1 2
where 45, 1, and 2 are rotation, number of threads (single), and function ID (Rotate2()). We
reduced the single-threaded run time from 1027 ms down to 498 ms, a 2.06× improvement.
100 GPU Parallel Program Development Using CUDA
Now, let’s analyze the instructions on the 4 lines that we moved to see why we had a 2.06×
improvement. First two computations are integer-to-double precision floating point (FP)
cast operations when we are calculating H and V. They are simple enough, but would use
an FPU resource (shown in Figure 4.3). The next line in calculating Diagonal is absolutely
a resource hog, since Square Root is very compute-intensive. This harmless looking line
requires two FP multiplications (FP-MUL) to compute H×H and V ×V and one FP-ADD
to compute their sum. After this, the super-expensive Square Root operation is performed.
As if sqrt is not enough torture for the core, we see a floating point division next, that is as
bad as the square root, followed by an integer comparison! So, when the CPU core hits the
instructions that computes these 4 lines, core resources are chewed up and spit out! When
we move them outside both loops, it is no wonder that we get a 2.06× speedup.
newX=cos(RotAngle)*X-sin(RotAngle)*Y;
newY=sin(RotAngle)*X+cos(RotAngle)*Y;
We are simply forcing the CPU to compute the sin() and cos() for every pixel ! There is
no need for that, Rotate3() function defines a precomputed variable called CRA (meaning
precomputed cosine of RotAngle) and uses it in the innermost loop whenever it needs to use
cos(RotAngle). The revised Rotate3() function is shown in Code 4.7 and it is a single-threaded
run time on CPU5 reduced from 498 ms to 376 ms, a 1.32× improvement.
Function Rotate4() (shown in Code 4.8) does the same thing by precomputing
sin(RotAngle) as SRA. Rotate4() single-threaded run time is reduced from 376 ms to 235 ms,
another 1.6×improvement. The simplified code lines in the Rotate4() function in Code 4.8 are
newX=CRA*X-SRA*Y;
newY=SRA*X+CRA*Y;
When we compare these two lines to the two lines above, the summary is as follows:
• Rotate3() needs the calculation of sin(), cos().
• Rotate3() performs 4 double-precision FP multiplications.
• Rotate3() performs 2 double-precision FP addition/subtractions.
We notice that we are calculating a value called ip.Hpixels*3 just to turn back around
and divide it by 3 a few lines below. Knowing that integer divisions are expensive, why
not do this in a way where we can eliminate the integer division altogether? To do this, we
observe that the variable c is doing nothing but mirror the same value as col/3. Since we
are starting the variable col at 0 before we get into the while() loop, why not start the
c variable at 0 also? Since we are incrementing the value of the col variable by 3 at the
end of the while loop, we can simply increment the c variable by one. This will create two
variables that completely track each other, with the relationship col = 3 × c without ever
having to use an integer division.
The implementation of the Rotate5() function that implements this idea is shown in
Code 4.9 and simply trades an integer division c = col/3 for an integer increment operation
c + +. Additionally, to find the half-way point in the picture, ip.Hpixels and ip.Vpixels
must be divided by 2 as shown above. Since this can also be precomputed, it is moved
outside both loops in the implementation of the Rotate5(). All said and done, Rotate5() run
time is reduced from 235 ms to 210 ms, an improvement of 1.12×. Not bad ...
X=(double)c-(double)h; Y=(double)v-(double)row;
// pixel rotation matrix
newX=CRA*X-SRA*Y; newY=SRA*X+CRA*Y;
newX=newX*ScaleFactor; newY=newY*ScaleFactor;
After all of our modifications, these lines ended up in this order. So, the natural question
to ask ourselves is: can we consolidate these computations of newX and newY and can we
precompute variables X or Y? An observation of the loop shows that, although the variable
X is stuck inside the innermost loop, we can move the computation of variable Y outside the
innermost loop, although we cannot move it outside both loops. But, considering that this
will save us the repetitive computation of Y many times (3200 times for a 3200 row image!),
the savings could be worth the effort. The revised Rotate6() is shown in Code 4.10 and
implements these ideas by using two additional variables named SRAYS and CRAYS. Rotate6()
runtime improved from 210 ms to 185 ms (1.14× better). Not bad ...
Understanding the Cores and Memory 103
TABLE 4.3 imrotateMC.c execution times for the CPUs in Table 3.1.
#Th Func CPU1 CPU2 CPU3 CPU4 CPU5 CPU6
1 951 1365 782 1090 1027 845
2 530 696 389 546 548 423
3 514 462 261 368 365 282
4 Rotate() 499 399 253 322 272 227
6 387 283 338 214 213
8 374 237 297 188 163
10 341 228 285 163 201
1 468 580 364 441 498 659
2 280 301 182 222 267 330
3 249 197 123 148 194 220
4 Rotate2() 280 174 126 165 137 165
6 207 127 138 101 176
8 195 138 134 84 138
10 125 141 67
1 327 363 264 301 376 446
2 218 189 131 151 202 223
3 187 123 88 101 142 149
4 Rotate3() 202 93 106 108 101 112
6 123 97 106 75 116
8 117 101 110 59 89
10 106 92 47
1 202 227 161 182 235 240
2 140 124 80 91 135 120
3 109 80 54 61 92 80
4 Rotate4() 109 65 73 54 69 60
8 88 62 69 37 47
10 58 55 29
1 171 209 145 158 210 207
2 140 108 73 78 117 104
3 93 73 49 53 80 69
4 Rotate5() 93 61 69 72 61 52
6 72 51 62 44 53
8 81 56 60 36 40
10 59 48 29
1 156 180 125 128 185 176
2 124 92 63 64 109 88
3 93 78 43 45 78 59
4 Rotate6() 93 57 63 65 55 44
6 60 43 43 37 44
8 65 51 49 30 33
1 140 155 107 110 161 156
2 109 75 53 55 97 78
3 93 52 36 37 64 52
4 Rotate7() 62 70 53 56 46 39
6 61 36 38 36 39
8 56 40 42 24 29
10 43 45 21
106 GPU Parallel Program Development Using CUDA
Aside from these simple rules, Table 4.3 shows that if the threads we design are thick,
we will not be able to take advantage of the multiple threads that a core can execute. In our
code so far, even the improved Rotate7() function has thick threads. So, the performance
falls off sharply when either the launched threads gets close to the number of the physical
cores, or we saturate the memory. This brings up the concept of “whichever constraint comes
first.” In other words, when we change our program to improve it, we might eliminate one
bottleneck (say, FPU inside the core), but create a totally different bottleneck (say, main
memory bandwidth saturation).
The “whichever constraint comes first” concept will be the key ingredient in designing
GPU code, since inside a GPU, there are multiple constraints that can saturate and the
programmer has to be fully aware of every single one of them. For example, the programmer
could saturate the total number of launched threads before the memory is saturated. The
constraint ecosystem inside a GPU will be very much like the one we will learn in the CPU.
More on that coming up very shortly ...
Before we jump right into the GPU world, we have one more thing to learn: thread
synchronization in Chapter 5. Then, we are ready to take on the GPU challenge starting
in Chapter 6.
CHAPTER 5
107
108 GPU Parallel Program Development Using CUDA
FIGURE 5.1 The imedge.c program is used to detect edges in the original im-
age astronaut.bmp (top left). Intermediate processing steps are: GaussianFilter()
(top right), Sobel() (bottom left), and finally Threshold() (bottom right).
to this program and the readers should definitely attempt to discover them. However, the
choices made in this program are geared toward improving the instructional value of this
chapter, rather than the ultimate performance of the imedge.c program. Specifically,
• A much more robust edge detection program can be designed by adding other post-
processing steps. They have not been incorporated into our program.
• The granularity of the processing is limited to a single row when running the “MCT”
version (imedgeMCT.c) at the end of this chapter. Although a finer granularity pro-
cessing is possible, it does not add any instructional value.
• To calculate some of the computed pixel values, the choice of the double variable type
is a little overkill; however, double type has been employed in the program because it
improves the instructional value of the code.
• Multiple arrays are used to store different processed versions of the image before
its final edge-detected version is computed: TheImage array stores the original image.
BWImage array stores the B&W version of this original. GaussImage array stores the
Gaussian filtered version. Gradient and Theta arrays store the Sobel-computed pixels.
The final edge-detected image is stored in CopyImage. Although these arrays could be
combined, separate arrays have been used to improve the clarity of the code.
2 4 5 4 2
where Gauss is the filter kernel and ∗ is the convolution operation, which is one of the
most common operations in digital signal processing (DSP). Together, the operation in
Equation 5.2 convolves the B&W image we created in Equation 5.2 – that is contained in
variable BWImage – with the Gauss filter mask, contained in variable Gauss to result in the
blurred image, contained in variable GaussImage. Note that other filter kernels are available
that result in different levels of blurring, however, for our demonstration purposes, this filter
kernel is totally fine. At the end of this step, our image looks like Figure 5.1 (top right).
Sobel(): The goal of this step is to use a Sobel gradient operator on each pixel to deter-
mine the direction – and existence – of edges. From Equation 5.3, Sobel kernels are
−1 0 1 −1 −2 −1
Gx = −2 0 2 , Gy = 0 0 0 (5.3)
−1 0 1 −1 −2 −1
where Gx and Gx are the Sobel kernels used to determine the edge gradient for each pixel in
the x and y directions. These kernels are convolved with the previously computed GaussImage
to compute a gradient image as follows:
p GX
GX = Im ∗ Gx , GY = Im ∗ Gy , G = GX 2 + GY 2 , θ = tan−1 (5.4)
GY
where the CopyImage array stores the final binary image. In case the gradient value of a
pixel is in between these two thresholds, we use the second array, Theta, to determine the
direction of the edge. We classify the direction into one of four possible values: horizontal
(EW), vertical (NS), left diagonal (SW-NE), and right diagonal (SE-NW). The idea is that
Thread Management and Synchronization 111
TABLE 5.1 Array variables and their types, used during edge detection.
Function Starting Destination Destination
to perform array variable array variable type
Convert image to B&W TheImage BWImage unsigned char
Gaussian filter BWImage GaussImage double
Sobel filter GaussImage Gradient double
Theta double
Threshold Gradient CopyImage unsigned char
(Theta if needed)
if the Theta of the edge is pointing to vertical (that is, up/down), we determine whether this
pixel is an edge or not by looking at the pixel above or below it. Similarly, for a horizontal
pixel, we look at its horizontal neighbors. This well-studied method is formulated below:
where Θ = Θ[x, y] is the angle of edge [x, y] and L and H are the low, high thresholds. The
gradient is shown as ∆. The final result of the imedge.c program is shown in Figure 5.1
(bottom right). The #define values define how EDGE/NOEDGE will be colored; in this
program I assigned 0 (black) to EDGE and 255 (white) to NO EDGE. This makes the
edges print-friendly.
double** CreateBlankDouble()
{
int i;
double** img = (double **)malloc(ip.Vpixels * sizeof(double*));
for(i=0; i<ip.Vpixels; i++){
img[i] = (double *)malloc(ip.Hpixels*sizeof(double));
memset((void *)img[i],0,(size_t)ip.Hpixels*sizeof(double));
}
return img;
}
double Gauss[5][5] = { { 2, 4, 5, 4, 2 },
{ 4, 9, 12, 9, 4 },
{ 5, 12, 15, 12, 5 },
{ 4, 9, 12, 9, 4 },
{ 2, 4, 5, 4, 2 } };
// Calculate Gaussian filtered GaussFilter[][] array from BW image
void *GaussianFilter(void* tid)
{
long tn; // My thread number (ID) is stored here
int row,col,i,j;
double G; // temp to calculate the Gaussian filtered version
tn = *((int *) tid); // Calculate my Thread ID
tn *= ip.Vpixels/NumThreads;
5.2.5 Sobel
Code 5.5 implements the gradient computation. It achieves this by applying Equation 5.3
to the GaussImage array to generate the two resulting arrays: Gradient array contains the
magnitude of the edge gradients, whereas the Theta array contains the angle of the edges.
Resource usage characteristics of the Sobel() function are
• Gx and Gy arrays are small and should be cached nicely.
• GaussImage array is accessed 18 times for each pixel and should also be cache-friendly.
• Gradient and Theta arrays are written once for each pixel and take no advantage of
the cache memory inside the cores.
Thread Management and Synchronization 117
5.2.6 Threshold
Code 5.6 shows the implementation of the thresholding function that determines whether a
given pixel at location [x, y] should be classified as an EDGE or NOEDGE. If the gradient
value is lower then ThreshLo or higher than ThreshHi, the EDGE/NOEDGE determination
requires only Equation 5.5 and the Gradient array. Any gradient value between these two
values requires a more elaborate computation, based on Equation 5.6.
The if condition makes the resource determination of Threshold() more complicated:
• Gradient has a good reuse ratio; it should therefore be cached nicely.
• Theta array is accessed based on pixel – and edge – values. So, it is hard to determine
its cache-friendliness.
• CopyImage array is accessed only once per pixel, making it cache-unfriendly.
118 GPU Parallel Program Development Using CUDA
TABLE 5.2 imedge.c execution times for the W3690 CPU (6C/12T).
#Th/Func 1 2 4 8 10 12
ReadBMP() 73 70 71 72 73 72
Create arrays 749 722 741 724 740 734
GaussianFilter() 5329 2643 1399 1002 954 880
Sobel() 18197 9127 4671 2874 2459 2184
Threshold() 499 260 147 132 95 92
WriteBMP() 70 70 66 60 61 62
Total without IO 24850 12829 7030 4798 4313 3957
components of each pixel and divide the resulting value by 3. Later, when the Gaussian filter
is being applied, as shown in Equation 5.2, this B&W value for each pixel gets multiplied
by 2, 4, 5, 9, 12, and 15 and the Gaussian kernel is formed. This kernel – that is essentially
a 5×5 matrix – possesses favorable symmetries, containing only six different values (2, 4,
5, 9, 12, and 15).
A close look at Equation 5.2 shows that the constant 31 multiplier for the BWImage array
1
variable, as well as the constant 159 multiplier for the Gauss array matrix can be taken
outside the computation and be dealt with at the very end of the computation. That way
the computational burden is only experienced once for the entire formula, rather than for
each pixel. Therefore, to compute Equation 5.1 and Equation 5.2, one can get away with
multiplying each pixel value with simple integer numbers.
It gets better ... look at the Gauss kernel; the corner value is 2, which means that some
pixel at some point gets multiplied by 2. Since the other corner value is also 2, some pixel
four columns ahead (horizontally) also gets multiplied by 2. Same for four rows below and
four rows and four columns below. An extensive set of such symmetries brings about the
following idea to speed up the entire convolution operation:
For each given pixel with a B&W value of, say, X, why not precompute different multiples
of X and save them somewhere only once, right after we know what the pixel’s B&W value
is. These multiples are clearly 2X, 4X, 5X, 9X, 12X, and 15X. A careful observer will
come up with yet another optimization: instead of multiplying X by 2, why not simply add
X to itself and save the result into another place. Once we have this 2X value, why not
add that to itself to get 4X and continue to add 4X to X to get 5X, thereby completely
avoiding multiplications.
Since we saved each pixels’s B&W value as a double, each multiplication and addition
is a double type operation; so, saving the multiplications will clearly help reduce the core
computation intensity. Additionally, each pixel is accessed only once, rather than 25 times
during the convolution operation that Equation 5.2 prescribes.
#define EDGE 0
#define NOEDGE 255
#define MAXTHREADS 128
struct ImgProp{
int Hpixels;
int Vpixels;
unsigned char HeaderInfo[54];
unsigned long int Hbytes;
};
struct Pixel{
unsigned char R;
unsigned char G;
unsigned char B;
};
struct PrPixel{
unsigned char R;
unsigned char G;
unsigned char B;
unsigned char x; // unused. to make it an even 4B
float BW;
float BW2,BW4,BW5,BW9,BW12,BW15;
float Gauss, Gauss2;
float Theta,Gradient;
};
5.4.5 PrGaussianFilter
The PrGaussianFilter() function, shown in Code 5.10, has the same exact functionality as
Code 5.4, which is the version that does not use the precomputed values. The difference
between GaussianFilter() and PrGaussianFilter() is that the former achieves the same compu-
tation result by adding the appropriate precomputed values for the corresponding pixels,
rather than performing the actual computation.
The inner-loop simply computes the Gaussian-filtered pixel value from Equation 5.2 by
using the precomputed values that were stored in the struct in Code 5.7.
124 GPU Parallel Program Development Using CUDA
// Function that calculates the .Gradient and .Theta for each pixel.
// Uses the pre-computed .Gauss and .Gauss2x values
void *PrSobel(void* tid)
{
int row,col,i,j; float GX,GY; float RPI=180.0/PI;
long tn = *((int *) tid); tn *= ip.Vpixels/NumThreads;
5.4.6 PrSobel
The PrSobel() function, shown in Code 5.11, has the same exact functionality as Code 5.5,
which is the version that does not use the precomputed values. The difference between
Sobel() and PrSobel() is that the former achieves the same computation result by adding the
appropriate precomputed values for the corresponding pixels, rather than performing the
actual computation.
The inner-loop simply computes the Sobel-filtered pixel value from Equation 5.3 by
using the precomputed values that were stored in the struct in Code 5.7.
Thread Management and Synchronization 125
5.4.7 PrThreshold
The PrThreshold() function, shown in Code 5.12, has the same exact functionality as
Code 5.6, which is the version that does not use the precomputed values. The difference
between Threshold() and PrThreshold() is that the former achieves the same computation
result by adding the appropriate precomputed values for the corresponding pixels, rather
than performing the actual computation.
The inner-loop simply computes the resulting binary (thresholded) pixel value from
Equation 5.5 by using the precomputed values that were stored in the struct in Code 5.7.
126 GPU Parallel Program Development Using CUDA
TABLE 5.3 imedgeMC.c execution times for the W3690 CPU (6C/12T) in ms for a
varying number of threads (above). For comparison, execution times of imedge.c
are repeated from Table 5.2 (below).
Function #threads =⇒ 1 2 4 8 10 12
PrReadBMP() 2836 2846 2833 2881 2823 2898
Create arrays 31 32 31 36 31 31
PrGaussianFilter() 2179 1143 570 526 539 606
PrSobel() 7475 3833 1879 1141 945 864
PrThreshold() 358 193 121 107 113 107
WriteBMP() 61 60 61 61 60 61
imedgeMC.c runtime no I/O 12940 8107 5495 4752 4511 4567
ReadBMP() 73 70 71 72 73 72
Create arrays 749 722 741 724 740 734
GaussianFilter() 5329 2643 1399 1002 954 880
Sobel() 18197 9127 4671 2874 2459 2184
Threshold() 499 260 147 132 95 92
WriteBMP() 70 70 66 60 61 62
imedge.c runtime no I/O 24850 12829 7030 4798 4313 3957
Speedup 1.92× 1.58× 1.28× 1.01× 0.96× 0.87×
tid=0 1835 ms
tid=1 1981 ms
join
create
tid=2 2246 ms
tid=3 2016 ms
Serial RunTime
MT RunTime = 2246 ms = 7281 ms
FIGURE 5.2 Example barrier synchronization for 4 threads. Serial runtime is 7281 ms
and the 4-threaded runtime is 2246 ms. The speedup of 3.24× is close to the best-
expected 4×, but not equal due to the imbalance of each thread’s runtime.
float a;
Thread 2: Lock Mutex M int b;
f = a+(float)(b+c); char c;
f = a+(float)(b+c); Unlock Mutex M
The count updating problem mentioned in Analogy 5.1 is precisely what happens when
multiple threads attempt to update the same variable without using an additional structure
like a MUTEX, shown in Figure 5.3. It is a very common bug in multithreaded code. The
red flag solution is precisely what is used in multithreaded programming. The underlying
idea to prevent incorrect updating is very simple: instead of a thread recklessly updating a
variable (i.e., unsafely), a shared MUTEX variable is used.
If a variable is a MUTEX variable, it has to be updated according to the rules of updating
such a variable; each thread knows that it is updating a MUTEX variable and does not touch
the variable itself before it lets the other threads know that it is updating it. It does this by
locking the MUTEX, which is equivalent to bringing up the red flag in Analogy 5.1. Once it
locks the MUTEX, it is free to make any update it desires to the variable that is controlled
by this MUTEX. In other words, it either excludes itself or the other threads from updating
a MUTEX variable, hence the name mutually exclusive, or in short, MUTEX.
Before the terms get confused, let me make something clear: there is a difference between
a MUTEX itself and MUTEX variables. For example, in Figure 5.3, the name of the MUTEX is
M , while the variables that MUTEX M controls are f , a, b, and c. In such a setup, the
multithreaded functions are supposed to lock and unlock the MUTEX M itself. After a lock
has been obtained, they are free to update the four MUTEX-controlled variables, f , a, b, and
c. When done, they are supposed to unlock MUTEX M.
Let me make one point very clear: although a MUTEX eliminates the incorrect-updating
problem, implementing a MUTEX requires hardware-level atomic operations. For example,
the CAS (compare and swap) instructions in the x86 Intel ISA achieve this. Luckily, a
programmer enjoys the readily available MUTEX implementation functions that are a part
of POSIX and this is what we will use in our implementation of imageMCT.c.
There is absolutely no mechanism that checks to see whether a thread has updated a vari-
able safely by locking/unlocking a controlling MUTEX, therefore, it is also a common bug for
a programmer to forget that he or she was supposed to lock/unlock a MUTEX for a variable
that is being shared. In this case, exactly the same problems that were mentioned in Anal-
ogy 5.1 will creep up, presenting an impossible-to-debug problem. It is also common for a
programmer to realize that a variable should have really been a MUTEX variable and declare
a MUTEX for it halfway into the program development. However, declaring a MUTEX does
not magically solve the incorrect-updating problem; correct locking/unlocking does. All it
takes is forgetting one place where the variable is being accessed and forget to lock/unlock
its corresponding MUTEX; worse yet, most incorrect update problems are infrequent and
they manifest themselves as weird intermittent problems, keeping most programmers up at
night! So, a good upfront planning is the best way to prevent these problems.
• The indentation of the variables inside the lock/unlock make it easy to see the variables
that are being updated while the MUTEX is locked.
• We have to create and destroy each MUTEX using the following functions:
• For N threads that are being launched, there are N + 2 MUTEX variables, all con-
trolled by the same MUTEX named CtrMutex. These variables are: NextRowToProcess,
LastRowRead, and the array ThreadCtr[0]..ThreadCtr[N-1].
pthread_mutex_t CtrMutex;
struct PrPixel **PrIm;
int NextRowToProcess, LastRowRead;
int ThreadCtr[MAXTHREADS]; // Counts # rows processed by each thread
void *AMTPreCalcRow(void* ThCtr)
{
unsigned char r, g, b; int i,j,Last;
float R, G, B, BW, BW2, BW3, BW4, BW5, BW9, BW12, Z=0.0;
do{ // get the next row number safely
pthread_mutex_lock(&CtrMutex);
Last=LastRowRead; i=NextRowToProcess;
if(Last>=i){
NextRowToProcess++; j = *((int *)ThCtr);
*((int *)ThCtr) = j+1; // One more row processed by this thread
}
pthread_mutex_unlock(&CtrMutex);
if(Last<i) continue;
if(i>=ip.Vpixels) break;
for(j=0; j<ip.Hpixels; j++){
r=PrIm[i][j].R; g=PrIm[i][j].G; b=PrIm[i][j].B;
R=(float)r; G=(float)g; B=(float)b; BW3=R+G+B;
PrIm[i][j].BW = BW = BW3*0.33333; PrIm[i][j].BW2 = BW2 = BW+BW;
PrIm[i][j].BW4 = BW4 = BW2+BW2; PrIm[i][j].BW5 = BW5 = BW4+BW;
PrIm[i][j].BW9 = BW9 = BW5+BW4; PrIm[i][j].BW12 = BW12 = BW9+BW3;
PrIm[i][j].BW15 = BW12+BW3; PrIm[i][j].Gauss=PrIm[i][j].Gauss2=Z;
PrIm[i][j].Theta= PrIm[i][j].Gradient = Z;
}
}while(i<ip.Vpixels);
pthread_exit(NULL);
}
TABLE 5.4 imedgeMCT.c execution times (in ms) for the W3690 CPU (6C/12T),
using the Astronaut.bmp image file (top) and Xeon Phi 5110P (60C/240T) using
the dogL.bmp file (bottom).
Function #threads =⇒ 1 2 4 8 10 12
PrAMTReadBMP() 2267 1264 920 1014 1020 1078
Create arrays 33 31 31 33 32 33
PrGaussianFilter() 2223 1157 567 556 582 611
PrSobel() 7415 3727 1910 1124 948 842
PrThreshold() 341 195 119 107 99 104
WriteBMP() 61 62 60 63 61 63
imedgeMCT.c w/o IO 12640 6436 3607 2897 2742 2731
PrReadBMP() 2836 2846 2833 2881 2823 2898
Create arrays 31 32 31 36 31 31
PrGaussianFilter() 2179 1143 570 526 539 606
PrSobel() 7475 3833 1879 1141 945 864
PrThreshold() 358 193 121 107 113 107
WriteBMP() 61 60 61 61 60 61
imedgeMC.c w/o IO 12940 8107 5495 4752 4511 4567
Speedup (W3690) 1.02× 1.26× 1.52× 1.64× 1.64× 1.67×
135
CHAPTER 6
Introduction to GPU
Parallelism and CUDA
137
138 GPU Parallel Program Development Using CUDA
This story gets worse. Even if you had a 486DX CPU, the FPU inside your 486DX was
still not fast enough for most of the games. Any exciting game demanded a 20× (or even
50×) higher-than-achievable floating point computational power from its host CPU. Surely,
in every generation the CPU manufacturers kept improving their FPU performance, just to
witness a demand for FPU power that grew much faster than the improvements they could
provide. Eventually, starting with the Pentium generation, the FPU was an integral part
of a CPU, rather than an option, but this didn’t change the fact that significantly higher
FPU performance was needed for games. In an attempt to provide much higher scale FPU
performance, Intel went on a frenzy to introduce vector processing units inside their CPUs:
the first ones were called MMX, then SSE, then SSE2, and the ones in 2016 are SSE4.2.
These vector processing units were capable of processing many FPU operations in parallel
and their improvement has never stopped.
Although these vector processing units helped certain applications a lot — and they still
do – the demand for an ever-increasing amount of FPU power was insane! When Intel could
deliver a 2× performance improvement, game players demanded 10× more. When they
could eventually manage to deliver 10× more, they demanded 100× more. Game players
were just monsters that ate lots of FLOPS! And, they were always hungry! Now what? This
was the time when a paradigm shift had to happen. Late 1990s is when the manufacturers
of many plug-in boards for PCs — such as sound cards or ethernet controller — came
up with the idea of a card that could be used to accelerate the floating point operations.
Furthermore, routine image coordinate conversions during the course of a game, such as
3D-to-2D conversions and handling of triangles, could be performed significantly faster by
dedicated hardware rather than wasting precious CPU time. Note that the actual unit
element of a monster in a game is a triangle, not a pixel. Using triangles allows the games
to associate a texture for the surface of any object, like the skin of a monster or the surface
of a tank, something that you cannot do with simple pixels.
These efforts of the PC card manufacturers to introduce products for the game market
gave birth to a type of card that would soon be called a Graphics Processing Unit. Of course,
we love acronyms: it is a GPU ... A GPU was designed to be a “plug-in card” that required
a connector such as PCI, AGP, PCI Express, etc. Early GPUs in the late 1990s strictly
focused on delivering as high of a floating point performance as possible. This freed the
CPU resources and allowed a PC to perform 5× or 20× better in games (or even more if
you were willing to spend a lot of money on a fancy GPU). Someone could purchase a $100
GPU for a PC that was worth $500; for this 20% extra investment, the computer performed
5× faster in games. Not a bad deal. Alternatively, by purchasing a $200 card (i.e., a 40%
extra investment), your computer could perform 20× faster in games. Late 1990s was the
point of no return, after which the GPU was an indispensable part of every computer, not
just for games, but for a multitude of other applications explained below. Apple computers
used a different strategy to build a GPU-like processing power into their computers, but
sooner or later (e.g., in the year 2017, the release year of this book) the PC and Mac lines
have converged and they started using GPUs from the same manufacturers.
FIGURE 6.1 Turning the dog picture into a 3D wire frame. Triangles are used to rep-
resent the object, rather than pixels. This representation allows us to map a texture
to each triangle. When the object moves, so does each triangle, along with their as-
sociated textures. To increase the resolution of this kind of an object representation,
we can divide triangles into smaller triangles in a process called tesselation.
or cos(), or even floating point computations of any sort. The entire game could run by
performing integer operations, thereby requiring only an ALU. Even a low-powered CPU
was perfectly sufficient to compute all of the required movements in real time. However,
having watched the Terminator 2 movie a few years ago, the Pacman game was far from
exciting for gamers of the 1990s. First of all, objects had to be 3D in any good computer
game and the movements were substantially more sophisticated than Pacman — and in
3D, requiring every transcendental operation you can think of. Furthermore, because the
result of any transcendental function due to a sophisticated object move — such as the
rotation operation in Equation 4.1 or the scaling operation in Equation 4.3 — required
the use of floating point variables to maintain image coordinates, GPUs, by definition, had
to be computational units that incorporated significant FPU power. Another observation
that the GPU manufacturers made was that the GPUs could have a significant edge in
performance if they also included dedicated processing units that performed routine con-
versions from pixel-based image coordinates to triangle-based object coordinates, followed
by texture mapping.
To appreciate what a GPU has to do, consider Figure 6.1, in which our dog is represented
by a bunch of triangles. Such a representation is called a wire-frame. In this representation,
a 3D object is represented using triangles, rather than an image using 2D pixels. The unit of
element for this representation is a triangle with an associated texture. Constructing a 3D
wire-frame of the dog will allow us to design a game in which the dog jumps up and down;
as he makes these moves, we have to apply some transformation — such as rotation, using
the 3D equivalent of Equation 4.1 — to each triangle to determine the new location of that
triangle and map the associated texture to each triangle’s new location. Much like a 2D
image, this 3D representation has the same “resolution” concept; to increase the resolution
of a triangulated object, we can use tesselation, in which a triangle is further subdivided into
smaller triangles as shown in Figure 6.1. Note: Only 11 triangles are shown in Figure 6.1
to avoid cluttering the image and make our point on a simple figure; in a real game, there
could be millions of triangles to achieve sufficient resolution to please the game players.
Now that we appreciate what it takes to create scenes in games where 3D objects
are moving freely in the 3-dimensional space, let’s turn our attention to the underlying
140 GPU Parallel Program Development Using CUDA
FIGURE 6.2 Steps to move triangulated 3D objects. Triangles contain two attributes:
their location and their texture. Objects are moved by performing mathematical
operations only on their coordinates. A final texture mapping places the texture
back on the moved object coordinates, while a 3D-to-2D transformation allows the
resulting image to be displayed on a regular 2D computer monitor.
computations to create such a game. Figure 6.2 depicts a simplified diagram of the steps
involved in moving a 3D object. The designer of a game is responsible for creating a wire-
frame of each object that will take part in the game. This wire-frame includes not only the
locations of the triangles — composed of 3 points for each triangle, having an x, y, and z
coordinate each — but also a texture for each triangle. This operation decouples the two
components of each triangle: (1) the location of the triangle, and (2) the texture that is
associated with that triangle. After this coupling, triangles can be moved freely, requiring
only mathematical operations on the coordinates of the triangles. The texture information
— stored in a separate memory area called texture memory — doesn’t need to be taken into
account until all of the moves have been computed and it is time to display the resulting
object in its new location. Texture memory does not need to be changed at all, unless, of
course, the object is changing its texture, as in the Hulk movie, where the main character
turns green when stressed out! In this case, the texture memory also needs to be updated
in addition to the coordinates, however, this is a fairly infrequent update when compared
to the updates on the triangle coordinates. Before displaying the moved object, a texture
mapping step fills the triangles with their associated texture, turning the wire-frame back
into an object. Next, the recomputed object has to be displayed on a computer screen;
because every computer screen is composed of 2D pixels, a 3D-to-2D transformation has to
be performed to display the object as an image on the computer screen.
• Ability to convert from triangle coordinates back to image coordinates for display in
a computer screen (Box IV)
Based on this observation, right from the first day, every GPU was manufactured with
the ability to implement some sort of functionality that matched all of these boxes. GPUs
kept evolving by incorporating faster Box IIs, although the concept of Box I, III, and IV
never changed too much. Now, imagine that you are a graduate student in the late 1990s
—in a physics department— and trying to write a particle simulation program that requires
an extensive amount of floating point computations. Before the introduction of the GPUs,
all you could use was a CPU that had an FPU in it and, potentially, a vector unit. However,
when you bought one of these GPUs at an affordable price and realized that they could
perform a much higher volume of FPU operations, you would naturally start thinking:
“Hmmm... I wonder if I could use one of these GPU things in my particle simulations?”
This investigation would be worth every minute you put into it because you know that these
GPUs are capable of 5× or 10× faster FPU computations. The only problem at that time
was that the functionality of Box III and Box IV couldn’t be “shut off.” In other words,
GPUs were not designed for non-gamers who are trying to do particle simulations!
Nothing can stop a determined mind! It didn’t take too long for our graduate student
to realize that if he or she mapped the location of the particles as the triangle locations
of the monsters and somehow performed particle movement operations by emulating them
as monster movements, it could be possible to “trick” the GPU into thinking that you
are actually playing a game, in which particles (monsters) are moving here and there and
smashing into each other (particle collisions). You can only imagine the massive challenges
our student had to endure: First, the native language of the games was OpenGL, in which
objects were graphics objects and computer graphics APIs had to be used to “fake” particle
movements. Second, there were major inefficiencies in the conversions from monster-to-
particle and particle-back-to-monster. Third, accuracy was not that great because the initial
cards could only support single precision FPU operations, not double precision. It is not
like our student could make a suggestion to the GPU manufacturers to incorporate double
precision to improve the particle simulation accuracy; GPUs were game cards and they were
game card manufacturers, period! None of these challenges stopped our student! Whoever
that student was, the unsung hero, created a multibillion dollar industry of GPUs that are
in almost every top supercomputer today.
Extremely proud of the success in tricking the GPU, the student published the results
... The cat was out of the bag ... This started an avalanche of interest; if this trick can be
applied to particle simulations, why not circuit simulations? So, another student applied it
to circuit simulations. Another one to astro-physics, another one to computational biology,
another ... These students invented a way to do general purpose computations using GPUs,
hence the birth of the term GPGPU.
example, oil explorers could analyze the underwater SONAR data to find oil under water, an
application that requires a substantial volume of floating point operations. Alternatively,
the academic and research market, including many universities and research institutions
such as NASA or Sandia National Labs, could use the GPGPUs for extensive scientific
simulations. For these simulations, they would actually purchase hundreds of the most
expensive versions of GPGPUs and GPU manufacturers could make a significant amount
of money in this market and create an alternative product to the already-healthy game
products.
In the late 1990s, GPU manufacturers were small companies that saw GPUs as ordinary
add-on cards that were no different than hard disk controllers, sound cards, ethernet cards,
or modems. They had no vision of the month of September 2017, when Nvidia would become
a company that is worth $112 B (112 billion US dollars) in the Nasdaq stock market (Nasdaq
stock ticker NVDA), a pretty impressive 20-year accomplishment considering that Intel, the
biggest semiconductor manufacturer on the planet with its five decade history, was worth
$174 B the same month (Nasdaq stock ticket INTC). The vision of the card manufacturers
changed fairly quickly when the market realized that GPUs were not in the same category
as other add-on cards; it didn’t take a genius to figure out that the GPU market was ready
for an explosion. So the gold rush started. GPU cards needed two main ingredients: (1) the
GPU chips, responsible for all of the computation, (2) GPU memory, something that could
be manufactured by the CPU DRAM manufacturers that were already making memory
chips for the CPU market, (3) interface chips to interface to the PCI bus, (4) power supply
chips that provide the required voltages to all of these chips, and (5) other semiconductors
to make all of these work together, sometimes called “glue logic.”
The market already had manufacturers for (2), (3), and (4). Many small companies
were formed to manufacture (1), the GPU “chips,” so the functionality shown in Figure 6.2
could be achieved. The idea was that GPU chip designers — such as Nvidia — would
design their chips and have them manufactured by third parties — such as TSMC — and
sell the GPU chips to contractor manufacturers such as FoxConn. FoxConn would purchase
the other components (2,3,4, and 5) and manufacture GPU add-on cards. Many GPU chip
designers entered the market just to see a massive consolidation toward the end of 1990s.
Some of them bankrupted and some of them sold out to bigger manufacturers. As of 2016,
only three key players remain in the market (Intel, AMD, and Nvidia), two of them being
actual CPU manufacturers. Nvidia became the biggest GPU manufacturer in the world as
of 2016 and made multiples pushes to enter into the GPU/CPU market by incorporating
ARM cores into their Tegra line GPUs. Intel and AMD kept incorporating GPUs into
their CPUs to provide an alternative to consumers that didn’t want to buy a discrete
GPU. Intel has gone through many generations of designs eventually incorporating Intel
HD Graphics and Intel Iris GPUs into their CPUs. Intel’s GPU performance improved to
the point when in 2016, Apple deemed the built-in Intel GPU performance sufficient to be
included in their Mac Books as the only GPU, instead of discrete GPUs. Additionally, Intel
introduced the Xeon Phi cards to compete with Nvidia in the high-end supercomputing
market. While this major competition was taking place in the desktop market, the mobile
market saw a completely different set of players emerge. QualComm and Broadcom built
GPU cores into their mobile processors by licensing them from other GPU designers. Apple
purchased processor designers to design their “A” family processors that had built-in CPUs
and GPUs with extreme low power consumption. By about 2011 or 2012, CPUs couldn’t be
thought of as the only processing unit of any computer or mobile device. CPU+GPU was
the new norm.
Introduction to GPU Parallelism and CUDA 143
GPU code side ... Furthermore, a single compiler would be great to compiler both sides’
code, without requiring two separate compilations.
â There is no such thing as GPU Programming ...
â GPU always interfaces to the CPU through certain APIs ...
â So, there is always CPU+GPU programming ...
Given these facts, CUDA had to be based on the C programming language (for the
CPU side) to provide high performance. The GPU side also had to be almost exactly like
the CPU side with some specific keywords to distinguish between host versus device code.
The burden to determine how the execution would take place at runtime — regarding CPU
versus GPU execution sequences — had to be determined by the CUDA compiler. GPU
parallelism had to be exposed on the GPU side with a mechanism similar to the Pthreads
we saw in the Part I of this book. By taking into account all of these facts, Nvidia designed
its nvcc compiler that is capable of compiling CPU and GPU code simultaneously. CUDA,
since its inception, has gone through many version updates, incorporating an increasing set
of sophisticated features. The version I use in this book is CUDA 8.0, released in September
2016. Parallel to the progress of CUDA, Nvidia GPU architectures have gone through
massive updates as I will document shortly.
FIGURE 6.3Three farmer teams compete in Analogy 6.1: (1) Arnold competes alone
with his 2× bigger tractor and “the strongest farmer” reputation, (2) Fred and
Jim compete together in a much smaller tractor than Arnold. (3) Tolga, along with
32 boy and girl scouts, compete together using a bus. Who wins?
Analogy 6.1 is depicted in Figure 6.3 with three alternatives: Arnold represents a single-
threaded CPU that can work at 4 GHz, while Fred and Jim together in their smaller tractor
represent a dual-core CPU in which each core works at something like 2.5 GHz. We have done
major evaluations on the performance differences between these two alternatives in Part I
of the book. The interesting third alternative in Figure 6.3 is Tolga with the 32 boy/girl
scouts. This represents a single CPU core — probably working at 2.5 GHz — and a GPU
co-processor composed of 32 small cores that each work at something like 1 GHz. How could
we compare this alternative to the first two?
the GPUs win is the fact that the GPU cores are much simpler and they work at lower
speed. This allows the GPU chip designers to build a significantly higher number of cores
into their GPU chips and the lower speed keeps the power consumptions below the magic
200–250 W, which is about the peak power you can consume from any semiconductor device
(i.e., “chip”).
Note that the power that is consumed by the GPU is not proportional to the frequency
of each core; instead, the dependence is something like quadratic. In other words, a 4 GHz
CPU core is expected to consume 16× more power than the same core working at 1 GHz.
This very fact allows GPU manufacturers to pack hundreds or even thousands of cores
into their GPUs without reaching the practical power consumption limits. This is actually
exactly the same design philosophy behind multicore CPUs too. A single core CPU working
at 4 GHz versus a dual-core CPU in which both cores work at 3 GHz could consume similar
amounts of power. So, as long as the parallelization overhead is low (i.e., η is close to 1),
a dual-core 3 GHz CPU is a better alternative than a single core 4GHz CPU. GPUs are
nothing more than this philosophy taken to the ultimate extreme with one big exception:
while the CPU multicore strategy calls for using multiple sophisticated (out-of-order)
cores that work at lower frequencies, this design strategy only works if you are trying to
put 2, 4, 8, or 16 cores inside the CPU. It simply won’t work for 1000! So, the GPUs had to
go through an additional step of making each core simpler. Simpler means that each core
is in-order (see Section 3.2.1), work at lower frequencies, and their L1$ memories are not
coherent. Many of these details are going to become clear as we go through the following
few chapters. For now, the take-away from this section should be that GPUs incorporate a
lot of architectural changes — as compared to CPUs — to provide a manageable execution
environment for such a high core count.
4. The existence of the warp concept has dramatic implications on the GPU architecture.
In Figure 6.3, we never talked about how the coconuts arrive in the bus. If you brought
only 5 coconuts into the bus, 27 of the scouts would sit there doing nothing; so, the
data elements must be brought in to the GPU in the same bulk amounts, although
the unit of these data chunks is half warp or 16 elements.
5. The fact that the data arrives into the GPU cores in half warp chunks means that the
memory sub-system that is bringing the data into the GPU cores should be bringing
in the data 16-at-a-time. This implies a parallel memory subsystem that is capable of
shuttling around data elements 16 at a time, either 16 floats or 16 integers, etc. This
is why the GPU DRAM memory is made from GDDR5, which is parallel memory.
6. Because the CPU cores and GPU cores are completely different processing units,
it is expected that they have different ISAs (instruction set architectures). In other
words, they speak a different language. So, two different sets of instructions must
be written: one for Tolga, one for the scouts. In the GPU world, a single compiler
— nvcc — compiles both the CPU instructions and GPU instructions, although there
are two separate programs that the developer must write. Thank goodness, the CUDA
language combines them together and makes them so similar that the programmer can
write both of these programs without having to learn two totally different languages.
1. First, the CPU will read the command line arguments and will parse them and place
the parsed values in the appropriate CPU-side variables. Exactly the same story as
the plain-simple CPU version of the code, the imflipP.c.
2. One of the command line variables will be the file name of the image file we have to
flip, like the file that contains the dog picture, dogL.bmp. The CPU will read that
file by using a CPU function that is called ReadBMP(). The resulting image will be
placed inside a CPU-side array named TheImg[]. Notice that the GPU does absolutely
nothing so far.
3. Once we have the image in memory and are ready to flip it, now it is time for the
GPU’s sun to shine! Horizontal or vertical flipping are both massively parallel tasks, so
the GPU should do it. At this point in time, because the image is in a CPU-side array
(more generally speaking, in CPU memory), it has to be transferred to the device
side. What is obvious from this discussion is that the GPU has its own memory, in
addition to the CPU’s own memory — DRAM — that we have been studying since
the first time we saw it in Section 3.5.
4. The fact that the CPU memory versus GPU memory are completely different memory
areas (or “chips”) should be pretty clear because the GPU is a different plug-in device
that shares none of the electronic components with the CPU. The CPU memory is
soldered on the motherboard and the GPU memory is soldered on the GPU plug-in
card; the only way a data transfer can happen between these two memory areas is an
explicit data transfer — using the available APIs we will see shortly in the following
pages — through the PCI Express bus that is connecting them. I would like the reader
to refresh his or her memory with Figure 4.3, where I showed how the CPU connected
to the GPU through the X99 chipset and the PCI Express bus. The X99 chip facilitates
the transfers, while the I/O portion of the CPU “chip” employs hardware to interface
to the X99 chip and shuttle the data back and forth between the GPU memory and
the DRAM of the CPU (by passing through the L3$ of the CPU along the way).
5. So, this transfer must take place from the CPU’s memory into the GPU’s memory
before the GPU cores can do anything with the image data. This transfer occurs by
using an API function that looks like an ordinary CPU function.
6. After this transfer is complete, now somebody has to tell the GPU cores what to do
with that data. It is the GPU side code that will accomplish this. Well, the reality is
that you should have transferred the code before the data, so by the time the image
data arrives at the GPU cores they are aware of what to do with it. This implies that
we are really transferring two things to the GPU side: (1) data to process, (2) code
to process the data with (i.e., compiled GPU instructions).
7. After the GPU cores are done processing the data, another GPU→CPU transfer must
transfer the results back to the CPU.
Using our Figure 6.3 analogy, it is as if Tolga is first giving a piece of paper with the
instructions to the scouts so they know what to do with the coconuts (GPU side code),
grabbing 32 coconuts at a time (read from CPU memory), dumping 32 coconuts at a
time in front of the scouts (CPU→GPU data transfer), telling the scouts to execute their
given instructions, which calls for harvesting the coconuts that just got dumped in front
of them (GPU-side execution), and grabbing what is in front of them when they are done
(GPU→CPU data transfer in the reverse direction) and putting the harvested coconuts back
in the area where he got them (write the results back to the CPU memory).
Introduction to GPU Parallelism and CUDA 149
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <iostream>
#include <ctype.h>
#include <cuda.h>
This process sounds a little inefficient due to the continuous back-and-forth data transfer,
but don’t worry. There are multiple mechanisms that Nvidia built into their GPUs to make
the process efficient and the sheer processing power of the GPU eventually partially hides
the underlying inefficiencies, resulting in a huge performance improvement.
main() function and the following five pointers that facilitate image storage in CPU and
GPU memory:
• TheImg variable is the pointer to the memory that will be malloc()’d by the
ReadBMPLin() function to hold the image that is specified in the command line (e.g.,
dogL.bmp) in the CPU’s memory. Notice that this variable, TheImg, is a pointer to the
CPU DRAM memory.
• CopyImg variable is another pointer to the CPU memory and is obtained from a sep-
arate malloc() to allocate space for a copy of the original image (the one that will be
flipped while the original is not touched). Note that we have done nothing with the
GPU memory so far.
• As we will see very shortly, there are APIs that we will use to allocate memory in the
GPU memory. When we do this, using an API called cudaMalloc(), we are asking the
GPU memory manager to allocate memory for us inside the GPU DRAM. So, what
the cudaMalloc() returns back to us is a pointer to the GPU DRAM memory. Yet, we
will take that pointer and will store it in a CPU-side variable, GPUImg. This might look
confusing at first because we are saving a pointer to the GPU side inside a CPU-side
variable. It actually isn’t confusing. Pointers are nothing more than “values” or more
specifically 64-bit integers. So, they can be stored, copied, added, and subtracted in
exactly the same way 64-bit integers can be. When do we store GPU-side pointers
on the CPU side? The rule is simple: Any pointer that you will ever use in an API
that is called by the CPU must be saved on the CPU side. Now, let’s ask ourselves
the question: will the variable GPUImg ever be used by the CPU side? The answer
is definitely yes, because we will need to transfer data from the CPU to the GPU
using cudaMalloc(). We know that cudaMalloc() is a CPU-side function, although its
responsibility has a lot to do with the GPU. So, we need to store the pointers to both
sides in CPU-side variables. We will most definitely use the same GPU-side pointer
on the GPU side itself as well! However, we are now making a copy of it at the host
(CPU), so the CPU has the chance of accessing it when it needs it. If we didn’t do
this, the CPU would never have access to it in the future and wouldn’t be able to
initiate memory transfers to the GPU that involved that specific pointer.
• The other GPU-side pointers, GPUCopyImg and GPUResult, have the same story. They
are pointers to the GPU memory, where the resulting “flipped” image will be stored
(GPUResult) and another temporary variable that the GPU code needs for its operation
(GPUCopyImg). These two variables are CPU-side variables that store pointers that we
will obtain from cudaMalloc(); storing GPU pointers in CPU variables shouldn’t be
confusing.
There are multiple #include directives you will see in every CUDA program, which are
<cuda runtime.h>, <cuda.h>, and <device launch parameters.h> to allow us to use Nvidia
APIs. These APIs, such as cudaMalloc(), are the bridge between the CPU and the GPU
side. Nvidia engineers wrote them and they allow you to transfer data between the CPU
and the GPU side magically without worrying about the details.
Note the types that are defined here, ul, uch, and ui, to denote the unsigned long,
unsigned char, and unsigned int, respectively. They are used so often that it makes the
code cleaner define them as user-defined types. It serves, in this case, no purpose other than
to reduce the clutter in the code. The variables to hold the file names are InputFileName
and OutputFileName, which both come from the command line. The ProgName variable is
hard-coded into the program for use in reporting as we will see later in this chapter.
Introduction to GPU Parallelism and CUDA 151
Analogy 6.2 actually has quite a bit of detail. Let’s understand it.
• The city of cocoTown is the CPU and cudaTown is the GPU. Launching the spaceship
between the two cities is equivalent to executing GPU code. The notebook they left in
the spaceship contains the function parameters for the GPU-side function (for exam-
ple, the Vflip()); without these parameters cudaTown couldn’t execute any function.
• It is clear that the data transfer from the earth (cocoTown) to the moon (cudaTown)
is a big deal; it takes a lot of time and might even marginalize the amazing execution
speed at cudaTown. The spaceship is representing the data transfer engine, while the
space itself is the PCI Express bus that is connecting the CPU and GPU.
154 GPU Parallel Program Development Using CUDA
• The satellite phone represents the CUDA runtime API library for cocoTown and
cudaTown to communicate. One important detail is that just because the satellite
phone operator is in cudaTown, it doesn’t guarantee that a copy is also saved in
cudaTown; so, these parameters (e.g., warehouse number) must still be put inside the
spaceship (written inside the notebook).
The variables time1, time2, time3, and time4 are all CPU-side variables that store time-
stamps during the transfers between the CPU and GPU, as well as the execution of the
GPU code on the device side. A curious observation from the code above is that we only
use Nvidia APIs to time-stamp the GPU-related events. Anything that touches the GPU
must be time-stamped with the Nvidia APIs, specifically cudaEventRecord() in this case.
But, why? Why can’t we simply use the good-and-old gettimeofday() function we saw in the
CPU code listings?
The answer is in Analogy 6.2: We totally rely on Nvidia APIs (the people from the
moon) to time anything that relates to the GPU side. If we are doing that, we might as
well let them time all of the space travel time, both forward and back. We are recording
the beginning and end of these data transfers and GPU kernel execution as events, which
allows us to use Nvidia event timing APIs to time them, such as cudaEventRecord(). To be
used in this API an event must be first created using the cudaEventCreate() API. Because
the event recording mechanism is built into Nvidia APIs, we can readily use them to time
our GPU kernels and the CPU←→GPU transfers, much like we did with our CPU code.
Introduction to GPU Parallelism and CUDA 155
In Code 6.3, we use time1 to time-stamp the very beginning of the code and time2 to
time-stamp the point when the CPU→GPU transfer is complete. Similarly, time3 is when
the GPU code execution is done and time4 is when the arrival of the results to the CPU
side is complete. The difference between any of these two time-stamps will tell us how long
each one of these events took to complete. Not surprisingly, the difference must also be
calculated by using the cudaEventElapsedTime() API — shown in Code 6.4 — in the CUDA
API library, because the stored time-stamps are in a format that is also a part of the Nvidia
APIs rather than ordinary variables.
Nvidia Runtime Engine contains a mechanism — through the cudaMalloc() API — for
the CPU to “ask” Nvidia to see if it can allocate a given amount of GPU memory. The
answer is returned in a variable of type cudaError t. If the answer is cudaSuccess, we know
that the Nvidia runtime Engine was able to create the GPU memory we asked for and
placed the starting point of this memory area in a pointer that is named GPUImg. Remember
from Code 6.1 that the GPUImg is a CPU-side variable, pointing to a GPU-side memory
address.
Available
GPUs
Your
Nvidia SysTray
Icon
Nvidia Driver
Driver Version
Nvidia Control Panel
FIGURE 6.4Nvidia Runtime Engine is built into your GPU drivers, shown in your
Windows 10 Pro SysTray. When you click the Nvidia symbol, you can open the
Nvidia control panel to see the driver version as well as the parameters of your
GPU(s).
Much like the memory allocation API cudaMalloc(), the memory transfer API
cudaMemcpy() also uses the same status type cudaError t, which returns cudaSuccess if
the transfer completes without an error. If it doesn’t, then we know that something went
wrong during the transfer.
Going back to our Analogy 6.2, the cudaMemcpy() API is a specialized function that
the spaceship has; a way to transfer 166,656 coconuts super fast in the spaceship, instead
of worrying about each coconut one by one. Fairly soon, we will see that this memory
transfer functionality will become a lot more sophisticated and the transfer time will end
up being a big problem that will face us. We will see a set of more advanced memory transfer
functions from Nvidia to ease the pain! In the end, just because the transfers take a lot of
time, cudaTown people do not want to lose business. So, they will invent ways to make the
coconut transfer a lot more efficient to avoid discouraging cocoTown people from sending
business their way.
Introduction to GPU Parallelism and CUDA 157
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed! %s",cudaGetErrorString(cudaStatus));
exit(EXIT_FAILURE);
}
It is fairly common for programmers to write a wrapper function that wraps every single
CUDA API call around some sort of error checking as shown below:
where the wrapper function — chkCUDAErr() — is one that we write within our C code,
which directly uses the error code coming out of a CUDA API. An example wrapper function
is shown below, which exits the program when a GPU runtime code is returned by any
CUDA API:
// helper function that wraps CUDA API calls, reports any error and exits
void chkCUDAErr(cudaError_t ErrorID)
{
if (ErrorID != CUDA_SUCCESS){
printf("CUDA ERROR :::%\n", cudaGetErrorString(ErrorID));
exit(EXIT_FAILURE);
}
}
The Flip parameter is set based on the command line argument the user enters. When
the option ’H’ is chosen by the user, the Hflip() GPU-side function is called and the three
158 GPU Parallel Program Development Using CUDA
specified arguments (GPUCopyImg, GPUImg, and IPH) are passed onto Hflip() from the CPU
side. The ’V’ option launches the Vflip() kernel with four arguments, as opposed to the
three arguments in the Hflip() kernel; GPUCopyImg, GPUImg, IPH, and IPV. Once we look at
the details of both kernels, it will be clear why we need the additional argument inside
Vflip().
The following lines show what happens when the user chooses the ’T’ (transpose) or ’C’
(copy) options in the command line. I could have implemented transpose in a more efficient
way by writing a specific kernel for it; however, my goal was to show how two kernels can
be launched, one after the other. So, to implement ’T’, I launched Hflip followed by Vflip,
which effectively transposes the image. For the implementation of the ’C’ option, though, I
designed a totally different kernel PixCopy().
switch (Flip){
...
case ’T’: Hflip <<< NumBlocks, ThrPerBlk >>> (GPUCopyImg, GPUImg, IPH);
Vflip <<< NumBlocks, ThrPerBlk >>> (GPUImg, GPUCopyImg, IPH, IPV);
GPUResult = GPUImg; GPUDataTransfer = 4*IMAGESIZE;
break;
case ’C’: NumBlocks = (IMAGESIZE+ThrPerBlk-1) / ThrPerBlk;
PixCopy <<< NumBlocks, ThrPerBlk >>> (GPUCopyImg, GPUImg, IMAGESIZE);
GPUResult = GPUCopyImg; GPUDataTransfer = 2*IMAGESIZE;
break;
}
When the option ’H’ is chosen by the user, the execution of the following line is handled
by the Nvidia Runtime Engine, which involves launching the Hflip() kernel and passing the
three aforementioned arguments to it from the CPU side.
Going forward, I will use the terminology launching GPU kernels. This contrasts with the
terminology of calling CPU functions; while the CPU calls a function within its own planet,
say earth according to Analogy 6.2, this is possibly not a good terminology for GPU kernels.
Because the GPU really acts as a co-processor, plugged into the CPU using a far slower
connection than the CPU’s own internal buses, calling a function in a far location such as
moon deserves a more dramatic term like launching. In the GPU kernel launch line above,
Hflip() is the GPU kernel name, and the two parameters that are inside the ≪ and ≫
symbols (NumBlocks and ThrPerBlk) tell the Nvidia Runtime Engine what dimensions to
run this kernel with; the first argument (NumBlocks) indicates how many blocks to launch,
and the second argument (ThrPerBlk) indicates how many threads are launched in each
block. Remember from Analogy 6.2 that these two numbers are what the cudaTown people
wanted to know; the number of boxes (NumBlocks) and the number of coconuts in each box
(ThrPerBlk). The generalized kernel launch line is as follows:
GPU Kernel Name <<< dimension, dimension >>> (arg1, arg2, ...);
where arg1, arg2, ... are the parameters passed from the CPU side onto the GPU kernel. In
Code 6.3, the arguments are the two pointers (GPUCopyImg and GPUImg) that were given to us
by cudaMalloc() when we created memory areas to store images in the GPU memory and
IPH is a variable that holds the number of pixels in the horizontal dimension of the image
(ip.Hpixels). GPU kernel Hflip() will need these three parameters during its execution and
Introduction to GPU Parallelism and CUDA 159
would have no way of getting them had we not passed them during the kernel launch.
Remember that the two launch dimensions in Analogy 6.2 were 166,656 and 256, effectively
corresponding to the following launch line:
This tells the Nvidia Runtime Engine to launch 166,656 blocks of the Hflip() kernel and pass
the three parameters onto every single one of these blocks. So, the following blocks will be
launched: Block 0, Block 1, Block 2, ... Block 166,655. Every single one of these blocks will
execute 256 threads (tid = 0, tid = 1, ... , tid = 255), identical to the pthreads examples we
saw in Part I of the book. What we are really saying is that we are launching a total of
166,656×256 ≈ 41 M threads with this single launch line.
It is worth noting the difference between Million and Mega: Million threads means
1,000,000 threads, while Mega threads means 1024×1024 = 1,048,576 threads. Similarly
Thousand is 1000 and Kilo is 1024. I will notate 41 Mega threads as 41 M threads. Same
for 41,664 Kilo threads, being notated as 41,664 K threads. To summarize:
One important note to take here is that the GPU kernel is a bunch of GPU machine code
instructions, generated by the nvcc compiler, on the CPU side. These are the instructions
for the cudaTown people to execute in Analogy 6.2. Let’s say you wanted them to flip the
order in which the coconuts are stored in the boxes and send them right back to earth. You
then need to send them instructions about how to flip them (Hflip()). Because cudaTown
people do not know what to do with the coconuts once they receive them. They need the
coconuts (data), as well as the sequence of commands to execute (instructions). So, the
compiled instructions also travel to cudaTown in the spaceship, written on a big piece of
paper. At runtime, these instructions are executed on each block independently. Clearly,
the performance of your GPU program depends on the efficiency of the kernel instructions,
i.e., the programmer.
Let’s refresh our memory with Code 2.8, which was the MTFlipH() CPU function that
accepted a single parameter named tid. By looking at the tid parameter that is passed onto
it, this CPU function knew “who it was.” Based on who it was it processed a different part of
the image, indexed by tid in some fashion. The GPU kernel Hflip() has stark similarities to
it: This kernel acts almost exactly like its CPU sister MTFlipH() and the entire functionality
of the Hflip() kernel will be dictated by a thread ID. Let’s now compare them:
• MTFlipH() function is launched with 4–8 threads, while the Hflip() kernel is launched
with almost 40 million threads. I talked about the overhead in launching CPU threads
in Part I, which was really high. This overhead is almost negligent in the GPU world,
allowing us to launch a million times more of them.
• MTFlipH() expects the Pthread API call to pass the tid to it, while the Hflip() kernel
will receive its thread ID (0...255) directly from Nvidia Runtime Engine, at runtime.
As the GPU programmer, all we have to worry about is to tell the kernel how many
threads to launch and they will be numbered automatically.
• Due to the million times higher number of the threads we launch, some sort of hierar-
chy is necessary. This is why the thread numbering is broken down into two values: the
blocks are little chunks that execute, with 256 threads in each. Each block executes
completely independent from each other.
160 GPU Parallel Program Development Using CUDA
cudaEventSynchronize(time1); cudaEventSynchronize(time2);
cudaEventSynchronize(time3); cudaEventSynchronize(time4);
cudaEventElapsedTime(&totalTime, time1, time4);
cudaEventElapsedTime(&tfrCPUtoGPU, time1, time2);
cudaEventElapsedTime(&kernelExecutionTime, time2, time3);
cudaEventElapsedTime(&tfrGPUtoCPU, time3, time4);
cudaStatus = cudaDeviceSynchronize();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "\n Program failed after cudaDeviceSynchronize()!");
free(TheImg); free(CopyImg); exit(EXIT_FAILURE);
}
WriteBMPlin(CopyImg, OutputFileName); // Write the flipped image back to disk
...
The cudaDeviceSynchronize() function waits for every single launched kernel to complete its
execution. The result could be an error, in which case cudaDeviceSynchronize() will return
an error code. Otherwise, everything is good and we move onto reporting the results.
cudaEventRecord(time3, 0);
// Copy output (results) from GPU buffer to host (CPU) memory.
cudaStatus = cudaMemcpy(CopyImg, GPUResult, IMAGESIZE, cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy GPU to CPU failed!");
exit(EXIT_FAILURE);
}
cudaEventRecord(time4, 0);
cudaEventSynchronize(time1); cudaEventSynchronize(time2);
cudaEventSynchronize(time3); cudaEventSynchronize(time4);
cudaEventElapsedTime(&totalTime, time1, time4);
cudaEventElapsedTime(&tfrCPUtoGPU, time1, time2);
cudaEventElapsedTime(&kernelExecutionTime, time2, time3);
cudaEventElapsedTime(&tfrGPUtoCPU, time3, time4);
PTX instructions and further half-compiles them at runtime and feeds the full-compiled
instructions into the GPU cores. In Windows, all of the “Nvidia magic code” that facil-
itates this “further-half-compiling” is built into a Dynamic Link Library (DLL) named
cudart (CUDA Run Time). There are two flavors: in modern x64 OSs, it is cudart64
and in old 32-bit OSs, it is cudart32, although the latter should never be used because
all modern Nvidia GPUs require a 64-bit OS for efficient use. In my Windows 10 Pro
PC, for example, I was using cudart64 80.dll (Runtime Dynamic Link Library for CUDA
8.0). This file is not something you explicitly have to worry about; the nvcc compiler
will put it in the executable directory for you. I am just mentioning it so you are aware
of it.
Let’s compare Code 6.7 to its CPU sister Code 2.7. Let’s assume that both of them are
trying to flip the astronaut.bmp image in Figure 5.1 vertically. astronaut.bmp is a 7918×5376
image that takes ≈ 121 MB on disk. How would their functionality be different?
• For starters, assume that Code 2.7 uses 8 threads; it will assign the flipping task of 672
lines to each thread (i.e., 672 × 8 = 5376). Each thread will, then, be responsible for
processing ≈ 15 MB of information out of the entire image, which contains ≈ 121 MB
of information in its entirety. Because the launch of more than 10–12 threads will not
help on an 8C/16T CPU, as we witnessed over and over again in Part I, we cannot
really do better than this when it comes to the CPU.
• The GPU is different though. In the GPU world, we can launch a gazillion threads
without incurring any overhead. What if we went all the way to the bitter extreme and
had each thread swap a single pixel ? Let’s say that each GPU thread takes a single
pixel’s RGB value (3 bytes) from the source image GPU memory area (pointed to by
*ImgSrc) and writes it into the intended vertically flipped destination GPU memory
area (pointed to by *ImgDst).
• Remember, in the GPU world, our unit of launch is blocks, which are clumps of threads,
each clump being 32, 64, 128, 256, 512, or 1024 threads. Also remember that it cannot
be less than 32, because “32” is the smallest amount of parallelism we can have and
32 threads are called a warp, as I explained earlier in this chapter. Let’s say that each
one of our blocks will have 256 threads to flip the astronaut image. Also, assume that
we are processing one row of the image at a time using multiple blocks. This means
that we need d7918/256e = 31 blocks to process each row.
• Because we have 5376 rows in the image, we will need to launch 5376 × 31 = 166,656
blocks to vertically flip the astronaut.bmp image.
• We observe that 31 blocks-per-row will yield some minor loss, because 31 × 256 = 7936
and we will have 18 threads (7936 − 7918 = 18) doing nothing to process each row
of the image. Oh well, nobody said that massive parallelism doesn’t have its own
disadvantages.
• This problem of “useless threads” is actually exacerbated by the fact that not only
are these threads useless, but also they have to check to see if they are the useless
threads as shown in the line below:
This line simply says “if my tid is between 7918 and 7935 I shouldn’t do anything,
because I am a useless thread.” Here is the math: We know that the image has
166 GPU Parallel Program Development Using CUDA
7918 pixels in each row. So, the threads tid = 0...7917 are useful, and because we
launched 7936 threads (tid = 0...7935), this designates threads (tid = 7918...7935) as
useless.
• Don’t worry about the fact that we do not see tid in the comparison; rather, we see
MYcol. When you calculate everything, the underlying math ends up being exactly
what I just described. The reason for using a variable named MYcol is because the
code has to be parametric, so it works for any size image, not just the astronaut.bmp.
• Why is it so bad if only 18 threads check this? After all, 18 is only a very small
percentage of the 7936 total threads. Well, this is not what happens. Like I said before,
what you are seeing in Code 6.7 is what every thread executes. In other words, all
7936 threads must execute the same code and must check to see if they are useless,
just to find that they aren’t useless (most of the time) or they are (only a fraction
of the time). So, with this line of code, we have introduced overhead to every thread.
How do we deal with this? We will get to it, I promise. But, not in this chapter ... For
now, just know that even with these inefficiencies — which are an artifact of massively
parallel programming — our performance is still acceptable.
• And, finally, the __global__ is the third CUDA symbol that I am introducing here,
after ≪ and ≫. If you precede any ordinary C function with __global__ the
nvcc compiler will know that it is a GPU-side function and it compiles it into PTX,
rather than the x64 machine code output. There will be a few more of these CUDA
designators, but, aside from that, CUDA looks exactly like C.
Here, the ts and te variables computed the starting and ending row numbers in the
image, respectively. Vertical flipping was achieved by two nested loops, one scanning the
columns and the other scanning the rows. Now, let’s compare this to the Vflip() function
in Code 6.7:
Introduction to GPU Parallelism and CUDA 167
__global__
void Vflip(uch *ImgDst, uch *ImgSrc, ui Hpixels, ui Vpixels)
{
ui ThrPerBlk = blockDim.x;
ui MYbid = blockIdx.x;
ui MYtid = threadIdx.x;
ui MYgtid = ThrPerBlk * MYbid + MYtid;
ui BlkPerRow = (Hpixels + ThrPerBlk - 1) / ThrPerBlk; // ceil
ui RowBytes = (Hpixels * 3 + 3) & (˜3);
We see that there are major similarities and differences between CPU and GPU func-
tions. The task distribution in the GPU function is completely different because of the
blocks and the number of threads in each block. So, although the GPU still calculates a
bunch of indexes, they are completely different than the CPU function. GPU first wants to
know how many threads were launched with each block. The answer is in a special GPU
value named blockDim.x. We know that this answer will be 256 in our specific case because
we specified 256 threads to be launched in each block (Vflip≪ ..., 256 ≫). So, each block
contains 256 threads, with thread IDs 0...255. The specific thread ID of this thread is in
threadIDx.x. It also wants to know, out of the 166,656 blocks, what is its own block ID.
This answer is in another GPU value named blockIdx.x. Surprisingly, it doesn’t care about
the total number of blocks (166,656) in this case. There will be other programs that do.
It saves its block ID and thread ID in two variables named bid and tid. It then computes
a global thread ID (gtid) using a combination of these two. This gtid gives a unique ID
to each one of the launched GPU threads (out of the total 166,656 × 256 ≈ 41 M threads),
thereby linearizing them. This concept is very similar to how we linearized the pixel memory
locations on the disk according to Equation 6.2. However, an immediate correlation between
linear GPU thread addresses and linear pixel memory addresses is not readily available in
this case due to the existence of the useless threads in each row. Next, it computes the blocks
per row (BlkPerRow), which was 31 in our specific case. Finally, because the value of the
number of horizontal pixels (7918) was passed onto this function as the third parameter, it
can compute the total number of bytes in a row of the image (3 × 7918 = 23,754 Bytes) to
determine the byte index of each pixel.
After these computations, the kernel then moves onto computing the row and column
index of the single pixel that it is responsible for copying as follows:
After these lines, the source pixel memory address is in MYsrcIndex and the destination
memory address is in MYdstIndex. Because each pixel contains three bytes (RGB) starting
at that address, the kernel copies three consecutive bytes starting at that address as follows:
Let’s now compare this to CPU Code 2.7. Because we could only launch 4–8 threads, instead
of the massive 41 M threads we just witnessed, one striking observation from the GPU kernel
is that the for loops are gone! In other words, instead of explicitly scanning over the columns
and rows, like the CPU function has to, we don’t have to loop over anything. After all, the
entire purpose of the loops in the CPU function was to scan the pixels with some sort of
two-dimensional indexing, facilitated by the row and column variables. However, in the GPU
kernel, we can achieve this functionality by using the tid and bid, because we know the
precise relationship of the coordinates and the tid and bid variables.
TABLE 6.1 CUDA keyword and symbols that we learned in this chapter.
CUDA Description Examples
Keyword
precedes a __global__
void PixCopy(uch *ImgDst, uch *ImgSrc, ui FS)
device-side
global {
function ...
(i.e., a kernel) }
Launch a
Hflip<<<NumBlocks, ThrPerBlk>>>(..., ..., ...);
device-side
≪, ≫ kernel
Vflip<<<NumBlocks, ThrPerBlk>>>(..., ..., ...);
from the
host-side PixCopy<<<NumBlocks, ThrPerBlk>>>(..., ..., ...);
case we got lucky. However, if we had 20 more bytes in the file, we would have 236 threads
wasted out of the 256 in the very last block. This is why we still have to put the following
if statement in the kernel to check for this condition as shown below:
The if statement in Code 6.9, much like the ones in Code 6.7 and Code 6.8, checks if “it
is a useless thread” and does nothing if it is. The performance impact of this line is similar
to the previous two kernels: although this condition will be only true for a negligible number
of threads, every thread still has to execute it for every single byte they are copying. We can
improve this, but we will save all of these improvement ideas to the upcoming chapters. For
now, it is worth noting that the performance impact of this if statement in the PixCopy()
kernel is far worse than the one we saw in the other two kernels. The PixCopy() kernel has
a much finer granularity as it copies only a single byte. Because of this, there are only 6
lines of C code in PixCopy(), one of which being the if statement. In contrast, Code 6.7 and
Code 6.8 contain 16–17 lines of code, thereby making the impact of one added line much
less. Although “lines of code” clearly does not translate to “the number of cycles that it
takes for the GPU core to execute the corresponding instructions” one-on-one, we can still
get an idea about the magnitude of the problem.
the (1) horizontal or (2) vertical direction, (3) copies it to another image, or (4) transposes
it. The command line to run imflipG.cu is as follows:
imflipG astronaut.bmp a.bmp V 256
This vertically flips an image named astronaut.bmp and writes the flipped image into another
file named a.bmp. The ’V’ option is the flip direction (vertical) and 256 is the number of
threads in each block, which is what we will plug into the second argument of our kernel
dimensions with the launch parameters Vflip ≪ ..., 256 ≫ (...). We could choose ’H’, ’C’, or
’T’ for horizontal flip, copy, or transpose operations.
FIGURE 6.5 Creating a Visual Studio 2015 CUDA project named imflipG.cu. Assume
that the code will be in a directory named Z:\code\imflipG in this example.
• You can select the Network Installer, which will install straight from the Internet.
After spending 50 GB of your hard disk space on VS 2015, you will not be terribly
worried about another GB. Either option is fine. I always choose the network installer,
so I don’t have to worry about deleting the local installer code after the installation
is done.
• Click OK for the default extraction paths. The screen may go blank for a few second
while the GPU drivers are being configured. After the installation is complete, you
will see a new option in your Visual Studio, named “NSIGHT.”
FIGURE 6.6 Visual Studio 2015 source files are in the Z:\code\imflipG\imflipG di-
rectory. In this specific example, we will remove the default file, kernel.cu, that VS
2015 creates. After this, we will add an existing file, imflipG.cu, to the project.
project source files are going to be placed by VS 2015; so, the source files will be under
the directory Z:\code\imflipG\imflipG. Go into Z:\code\imflipG\imflipG; you will see a file
named kernel.cu and another file we don’t care about. The kernel.cu file is created in the
source file directory automatically by VS 2015 by default.
At this point, there are three ways you can develop your CUDA project:
1. You can enter your code inside kernel.cu by using it as a template and delete the parts
you don’t want from it and compile it and run it as your only kernel code.
2. You can rename kernel.cu as something else (say, imflipG.cu) by right clicking on it
inside VS 2015. You can clean what is inside the renamed imflipG.cu and put your
own CUDA code in there. Compile it and run it.
3. You can remove the kernel.cu file from the project and add another file, imflipG.cu,
to the project. This assumes that you already had this file; either by acquiring from
someone or editing it in a different editor.
I will choose the last option. One important thing to remember is that you should never
rename/copy/delete the files from Windows. You should perform any one of these operations
inside Visual Studio 2015. Otherwise, you will confuse VS 2015 and it will try to use a file
that doesn’t exist. Because I intend to use the last option, the best thing to do is to actually
plop the file imflipG.cu inside the Z:\code\imflipG\imflipG directory first. The screen shot
after doing this is shown at the bottom of Figure 6.6. This is, for example, what you would
174 GPU Parallel Program Development Using CUDA
FIGURE 6.7The default CPU platform is x86. We will change it to x64. We will also
remove the GPU debugging option.
do if you are testing the programs I am supplying as part of this book. Although you will
get only a single file, imflipG.cu, as part of this book, it must be properly added to a VS
2015 project, so you can compile it and execute it. Once the compilation is done, there will
be a lot of miscellaneous files in the project directory, however, the source file is only a
single file: imflipG.cu.
Figure 6.6 also shows the steps in deleting the kernel.cu() file. You right click and choose
“Remove” first (top left). A dialog box will appear asking you whether you want to just
remove it from the project, but keep the actual file (the “Remove” option) or remove it from
the project and delete the actual file too (the “Delete” option). If you choose the “Delete”
option, the file will be gone and it will no longer be a part of the project. This is the graceful
way to get this file permanently out of your life, while also letting VS 2015 know about it
along the way. After kernel.cu() is gone, you right click the project and this time Add a file
to it. You can either add the file that we just dropped into the source directory (which is
what we want to do by choosing the “Add Existing Item” option), or add a new file that
doesn’t exist and you will start editing (the “Add New Item” option). After we choose “Add
Existing,” we see the new imflipG.cu file added to the project in Figure 6.6. We are now
ready to compile it and run it.
FIGURE 6.8The default Compute Capability is 2.0. This is too old. We will change it
to Compute Capability 3.0, which is done by editing Code Generation under Device
and changing it to compute 30, sm 30.
will open, as shown in Figure 6.7. For the GPU, the first option you choose is Generate GPU
Debug Information. If you choose “Yes” here, you will be able to run the GPU debugger,
however your code will run at half the speed because the compiler has to add all sorts of
break points inside your code. Typically, the best thing to do is to keep this at “Yes” while
you are developing your code. After your code is fully debugged, you switch it to “No” as
shown in Figure 6.7.
After you choose the GPU Debug option, you have to edit the Code Generation under
CUDA C/C++ → Device and select the Code Generation next, as shown in Figure 6.8.
The default Compute Capability is 2.0, which will not allow you to run a lot of the new
features of the modern Nvidia GPUs. You have to change this to Compute Capability 3.0.
Once the “Code Generation” dialog box opens, you have to first uncheck Inherit from parent
of project defaults. The default Compute Capability is 2.0, which the “compute 20, sm 20”
string represents; you have to change it to “compute 30, sm 30” by typing this new string
into the textbox at the top of the Code Generation dialog box, as shown in Figure 6.8. Click
“OK” and the compiler knows now to generate code that will work for Compute Capability
3.0 and above. When you do this, your compiled code will no longer work with any GPU
that only supports 2.0 and below. There have been major changes starting with Compute
Capability 3.0, so it is better to compile for at least 3.0. Compute Capability of the Nvidia
GPUs is exactly like the x86 versus x64 Intel ISA, except there are quite a few more options
from Compute Capability 1.0 all the way up to 6.x (for the Pascal Family) and 7.x for the
upcoming Volta family.
The best option to choose when you are compiling your code is to set your Compute Capa-
bility to the lowest that will allow you to run your code at an acceptable speed. If you set
it too high, like 6.0, then your code will only run on Pascal GPUs, however you will have
the advantage of using some of the high-performance instructions that are only available
176 GPU Parallel Program Development Using CUDA
annoying
squiggly
lines
executable
imflipG.exe file
FIGURE 6.9 Compiling imflipG.cu to get the executable file imflipG.exe in the
Z:\code\imflipG\x64\Debug directory.
in Pascal GPUs. Alternatively, if you use a low number, like 2.0, then your code might
be exposed to the severe limitations of the early year-2000 days, when — just as a quick
example — the block size limitations were so restrictive that you had to launch the kernels
in a loop because each kernel launch could only have a maximum of ≈ 60,000 blocks, rather
than the multibillion, starting with 3.0. This would be a huge problem even in our very
first CUDA program imflipG.cu; as we analyzed in Section 6.4.15, imflipG.cu required us
to launch 166,656 blocks. Using Compute Capability 2.0 would require that we somehow
chop up our code into three separate kernel launches, which would make the code messy.
However, using Compute Capability 3.0 and above, we no longer have to worry about this
because we can launch billions of blocks with each kernel. We will study this in great de-
tail in the next chapter. This is why the 3.0 is a good default for your projects and I will
choose 3.0 as my default assumption for all of the code I am presenting in this book, unless
otherwise stated explicitly. If 3.0 is continuously what you will use, it might be better to
change the Project defaults, rather than having to change this every time you create a new
CUDA program template.
Once you choose the Compute Capability, you can compile and run your code; go to
BUILD → Build Solution as shown in Figure 6.9. If there are no problems, your screen will
look like what I am showing in Figure 6.9 (1 succeeded and 0 failed) and your executable
file will be in the Z:\code\imflipG\x64\Debug directory. If there are errors, you can click it
to go to the source line of the error.
Although Visual Studio 2015 is a very nice IDE, it has a super annoying feature when it
comes to developing CUDA code. As you see in Figure 6.9 (see ebook for color version), your
kernel launch lines in main() — consisting of CUDA’s signature ≪ and ≫ brackets —
will have squiggly red lines as if they are a syntax error. It gets worse; because VS 2015
sees them as dangerous aliens trying to invade this planet, any chance it gets, it will try to
separate them into double and single brackets: “≪” will become “<”. It will drive you
Introduction to GPU Parallelism and CUDA 177
nuts when it separates them and you connect them back together and, in a minute, they are
separated again. Don’t worry. You will figure out how to handle them in time and you will
get over it. I no longer have any issues with it. Ironically, even after being separated, nvcc
will actually compile it correctly. So, the squiggly lines are nothing more than a nuisance.
C:\> Z:
Z:\> CD Z:\code\imflipG\x64\Debug
Z:\code\imflipG\x64\Debug> imflipG Astronaut.bmp Output.bmp V 256
As seen in Figure 6.10, if you have File Explorer open, you can browse this executable
code directory and when you click the location dropdown box, the directory name will be
highlighted (Z:\code\imflipG\x64\Debug), allowing you to copy it using Ctrl-C. You can
then type “CD” inside your CMD window and paste that directory name after “CD”, which
eliminates the need to remember that long directory name. The program will require the
source file Astronaut.bmp that we are specifying in the command line. If you try to run it
178 GPU Parallel Program Development Using CUDA
without the Astronaut.bmp in the code executable directory, you will get an error message,
otherwise the program will run and place the expected resulting output file, Output.bmp
in this case, in the same directory. To visually inspect this file, all you have to do is to
open a browse — such as Internet Explorer or Mozilla — and drop the file into the browser
window. Even simpler, you can double click on the image and Windows will open the
associated application to view it. If you want to change that default application, Windows
will typically give you an option to do so.
If everything checks out OK in this list after the execution of the program is complete,
then your program may be fine. After these checks the only remaining issues are subtle
ones. These issues do not manifest themselves as errors or crashes; they may have subtle
effects that are hard to tell through the checklist above, such as the image being one pixel
shifted to the right and the one row on the left being blank (e.g., white). You wouldn’t be
able to tell this problem with the simple visual check, not even when you drag and drop the
file into a browser. The one horizontal column of blank pixels would be white, much like
the background color of the browser, thereby making the two difficult for you to distinguish
between the browser background versus image column. However, a trained eye knows to be
suspicious of everything and can spot the most subtle differences like this. In any event,
a simple file checker will clear up any doubt that you have in mind for these kinds of
problems.
Just as computer programmer trivia, I can’t stop myself from mentioning a third kind
of a problem: everything checks out fine, and the golden and output files compare fine.
However, the program gradually makes computer performance degrade. So, in a sense, al-
though your program is producing the expected output, it is not running properly. This
is the kind of problem that will really challenge an intermediate programmer, even an
experienced one. But, more than likely, an experienced programmer will not have these
types of bugs in his or her code; yeah right! Examples of these bugs include ones that
allocate memory and do not free it or ones that write a file with the wrong attributes,
preventing another program from modifying it, assuming that the intent of the program
Introduction to GPU Parallelism and CUDA 179
is to produce an output that can be further modified by another program, etc. If you are
a beginner, you will develop your own experience database as time goes on and will be
proficient in spotting these bugs. I can make a suggestion for you though: be suspicious of
everything! You should be able to detect any anomalies in performance, output speed, the
difference between two different runs of the same code, and more. When it comes to com-
puter software bugs — and, for that matter even hardware design bugs — it might be a good
time to repeat Intel’s former CEO and legend, late Andy Grove’s words: only the paranoid
survive.
FIGURE 6.11 The /usr/local directory in Unix contains your CUDA directories.
Actually, in a Windows platform, this is precisely what Visual Studio does when you
click the “Build” option. You can view and edit the command line options that VS 2015
will use when compiling your CUDA code by going to PROJECT → imflipG Properties
on the menu bar. Xcode IDE is no different. Indeed, the Eclipse IDE that I will describe
when showing the Unix CUDA development environment is identical. Every IDE will have
an area to specify the command line arguments to the underlying nvcc compiler. In Xcode,
Eclipse, or VS 2015, you can completely skip the IDE and compile your code using the
command line terminal. The CMD tool of Windows also works for that.
Here, there are two different CUDA directories shown. This is because CUDA 7.5 was
installed first, followed by CUDA 8.0. So, both of the directories for 7.5 and 8.0 are there.
The /usr/local/cuda symbolic link points to the one that we are currently using. This is why
it might be a better idea to actually put this symbolic link in your PATH variable, instead
of a specific one like cuda-8.0, which I showed above.
A dialogue box opens asking you for the workspace location. Use the default or set it to
your preferred location and press OK. You can create a new CUDA project by choosing
File → New → CUDA C/C++ Project, as shown in Figure 6.12.
Build your code by clicking the hammer icon and run it. To execute a compiled program
on your local machine, run it as you would any other program. However, because we are
Introduction to GPU Parallelism and CUDA 183
FIGURE 6.12 Creating a new CUDA project using the Eclipse IDE in Unix.
normally going to be passing files and command line arguments to the program, you will
probably want to put the cd into the directory with the binary and run it from there. You
could specify the command line arguments from within your IDE, but this is somewhat
tedious if you are changing them frequently. The binaries generated by IDEs generally
appear in some subfolder of your project (Eclipse puts them in the Debug and Release
folders). As an example, to run the “Release” version of an application in Linux that we
developed in Eclipse/Nsight, we may type the following commands:
cd ~/cuda-workspace/imflipG/Release
. /imflipG
This will run your CUDA code and will display the results exactly like in Windows.
CHAPTER 7
CUDA Host/Device
Programming Model
PU is a co-processor, with almost no say in how the CPU does its work. So, although
G we will study the GPU architecture in great detail in the following chapters, we will
focus on the CPU-GPU interaction in this chapter. The programming model of a GPU is
a host/device model, in which the host (CPU) issues co-processor like commands to the
device (GPU) and has no idea what the GPU is doing during their execution, although
it can use a rich set of query commands to determine its status during the execution. To
the CPU, all that matters is that the GPU finishes its execution and the results show up
somewhere that it can access. GPU, on the other hand, receives commands from the CPU
through a set of API functions — provided by Nvidia — and executes them. So, although
CPU programming can be learned independently, GPU programming should be learned in
conjunction with CPU programming, hence the reason for the organization of this book.
â There is no such thing as GPU programming; there is only CPU+GPU
programming ... You can’t learn “just GPU programming.”
â When you were learning how to ride a bicycle, what did you do?
Did you learn “just how to pedal” and ignore steering?
No, you either learned both or you didn’t know how to ride a bike.
â The CPU code dictates what the GPU does.
So, you need to learn their programming together; you can’t just learn one.
In this chapter, we will focus on the parts of the GPU programming that involves
both the CPU and GPU, which are the launch dimensions of a GPU kernel, PCI Express
bandwidth and its impact on the overall performance, and the memory bandwidth of the
CPU and the GPU.
185
186 GPU Parallel Program Development Using CUDA
only a few threads (say, 8), we simply chopped up the image into 8 pieces and had each
thread process a portion of 18 of the image. However, when there are millions of threads we
can launch, a slew of very different considerations come up. Let’s first conceptualize our
parallelism without writing a single line of GPU code.
The pointer TheImg points to the original CPU image array, which is then copied onto
the GPU image array, pointed to by the GPUImg pointer. When the GPU kernel exe-
cutes, it takes the image that is stored in the array that is pointed to by GPUImg, flips
it, and stores the flipped image in the GPU memory, pointed to by GPUCopyImg. This im-
age, then, is transferred back to the CPU’s second image array, pointed to by the CopyImg
pointer.
If this code was being executed using 8 CPU threads, each thread would be responsible
for copying an eighth of the image, but GPU parallelism differs despite major similarities.
Here are the basic rules for GPU parallelism:
• The GPU code is written using threads, exactly like the CPU code, so a GPU kernel is
the code for each thread. In that sense, a thread can be thought of as being a building
block of the task. So, the programmer designs the GPU program based on threads.
• Because no less than 32 threads (a warp) executes at any given point in time, a warp
can be thought of as being the building block of code execution. So, the GPU executes
the program based on warps.
• In many cases, a warp is too small of an execution unit for the GPU. So, we launch
our kernels in terms of blocks, which are a bunch of warps, clumped up together. In
that sense, a block can be thought of as being a building block of code launch. So,
the programmer launches his or her kernels in terms of blocks. The notion of the
warp is more like a good trivia; a programmer conceptualizes everything in terms
of threads/block. In all of our GPU kernels, we will always ask ourselves the fol-
lowing question: “How many threads should our blocks have?” Rarely we will worry
about warps (if ever). The size of a warp for Nvidia GPUs never changed in the past
two decades of Nvidia GPU designs, but we have to keep the possibility in mind
that Nvidia might decide to change the warp size to something other than 32 in
future GPU generations. But, the notion of how many threads are in a block? will
never change. The programmer doesn’t really think of the block size in terms of
warps.
• Common block sizes are 32, 64, 128, 256, 512, or 1024 threads/block.
CUDA Host/Device Programming Model 187
Vflip() kernel execution times (ms) for different size images on a GTX
TABLE 7.1
TITAN Z GPU.
astronaut.bmp (121 MB) mars.bmp (241 MB)
ThrPerBlk 7918 × 5376 12140 × 6940
NumBlocks BlkPerRow time (ms) time (ms)
32 1333248 ≈ 1.27 M 248 12.04 22.93
64 666624 = 651 K 124 6.58 12.63
128 333312 ≈ 326 K 62 4.24 7.90
256 166656 ≈ 163 K 31 4.48 8.15
512 86016 = 84 K 16 4.64 8.53
1024 43008 = 42 K 8 5.33 9.22
Different block sizes (ThrPerBlk) correspond to different total number of blocks launched
(NumBlocks) and the total number of blocks needed to process each row (BlkPerRow),
as tabulated. The CPU←→GPU data transfer times are not included in this table.
The most intuitive explanation for performance degradation for ThrPerBlk<128 is that
above a certain threshold (more than 384 bytes copied per block in this specific case) the
GPU memory’s massive bandwidth can be taken advantage of, thereby yielding the best
results. This is very similar to what we witnessed in Section 3.5.3, where the CPU DRAM
didn’t like “choppy” access; it preferred big consecutive chunks. The question is: How do
we explain the slight performance degradation above the — ThrPerBlk>128 — threshold?
Also, how can we be sure that bigger blocks translate to “consecutive memory access” inside
the GPU memory? These questions will take a few more sections to appreciate. Let’s keep
studying the GPU to find answers to these questions.
7.2.1 Grids
We observe from Table 7.1 that the total number of GPU threads we launched is always
41 M — for the astronaut.bmp image — regardless of what the size of each block is. This is
because we conceptualized each thread as being responsible for one pixel (3 bytes) and the
size of the image is 121 MB. So, not surprisingly, we need to launch a number of threads
that is 13 of the image size. Actually, it is a little more than that because of the useless
threads that I mentioned in Section 7.1.2. So, although the 41 M pixels became a fact the
second we decided to adopt the “one pixel per thread” strategy, we still have a choice in
how we chop these 41 M threads up into blocks: If our blocks have 128 threads each, then we
will have ≈326 K blocks. Alternatively, a much smaller block size of 32 threads will result
in ≈1.27 M blocks, per Table 7.1.
1D Grids: Nvidia has a name for these army of blocks: a grid. A grid is a bunch
of blocks, arranged in a 1D or 2D fashion. Let’s choose the ThrPerBlk=128 as our block
size option in Table 7.1, which we know yields the optimum kernel execution time. In this
case, we are choosing to launch a grid that has 333,312 blocks. In this grid, the blocks are
numbered from Block 0 to Block 333,311. Alternatively, if you launch the same grid with
ThrPerBlk=256, you will have 166,656 blocks, numbered from Block 0 to Block 166,655.
• The grid dimension is 166,656, so block IDs will range in 0...166,655.
2D Grids: The grids do not have to be in a 1D array. They can most definitely be in
a 2D array. For example, instead of launching 166,656 blocks in a 1D fashion — which will
have block numbers 0 through 166,655 — you can launch them in a 2D fashion; in this case
you still need to choose the size of each dimension: you have options such as 256 × 651 or
768 × 217. As an example, if you chose the 768 × 217 option, your blocks will now have a
2D block numbering (i.e., x and y block IDs) as follows:
• The x grid dimension is 768, so block IDs in the x dimension will range in 0...767,
• The y grid dimension is 217, so block IDs in the y dimension will range in 0...216.
190 GPU Parallel Program Development Using CUDA
3D Grids: Starting with Compute Capability 2.x, Nvidia GPUs support 3D grids of
blocks. However, there is a fine print; just because a GPU supports a 3D grid of blocks
does not guarantee the total number of blocks that can be launched with a single kernel
launch. As we will see in Section 7.7.3, the GT630 GPU can support three dimensions,
but its x dimension is limited to 65,535 blocks and the total number of blocks that can be
launched with a single kernel launch is only ≈190 K. In other words, the product of the
three dimensions cannot exceed this number, which is an architectural limitation.
One important note here is that grids are always multidimensional; so, if you launch
a 1D grid, the only available dimension will be considered as the “x” dimension. One of
the most beautiful features of the GPUs is that grids are automatically numbered by the
hardware and passed on to the kernels. Table 7.2 shows the kernel variables that the GPU
hardware passes on to the kernel. In the example I just gave, if we launch a 1D grid of
166,656 blocks, the GPU hardware will have the gridDim.x=166656 available in every single
kernel that is launched. In the case of a 2D grid launch (say, with the 768×217) dimensions,
the GPU hardware will pass gridDim.x=768 and gridDim.y=217 to every single kernel. The
kernels can take these values and use them in their calculation. Because they are hardware-
generated values, the cost of obtaining them is zero. This means that whether you launch
a 1D or 2D grid you do not pay any performance penalty.
7.2.2 Blocks
As I described before, blocks are the unit element of launch. The way a GPU program-
mer conceptualizes a program is that a giant tasks gets chopped up into blocks that can
execute independently. In other words, no block should be depending on each other to
execute, because this is something that will “serialize” the execution. Each block should
be so independent from each other in terms of resource requirement, execution, and result
reporting that you should be able to run block 10,000 and block 2 at the same time without
causing any problems. As an another 2D case, block (56,125) and (743,211) should be able
to run without requiring any of the others to have completed execution. It is only then you
can take advantage of massive parallelism. If there is any dependency between, say, block
CUDA Host/Device Programming Model 191
(something, something) and block (something+1, something) or in any other way, you are
hurting the very first requirement of massive parallelism:
â For the GPU to do its job (massive parallelism), a great responsibility
falls on the GPU programmer’s shoulders: The GPU programmer
should divide the execution into a “massively independent” set of blocks.
â Each block should have no resource dependence with other blocks.
â Massively parallel execution is only possible with massively independent blocks.
â Any hint of dependence among blocks will “serialize” the execution.
Assuming that you launch your kernels using a 2D grid with dimensions 768 × 217, it is
as if you are launching 166,656 blocks in two nested for loops; these 166,656 blocks would
each have a unique block ID that is passed on to every single one of your kernels using
the blockIdx.x and blockIdx.y variables shown in Table 7.2. So, the for loops would be
symbolically look like this:
Alternatively, if you decided to launch the blocks in a 1D grid, like we did in Section 7.1.3,
this would correspond to launching the blocks in a single for loop as follows:
7.2.3 Threads
You must have the intuition at this point that the block dimensions do not have to be
strictly 1D either. They can be 2D or 3D. For example, if you are launching 256 threads in
each block, you can launch them in a 2D thread array of size 16×16 or a 3D thread array
of size 8×8×4. In these cases, your thread IDs will range as follows:
• In a 1D thread array of size 256, blockDim.x=256, blockDim.x=1, blockDim.x=1,
Thread IDs range in threadIdx.x=0...255, threadIdx.y=0, threadIdx.z=0.
• In a 2D thread array of size 16×16, blockDim.x=16, blockDim.y=16,
Thread IDs range in threadIdx.x=0...15, threadIdx.y=0...15, threadIdx.z=0.
• In a 3D thread array of size 8×8×4, blockDim.x=8, blockDim.y=8, blockDim.z=4,
Thread IDs range in threadIdx.x=0...7, threadIdx.y=0...7, threadIdx.z=0...3.
192 GPU Parallel Program Development Using CUDA
In all three of these cases you launch 256 threads in each block. What changes — among
these three cases — is that instead of the single for loop I just showed, a 2D or 3D thread
array corresponds to executing the threads within two or three nested for loops, respec-
tively, and passing the for loop variables (threadIdx.x, threadIdx.y, and threadIdx.z) into
the kernel. This implies that the programmer doesn’t have to worry about the for loops
pertaining to the threads. This is the functionality of the GPU hardware. To put it in dif-
ferent terms, you get free for loops. However, this doesn’t mean that the looping becomes
free instantly. Each kernel still has to check this big list of variables to see who it is.
Continuing our 2D grid example with x and y grid dimensions 768×217, a complete block
executes within the inner loop, having block IDs (blockIdx.x and blockIdx.y). Remember
that the execution of one block means the execution of every thread within that block. The
number of threads within each block is also in another parameter, blockDim.x, which is 256
in this specific case. Each one of these threads executes the same kernel function Vflip()
(shown in Code 6.7), so it is as if you are running the Vflip() function in, yet another, for
loop — with 256 iterations — as follows:
This pseudo-code demonstrates the scenario where the kernel is launched as a 2D array of
blocks, in which every block will have a 2D index. Furthermore, each block is composed
of a 3D array of 256 threads, arranged as an 8 × 8 × 4 array of 3D threads, each with
a 3D thread ID. Observe that in this scenario, we get five for() loops for free because
the Nvidia GPU hardware generates all five loop variables as it launches the blocks and
threads.
block size is 256 threads, we mean our block size is 8 warps. Warps are the unit element
of execution, as opposed to blocks, which are unit element of launch. Warps are always
32 threads. This argument makes it clear that just because we launched the blocks in 256
thread clumps, it doesn’t mean that they will be executed instantaneously, with all 256
threads executing and finishing in a flash. Instead, the block execution hardware inside the
GPU will execute them in 8 warps, warp0, warp1, warp2, ... warp7.
Although each warp will contain this warp ID that I just showed, we only worry about
this ID if we are writing low-level PTX assembly language, which makes this warp ID
available to the kernel. Otherwise, we simply worry about the blocks at the high level
CUDA language. The significance of a warp will become clear when we go through some
simple PTX examples, however, in general, a programmer can conceptualize everything in
terms of blocks and the code will run fine.
• ThrPerBlk=256 BlkPerRow=31
Here are general guidelines for the structure of a GPU (or CPU) thread. We will look at
the details of the GPU kernels based on the following guidelines:
â Every thread of the multithreaded CPU and GPU code goes through
three stages of operation.
â These three stages are:
1. Who am I? This is where the kernel finds out about its own ID.
2. What is my task? This is where the kernel determines
which part of the data it is supposed to process, based on its ID.
3. Do it... and, it does what it is supposed to do.
__global__
void Vflip(uch *ImgDst, uch *ImgSrc, ui Hpixels, ui Vpixels)
{
ui ThrPerBlk = blockDim.x;
ui MYbid = blockIdx.x;
ui MYtid = threadIdx.x;
ui MYgtid = ThrPerBlk * MYbid + MYtid;
This is where the Vflip() kernel extracts its block ID, thread ID, and the ThrPerBlk value. Just
for clarity, the kernel is using the same variable name ThrPerBlk as the main(), but this is a
local variable to this kernel, so the name could have been anything else. Because our example
assumes that 41 M kernels are being launched, the Vflip() function above represents only a
single thread out of these 41 M kernels. Therefore, as its first task, this kernel computes
which one of the 41 M kernels it is; this global thread ID is placed in a variable named
MYgtid. This computation “linearizes” the thread index (or, alternatively the thread ID),
departing from the block ID (in the MYbid variable) and the thread ID (in the MYtid variable).
2. What is my task? I am using the term linearized thread ID to refer to MYgtid.
After determining its MYgtid, Vflip() continues as follows:
The concept of linearization refers to turning a 2D index into a 1D index here, as we saw
in Equation 6.2, where we computed the linear memory address of a pixel based on its x
and y pixel coordinates. The similarity here — in the case of MYgtid — is that the launch
196 GPU Parallel Program Development Using CUDA
pattern of the 41 M threads here are indeed 2D, the blocks occupying the x dimension and
the threads in each block occupying the y dimension. Therefore, linearization in this case
allows the thread to determine a globally unique ID — MYgtid — among all 41 M threads,
something that cannot be determined from the MYbid or MYtid alone.
In determining what its task is, Vflip() first determines which row (MYrow) and which
column (MYcol) pixel coordinates it is supposed to copy, as well as the mirroring row that
this will be copied onto (MYmirrorrow) with the same column index. After computing its
column index, it quits if it realizes that it is one of the useless threads, as I described in
Section 6.4.14. Next, it turns row and column information into source and destination GPU
memory addresses (MYsrcIndex and MYdstIndex). Note that it uses the ImgSrc and ImgDst
pointers that were passed onto this kernel after being allocated using cudaMalloc() inside
main().
3. Do it ... After computing the source and destination memory addresses, all that
is left is to copy a pixel’s three consecutive bytes from the source to the destination GPU
memory.
ImgDst[MYdstIndex] = ImgSrc[MYsrcIndex];
ImgDst[MYdstIndex + 1] = ImgSrc[MYsrcIndex + 1];
ImgDst[MYdstIndex + 2] = ImgSrc[MYsrcIndex + 2];
2. What is my task? Determining the start and end indexes is also extremely easy
for the CPU; it was just a better simple formula to compute the part of the image that this
thread was responsible for as follows:
will be launched, corresponding to gridDim.x=498,834. Each block will have IDs in the
range blockIDx.x=0...498,833. Let’s look at the thread execution steps.
1. Who am I? Here is the PixCopy() kernel, originally shown in Code 6.9. The global
index is still computed, just like the other two kernels.
__global__
void PixCopy(uch *ImgDst, uch *ImgSrc, ui FS)
{
ui ThrPerBlk = blockDim.x;
ui MYbid = blockIdx.x;
ui MYtid = threadIdx.x;
ui MYgtid = ThrPerBlk * MYbid + MYtid;
2. What is my task? A range check is done in this step, although it is a little different;
this thread is useless only if the global thread ID is bigger than the size of the image. This
would only happen if the size of the image is not perfectly divisible by the number of threads
in a block. As an example, astronaut.bmp is 7918 × 5376 × 3=127,701,504 bytes and if we are
using 1024 threads per block, we would need 124,708.5 blocks; so, we would launch 124,709
blocks, wasting half a block (i.e., 512 threads). Although this is a minuscule amount as
compared to 121 MB, it still forces us to use the line to check to see if this global thread ID
is out of range.
3. Do it ... Because each thread only copies a single pixel, the global thread ID (MYgtid)
is directly correlated to the source and destination GPU memory addresses. Actually, more
than that: it is the same addresses, making it unnecessary to compute any other index.
Therefore, the line that does the actual work is only a single line.
ImgDst[MYgtid] = ImgSrc[MYgtid];
}
All this line does is to copy a single byte from one part of GPU global memory (pointed to
by *ImgSrc) to another part of the GPU global memory (pointed to by *ImgDst).
The fact that the PixCopy() kernel was so much simpler than the other two should make
you scratch your heads at this point. The questions you should be asking yourself are
• Does the fact that PixCopy() was so much simpler mean that these simplifications can
be applied to all GPU kernels?
• In other words, was the complicated nature of the Vflip() and Hflip() kernels a fact of life
for GPUs or something that we have caused by choosing the design of parallelization
in Section 7.1?
• Does this, then, mean that we could have designed the Vflip() and Hflip() kernels much
better?
• Will the PixCopy() kernel work faster just because it looks simpler?
• If simple doesn’t mean faster, should we — ever — care about making code look
simpler or should we strictly care about the execution time?
CUDA Host/Device Programming Model 199
• What would happen if we tried to copy only a single byte in the Vflip() and Hflip()
kernel?
Phew! These are all excellent questions to ask. And, the answers will take a few chapters
to appreciate. For now, let me remind you of my golden rules in designing good code:
â A good programmer writes fast code.
â A really good programmer writes even faster code.
â An excellent programmer writes super fast code.
â An exceptional programmer doesn’t worry about writing fast code;
he or she only worries about understanding the fundamental cause(s)
for low performance and redesigning the code to avoid them.
It is highly likely that he or she will end up writing the fastest code in the end.
The moral of the story is that your take-away from this book should be understanding
the causes of low performance by bombarding your brain with questions like the ones above.
When I showed some of the examples in this book to my students, I received reactions such
as “Wow, I could have done better, this code is really slow.” My answer is: Exactly! That’s
my point. In this book, I went to extremes to find a sequence of improvements that can
demonstrate a performance improvement at each step, which clearly implied starting with
fairly pathetic code! So, I did that. The most important thing I want to demonstrate in this
book is the incremental performance improvement achieved by adding/removing certain
lines. Just to let you know, the Vflip(), Hflip(), and PixCopy() kernels are so exceptionally
bad that I will write many improved versions of them in the upcoming chapters, each
demonstrating a healthy performance gain due to a particular architectural reason.
Your goal in reading this book should be to thoroughly — painfully and obsessively —
understand the reasons behind the improvements. Once you understand the reasons, you
have control over them and you can be an exceptional programmer per my comments above.
Otherwise, do not ever try to improve code without understanding the reasons of the im-
provement; more than likely, each one will have an architectural correlate. GPU program-
ming differs majorly from CPU programming in that the nice architectural improvements
such as out of order execution and cache coherency among all of L1$’s are missing. This
pushes a lot of responsibility to the shoulders of the programmer. So, if you do not under-
stand the underlying reasons for performance degradations, you will not be able to work
the 3000–5000 cores that you have in this beast to their fullest potential!
I plugged in the fancy Kepler GPU (GTX Titan Z, the one reported in Table 7.3 inside
Box V). Guess what? My Nvidia Control Panel still reported PCI Express 2.0. Of course, like
any other rational computer scientist or electrical engineer would, I panicked and rebooted
my PC multiple times. Nope! Still PCIe 2.0. What could be the problem? After my “denial”
period passed, I started digging through the Intel website just to find that the i7-3820
CPU [8] did not support PCIe 3.0. It simply does not have a PCIe 3.0 controller built into
it. It only supports PCIe 2.0; in other words, I got the worst of the three! I kept reading
and eventually found that only the Xeon E5-2680 or Xeon E5-2690 [13] support PCIe 3.0
on that specific motherboard. Although it is very expensive normally, I was able to find a
used Xeon E5-2690 at a quarter of the price and plugged it in. Xeon E5-2690 is an 8C/16T
CPU and I have been using it ever since.
The moral of the story here is that all three components have to support a specific PCIe
speed: (1) motherboard, (2) CPU, and (3) the GPU. If one of them supports anything less,
you get the worst of the three. Another example is my Dell laptop (shown as Box II in
Table 7.3). Although this laptop should support PCIe 3.0, my Nvidia control panel reports
PCIe 2.0. What could it be? (1) The motherboard should be PCIe 3.0 because Dell’s website
says so, (2) the CPU is i7-3740QM, which should support PCIe 3.0, and (3) the GPU, Nvidia
Quadro K3000M, is a Kepler with PCIe 3.0 support. What else could it be? There is, yet,
another possibility. The PCIe 3.0 support might be disabled in the BIOS.
Well, there will be another round of Googling on my end ...
things in the computer that you cannot control. In the end, we do not care too much about
the little differences between consecutive packets. All we care about is the “average” transfer
speed over a long term, say, hundreds of packets. Measuring a single packet’s transfer speed
as we computed above (2.38 GBps) might cause a huge error over the long term. However, if
we know the total amount2 of time for 121 MB (say, 39.43 ms), we can calculate the transfer
121×1024 Bytes
throughput as 39.43×10 −3 seconds ≈ 3 GBps. We observe that we would be making a large
error if we used only a single packet to measure the speed (2.56 GBps). However, this final
3 GBps is a much more accurate throughput number because all of the plus and minus errors
average out to lose their influence over a longer interval.
Bandwidth of a data transfer medium (e.g., memory bus or PCI Express bus) is the
maximum throughput it supports. For example, the bandwidth of a PCI Express 2.0 bus is
8 GBps. This means that we cannot expect a throughput that is higher than 8 GBps when
transferring data over a PCIe 2.0 bus. However, generally the throughput we achieve will
be less because there are a lot of OS-dependent reasons that prevent reaching this peak.
Upstream bandwidth is the expected bandwidth in the CPU→GPU direction, whereas
Downstream bandwidth refers to the GPU→CPU direction. One nice feature of PCIe is
that it supports simultaneous data transfers in both directions.
TABLE 7.4 Introduction date and peak bandwidth of different bus types.
Peak Introduction
Bus Type Common Uses
Bandwidth Date
Industry Standard
< 20 MBps 1981 8–16 b Peripherals
Architecture (ISA)
VESA Local 32 b High-End
< 150 MBps 1992
Bus (VLB) Peripherals
Peripheral Component Peripherals,
266 MBps 1992
Interconnect (PCI) Slow GPUs
Accelerated Graphics
2133 MBps 1996 GPUs
Port (AGP)
PCIe Gen1 x1 250 MBps Peripherals, Slow GPUs
2003
PCIe Gen1 x16 4 GBps GPUs
PCIe Gen2 x1 500 MBps Peripherals, Slow GPUs
2007
PCIe Gen2 x16 8 GBps GPUs
PCIe Gen3 x1 985 MBps Peripherals
2010
PCIe Gen3 x16 15.75 GBps GPUs
Nvidia Nvidia GPU
80 GBps April 2016 Supercomputers
NVlink Bus
PCIe Gen4 x1 1.969 GBps final specs Peripherals
expected
PCIe Gen4 x16 31.51 GBps in 2017 GPUs
AGP is totally obsolete today, while legacy PCI is still provided on some motherboards.
Nvidia introduced the NVlink bus in mid-2016 for use in GPU-based supercomputers,
with almost 5× higher bandwidth than the then-available PCIe Gen3.
In general, PCIe bus is a huge bottleneck for GPUs, placing a significant limitation on the
CPU←→GPU data throughput. To alleviate this problem, PCIe standards have continuously
been improved in the past two decades, starting with the PCIe 1.0 standard in 2003 that
intended to replace the then-standard Accelerated Graphics Port (AGP). The AGP standard
used a 32-bit bus, which transferred the data in 32-bit chunks at a time and achieved a
maximum throughput of ≈2 GBps. The PCIe standard reduced the bus size to a single
bit, which could work at 250 MBps (1-bit instead of 32-bits, found in AGP). Although this
intuitively sounds like a degradation of the standard, it is not true; synchronizing 32 bits is
highly susceptible to phase delays among 32 bits, degrading the performance of this parallel
transfer of the 32-bits. Alternatively, turning those 32 parallel bits into individual 32 single-
bit data entities allows us to send all 32 of them in separate PCIe lanes without worrying
about the phase delays during the transfer and synchronizing them at the receiver end. As
a result, PCI Express can achieve a much better data throughput.
A list of different bus types is provided in Table 7.4, showing their introduction dates
chronologically. As shown in Table 7.4, PCIe concept has another huge advantage: the
PCIe x16 is downward compatible with x8, x4, and x1. Therefore, you can use different
204 GPU Parallel Program Development Using CUDA
peripherals — such as network cards or sound cards — on the same PCIe bus, without
requiring different standards for different cards. This allowed PCIe to take over all of the
previous standards such as AGP, PCI, ISA, and possibly more that I don’t even remember.
This serial-transfer structure of PCIe 1.0 allowed it to deliver a peak 4 GBps throughput on
16 lanes, beating the AGP. Sixteen PCIe lanes are denoted as “PCIe x16,” which is a typical
number of lanes used for GPUs. Slower cards, such as Gigabit NICs, use PCIe 1x or 4x.
Further revisions of PCIe kept increasing the transfer throughput; PCIe 2.0 was specified
at 8 GBps and PCIe 3.0 was specified at 15.75 GBps. What is coming in the future is the
PCIe 4.0 standard that will allow a 2× higher throughput than PCIe 3.0. Furthermore,
Nvidia has just introduced its own NVlink bus that is designed for high-end server boards
to eliminate the bottlenecks due to the PCIe standard. Starting with the Pascal family,
Nvidia is offering both the PCI 3.0 as an option, as well as NVlink on high-end servers,
housing Pascal GPUs.
The family names I am mentioning (Fermi, Kepler, Pascal) are different generations
of GPUs designed by Nvidia and I will discuss them in more detail in Section 7.7.1 and
Section 8.3. Within each family, there are different GPU engine designs; for example, GK
family denotes the Kepler family engines, GP is for Pascal engines, and GF is for Fermi
engines. In Table 7.3, Box I contains a Fermi architecture GPU, working on the PCIe 2.0
x16 bus, while Boxes II, III, V, and VI are Kepler engine GPUs, the former two working
on a PCIe 2.0 bus and the latter two working on a PCIe 3.0 bus. Box IV is the only Pascal
engine GPU, which works on a PCIe 3.0 bus. If we look at two different Kepler GPUs,
within Box V and Box VI, they have the GK110 and GK210 engines. Therefore, although
they belong to the same family, they could have significant performance differences, as we
will thoroughly study throughout the book.
For now, a clean observation can be made from Table 7.3 that the CPU→GPU and
GPU→CPU data transfer times are directly correlated with PCIe bus speeds. However, the
fact that there is an asymmetry between the two different directions for Box I and Box VI
is curious. Keep reading the book. The answers will eventually pop out.
• The GPU also has its own internal memory, which is called global memory and this
memory is connected to the cores with a bus that has a peak bandwidth of 336 GBps.
Vertical flipping took 4.24 ms — in its best case — to transfer 121 MB of data from
the global memory to the cores and from the cores back to a different area of global
memory; so, the total data transfer amount was 2 × 121 MB. This corresponds to a
transfer throughput of 2 × 121/1024/0.00424s ≈ 57.39 GBps, which is significantly
lower than the 336 GBps peak (noted as 17% of peak in Table 7.3), suggesting that
there is major room for improvement.
One note to make here is that Box V in Table 7.3 shows a GTX Titan Z GPU, which is
really 2 GPUs in one GPU; it has a total of 2880 cores in each GPU, with a total of 5760
for the GTX Titan Z GPU card. The K80 GPU inside the Dell Server (Box VI) is designed
exactly the same way; 4992 total GPU cores, separated into two GPUs as 2×2496 cores.
The GTX Titan Z GPU card in Box V is connected to a single PCIe Gen3 slot; there are
two GPUs receiving and sending data through this single connection to the PCIe bus. To
generate the results in Table 7.3, I choose GPU ID = 0, thereby telling Nvidia that I would
like to use the first one of the two GPUs on that card. So, Table 7.3 can be interpreted as
if the results were obtained on a single GPU. One other important note from Figure 7.1
is the drastic difference in the sizes of the cache memory of the CPU versus GPU. Indeed,
GPU doesn’t even have an L3$. It only has an L2$, which is a tenth of the size of the CPU,
yet it feeds 2880 cores, rather than the six cores of the CPU. These quantities shouldn’t
be surprising. The L2$ of the GPU is significantly faster than the L3$ of the CPU and is
designed to feed the GPU cores at a speed that is much higher than the CPU’s bus speed.
Therefore, the VLSI technology can only allow Nvidia to design an architecture with a
1.5 MB of L2$.
The L3$ of the CPU and the L2$ of the GPU share the same functionality of Last
Level Cache (LLC). Typically, the LLC is the only cache that directly interfaces the actual
memory of the device and is responsible for being the first line of defense against data
starvation. The LLC is designed for size — not for speed — because the more LLC you
have, the less likely you are to starve for data. Lower level cache memories are generally
built right into the cores. For example, in the case of the CPU, each core has a design with
FIGURE 7.1 The PCIe bus connects for the host (CPU) and the device(s) (GPUs).
The host and each device have their own I/O controllers to allow transfers through
the PCIe bus, while both the host and the device have their own memory, with a
dedicated bus to it; in the GPU this memory is called global memory.
206 GPU Parallel Program Development Using CUDA
TABLE 7.5 Introduction date and peak throughput of different CPU and GPU mem-
ory types.
Peak Introduction
Memory Type Common Uses
Throughput Date
Synchronous
<2000 MBps 1993
DRAM (SDRAM)
Double Data Rate CPU Main Memory,
3200 MBps 2000 Peripheral Card Memory,
(DDR) SDRAM
DDR2 SDRAM 8533 MBps 2003 Peripheral Device Memory
DDR3 SDRAM 17066 MBps 2007
DDR4 SDRAM 19200* MBps 2014
GDDR3 10–30 GBps 2004
GDDR5 40–350 GBps 2008
GPU Main Memory
GDDR5X 300–500 GBps 2016
High Bandwidth 500–2000*
2016
Memory (HBM, HBM2) GBps
DDRx family is used commonly in peripherals, as well as CPU main memory. Both the
DDRx memory and the GPU GDDRx family designs have advanced continuously over the
past two decades, delivering increasing peak throughputs.
32+32 KB L1$ and a 256 KB L2$. In the case of the GPU, we will see that a 64 KB or
96 KB L1$ is shared by quite a few cores, while the LLC is the L2$ you see in Figure 7.1
and an L3$ does not exist in any GPU in Table 7.3. As a summary, in both of the LLC
architectures, all that the architects care about is that when a core needs data it can find
that data without waiting for an extended period of time, which will hurt performance. We
will get deep into the details of the GPU internal architecture in Chapter 8.
A list of different CPU and GPU memory types is provided in Table 7.5, with their
introduction dates chronologically. We see that further generations of CPU memory designs
have offered increasing bandwidths, albeit at the expense of increased access latency, as I
initially pointed out in Section 4.3.4. On an alternate progression path, GDDR family GPU
memory designs have taken advantage of the advances in regular DRAM standards; for
example, GDDR5 design borrowed heavily from the DDR3 standard. Today, the advanced
GDDR5X standard is used in high-end Pascal GPUs, such as GTX1080, while the HBM2
standard is used in high-end GPU accelerators, such as the P100. Note that because the
DDR4 and HBM2 standards are still evolving, I put down tentative peak rates (indicated
with a *) in Table 7.5, which are not confirmed — but are reasonably accurate — numbers.
When you compile your code with CC 3.0, for example, you are guaranteeing that the
executable application (imflipG.exe in Windows and typically imflipG in Mac and Unix) can
only run with GPUs that support CC 3.0 or higher. A built-in feature of imflipG.cu queries
the GPU and outputs the supported highest CC. Looking at Table 7.3, we observe that
every GPU in this table supports CC 3.0 or higher (specifically, 3.0, 3.5, 3.7, and 6.1).
I/O sub-system and the i7-4770K [9] CPU — in box IV — does not offer anything
more than the other workstation-grade (i.e., non-Xeon) CPUs. Alternatively, the Dell
Server (Box VI) performs somehow better in certain cases, thanks to the improved
I/O throughput of a Xeon-based system. This Xeon-driven improved I/O speed is also
noticeable in Box III, which includes a Xeon W3690 CPU [15].
• Box IV reaches a much better relative global memory throughput (≈ 40% of the band-
width), as compared to the execution on the Kepler family GPUs, which reach only
≈13–19% of their bandwidth. This is due to the architectural improvements inside
the Pascal GPU that allows legacy CC 3.0 instructions (such as byte access) to be
executed with much better efficiency, whereas the older generations did not perform
well when the data access size was not the natural 32-bits.
• Of course, there always has to be a case that requires additional explanation. Box I
was able to reach 31–37% of its bandwidth. Why? This doesn’t disprove my discussion
about the GK versus GP engines. If you look carefully at the bandwidth of box I’s
GPU (GT 640), it is a mere 28.5 GBps, which is nearly a tenth of the Pascal GPU’s
256 GBps. So, when it comes to the discussion about achieving a percentage of the
global memory bandwidth, we should be fair to the GPUs that have a much higher
bandwidth. So, for now, it does not make sense to focus on the percentage for Box I.
• Speaking in absolute numerical terms, running the same code using the same CC 3.0,
low-end Kepler engines (in Boxes I and II) achieved a global memory throughput of
≈10 GBps, higher-end Kepler engines (in Boxes III, V, and VI) achieved ≈30–60 GBps,
while the newest generation Pascal GPU (Box IV) achieved ≈110 GBps.
To summarize:
â Choose the lowest Compute Capability (CC) when compiling your GPU code,
which will run at a satisfactory performance.
â Nvidia GPU families are: Fermi, Kepler, Maxwell, Pascal, and Volta.
â They support CC 2.x, 3.x, 5.x, 6.x, and 7.x, respectively.
â For example, if you choose CC 3.0, you are restricting your executable to:
“Kepler or higher” engines. Similarly, 6.0 means “Pascal or higher.”
â If you choose, say, 3.0 on a Pascal GPU, you won’t be able to take
advantage of the additional instructions introduced between CC 4.x and 6.x.
However, the code will most probably take advantage of the
architectural improvements, built-into the Pascal family.
display the maximum number of blocks supported when executed. For example, the GT630
GPU with a GF108 Fermi engine supports CC 2.1 and 65,535 blocks in the x dimension, but
does not support a total of more than ≈190 K blocks for the entire kernel launch. Running
our imflipG.cu on a GT630 would crash and quit, because, for example, from Table 7.1, we
see that we need more than 300 K blocks launched for certain options.
The simplest workaround for this is to add another loop around the kernel launch. In
other words, we can restrict the number of blocks (NumBlocks) to something like 32,768 and
launch multiple kernels to execute the same exact code. For example, if we need to launch
166,656 blocks (example in Section 7.3), we launch 6 different kernels, first 5 with 32,768
blocks (5 × 32,768 = 163,840) and the last one with 2816 blocks (163,840 + 2816 = 166,656).
So, in addition to the block IDs that Nvidia will assign each block at runtime, we will also
have to use another ID, let’s say, loop ID.
An alternative to this would be to use a “real” dimension that Nvidia supports, such as
the y dimension of the grid (controlled by the gridDim.y and blockIdx.y variables) or the
z dimension (gridDim.z and blockIdx.z variables), as I explained in Section 7.2.1. We see
from Table 7.2 that a 3D grid of blocks is allowed in CC 2.0 and above. The only problem
is that the product of these three grid dimensions, i.e., the total number of blocks that can
be launched with each kernel is limited to ≈190 K. This upper limit might differ among
different cards. So, we are, in a sense, emulating the fourth dimension of the grid with the
loop ID that surrounds the kernel launch, as we will see shortly in Section 7.7.4. The GPU
is queried in imflipG.cu to determine this upper limit as follows:
cudaGetDeviceProperties(&GPUprop, 0);
SupportedKBlocks = (ui) GPUprop.maxGridSize[0] * (ui) GPUprop.maxGridSize[1] *
(ui )GPUprop.maxGridSize[2]/1024;
SupportedMBlocks = SupportedKBlocks / 1024;
â When I started teaching my GPU classes in 2011, I used the GTX480 cards.
GTX480 cards have a Fermi GF100 engine.
â Fermi block dimension limit is 216 −1 = 65,535, which is a very ugly number.
65,536 (216 ) would be great, but 65,535 was a disaster!
65,535 is not a power of 2, and doesn’t work well with anything!
So, my students continuously resorted to using 32,768 blocks.
Programmers of Kepler should appreciate not having that limitation.
â Kepler block dimension limit is 231 −1 ≈ 2048 M.
You will see this reported in Figure 6.10.
With Kepler and above, you never have to emulate that extra dimension.
The newly designed program — which uses a loop to emulate the additional dimension —
is called imflipG2.cu. Because it is only for experimental purposes, I designed it to only work
with the ’V’ and ’C’ command line arguments; it does not support the ’T’ or ’H’ command
line options. As of 2017, the year of the publication of this book, every Nvidia GPU in the
market is Kepler or above, so there is no point in talking about CC 2.0 any further than
this section. For the rest of the book, I will be focusing on CC 3.0 and higher. But, let’s
run the code using CC 2.0 to satisfy our curiosity.
210 GPU Parallel Program Development Using CUDA
• The computation of the NumBlocks is left intact. For the ’C’ and ’V’ options, they
are calculated exactly the same way as Code 6.3. NumBlocks would be computed as
166,656 for the ’V’ option (as in Section 6.4.14) and 498,876 for the ’C’ option (as in
Section 6.4.18).
• However, this number — NumBlocks — is not used in the kernel launch; instead, it
is used to compute the number of loops needed (NumLoops) around the kernel launch.
The “ceiling” function (CEIL) ensures that the last loop (potentially with < 32,768
blocks) is not forgotten.
NumLoops = CEIL(NumBlocks,32768);
• Here, the PxCC20() kernel is the CC 2.0 version of the PixCopy(), shown in Code 7.2.
// Copy kernel with small block sizes (32768). Each thread copies 1 byte
__global__
void PxCC20(uch *ImgDst, uch *ImgSrc, ui FS, ui LoopID)
{
ui ThrPerBlk = blockDim.x;
ui MYbid = (LoopID * 32768) + blockIdx.x;
ui MYtid = threadIdx.x;
ui MYgtid = ThrPerBlk * MYbid + MYtid;
if (MYgtid > FS) return; // outside the allocated memory
ImgDst[MYgtid] = ImgSrc[MYgtid];
}
// Vertical flip kernel that works with small block sizes (32768)
// each thread only flips a single pixel (R,G,B)
__global__
void VfCC20(uch *ImgDst, uch *ImgSrc, ui Hpixels, ui Vpixels, ui LoopID)
{
ui ThrPerBlk = blockDim.x;
ui MYbid = (LoopID * 32768) + blockIdx.x;
ui MYtid = threadIdx.x;
ui MYgtid = ThrPerBlk * MYbid + MYtid;
ui BlkPerRow = (Hpixels + ThrPerBlk - 1) / ThrPerBlk; // ceil
ui RowBytes = (Hpixels * 3 + 3) & (˜3);
ui MYrow = MYbid / BlkPerRow;
ui MYcol = MYgtid - MYrow*BlkPerRow*ThrPerBlk;
if (MYcol >= Hpixels) return; // col out of range
if (MYrow >= Vpixels) return; // row out of range
ui MYmirrorrow = Vpixels - 1 - MYrow;
ui MYsrcOffset = MYrow * RowBytes;
ui MYdstOffset = MYmirrorrow * RowBytes;
ui MYsrcIndex = MYsrcOffset + 3 * MYcol;
ui MYdstIndex = MYdstOffset + 3 * MYcol;
which combines the loopID-based dimension and the actual blockIdx.x-based dimen-
sion to end up with a single dimension, which is the global thread ID, using the variable
MYgtid.
We notice that the bottom check for MYrow was unnecessary in the Vflip() kernel because
we were launching precisely the same number of blocks as the number of vertical pixels.
However, in VfCC20(), we are dividing the number of pixel rows by 32,768, which can
very well give us a residual that makes the last block of threads try to access a memory
area beyond the image.
• To exemplify this, assume that we are vertically flipping the Astronaut.bmp using the
VfCC20()
l kernel.
m This image requires launching 166,656 blocks, which corresponds
166,656
to 32,768 = 6 loops to launch 32,768 blocks within each loop iteration. So, there
will actually be a total number of 32,768 × 6 = 196,608 blocks launched. The useful
blocks will have MYgtid values in 0...166,655, while the useless blocks will have MYgtid
values in 166,656..196,607. The extra MYrow check line prevents the useless blocks from
executing.
• This means that the number of wasted blocks is quite high (29,952 to be exact), which
will reduce the program’s performance.
• Of course, you could have tried to launch a much smaller number of blocks within
each loop iteration, such as 4096, which would have reduced the number of wasted
blocks. I am leaving it up to the reader to pursue this if needed. I am not elaborating
on this any further because all of the existing GPUs in the market today support CC
3.0 or higher, making this a non-issue in the new Nvidia generations.
• Yet, another possibility is to launch a variable number of blocks within each kernel to
reduce the number of wasted blocks down to almost zero. For example, for 166,656,
you could have used 6 loops with 27,776 blocks in each iteration, which would have
dropped the number of wasted blocks down to zero (not almost, but exactly zero).
If you can guarantee that there are no wasted blocks, there is no need for the range
check on the MYrow variable.
• The story goes on and on ... Could you have even eliminated the range check on the
MYcol variable? The answer is: YES. If you did not have any wasted blocks in the
column dimension, there wouldn’t be a need for even the first check. Say you did
that. Then, potentially, the computation of the indexes might get a little more com-
plicated. Is it worth complicating a part of the code to make another part less
complicated?
• Welcome to the world of choices, choices, choices... CUDA programming is not
all about strict rules. There are many ways you can get the same functionality
from a program. The question is which one works faster, and which one is more
readable?
â You will realize that in CUDA, there are many choices of parameters,
each with pros/cons. Which option should you choose?
Here are a few rules that might help:
â If there are two options, with the same final performance.
Choose the one that is easy to understand; simple is almost always better.
Difficult-to-understand code is prone to bugs.
â If only highly sophisticated techniques will result in high performance code,
document it extremely well, especially non-obvious parts.
Don’t do it for others, do it for yourself ! You will look at it later.
214 GPU Parallel Program Development Using CUDA
TABLE 7.6 Results of the imflipG2.cu program, which uses the VfCC20() and PxCC20()
kernels and works in Compute Capability 2.0.
Feature Box I Box II Box III
CPU i7-920 i7-3740QM W3690
C/T 4C/8T 4C/8T 6C/12T
Memory 16GB 32GB 24GB
BW GBps 25.6 25.6 32
GPU GT520 GT630 GTX550Ti GT640 K3000M GTX 760
Engine GF119 GF108 GF116 GK107 GK104 GK104
Cores 48 96 192 384 576 1152
Compute Cap 2.1 2.1 2.1 3.0 3.0 3.0
Global Mem 0.5GB 1GB 1GB 2GB 2GB 2GB
Peak GFLOPS 155 311 691 691 753 2258
DGFLOPS – – – 29 31 94
Data transfer speeds & throughput over the PCI Express bus
CPU→GPU ms 328.29 39.19 39.18 38.89 52.28 34.38
GBps 0.37 3.11 3.11 3.13 2.33 3.54
GPU→CPU ms 319.42 39.07 39.36 39.66 52.47 36.00
GBps 0.38 3.12 3.09 3.07 2.32 3.38
PCIe Bus Gen2 x1 Gen2 Gen2 Gen2 Gen2 Gen2
BW GBps 0.5 8.00 8.00 8.00 8.0 8.0
Achieved (%) (76%) (39%) (39%) (39%) (29%) (44%)
VfCC20() kernel run time (ms) ’V’ command line option
V 32 220.55 110.99 43.34 72.67 63.98 20.98
V 64 118.88 59.29 23.37 39.25 34.22 11.30
V 128 72.68 35.43 14.55 24.04 21.02 7.04
V 256 69.19 34.61 14.70 25.88 22.54 7.56
V 512 70.66 35.03 14.77 28.52 23.80 7.94
V 768 73.31 36.04 15.20 40.77 36.39 11.86
V 1024 124.00 62.31 25.49 36.23 31.05 10.34
GM BW GBps 14.4 28 99 28.5 89 192
Achieved GBps 3.52 7.04 16.74 10.13 11.59 34.59
(%) (24%) (25%) (17%) (36%) (13%) (18%)
PxCC20() kernel run time (ms) ’C’ command line option
C 32 356.30 186.36 69.03 102.39 86.68 27.42
C 64 179.82 93.79 34.56 51.67 43.05 13.69
C 128 97.20 49.31 18.23 27.18 22.48 7.28
C 256 69.54 35.70 13.31 28.04 23.48 7.67
C 512 74.03 37.71 13.84 29.16 24.37 7.88
C 768 84.63 42.32 15.69 42.88 35.22 11.38
C 1024 116.93 60.34 22.59 31.44 25.86 8.34
GM BW GBps 14.4 28 99 28.5 89 192
Achieved GBps 3.50 6.82 18.30 8.96 10.84 33.45
(%) (24%) (24%) (18%) (31%) (12%) (17%)
All Fermi GPUs are tested only on Box I (listed in Table 7.3). The astronaut.bmp image
was used with ’V’ and ’C’ options and different block sizes (32..1024).
216 GPU Parallel Program Development Using CUDA
lines, but printf() has a different place in every old school programmer’s heart. Good news:
printf() — along with a bunch of other other old school concepts — will be our best friend
in the CUDA world too. What I will show you in this section actually debugs a surprisingly
good number of bugs. Of course, I will show you the “new school” tools too, like the fancy
schmancy CUDA debugger named nvprof (or the even-fancier GUI version nvvp), but the
old school concepts are surprisingly powerful because they can get you to debug your code
much faster, rather than going through the entire process of running nvvp, blah blah blah.
Before we look at how to old school debug, let’s look at what type of bugs are common:
â If they found life on a new planet and wanted to let the new planet’s
folks know about the most common type of a computer bug on earth,
and due to a low-bandwidth satellite connection, I was allowed
one — and only one — word to let them know, I would say pointers.
â Yes, you can survive all sorts of bugs, but bad memory pointers will kill you!
Yes, pointers are a giant problem in C; yet, they are one of the most powerful features
of C. For example, there are no explicit pointers in Python, which is a rebel against the
pointer problems in C. However, you can’t live without pointers in CUDA programming.
Aside from bad-pointer-bugs, there are other common bugs, too; I am listing a bunch of
them in the following section. I will show you how old school debugging can eliminate a good
portion of them. Remember, this section is strictly about CUDA programming. Therefore,
our focus is on CUDA bugs and how to debug them in a CUDA environment, which means
either using Nvidia’s built-in tools or our old school concepts that utilize simple functions,
etc. and work everywhere. Even more specifically, we will not focus on problems within your
CPU code in a typical CUDA program. We will strictly focus on the bugs inside the CUDA
kernels. Any bug inside the actual CPU code can be fixed with the old school debugging
tools in Section 1.7.2 or the nice CPU debugging tools like gdb and valgrind, which I showed
in Section 1.7.1 and Section 1.7.3, respectively.
// inside main()
unsigned char *ImagePtr=(unsigned char *)cudaMalloc(IMAGESIZE);
// inside the GPU kernel
for(a=0; a<IMAGESIZE; a++) { *(ImagePtr+a)=76;... }
for(a=0; a<=IMAGESIZE; a++) { *(ImagePtr+a)=76; ... }
CUDA Host/Device Programming Model 217
The main() part of the program allocates memory in the GPU global memory through
the use of the CUDA API function cudaMalloc(), and the two for loops are executed
within the CUDA kernel and attempt to access the GPU global memory. The bottom
for loop will completely crash your program because it is stepping one count beyond
the image that is stored in the GPU global memory. This will be caught by the
Nvidia runtime and will totally terminate our CUDA application. The top for loop
is perfectly within the range of the image memory area, so it will work totally fine.
• Incorrect array indexes are nothing different than bad memory pointers. An array
index is just a convenient short-hand notation for the underlying memory pointer
computation. Incorrect index computations have exactly the same effect as incorrect
memory accesses. Check out the example below:
int SomeArray[20];
for(a=0; a<20; a++) { SomeArray[a]=0; ... }
SomeArray[20]=56;
The top for loop will run perfectly fine, initializing the entire array to zero (from index
0 to index 19). However, the bottom assignment will crash, because SomeArray[20] is
outside the array, which spans indexes SomeArray[0]...SomeArray[19].
• Infinite loops are loops with messed up loop variables, having an incorrect termina-
tion condition. Here is a quick example:
int y=0;
while(y<20){
SomeArray[y]=0;
}
Where is the part that updates y? The programmer intended to create a loop that
initializes the array, but forgot to put a line to update the y variable. There should
have been a line that reads y++; after SomeArray[y]=0; to avoid the
termination condition (y<20) never being satisfied.
• Uninitialized variable values are also a common bug, resulting from you declaring
a variable and not initializing them. As long as these variables will be assigned a value
before they are used, you are fine. But, if you use them before initializing them, more
than likely, they will crash your program. Although the source of the problem is an
uninitialized variable value, it is possibly the consequence of that — such as a bad
index or a pointer — will possibly cause the crash. Here is an example below:
int SomeArray[20];
int a=0;
int b,c;
for(x=19; x>=0; x++) { SomeArray[x-a]=0; ... }
for(x=19; x>=0; x++) { SomeArray[x-b]=0; ... }
for(x=19; x>=0; x++) { SomeArray[x]=x/c; ... }
The top for loop will never exceed the index range of [19...0], while the bottom for
loop might or might not. You can never make the assumption that any value you
declared will have an initial value (0 or anything else). At runtime, the memory area
218 GPU Parallel Program Development Using CUDA
for variable a is created and the value 0 is explicitly written into that memory address.
Alternatively, the memory area for variable b is created and nothing is written to that
area, thereby leaving whatever value was there in the computer memory before the
allocation of the variable. This value can be anything; if we assume that it was 50,
as an example, it is clear that it will make you exceed the index range. The third for
loop is a likely candidate to give you a division by zero error because the value of the
c variable is not initialized and can very well be zero.
int a=5;
int b=7;
int d=20;
if(a=b) d=10;
What is the result of d? Although the programmer meant to type if(a==b), there is
now a bug. The way it is written, this translates to the following lines:
int a=5;
int b=7;
int d=20;
a=b;
if(a) d=10;
In other words, if(a=b) means set a to b’s value and if the result is TRUE ... In C
language, any non-zero value is translated to its single-bit equivalent TRUE and zero
is translated to FALSE. Therefore, the final value that a received is TRUE, forcing
the execution of the follow-up statement d=10; and producing a wrong result. Let’s
see how we can debug these using old school CUDA debugging.
with the return trick. Let’s take the PixCopy() kernel as an example in Code 6.8, which is
repeated below.
__global__
void PixCopy(uch *ImgDst, uch *ImgSrc, ui FS)
{
ui ThrPerBlk = blockDim.x;
ui MYbid = blockIdx.x;
ui MYtid = threadIdx.x;
ui MYgtid = ThrPerBlk * MYbid + MYtid;
if (MYgtid > FS) return; // outside the allocated memory
ImgDst[MYgtid] = ImgSrc[MYgtid];
}
Imagine now that you didn’t have the if()... line in your code. It would crash because
of accessing a tiny bit outside the image area. To analyze this precisely, let’s remember
the example numbers from Section 6.4.18. For the astronaut.bmp image, our image size
was 127,712,256 Bytes and when we launched the PixCopy() kernel with 256 threads/block,
we ended up launching 498,876 blocks. Each block copies a single byte, so PixCopy() kernel
threads accessed addresses from 0 to 127,712,255, staying perfectly within the global memory
address range. Therefore, we didn’t even need the if()... line if we knew that this program
would be used strictly with the astronaut.bmp image and always with 256 threads/block.
What happens when we use a different image, say, 1966 × 1363 =8,038,974 Bytes. Let’s
also
l assume m that we want to launch it with 1024 threads/block. We would need to launch
8,038,974
1024 =7851 blocks. This would launch 7851 × 1024 =8,039,424 threads for your entire
CUDA application. If we didn’t have the if()... statement, threads with gtid values in the
range (0...8,038,973) would access perfectly allowed global memory areas. However, threads
with gtid values in the range (8,038,974...8,039,423) would access an unauthorized memory
range, thereby crashing your CUDA application.
How could you debug this? All of the assignments look pretty harmless, although there
is a single line that has a memory access. Remember what I said at the beginning of
Section 7.9: be suspicious of the pointers before anything else. Assuming that your initial
code looked like this without the if()... statement,
__global__
void PixCopy(uch *ImgDst, uch *ImgSrc, ui FS)
{
...
ui MYgtid = ThrPerBlk * MYbid + MYtid;
ImgDst[MYgtid] = ImgSrc[MYgtid]; // suspicious !!!
}
220 GPU Parallel Program Development Using CUDA
You could insert the return statement right before the memory access, as follows:
__global__
void PixCopy(uch *ImgDst, uch *ImgSrc, ui FS)
{
...
ui MYgtid = ThrPerBlk * MYbid + MYtid;
return; // this line skips the execution of the line below (DEBUG)
ImgDst[MYgtid] = ImgSrc[MYgtid]; // suspicious !!!
}
and determine that the problem lies with that last line of the kernel. After a few itera-
tions, you would realize that you are accessing an unauthorized memory range. Our idea
is that if we place a return statement just before the suspected line, we are skipping its
execution. If this fixes the problem, then we analyze its cause deeper. But, let’s not kid
ourselves; this problem would only happen in cases where your image size has exactly the
right value (more like exactly the wrong value) to make your number of threads go beyond
the value of the number of bytes. So, this could almost be considered in the sneaky bug
category.
__global__
void Hflip(uch *ImgDst, uch *ImgSrc, ui Hpixels)
{
...
// swap pixels RGB @MYcol , @MYmirrorcol
//ImgDst[MYdstIndex] = ImgSrc[MYsrcIndex]; COMMENTED
//ImgDst[MYdstIndex + 1] = ImgSrc[MYsrcIndex + 1]; COMMENTED
ImgDst[MYdstIndex + 2] = ImgSrc[MYsrcIndex + 2];
}
__global__
void PixCopy(uch *ImgDst, uch *ImgSrc, ui FS)
{
...
ui MYgtid = ThrPerBlk * MYbid + MYtid;
// this line is for debugging. Suspecting a memory pointer bug
if(MYgtid==1000000) printf("MYgtid=%u\n",MYgtid); ////DEBUG
ImgDst[MYgtid] = ImgSrc[MYgtid];
}
You are trying to narrow down the range of bugs you could have. At this point, an experi-
enced programmer would start smelling a memory pointer bug. If this theory is correct, the
code should work for certain values of MYgtid and crash for others. The best thing to do is
to print the value of MYgtid only if it has reached a certain value. Something like this:
I inserted the conditional printf() line that will print the output “MYgtid=1000000” only
if MYgtid has reached that value. If the program crashes before printing that line, you know
that the problem is when MYgtid¡1000000, or vice versa. For a 121 MB image, you could
start with 100 million and if it works increase it immediately to 120 million, followed by the
size of the image minus one. At some point it will crash and will give you the gotcha clue.
A few final words of wisdom from an old school debugger:
â As you see above, when I add debug lines into my code, I make them salient,
with annoying notes — like DEBUG — and comment lines. I also indent them
to be able to tell them apart from the actual code. This is because it is
common for you to go through a long and tedious debug session and forget
some useless code in there — used only for debugging — and to realize later
that some of your future bugs are due to that extra junk code you forgot in there.
â The last thing you want is for your debugging to inject bugs into your
code that didn’t exist before you started debugging.
â Another advice I can give you is: DO NOT DELETE CODE during debugging.
If you are debugging a line of code, simply make a copy of that line and keep
modifying the copy. If that line is fine, uncomment the original and move on.
â Nothing is more frustrating than to delete a line during debugging and not
be able to get it back, because you never made a backup of that line. If you
made a copy of the line, modify the copy all you want; the original is safe.
7.10.2.1 Attention
Attention is a limited resource in your brain. Unfortunately, after millions of years of devel-
opment, our brain still works as a single-threaded CPU, being able to process only a single
thing heavily. We cannot focus on more than one — and only one — thing heavily at any
point in time. The conscious part of our brain, described in Section 7.10.1, is solely respon-
sible for our attention; it performs heavy processing during code development and especially
debugging. If your attention is diverted to somewhere else during code development, your
programming and debugging performance will fall off a cliff.
is responsible for releasing the melatonin hormone to prepare you for sleep and increasing
cortisol levels when it is time to wake you up. During sleep, your ATP levels are replenished
and you are ready to burn them back to Adenosine again. The only remedy for sleep
deprivation is to sleep! There is no way to fight the brain! Just get a good night’s sleep and
you will be the best debugger again tomorrow.
Understanding GPU
Hardware Architecture
n the previous two chapters we looked at the structure of a CUDA program, learned how
Iexecutable
to edit, compile, and run a CUDA program, and analyzed the performance of the compiled
on different generations of GPUs, which have different Compute Capabilities
(CCs). We noted that a CPU is good for parallel computing, which has the building block
named thread ; it is usual to expect the CPU-based parallel programs to execute 10, 20, 100,
even 1000 threads at any point in time. However, a CUDA program (more generally, a GPU
program) is suitable for problems that can take advantage of massively parallel computing,
which implies the execution of hundreds of thousands or even millions of threads at a time.
To allow the execution of such an enormous number of threads, GPUs had to add two
additional hierarchical organizations of threads:
1. A Warp is a clump of 32 threads, which is really the minimum number of threads you
can break your tasks down to; in other words, nothing less than 32 threads executes
in a GPU. If you need to execute 20 threads, too bad, you will be wasting 12 threads,
because a GPU is not designed to execute such a “small” number of threads.
2. A Block is a clump of 1 to 32 warps. In other words, you launch your program as
1, 2, 3, ..., or 32 warps, corresponding to anywhere from 32 to 1024 threads. A block
must be designed as an isolated set of threads that can execute independently from
the other blocks. If you design your program to have such smooth separation (or,
independence) from other blocks, you will achieve blazing parallelism and will take
advantage of the GPU parallelism.
Clearly not every problem is amenable to massive-parallelism. Image processing, or
more generally, digital signal processing (DSP) problems are natural candidates for GPU
massive parallelism, because (1) you apply the same exact computation to different pixels
(or pixel groups), where one pixel can be computed independently from another, and (2)
there are typically hundreds of thousands or millions of pixels in an image we encounter in
today’s digital world. Following this understanding from the previous CUDA chapters, in
this chapter, we now want to understand how the GPU achieves this parallelism in hardware.
In this chapter, we will introduce the GPU edge detection program (imedgeG.cu) and
run it to observe its performance. We will relate this performance to the building blocks
of the GPU, such as the GPU cores and streaming multiprocessors (SM), which are the
execution units that house a bunch of these GPU cores. We will also study the relationship
among SM, GPU cores, and the things we have just learned, thread, warp, and block over
the course of a few chapters, during which we will learn the CUDA Occupancy Calculator,
which is a simple tool that tells us how “occupied” our GPU is, i.e., how busy we are keeping
225
226 GPU Parallel Program Development Using CUDA
it with our program. This will be our primary tool for crafting efficient GPU programs. The
good news is that although writing efficient GPU programs is an art as much as a science,
we are not alone! Throughout the following few chapters, we will learn how to use a few
such useful tools that will give us a good view of what is going on inside the GPU, as well
as during the transfers between the CPU and the GPU. It is only with this understanding
of the hardware that a programmer can write efficient GPU code.
Laura Tolga
Melissa
scouts
Linda Tony
Gina
scouts
Lilly Tom
Leland
scouts
Libby Tim
data tasks
scouts
FIGURE 8.1 Analogy 8.1 for executing a massively parallel program using a signifi-
cant number of GPU cores, which receive their instructions and data from different
sources. Melissa (Memory controller ) is solely responsible for bringing the coconuts
from the jungle and dumping them into the big barrel (L2$). Larry (L2$ controller )
is responsible for distributing these coconuts into the smaller barrels (L1$) of Laura,
Linda, Lilly, and Libby; eventually, these four folks distribute the coconuts (data)
to the scouts (GPU cores). On the right side, Gina (Giga-Thread Scheduler ) has the
big list of tasks (list of blocks to be executed ); she assigns each block to a school bus
(SM or streaming multiprocessor ). Inside the bus, one person — Tolga, Tony, Tom,
and Tim — is responsible to assign them to the scouts (instruction schedulers).
Understanding GPU Hardware Architecture 229
every single block has been completed, which will be a lot later than when Gina finishes
her assignments. If another kernel is launched with a different number of blocks, Gina’s job
is to assign those to the buses, before even the execution of the previous ones are done.
A very important note here is that Gina is also responsible for assigning a block ID to each
block that she assigns (from 0 to 1999 in this specific case). It is important to note that out
of all of the variables in Table 7.2, Gina’s responsibility is to make a note of the gridDim,
blockIdx, and blockDim variables on the papers she is preparing. In other words, it is the
Giga Thread Scheduler that passes these variables onto the GPU cores that will execute the
corresponding blocks. Alternatively, the assignment of the threadIdx variables has nothing
to do with Gina. They will end up being the responsibility of Tolga (and his brothers) when
he is assigning tasks to the scouts.
GPU
INTERFACE
SM 0
SM 1
SM 2
GDDR5 MEMORY
L1$ L1$ L1$
CONTROLLER
MEMORY
L2$ 768KB
GIGA THREAD
BUS
SCHEDULER
SM 5
SM 4
SM 3
FIGURE 8.2 The internal architecture of the GTX550Ti GPU. A total of 192 GPU
cores are organized into six streaming multiprocessor (SM) groups of 32 GPU cores.
A single L2$ is shared among all 192 cores, while each SM has its own L1$. A ded-
icated memory controller is responsible for bringing data in and out of the GDDR5
global memory and dumping it into the shared L2$, while a dedicated host interface
is responsible for shuttling data (and code) between the CPU and GPU over the
PCIe bus.
L2$ combined. This is the reason there are only two types of cache memory in the GPU:
L1$ is inside the SM and L2$ is shared among every SM.
and make Tolga and Laura do a little more work. This might potentially make things a tiny
bit slower in each school bus, but saves a lot of resources, which in turn allows us to get
more work done as a result, and this is all we care about.
With each new generation of Nvidia GPUs, their Parallel Thread Execution (PTX)
Instruction Set Architecture (ISA) evolved. You can think of the new ISAs as follows: in
new generation GPUs, the scouts in each school bus got more experienced and they had new
instructions to peel the coconuts more efficiently, maybe using more advanced tools. This
allowed them to get things done faster by doing more average work with each instruction.
There is a direct correlation between the PTX ISA and the Compute Capability (CC), as
we saw in Section 7.7. Each new CC effectively requires a new ISA. So, every generation of
Nvidia GPUs not only introduced a new CC, which is what the programmer cares about,
but also a new PTX ISA. Now, let us study each generation. A summary of some key
parameters for each generation is provided in Table 8.1.
compute. Some accelerators barely had a single monitor output (e.g., C2075), while some
of the earlier models (e.g., C1060) had no monitor output. Nvidia still continues to offer
its products in these two categories, although its higher-end GPUs started offering higher
double-precision performance.
will be written to disk as the resulting (processed) image. It is easier to refer to their sizes
by choosing an example image, e.g., astronaut.bmp, which is ≈121 MB. Therefore, both of
these image files are 121 MB and are allocated as follows, including the code to read in the
original image:
8.4.1.2 GPUImg
As shown in Table 8.2, GPUImg is the pointer to the GPU memory area where the original
image will reside. Based on our example, it takes up 121 MB and the original CPU image
is copied into this area as follows:
8.4.1.3 GPUBWImg
Because our algorithm needs only a B&W version of the image, the original image is imme-
diately turned into its B&W version using the BWKernel(), which is saved in the memory
area pointed to by GPUBWImg. The size of this area is one double per pixel, which stores the
B&W value of the pixel. For our example astronaut.bmp image, it is 325 MB. In Table 8.2
the “Data Move” column indicates the total amount of data that each kernel is responsible
for moving. For example, for the BWKernel(), this is indicated as 446 MB (i.e., 121+325).
The reason for that is the fact that this kernel has to read all of the GPUImg area (121 MB),
compute the resulting B&W image, and write the resulting B&W image into GPUBWImg area
(325 MB), therefore moving around a total of 446 MB worth of data. Both of these memory
areas are in global memory (GM); reading and writing from GM do not necessarily take
the same amount of time; however, for the sake of simplicity, it doesn’t hurt our argument
to assume that they are equal and calculate some meaningful bandwidth metrics.
8.4.1.4 GPUGaussImg
The GaussKernel() in our GPU code will take the B&W image (GPUBWImg) and compute
its Gaussian-filtered version and write it into the memory area that is pointed to by
GPUGaussImg. For each B&W pixel (type double), one Gaussian-filtered pixel (also type
double) is computed; therefore, the size of both of these memory areas is the same (325 MB).
So, the total amount of data that the GaussKernel() has to move is 650 MB, as indicated in
Table 8.2.
TABLE 8.2 Kernels used in imedgeG.cu, along with their source array name and type.
Kernel Source Destination Data
Name Name Type Size Name Type Size Move
The amount of data that each kernel manipulates for the astronaut.bmp file is also shown.
All sizes are in MB. “uc” denotes unsigned char.
8.4.1.6 GPUResultImg
The ThresholdKernel() is responsible for computing the B&W version of the image and write
it into the memory area pointed to by GPUResultImg, which will eventually be copied into the
CPU’s CopyImage. This resulting image writes 0-0-0 in pixels that will be black (edge) and
255-255-255 for pixels that will be white (no-edge). Note that the color used for edge versus
no-edge can be changed using the #define, as described in the CPU version of the program.
Sometimes you want black to indicate no-edge. This is useful to print the edge-detected
version of the image versus display it on a compute monitor.
The quantity 2*sizeof(uch)*IMAGESIZE computes the area needed for the initial
(GPUImg) and the final (GPUResultImg) images, while the intermediate results require
4*sizeof((double))*IMAGEPIX to store GPUBWImg, GPUGaussImg, GPUGradient, and GPUTheta.
In the C programming language, as long as you know the types of the variables, you can
address these areas as if they were arrays, which is what is done in imedgeG.cu. The trick
is to set the type of the pointers to the type of the variables that the pointer is pointing
to. This is why the initial cudaMalloc() is of type void*, which allows infinite flexibility
236 GPU Parallel Program Development Using CUDA
in setting each individual pointer to whatever type memory pointer our heart pleases in
Code 8.1 as follows:
This is nothing more than a copy of two 64-bit variables, however, the big deal is casting the
pointer type to uch; this lets the compiler know that the pointer on the left side (GPUImg) is
now of type uch * (correctly pronounced “uch pointer“ or “pointer to unsigned character”
or even more precisely “pointer to an array where each array element is an unsigned char
and occupies a single byte”). Remember that the right-side pointer (GPUptr) was of type
void*, which is a way of saying “it is a 64-bit integer that represents a typeless pointer.”
Performing the casting serves a crucial purpose: now that the compiler knows that the left-
side pointer is of type unsigned char, it will perform pointer arithmetic based on that. Look
at the next line:
One thing you cannot take for granted here is the meaning of the “+” operator on two
different types; GPUImg is an uch* and IMAGESIZE is some type of an integer. How would
you perform the addition in this case? The answer comes from prescribed pointer arithmetic
rules in the C language. Because of the casting, the compiler knows that GPUImg is pointing
to an array which has elements of size 1 byte. Therefore, each integer indeed adds 1 and
GPUResultImg points to a GPU memory area that is IMAGESIZE bytes apart from GPUImg. In
our example, when we are processing astronaut.bmp, each area is 121 MB; therefore, GPUImg
is at the very beginning of the allocated area (call it offset=0) and GPUResultImg is offset
121 MB. This was easy because both of these pointers are the same type; therefore, a simple
addition — without casting — suffices.
Understanding GPU Hardware Architecture 237
which is fairly straightforward: IMAGESIZE the amount of area we need for the GPUResultImg
array, which can be added to this variable using uch * pointer arithmetic, however, the
resulting pointer must be cast to double* because the new area that follows GPUResultImg
is GPUBWImg and will point to an array of double type elements with size 64-bits (8 bytes).
Once the pointer GPUBWImg is computed, the remaining three pointers are also type double*
and computed one double* from another is a matter of adding “how many double elements
apart” as follows:
which means “IMAGEPIX elements apart, where each element is of type double” or alter-
natively “IMAGEPIX*8 bytes apart” or if you want to be even more technically correct,
“IMAGEPIX*sizeof(double) bytes apart.” The remaining pointer computations follow ex-
actly the same logic:
We use the variable time2 to time-stamp the beginning of the remaining lines, which will call
the GPU kernels to perform edge detection. The first kernel is BWKernel(), which computes
the B&W version of the original RGB image:
Here, BlkPerRow and ThrPerBlk variables are needed for every kernel, so they are calculated
before launching any of the kernels. The BWKernel() takes the original image that just got
transferred into GPU memory (GPUImg) and writes its B&W version into the GPU memory
area that holds the B&W image (GPUBWImg), consisted of double elements. The time it
finishes its execution is time-stamped with the time2BW variable.
Understanding GPU Hardware Architecture 239
The other two kernels are called and time-stamped in the same way:
The time when all kernels finish executing is time-stamped with the time3 variable. The
result resides in the GPU memory area (GPUResultImg) and is transferred into CPU memory
as follows:
track of the amount of data movement for each kernel. For the first three kernels, this is
what we have:
Clearly, this is the lower estimate for how much data is moving, while the upper estimate
would require us to add another IMAGEPIX, i.e., if every single GPUTheta value is calculated,
as shown in Table 8.2. In case of such a scenario, the best course of action is to avoid
making critical decisions about kernel performance based on this specific kernel. The other
three kernels will allow us to judge the performance impact of our improvements more than
sufficiently anyway, so this hurdle will not hold us back from making accurate judgments.
Finally, the following lines are needed to compute the total amount of data movement:
GPUDataTfrKernel=GPUDataTfrBW+GPUDataTfrGauss+GPUDataTfrSobel+GPUDataTfrThresh;
GPUDataTfrTotal =GPUDataTfrKernel + 2 * IMAGESIZE;
Note that the addition of 2*IMAGESIZE takes into account the original and final image that
will be transported from/to the CPU.
Understanding GPU Hardware Architecture 241
8.5.1 BWKernel()
Code 8.3 provides a listing of the BWKernel(), which computes the B&W version of the
image according to Equation 5.1. The computations required for this kernel are so simple
that it almost looks like we shouldn’t even bother with a detailed explanation; however,
let’s definitely dig deep into Code 8.3 and see how it is implemented and whether we could
have done a better job.
Understanding GPU Hardware Architecture 243
B = (double)ImgGPU[MYsrcIndex];
G = (double)ImgGPU[MYsrcIndex + 1];
R = (double)ImgGPU[MYsrcIndex + 2];
ImgBW[MYpixIndex] = (R+G+B)/3.0;
}
Why not start with the very few lines at the beginning of the kernel? Let’s skip the first
three lines that are simple assignments and look at any line that involves some sort of a
computation:
Do you see anything wrong? This is the problem with writing nice and organized code.
CEIL is a macro that is defined at the very beginning of imedgeG.cu and contains the
following few lines:
Now do you see anything even more wrong? I do! We stressed in Section 4.7.3 that integer
divisions can be weapons of mass destruction for kernel performance, and, yet, you have
TWO OF THEM for EACH PIXEL: one in computing MYrow and one in the CEIL macro.
244 GPU Parallel Program Development Using CUDA
So, for the astronaut.bmp image, you are forcing this kernel to perform 84 million integer
divisions for 42 million pixels, worse yet, this is only to compute the B&W version of the
image. We still have the Gauss, Sobel, etc. We are not done. Look at the way we compute
the B&W pixel from its RGB components:
B = (double)ImgGPU[MYsrcIndex];
G = (double)ImgGPU[MYsrcIndex + 1];
R = (double)ImgGPU[MYsrcIndex + 2];
ImgBW[MYpixIndex] = (R+G+B)/3.0;
Do you see anything wrong? There is yet another division and it is of type double. So,
this division is definitely slower than the previous integer type divisions. Additionally, the
RGB pixel values are originally unsigned char and they are converted to double. Each such
case is as bad as a regular double operation, which ties up double-precision computational
resources. Could we have performed this computation differently? YES! Wait until we reach
Chapter 9, when we will come up with many ideas to improve the performance of this kernel
(and the others).
8.5.2 GaussKernel()
Code 8.4 provides a listing of the GaussKernel(), which computes the Gaussian-filtered image
from its B&W version according to Equation 5.2. A close observation of this kernel shows
the same two integer divisions on top and a single division — of type double on the bottom.
However, these operations get lost in the wind when you look at the substantial amount of
additions and multiplications performed in the nested for loops, not to mention the fact that
every single one of these addition, multiplications are of type double. In the CPU version of
imedge.c, we didn’t have to pay too much attention to whether a floating point operation
was float or double, i.e., single-precision or double-precision. Because all modern CPUs
are 64-bits and the 64-bit length of double precision is their native data size. So, their
performance to compute double-precision is almost the same as single precision, at least
for the simple additional and multiplication operations. This contrasts substantially with
GPUs; the native data size of all of the GPUs (at least up to the Pascal family) is 32 bits.
Therefore, their performance tanks when you perform too many double-precision floating
point operations. As we will deeply investigate in Chapter 9, the performance difference can
be a couple of orders of magnitude! Therefore, when we do not need to, we shouldn’t use
the double-precision floating point variables so generously.
Aside from this, there is so much more to talk about that we will go over all of it one
by one. Let’s look at the definition of the Gaussian filter constants:
__device__
double Gauss[5][5] = { { 2, 4, 5, 4, 2 },
{ 4, 9, 12, 9, 4 },
...
I didn’t repeat all of it. The device prefix we put in front of the array means that this
array is a device-side array. In other words, we are letting the compiler decide where it
goes. We don’t really specify where it goes. We just know that it is on the device-side, not
the host-side. Of course, this has to be the case, because every addition and multiplication
in the kernel is requiring one of these 25 values. Inside the two nested for loops, there are
about 150 integer and 75 double operations to compute a single pixel. You can see why we
are not really terribly concerned with the 3 divisions anymore. We have bigger problems.
Understanding GPU Hardware Architecture 245
__device__
double Gauss[5][5] = { { 2, 4, 5, 4, 2 },
{ 4, 9, 12, 9, 4 },
{ 5, 12, 15, 12, 5 },
{ 4, 9, 12, 9, 4 },
{ 2, 4, 5, 4, 2 } };
// Kernel that calculates a Gauss image from the B&W image
// resulting image has a double type for each pixel position
__global__
void GaussKernel(double *ImgGauss, double *ImgBW, ui Hpixels, ui Vpixels)
{
ui ThrPerBlk = blockDim.x;
ui MYbid = blockIdx.x;
ui MYtid = threadIdx.x;
ui MYgtid = ThrPerBlk * MYbid + MYtid;
int row, col, indx, i, j;
double G=0.00;
8.5.3 SobelKernel()
Code 8.5 provides a listing of the SobelKernel(), which computes the Sobel-filtered image
from its Gaussian-filtered version according to Equation 5.3 and Equation 5.4. This code
looks very similar to its CPU sister, Code 5.5. So, from the implementation standpoint, we
can feel comfortable that once we have the CPU version of something, it is pretty easy to
develop its GPU version, right? NO! Not so fast!!! IF you learned one thing in this book, it
is the fact that the code you are reading says nothing about the underlying hardware it will
be mapped to. In other words, if the CPU and GPU hardware were identical, you would
expect Code 5.5 (CPU Sobel) and Code 8.5 to give you identical performance; however, there
will be so many differences that we will dedicate an entire chapter to how GPU cores work
(Chapter 9) and another full chapter to how the GPU memory structure works (Chapter 10).
The biggest difference between CPU and GPU cores is that GPUs are designed to fit 1000,
2000, 3000 cores inside them, while CPUs only have 4–12 cores. So, each core in the CPU is
a lot faster, more capable, and can execute 64-bits natively, while the GPU cores are much
simpler and they are designed to achieve their performance from their sheer quantity, rather
than their sophisticated design. So, starting this chapter, the best practice for the readers
will be to watch what strengths GPU has and try to use these strengths in their programs.
The SobelKernel() has two nested loops with 3 values in each loop, so it is 9 iterations
total. We see plenty of integer operations:
row = MYrow + i;
col = MYcol + j;
indx = row*Hpixels + col;
As you remember, we spent an entire Section 4.7 going over each mathematical operation
and trying to understand how “bad” each one of these operations was in terms of compu-
tational requirement. Operations such as division, sin(), sqrt() were nothing but bad news
for the CPU cores. Considering that GPU cores are even simpler, we do not expect them
to have less damage to GPU performance. In SobelKernel, the most alarming lines are the
following:
Besides the less harmful double-precision addition and multiplication, these lines contain
double-precision division, square root, and arc tangent. If the CPU equivalent of SobelKernel
is any indication (the Sobel() function in Code 4.7), we expect these lines to execute very
slow. The only good news is that these lines are only executed once, while the double-
precision addition and multiplications inside the two for loops are executed many times. It
is good to keep an eye on all of these details, but, in the end, we will be able to quantify
all of these when we present runtime results for these kernels in the following pages.
Understanding GPU Hardware Architecture 247
__device__
double Gx[3][3] = { { -1, 0, 1 },
{ -2, 0, 2 },
{ -1, 0, 1 } };
__device__
double Gy[3][3] = { { -1, -2, -1 },
{ 0, 0, 0 },
{ 1, 2, 1 } };
// Kernel that calculates Gradient, Theta from the Gauss image
// resulting image has a double type for each pixel position
__global__
void SobelKernel(double *ImgGrad, double *ImgTheta, double *ImgGauss, ui Hpixels,
ui Vpixels)
{
ui ThrPerBlk = blockDim.x;
ui MYbid = blockIdx.x;
ui MYtid = threadIdx.x;
ui MYgtid = ThrPerBlk * MYbid + MYtid;
int row, col, indx, i, j;
double GX,GY;
8.5.4 ThresholdKernel()
Code 8.6 provides a listing of the ThresholdKernel(), which computes the thresholded (i.e.,
edge-detected) version of the image from its Sobel-filtered version according to Equation 5.5
and Equation 5.6. An interesting observation of Code 8.6 is that its performance is highly
dependent on the image itself; this concept is called data dependence of the performance.
There is a deep chain of if statements that can vastly vary the performance based on which
one of them are TRUE versus FALSE. In a situation like this, only a range of performance
numbers can be given, rather than a single value. This wasn’t the case for the other three
kernels, because all of the pixels required the computation of exactly the same functions.
Because of the “data dependence” we will not use this function to gauge the success for some
of the improvements we will suggest. Another interesting characteristic of this function is
that it lacks the nested loops (or even a single loop for that matter); therefore, it is expected
to execute substantially faster than the other kernels, especially because it doesn’t even
contain any of the transcendental functions (e.g., sin(), sqrt()).
What we observe from Table 8.3 (for the imedgeG.cu program) is pretty much a re-
peat of what we observed in Table 7.3, when we were analyzing the performance of the
imflipG.cu program. Aside from minor deviations, we see that these programs exhibit
identical PCIe throughput behavior because they use exactly the same API for transfers:
cudaMemcpy(). They both reach ≈20–45% of the peak throughput of PCIe and the per-
formance is not symmetric for GPU→CPU versus CPU→GPU transfers in certain cases.
Without getting into details, this is a more than sufficient conclusion for us.
TABLE 8.4 imedgeG.cu kernel runtime results; red numbers are the best option for
the number of threads and blue are fairly close to the best option (see ebook for
color version).
Feature Box I Box II Box III Box IV Box V Box VI
GPU GT640 K3000M GTX 760 GTX 1070 Titan Z Tesla K80
Compute Cap 3.0 3.0 3.0 6.1 3.5 3.7
GM BW GBps 28.5 89 192 256 336 240
Peak GFLOPS 691 753 2258 5783 8122 8736
DGFLOPS 29 31 94 181 2707 2912
# Threads BWKernel() kernel run time (ms)
32 82.02 72.84 23.83 5.38 12.90 16.43
64 55.69 49.33 16.35 4.95 9.69 9.23
128 52.99 48.68 16.26 5.00 7.62 6.79
256 53.28 48.97 16.36 4.97 7.02 6.76
512 56.76 51.59 17.17 5.18 7.14 7.07
768 64.02 57.30 18.96 5.74 7.46 8.60
1024 60.16 54.10 17.98 5.20 6.90 7.37
Achieved GBps 8.23 8.96 26.82 88.09 63.18 64.48
(%) (29%) (10%) (14%) (34%) (19%) (27%)
# Threads GaussKernel() kernel run time (ms)
32 311.92 246.74 81.84 30.97 48.12 63.19
64 212.85 188.18 62.74 30.71 48.14 62.34
128 208.15 186.69 62.24 30.64 39.87 62.34
256 206.87 183.94 61.26 30.28 35.71 62.34
512 207.90 185.86 62.00 30.47 35.64 62.34
768 223.05 193.67 64.55 29.52 35.67 62.37
1024 212.47 188.55 62.89 29.84 35.74 62.41
Achieved GBps 3.07 3.45 10.35 21.48 17.80 10.18
(%) (11%) (4%) (5%) (8%) (5%) (4%)
# Threads SobelKernel() kernel run time (ms)
32 289.86 255.17 84.63 31.50 43.64 44.32
64 253.82 231.04 76.77 31.29 40.37 31.36
128 254.30 231.76 77.00 31.31 33.33 31.36
256 255.11 231.13 76.76 31.32 29.28 31.36
512 260.81 232.67 77.47 31.31 30.00 31.36
1024 287.05 256.06 85.37 31.38 32.32 31.46
Achieved GBps 3.75 4.12 12.40 30.41 32.49 30.34
(%) (13%) (5%) (6%) (12%) (10%) (13%)
# Threads ThresholdKernel() kernel run time (ms)
32 99.40 85.74 27.89 4.13 12.74 20.24
64 61.13 53.45 17.54 3.46 8.37 12.00
128 46.73 42.36 14.08 3.31 6.16 7.73
256 47.27 42.73 14.20 3.45 5.77 7.75
512 53.29 47.46 15.49 3.93 5.97 9.42
768 68.52 60.80 18.65 4.90 7.12 13.01
1024 62.95 54.89 17.95 4.66 6.72 11.46
Achieved GBps 9.33 10.29 30.98 131.60 75.55 56.43
(%) (33%) (12%) (16%) (51%) (22%) (24%)
252 GPU Parallel Program Development Using CUDA
The “Achieved GBps %” column shows the relative bandwidth achieved (% of reported
peak bandwidth). The “Peak GFLOPS” lists the peak single-precision floating and “Peak
DGFLOPS” lists the peak double-precision computational capability of the GPU.
in Table 8.5, under the “Peak DGFLOPS” column; the peak double-precision ca-
pability of GTX 1070 is only 181 GFLOPS, while its single-precision capability is
5783 GFLOPS. Remember from Code 8.4 that the GaussKernel() is packed with
double-precision.
• For now, it suffices for us to propose a hypothesis that the reason for this surprising
performance is the inefficiency of the code, which Pascal is able to partially compensate
for. However, in Chapter 9, we will significantly tweak these kernels and get them close
to optimal, in terms of core operation. In that case, the double-precision capable GPUs
should start shining; for example, we expect K80 to scream in this case.
• To state the previous comments, we are making Pascal work as a “bad code fixer-
upper” rather than a good “computational unit.” However, when we optimize the
code, we will see who the good computational heros are. I already ruined the surprise
partially. I don’t want to talk more about this. We have an entire Chapter 9 to talk
about the GPU cores.
• Without thinking about the double-precision issue too much, almost every Kepler
GPU seems to have a comparable relative performance, with minor exceptions. This
means that the more cores Nvidia stuffs into their GPUs, the higher they make the
global memory bandwidth; otherwise, adding more cores would cause data starvation.
Although this is true when you look at these numbers at a higher level, details show
interesting hidden trends. However, we cannot pick out these trends on such inefficient
code. Currently, the inefficiencies in the kernels shown in Table 8.4 are masking the
real performance numbers; in the following chapters, when we make the code core-
and memory-friendly, much more interesting trends will emerge.
I expect the readers to loudly reject that the above lines are plain simple CPU code.
Even the variable names say GPU-something! The fact is that they are completely
CPU code and the variables are CPU-side variables. Unless you use these in any
conjunction with the GPU, you can call your variables whatever your heart desires.
The most important thing about these previous lines of code is that they are compiled
into CPU x64 instructions, and they use CPU memory and don’t know if a GPU even
exists.
2. Kernel launch code: This is pure CPU code, but its entire purpose is to launch
GPU code. Everything has to initiate at the CPU, because CPU is the host. So, the
only way to get access to the GPU side execution is these kernel launch lines, as shown
below (from Code 8.1):
The lines above are nothing but a shortcut for an API function that is provided by
Nvidia in facilitating a kernel launch. Instead of the ≪ and ≫ shortcut symbols,
you can simply use that API and nothing will be different, with the exception that
you will not get the annoying squiggly red lines that MS Visual Studio gives you. One
important aspect of this category is that the variables that are passed onto the kernel
are either constant values or GPU-side variables, such as GPU memory pointers.
3. GPU→CPU and CPU→GPU Data Transfer APIs: This is also pure CPU code.
These APIs are a part of the library Nvidia provides CUDA developers and they are
not different than the ≪ and ≫; they are simply an API that calls a function on
the CPU side, which let’s the CUDA runtime know to do something on the GPU side.
By definition, they include CPU-side and GPU-side in their function call. Here is an
example from Code 8.1:
4. Pure CUDA code: This part is totally written in C language, however, a few symbols
preceding the C code tell the compiler that what is coming is CUDA code. Here is an
example from Code 8.3:
__global__
void BWKernel(double *ImgBW, uch *ImgGPU, ui Hpixels)
{
ui ThrPerBlk = blockDim.x;
ui MYbid = blockIdx.x;
ui MYtid = threadIdx.x;
ui MYgtid = ThrPerBlk * MYbid + MYtid;
double R, G, B;
Clearly, there is no way to compile this part of the code with gcc because it is pure
GPU code and the expected compiled output is the PTX, the GPU assembly language.
There are two clues in this code that would completely give away the secret that it
is GPU code: (1) the __global__ identifier and (2) the specialized CUDA variables,
blockDim.x, etc.
Understanding GPU Hardware Architecture 255
cudaGetDeviceCount(&NumGPUs);
What does this code do? The OS runs this file as a usual EXE executable. As far as the OS
is concerned, this is nothing more than a function call for which the executable is included
in the cudart64 80.dll file. If you didn’t have this file accessible to the OS (by setting the
correct path), your program would crash right here and would tell you that it cannot locate
the compiled binaries of the function cudaGetDeviceCount(). So, let’s assume that you have
this file accessible.
256 GPU Parallel Program Development Using CUDA
Who handles the cudaMalloc()? This is a function that requires full cooperation from the
CPU and GPU side. So, the graphics driver has to have access to both. GPU is not a problem
because it is the driver’s own territory. However, accessing the CPU memory requires the
graphics driver to be nice and apologetic, and obey the master that controls the CPU
territory (the OS). Let us now go through a few situations, where things can go wrong:
• Improper CPU-Side Memory Pointers: This is the OS territory and anybody
who uses improper (out-of-bounds) CPU-side pointers gets slapped in the face, execu-
tion is taken away from them and they are thrown on the street! An incorrect CPU-
side memory pointer access would cause a Segmentation Fault in Unix and something
equivalent of that in Windows.
• Improper GPU-Side Memory Pointers: It is highly likely that your program
will have improper GPU-side memory accesses, but not CPU-side. If you violated the
CPU territory, the OS already slapped you in the wrist and the execution of your
CUDA program is over. However, the OS has no way of accessing the GPU side. Who
is going to protect you from GPU-side bad memory pointers? The answer is simple:
NRE (Nvidia Runtime Engine). So, NRE is the police who rules the GPU town! Any
bad memory pointers would cause the NRE to detect this and decide to terminate
the program. But, wait... Your program is really a CPU program. So, only the OS
can terminate you. Well, two different town’s cops cooperate to catch the bad guy in
this case. The NRE tells the OS that this program has done something that it wasn’t
supposed to; and the OS listens to its cop friend in the GPU-town and terminates
the program’s execution. The message you get will be different than the ones you
encounter with the CPU-side pointer issues, but, still, you are caught trespassing and
can’t run anymore!
• Improper GPU-Side Parameters: Bad GPU memory pointers are not the only
reason for the NRE to terminate you. Let’s assume that you are trying to run a kernel
with 2048 threads per block and the GPU engine you are using cannot support it.
The compiler would have no idea about this at compile time. It is only when you run
Understanding GPU Hardware Architecture 257
the GPU code that your NRE would detect this and tell the OS to shut you down,
because there is no way to continue the execution.
In Figure 8.2, the Host Interface is responsible for shutting the data back and forth between
the CPU and GPU. So, clearly, the NRE uses some internal APIs to facilitate this. This
requires reading the data from the CPU memory, and uses the X99 chipset to transfer it to
the GPU; this data is welcomed by the Host Interface and makes it into the L2$ and goes
into its final destination in the global memory on the right side of Figure 8.2.
which translates to the following quantities at runtime, when the values of the NumBlocks
and ThrPerBlk variables are plugged in:
My specific question is: aside from launching the kernel itself, what additional informa-
tion has to go to the GPU side, so the GPU can execute this thing internally? The answer
is one of two different sets of parameters as follows:
• Launch Parameters: These parameters (166656, 256) are strictly related to the
dimensions of the blocks and grids; they have nothing to do with the internal operation
of the kernel. So, the only part of the GPU that cares about this is the Giga Thread
Scheduler (GTS). All that the GTS does is, like its name suggests, schedule the blocks;
the only criteria for its scheduling is that an SM is willing to take on another block.
If it is, the GTS schedules it and its job is done. Of course, it has to communicate the
threads/block to the SM before the SM can decide whether it can handle this extra
job. If it can, it gets the job!
• Kernel Parameters: The parameters passed onto the kernel (GPUBWImg, GPUImg,
IPH) are strictly needed for the execution of the kernel by the GPU’s cores and SMs
and is of no concern to the GTS. Once the GTS schedules the block, that block is the
SM’s problem after that point.
258 GPU Parallel Program Development Using CUDA
• Every single one of these 166,656 blocks gets an exact copy of the CUDA binary code
for the CUDA kernel that it is supposed to execute. This is in cubin format and is
already just-in-time compiled from PTX into cubin, because cubin is the native GPU
core language, where PTX is an intermediate representation that must be translated
into cubin.
• Clearly, there is a way to make this more efficient by avoiding repetition, but, from
the standpoint of our understanding, note that every SM that receives a specific block
as a task also receives its code somehow and caches it in its own instruction cache
(inside the SM).
• None of the SMs can get more than 8 blocks in its queue, although it can only execute
a single one at any point in time. So, while one is executing, 7 are queued up, waiting
for execution.
it, it wouldn’t even consider taking another block, (2) if it can still take more, it compares
the parameters of the kernel you are advertising to its own parameters and sees if it has
“resources” to take this new block. Resources include a lot of things, such as cache memory,
register file, among many others as we will see in Chapter 9.
Back to our example: By this time, each SM absorbed the following blocks:
• SM0 =⇒ [ Block0, Block6, Block12, Block18, Block24, Block30, Block36, Block42 ]
• SM1 =⇒ [ Block1, Block7, Block13, Block19, Block25, Block31, Block37, Block43 ]
...
• SM5 =⇒ [ Block5, Block11, Block17, Block23, Block29, Block35, Block41, Block47 ]
Now that you have scheduled 48 blocks (Block0...Block47 ), you have to wait for somebody
to be free again to continue with the other 166,608 blocks (Block48 ... Block166655 ). It is
not necessarily true that the SMs will finish the execution of the blocks assigned to them
in exactly the same order of assignment. At this moment, you have each SM executing 8
blocks, but they can only execute one at a time, and put 7 to sleep temporarily. When a
block accesses resources that will take a while to get (say, some data from global memory),
it has the option to switch to another block to avoid staying idle. This is why you stuff
8 blocks to the SM and give it 8 options to choose from to keep itself busy. This concept
is identical to why assigning two threads to the CPU helped it do more work on average,
although it could not execute more than one of those threads at a time. In the case of
the SM, it grabs a bunch of blocks, so it can switch to another one when one comes to a
standstill.
Now, let’s fast forward the time a little bit. Say, SM1 got finished with Block7 before
anybody else. It would immediately raise its hand and volunteer to take in another block.
Having gotten rid of 0...47, your next block to schedule is 48. So, you would make the
following scheduling decision: Block48 →SM1. After this assignment, SM1 would clean up
all of the resources it needed for Block7 and replace it with Block48. So, SM1 ’s queue of 8
blocks is looking like this now:
• SM1 =⇒ [ Block1, Block48 , Block13, Block19, Block25, Block31, Block37, Block43 ]
Let’s say that SM5 finished Block23 next and raised its hand; you would assign your next
block (Block49 ) to it, which will change its queue to the following:
• SM5 =⇒ [ Block5, Block11, Block17, Block49 , Block29, Block35, Block41, Block47 ]
This would continue until you finally assigned Block166655. When you assign this very last
block, GTS’s responsibility is over. It might take a while to finish what is in the queue of
each SM after this very last assignment, but, as far as the GTS is concerned, job is done!
This is enough for you to understand that, out of the 166,656 blocks, you are #49. Be-
cause each block consists of 256 threads, you must execute this block using 256 threads. So,
the SMs responsibility is to execute Block49 using 256 — single-dimensional — threads,
numbered threadIdx.x=0...255. This means that it will facilitate the execution of 256
threads, where each thread gets the exact same parameters above; additionally, they also
get their threadIdx.x computed and passed onto them. So, if you are thread #75 out of the
256 threads, this is what is passed on to you when you are executing:
n the previous chapters, we looked at the GPU architecture at a high level; we tweaked
Ishould
the threads per block parameter and observed its impact on the performance. The readers
have a clear understanding by now that the unit of kernel launch is a block. In
this chapter, we will go much deeper into how the GPU actually executes the blocks. As I
mentioned at the very beginning of Part II, the unit of execution is not a block; rather, it
is a warp. You can think of the block as the big task to do, which can be chopped down
into much smaller sub-tasks named warps. The significance of the warp is that a smaller
unit of execution than a warp makes no sense, because it is too small considering the vast
parallelism the GPU is designed for.
In this chapter, we will understand how this concept of warp ties to the design of
the GPU cores and their placement inside a streaming multiprocessor (SM). With this
understanding, we will design many different versions of the kernels inside the imflipG.cu
and imedgeG.cu programs, run them, and observe their performance. We will run these
experiments in four different GPU architecture families: Fermi, Kepler, Maxwell, and Pascal.
With each new family, a new instruction set and compute capability have been introduced;
to accommodate these new instructions sets, the cores and other processing units had to
be designed differently. Because of this fact, some of the techniques we will learn in this
chapter will be broadly applicable to every family, while some of them will only work faster
in the new generations, due to the utilization of the more advanced instructions, available
only in the newer generations such as Pascal.
While we “guessed” the threads per block parameter in the previous chapters, we will
learn how to use a tool named CUDA Occupancy Calculator in the next chapter, which will
allow us to establish a formal methodology for determining this important parameter, along
with many other critical parameters that ensure optimum utilization of the SM resources
during kernel execution.
263
264 GPU Parallel Program Development Using CUDA
INTERFACE
PCIe HOST
GDDR5 MEMORY
CONTROLLER
L1$ L1$ L1$ L1$ L1$ L1$ L1$ L1$
SM SM SM SM SM SM SM SM
L2$ 768KB
GIGA THREAD
SCHEDULER
SM SM SM SM SM SM SM SM
L1$ L1$ L1$ L1$ L1$ L1$ L1$ L1$
Core
SFU
LD/ST
FIGURE 9.1 GF110 Fermi architecture with 16 SMs, where each SM houses 32 cores,
16 LD/ST units, and 4 Special Function Units (SFUs). The highest end Fermi GPU
contains 512 cores (e.g., GTX 580).
550Ti (with 6 SMs), each SM would get assigned an average of 10,416 versus 27,776 blocks on
average. So, it is not unreasonable to expect a ≈2.7× higher performance from GTX580 for
core-intensive code. How about memory? The GTX550Ti has a global memory bandwidth
of 98.5 GBps, which corresponds to about 0.51 GBps per core. Alternatively, GTX580’s GM
bandwidth is 192.4 GBps, which corresponds to 0.38 GBps per core. So, we might want to
consider the possibility that the GTX 550Ti might beat the GTX 580 in relative terms for
memory-intensive code by at least 20–30%. However, there are many more factors to affect
the memory performance, such as the cache memory built into the SMs, which can ease the
pressure on the GM with intelligent programming.
Figure 9.1 depicts a “PCI Express Host Interface,” which is a PCIe 2.0 controller.
Remember that the Fermi family did not support PCIe 3.0. Every family after that did.
This host interface works in concert with the I/O controller in the GPU→CPU data trans-
fers. Giga Thread Scheduler is the unit that is responsible for Block→ SM assignments, as
we detailed in Section 8.9.5. The memory controller can either be a GDDR3 or GDDR5
controller and it is composed of multiple controllers, as most of the books will show. Here,
I am only showing a single memory controller to depict the component that is responsible
for the data transfers, Global Memory ←→ L2$. The 768 KB L2$ is the Last Level Cache
(LLC) and it is coherent and shared among all of the cores. This is where the GPU caches
the contents of the GM. Alternatively, the L1$ inside each SM is not coherent and it is
strictly used as a local cache during the processing of individual blocks. More details on the
internal structure of each SM are given in the following subsection.
Instruction Cache
Warp Scheduler Warp Scheduler
Dispatcher Dispatcher
Core 128KB Register File (32768 x 32-bit Registers)
Dispatch
Operands Core Core Core Core LD/ST LD/ST
SFU
FP INT
Core Core Core Core LD/ST LD/ST
Result Q
Core Core Core Core LD/ST LD/ST
SFU
Core Core Core Core LD/ST LD/ST
FIGURE 9.2 GF110 Fermi SM structure. Each SM has a 128 KB register file that
contains 32,768 (32 K) registers, where each register is 32-bits. This register file
feeds operands to the 32 cores and 4 Special Function Units (SFU). 16 Load/Store
(LD/ST) units are used to queue memory load/store requests. A 64 KB total cache
memory is used for L1$ and shared memory.
(GTS) — into a set of warps and schedules them to be executed by the execution units inside
this SM, which are the cores, Load/Store queues, and Special Function Units (SFUs). When
memory read/write instructions need to be executed, these memory requests are queued up
in the Load/Store queues and when they receive the requested data, they make it available
to the requesting instruction. Each core has a Floating Point (FP) and an Integer (INT)
execution unit to execute float or int instructions, as shown on the left side of Figure 9.2.
Instructions need to access a lot of registers; rather than giving a register file to each
individual core (like in a CPU), the cores inside an SM share a large register file. In the
SM shown in Figure 9.2, a 128 KB register file (RF) is shared among the 32 cores. Each
register is a 32-bit (4 byte) unit, therefore making the RF a 32 K-register one. The SFU is
responsible for executing transcendental functions (e.g., sin(), cos(), log()). The Instruction
Cache holds the instructions within the block, while the L1$ Cache is responsible for caching
commonly used data, which is also shared with another type of cache memory named
Shared Memory. This 64 KB cache is split between the two as either (16 KB+48 KB) or
(48 KB+16 KB).
266 GPU Parallel Program Development Using CUDA
MEM CTRL
SMX SMX SMX SMX SMX SMX SMX SMX
L2$ 1.5 MB
SMX SMX SMX SMX SMX SMX SMX
ENGINE
Core
DPU
LDST
SFU
FIGURE 9.3 GK110 Kepler architecture with 15 SMXs, where each SMX houses 192
cores, 48 double precision units (DPU), 32 LD/ST units, and 32 Special Function
Units (SFU). The highest end Kepler GPU contains 2880 cores (e.g., GTX Titan
Black); its “double” version GTX Titan Z contains 5760 cores.
• Because the Kepler is designed to hold almost 6 times as many cores as Fermi (512
vs. 2880), each SMX is structured to hold a significantly higher number of cores than
Fermi (192), although there is one less SMX (15 vs. 16). Having such heavily populated
SMX units has interesting performance implications, as we will detail in the future
sections, because the cores inside an SM (or SMX) share their L1$, register file, and
another type of cache that is introduced, the Read Only Cache.
Understanding GPU Cores 267
Instruction Cache
Warp Scheduler Warp Scheduler Warp Scheduler Warp Scheduler
Core Core Core DPU Core Core Core DPU LDST SFU Core Core Core DPU Core Core Core DPU LDST SFU
Core Core Core DPU Core Core Core DPU LDST SFU Core Core Core DPU Core Core Core DPU LDST SFU
Core Core Core DPU Core Core Core DPU LDST SFU Core Core Core DPU Core Core Core DPU LDST SFU
Core Core Core DPU Core Core Core DPU LDST SFU Core Core Core DPU Core Core Core DPU LDST SFU
Core Core Core DPU Core Core Core DPU LDST SFU Core Core Core DPU Core Core Core DPU LDST SFU
Core Core Core DPU Core Core Core DPU LDST SFU Core Core Core DPU Core Core Core DPU LDST SFU
Core Core Core DPU Core Core Core DPU LDST SFU Core Core Core DPU Core Core Core DPU LDST SFU
Core Core Core DPU Core Core Core DPU LDST SFU Core Core Core DPU Core Core Core DPU LDST SFU
Core Core Core DPU Core Core Core DPU LDST SFU Core Core Core DPU Core Core Core DPU LDST SFU
Core Core Core DPU Core Core Core DPU LDST SFU Core Core Core DPU Core Core Core DPU LDST SFU
Core Core Core DPU Core Core Core DPU LDST SFU Core Core Core DPU Core Core Core DPU LDST SFU
Core Core Core DPU Core Core Core DPU LDST SFU Core Core Core DPU Core Core Core DPU LDST SFU
Core Core Core DPU Core Core Core DPU LDST SFU Core Core Core DPU Core Core Core DPU LDST SFU
Core Core Core DPU Core Core Core DPU LDST SFU Core Core Core DPU Core Core Core DPU LDST SFU
Core Core Core DPU Core Core Core DPU LDST SFU Core Core Core DPU Core Core Core DPU LDST SFU
Core Core Core DPU Core Core Core DPU LDST SFU Core Core Core DPU Core Core Core DPU LDST SFU
FIGURE 9.4 GK110 Kepler SMX structure. A 256 KB (64 K-register) register file feeds
192 cores, 64 Double-Precision Units (DPU), 32 Load/Store units, and 32 SFUs.
Four warp schedulers can schedule four warps, which are dispatched as 8 half-warps.
Read-only cache is used to hold constants.
MEM CTRL
SMM SMM SMM SMM SMM SMM SMM SMM SMM SMM
L2$ 2 MB
SMM SMM SMM SMM SMM SMM SMM SMM SMM SMM
ENGINE
Core
LDST
SFU
GPC GPC GPC
FIGURE 9.5 GM200 Maxwell architecture with 24 SMMs, housed inside 6 larger GPC
units; each SMM houses 128 cores, 32 LD/ST units, and 32 Special Function Units
(SFU), does not contain double-precision units (DPUs). The highest end Maxwell
GPU contains 3072 cores (e.g., GTX Titan X).
Instruction Cache
Texture / L1$
Texture / L1$
96KB Shared Memory
HBM2
SM SM SM SM SM SM SM
MEM CTRL
SM SM SM SM SM SM SM
L2$ 4 MB
GIGA THREAD
SM SM SM SM SM SM SM
ENGINE
SM SM SM SM SM SM SM
Core
DPU
LDST
GPC GPC GPC SFU
FIGURE 9.7 GP100 Pascal architecture with 60 SMs, housed inside 6 larger GPC
units, each containing 10 SMs. The highest end Pascal GPU contains 3840 cores
(e.g., P100 compute accelerator). NVLink and High Bandwidth Memory (HBM2)
allow significantly faster memory bandwidths as compared to previous generations.
Instruction Cache
Instruction Buffer Instruction Buffer
Warp Scheduler Warp Scheduler
Dispatcher Dispatcher Dispatcher Dispatcher
Texture / L1$
64KB Shared Memory
FIGURE 9.8 GP100 Pascal SM structure consists of two identical sub-structures that
contain 32 cores, 16 DPUs, 8 LD/ST units, 8 SFUs, and 32 K registers. They share
an instruction cache, however, they have their own instruction buffer.
where f is the base core clock for a CUDA core and n is the total number of CUDA cores.
In the Fermi generation, the concept of a “CUDA core” was a little different; cores were
called SPs (Streaming Processors) and a shader clock was defined to be 2× the base core
clock. This is why the computation of the peak power is different with Fermi; for example
the GTX 580’s core clock is 772 MHz and its shader clock (fshader ) is 1544 MHz. So, its
peak output is 772 × 2 × 512 × 2 = 1581 GFLOPS.
Starting with the Kepler generation, Nvidia called the cores CUDA cores. For example,
for the GTX 780, the core clock is 863 MHz and there are 2304 CUDA cores. Therefore,
GTX 780 peak compute power is 863 × 2304 × 2 = 3977 GFLOPS (single precision).
Double precision peak compute power is calculated as follows:
Peak GFLOPS
Kepler GPUS with no DPU
24
Peak GFLOPS
Kepler GPUs with a DPU
3
Peak GFLOPS
Peak DGFLOPS = Maxwell GPUS (9.2)
32
Peak GFLOPS
Pascal GPUS with no DPU
32
Peak GFLOPS Pascal GPUs with a DPU
2
As an example of Equation 9.2, let us calculate the peak DGFLOPS of GTX 1070 that
we used in our previous results; GTX 1070 cores run at a clock frequency of 1506 MHz and
the GTX 1070 is a “Pascal GPU with no DPUs,” so, with its 1920 cores, GTX 1070’s single-
precision peak output is computed as 1506 × 1920 × 2 = 5783 GFLOPS from Equation 9.2
and its double-precision peak output is computed as 5783 ÷ 32 = 181 DGFLOPS from
Equation 9.2.
It is important to note here that these computations assume the case where each core is
non-stop delivering FLOPS with no inefficiencies. If we learned anything in this book, it is
the fact that such a perfect scenario only occurs if-and-only-if the programmer designs the
CUDA kernels with infinite efficiency. So far, with the kernels we saw, we barely hit 20, 30, or
may be 40% of this peak. In this chapter, our goal is to get as close to 100% as possible. Also
note that the alternative problem is being able to saturate the global memory bandwidth. In
other words, a memory-intensive program might saturate the memory bandwidth far before
it saturates the cores. We will use a tool named CUDA Occupancy Calculator toward the
end of this chapter to give us an idea about which one might occur first (i.e., core saturation
of memory saturation), before even we launch our kernels.
The reason behind Equation 9.1 and Equation 9.2 is that single-precision versus double-
precision peaks have everything to do with the CUDA core-to-DPU ratio inside the SM,
SMX, and SMM units. For example, in a Kepler SMX unit (Figure 9.4), we can clearly see
that there is a DPU for every three CUDA cores; therefore, we divide the GFLOPS by 3 to
Understanding GPU Cores 273
TABLE 9.1 Nvidia microarchitecture families and their peak computational power for
single precision (GFLOPS) and double-precision floating point (DGFLOPS).
Family Engine GTX # # # # Peak Compute Power
Model SMs cores DPUs SFUs float double Ratio
550 Ti 4 192 16 691
Fermi GF110
GTX 580 16 512 64 1581
GK110 GTX 780 12 2304 0 384 3977 166 24×
Kepler GK110 Titan 14 2688 896 448 4500 1500 3×
2×GK110 Titan Z 30 5760 1920 960 8122 2707 3×
2×GK210 K80 26 4992 1664 832 8736 2912 3×
980 Ti 22 2816 0 704 5632 176 32×
Maxwell GM200
Titan X 24 3072 0 768 6144 192 32×
GP104 GTX 1070 15 1920 0 480 5783 181 32×
Pascal GP102 Titan X 28 3584 0 896 10157 317 32×
GP100 P100 56 3584 1792 896 9519 4760 2×
Volta
get the DGFLOPS. Similarly, for the Pascal SM units, this ratio is 2, which is reflected in
Equation 9.2. However, it is less obvious why the 32× ratio is the fact for Pascal’s with no
DPU (such as Titan X - Pascal Edition and GTX 1070). This has to do with the design of
the CUDA cores inside Pascal; they can execute double-precision operations, although they
take 32× more time to do so, as compared to single-precision operations. For Kepler’s with
no DPUs (such as GTX 780), this ratio was a less dramatic 24×.
Another interesting observation from Table 9.1 is that the number of cores per SM does
not match for the GTX 1070 and P100 GPUs; while the P100 has 64 cores per SM, GTX
1070 and GTX Titan X both have 128 cores per SM. This is because P100 is designed to be
a double-precision engine, while the other two are single-precision engines with a 32× lower
performance in double-precision. Because of this, their SM architectures are completely
different. We will be studying the GP104 engine Section 9.3.14 when we are going over the
different data types that Pascal supports.
There is a catch with GPU Boost though! An Integrated Circuit (IC) needs a higher voltage
to be able to run at a higher frequency. So, to be able to clock the cores at 875 MHz, the
GPU internal voltage circuitry would have to increase the core voltage. Then, what happens
to the power consumption? The power consumption formula is as follows:
P ∝V2·f (9.3)
where f is the frequency of the core and V is the operating voltage. Although the details
are not very important for this qualitative argument, it is clear that when you increase the
frequency of the core by 24%, the power consumption of the GPU goes up a lot more than
24%; if you are really trying to come up with a number, call it something like 50%.
So, altogether, there are 3 double type variables and 10 unsigned int type variables. The
compiler should map all of these variables to registers, because they are the most efficient
for the GPU cores to execute instructions with. Considering that each register is a 32-bit
Understanding GPU Cores 277
value, double types eat away two registers worth of space. So, we need 10 registers for
the unsigned int types and 6 to store the double types, for a total of 16. This is clearly
not enough. The compiler needs to store at least one or two 32-bit temporary values and
some 64-bit temporary values. So, it is realistic to assume that the BWKernel() in Code 8.3
requires 20–24 registers total to execute. Let’s call it 24 to be on the safe side.
Assume that we are running the BWKernel() on a Pascal GP100 GPU, for which the
SM structure is shown in Figure 9.8. Also assume that we launched these kernels with 128
threads per block. In this case, every single one of these 128 threads in each block needs
24 registers; so each block requires 128 × 24 = 3072 = 3 K registers in the RF to be even
scheduled into that SM. Recall from Section 8.9.3 that a Pascal SM can accept up to 32
blocks. If the Giga Thread Engine (GTE) ends up scheduling 32 blocks into this SM, each
SM will require 3 × 32 = 96 K registers to accept all of these 32 blocks. However, as we
see in Figure 9.8, the Pascal SM only has 32 K registers (with a total storage of 128 KB in
terms of bytes). So, the GTE would actually only be able to schedule 10 blocks into this
SM before the SM cannot accept any more blocks due to lack of register space in the RF.
What would happen if I decided to launch these kernels with 64 threads per block instead
of 128? The answer is: each block would now require 1.5 K registers and the GTE would
now be able to schedule 20 blocks. Even if I drop the threads/block down to 32, I can still
schedule only 30 blocks, which is less than the maximum number possible.
This highlights the fact that the programmer should write kernels that use as few regis-
ters as possible to avoid register starvation in the SM. Using too many registers has another
interesting implication: the maximum number of registers a thread can have is limited by
the CUDA hardware. This number was 63 until Compute Capability 3.0 and Nvidia in-
creased it to 255 beyond CC 3.0. It stayed there since then. Increasing it to beyond this
makes no practical sense, because kernels with a really high number of registers exhaust the
RF so fast that any performance benefit that can come from these large kernels is negated
by the inability to schedule them inside the SMs in large quantities.
what to keep inside the L1$, the SM cache controller looks for very simple patterns in
data usage. However, the underlying idea is that nobody knows the data better than the
programmer himself or herself. The texture cache is where the GPU keeps the textures of
the objects used in computer games.
where the warp schedulers come into the game; their job is to turn each block into a set of
warps and schedule them to be individually executed. Using the same parameters in Sec-
tion 8.9.5, we launched 256 threads/block, which effectively corresponds to 8 warps/block.
Therefore, the execution of Block0 in our example requires warp0...warp7 in this block,
which corresponds to threadIdx.x values in the range 0...255. To accomplish this, here is
what the warp schedulers schedule:
Note that these warps are only scheduled, not dispatched yet. They have to wait until
resources are available that allows them to be dispatched.
Here are two new instructions that work on “vector” 8-bit data:
Note that these two instructions are only available in PTX ISA 5.0 (i.e., Pascal family),
which allow the processing of four bytes in one clock cycle, or two 16-bit words. Because
of this, the Pascal family can achieve 4× the performance in processing byte-size data, as
long as the code is compiled to take advantage of these instructions. As you can see here,
Nvidia’s architecture design trend is to turn their integer cores into more like the MMX and
SSE, and AVX units that the i7 CPUs have. With, for example, the Intel AVX instruction
extensions, it is possible to process a 512-bit vector as either 8 64-bit numbers, 16 32-bit
integers, 32 16-bit, or 64 8-bit integers. The dp4a and dp2a instructions resemble this a little
bit. My guess is that in the Volta family (the next generation after Pascal), there will be a
much wider set of these instructions, potentially applicable to other data types.
may actually fit in 32-bits; based on this, 24-bit multiplication instructions allowed one to
save either the upper or lower 32 bits of the result. That way, you can use the lower 32
bits if you know that your numbers are small to start with. Alternatively, you can use the
upper 32 bits if you were storing fixed point numbers and the lower bits only mean more
resolution and can be ignored. If you cannot live without all 48 bits of the result, you can
always perform both of the multiplications and save both of the results for future use.
Example PTX instructions for the 24-bit data type are as follows:
Note that a “saturating” addition avoids overflow by limiting the result to the
MININT· · · MAXINT range; it only applies to the s32 type. For example, adding 1 to the
highest number (231 − 1) would cause an overflow because the result (231 ) is an out-of-range
s32 value. However, add.sat limits the result to MAXINT (231 − 1) and voids overflow.
This is perfect for digital signal processing applications, where a lot of filter coefficients and
sampled voice or image data are being multiplied. The inaccuracy caused by the saturation
is inaudible to the human ear, but avoiding overflow prevents the results from being com-
pletely wrong and meaningless and outputting white noise like garbage at the output of the
filter.
Here, the @p is the guard predicate, which executes the conditional add instruction based
on the Boolean value of the p predicate register. The reverse of the predicate can also be
used for conditional instructions, as follows:
The CC.CF is the carry flag in the condition register, which allows the carry to be used in
the second, third, and last additions to extend beyond 32-bits. You can do the same thing
to do a 128-bit multiplication by using the madc instruction, which multiply-accumulates
and uses the carry during the accumulation.
With the .f32 PTX data type, the smallest representable number (at full resolution) is
≈ 1.17 × 10−38 and the highest representable number (at full resolution) is ≈ 3.4 × 10+38 .
This conforms to the IEEE 754 single-precision floating point standard, which is one of the
most commonly used data types in any computer. Although the same format allows the rep-
resentation of smaller numbers (denormalized numbers), the resolution (i.e., the number of
mantissa bits) of these numbers is lower. Every floating point number consists of three fields:
• sign bit is a single-bit value, where 0 indicates positive and 1 indicates negative.
• exponent is an 8-bit value and determines the range of the number.
• mantissa is a 23-bit value and determines the precision of the number. The effective
precision of a float is actually 24 bits because when a float number is stored, the
leading {1.} of the mantissa is not stored because a normalized mantissa always leads
with this, what is called hidden 1 ; this effectively corresponds to an additional bit in
the resolution of the mantissa.
The idea behind a floating point format in general — as compared to a same size inte-
ger INT32 — is that we sacrifice precision to gain range. For example, comparing FP32
to INT32, while INT32 has a 32-bit fixed precision and a fixed range, FP32 only has a
24-bit effective precision, but allows us to represent significantly larger numbers, i.e., has a
much wider range. Note that range also implies being able to represent significantly smaller
numbers. Example PTX instructions for the FP32 data type are as follows:
16 bits
32 bits
64 bits
1 11 bits 52 bits
exponent mantissa
sign
FIGURE 9.9 IEEE 754-2008 floating point standard and the supported floating point
data types by CUDA. half data type is supported in Compute Capability 5.3 and
above, while float has seen support from the first day of the introduction of CUDA.
Support for double types started in Compute Capability 1.3.
As the accumulation of the numbers continue, the error grows, which effectively reduces the
resolution of the result. Although using double precision does not prevent the accumulation
of the error, it drastically reduced the ratio of the error in comparison to the result. Example
PTX instructions for the FP64 data type are as follows:
Here, we see the ability of the GPU to do “packed” computations (i.e., two additions in one
instruction). This is somehow similar to the dp4a instruction that allows the addition of 4
bytes in one instruction.
In other words, if you executed 1 billion floating point additions, 1 billion multiplications,
and 1 billion FMAs in a second, you computed 3 GFLOPS. With the FMA, you buy one
(multiplication) and get one (addition) free! So, you don’t get additional bonus points for
executing these two operations in one instruction.
The difference is that while the MultiplyAndRound operation rounds the resulting number
to the precision of the operands, which reduces the intermediate resolution, the Multiply
operation produces a result that has infinite resolution. Thus, the fma family operations
prevent the accuracy loss twice. In modern CPUs and GPUs, fma is the only type of operation
that makes sense to use, while the double rounding is somehow deprecated.
• We will make the Hflip() kernel (Code 6.8) core friendly. Its core-friendly versions are
Hflip2() (Code 9.2), Hflip3() (Code 9.4), Hflip4() (Code 9.6), and Hflip5() (Code 9.8),
• We will make the PixCopy() kernel (Code 6.9) core friendly. Its core-friendly versions
are PixCopy2() and PixCopy3() (both in Code 9.10),
The reason for the interleaved numbering is that whatever idea we can apply to Vflip2()
(Code 9.3) can also be applied to design Hflip2() (Code 9.2); therefore, they are introduced
sequentially. The imflipG.cu program does not contain a lot of core computations; aside
from some exceptions, it contains a lot of data movement. Because of this, we would not
expect a lot of kernel performance improvement by making it core friendly; however, even
in this case, we will observe a significant performance improvement with the ideas we are
introducing in this section. Generic data manipulation is a core operation and if we improve
it we should expect an improvement. Additionally, generic kernel-based improvements —
relating to passing arguments into the kernel — are covered in this section, too. We will
use the experience we gain in this chapter to improve a much more core-intensive program,
imedgeG.cu, in Section 9.5.
The main() function of the imflipGCM.cu program is shown in Code 9.1. The added func-
tionality, as compared to imflipG.cu, is the introduction of the multi-dimensional variables
as follows:
In this example, dimGrid2D is a 2D variable. We will use it when we are passing 2D block di-
mensions. Aside from that, the program runs the correct kernel version based on a cascaded
set of switch statements. For example, the statement below
TABLE 9.2 Comparison of kernel performances between (Hflip() and Hflip2()) as well
as (Vflip() and HVflip2()).
Feature Box II Box III Box IV Box VII Box VIII
GPU K3000M GTX 760 Titan Z GTX 1070 Titan X
Engine GK104 GK104 2xGK110 GP104-200 GP102-400
Cores 576 1152 2x2880 1920 3584
Compute Cap 3.0 3.0 3.5 6.1 6.1
GM BW GBps 89 192 336 256 480
Peak GFLOPS 753 2258 8122 5783 10157
DGFLOPS 31 94 2707 181 317
Kernel Performance: imflipGCM astronaut.bmp out.bmp H 128 1
Hflip (ms) 20.12 6.73 4.17 2.15 1.40
GBps 11.82 35.35 57.02 110.78 169.5
Achieved (%) (13%) (18%) (17%) (43%) (35%)
Kernel Performance: imflipGCM astronaut.bmp out.bmp H 128 2
Hflip2 (ms) 17.23 5.85 3.63 1.98 1.30
GBps 13.81 40.69 65.54 119.85 182.34
Achieved (%) (16%) (21%) (20%) (47%) (38%)
Improvement 14% 13% 13% 8% 7%
Kernel Performance: imflipGCM astronaut.bmp out.bmp V 128 1
Vflip (ms) 20.02 6.69 4.11 2.12 1.40
GBps 11.88 35.56 57.83 112.19 169.5
Achieved (%) (13%) (19%) (17%) (44%) (35%)
Kernel Performance: imflipGCM astronaut.bmp out.bmp V 128 2
Vflip2 (ms) 17.23 5.84 3.67 1.96 1.30
GBps 13.81 40.71 64.85 121.63 182.34
Achieved (%) (16%) (21%) (19%) (48%) (38%)
Improvement 14% 13% 11% 8% 7%
these computations carefully, we see the integer division, which is not something that you
want to put inside every thread’s computation, as we witnessed many times before. We also
see two integer additions. Ironically, two integer additions can be more expensive than an
addition followed by a multiplication in cases when the GPU cores’ integer unit does not
support a three-operand addition like d = a + b + c; PTX does not seem to have one.
Table 9.2 provides a comparison between the Hflip() and Hflip2() functions. Although
some of the boxes are the same as the ones we saw previously, a new box is added (Box
VII), which incorporates a Pascal series GTX Titan X GPU. So, this table includes two
Kepler and two Pascal GPUs. No Maxwell is included, but you can expect the performance
characteristics to be somewhere between these two families. Table 9.2 also provides a com-
parison between the Vflip() and Vflip2() functions. Because of the memory access patterns
being very similar between Hflip() and Vflip() kernels, Table 9.2 shows an identical behavior
for both kernels.
Code 9.2 shows the modified kernel, Hflip2(), in which the lines that are supposed to
compute BlkPerRow and RowBytes are simply commented out and these two values are passed
as arguments to the kernel, increasing the total number of arguments passed into the kernel
to 5 (from 3).
290 GPU Parallel Program Development Using CUDA
Because the computation of these two values (BlkPerRow and RowBytes) depend on other
values that do not change throughout the execution of the program once the user enters the
command line parameters (ThePerBlk and ip.Hpixels), these values are readily calculated
inside main() and passed into the kernel call as follows:
The bad news is that we cannot use the same trick as before by passing these values as
function arguments; they change for every single thread and they are not fixed.
Understanding GPU Cores 291
ui MYrow = blockID.y;
ui MYcol = MYbid*ThrPerBlk + MYtid;
This not only makes the computation of MYrow just a simple register mov operation, it makes
the computation of MYcol extremely easy too. We know from Section 9.3.4 that computing
MYcol is a single instruction mad.lo.u32 d,a,b,c; despite its complicated look. So, with this
new index mapping, we converted the computation of the x,y coordinates to a mere two
PTX instructions. It gets better ... we no longer need the MYgtid variable either.
292 GPU Parallel Program Development Using CUDA
As we see from the code above, the y coordinate of the image has a one-on-one relationship
with the blockIDx.y (i.e., the second dimension of the grid of blocks), which eliminates
the need for each thread to compute the y image coordinate inside each kernel. Once the
y coordinate is known, this makes it easier to compute the x coordinate also. This trick
allows us to use the dimension computation hardware to get a free integer division! This
is great considering the fact that we were getting a free for loop when we used the GPU
internal hardware correctly, as we observed before. As we see in these examples, the trick
with CUDA programming is to avoid over-programming; the more you use the internal GPU
hardware to reduce the core instructions, the faster your programs will be.
Understanding GPU Cores 293
(Boxes III, IV), a little healthier 5% on the mobile Kepler (Box II), but a nice 8% on the
two Pascal GPUs (Boxes VII and VIII).
ui MYrow = blockIdx.y;
ui MYcol2 = (MYbid*ThrPerBlk + MYtid)*2;
if (MYcol2 >= Hpixels) return; // col (and col+1) are out of range
ui MYmirrorcol = Hpixels - 1 - MYcol2;
ui MYoffset = MYrow * RowBytes;
ui MYsrcIndex = MYoffset + 3 * MYcol2;
ui MYdstIndex = MYoffset + 3 * MYmirrorcol;
have the if statement after writing the first RGB to make sure that we are not going
out of bounds in address range.
• Results of this kernel are shown in Table 9.4; this kernel gives us a worse performance
than the previous Hflip3() kernel. Can this be true?
__global__
void Vflip4(uch *ImgDst, uch *ImgSrc, ui Hpixels, ui Vpixels, ui RowBytes)
{
ui ThrPerBlk = blockDim.x; ui MYbid = blockIdx.x;
ui MYtid = threadIdx.x; ui MYrow = blockIdx.y;
ui MYcol2 = (MYbid*ThrPerBlk + MYtid)*2;
if (MYcol2 >= Hpixels) return; // col is out of range
ui MYmirrorrow = Vpixels - 1 - MYrow;
ui MYsrcOffset = MYrow * RowBytes;
ui MYdstOffset = MYmirrorrow * RowBytes;
ui MYsrcIndex = MYsrcOffset + 3 * MYcol2;
ui MYdstIndex = MYdstOffset + 3 * MYcol2;
// swap pixels RGB @MYrow , @MYmirrorrow
ImgDst[MYdstIndex] = ImgSrc[MYsrcIndex];
ImgDst[MYdstIndex + 1] = ImgSrc[MYsrcIndex + 1];
ImgDst[MYdstIndex + 2] = ImgSrc[MYsrcIndex + 2];
if ((MYcol2+1) >= Hpixels) return; // only col+1 is out of range
ImgDst[MYdstIndex + 3] = ImgSrc[MYsrcIndex + 3];
ImgDst[MYdstIndex + 4] = ImgSrc[MYsrcIndex + 4];
ImgDst[MYdstIndex + 5] = ImgSrc[MYsrcIndex + 5];
}
computation inside the if statement is harmless, the if statement itself is a huge problem.
The performance penalty introduced by it totally negates the performance gain. The reason
for this is thread divergence, which is due to the threads in a warp providing a different
TRUE/FALSE answer to the same if statement; the divergence hurts the parallelism, be-
cause GPU does its best work when all 32 threads do exactly the same thing.
__global__
void Hflip5(uch *ImgDst, uch *ImgSrc, ui Hpixels, ui RowBytes)
{
ui ThrPerBlk = blockDim.x;
ui MYbid = blockIdx.x;
ui MYtid = threadIdx.x;
ui MYrow = blockIdx.y;
ui MYcol4 = (MYbid*ThrPerBlk + MYtid) * 4;
if (MYcol4 >= Hpixels) return; // col (and col+1) are out of range
ui MYmirrorcol = Hpixels - 1 - MYcol4;
ui MYoffset = MYrow * RowBytes;
ui MYsrcIndex = MYoffset + 3 * MYcol4;
ui MYdstIndex = MYoffset + 3 * MYmirrorcol;
// swap pixels RGB @MYcol , @MYmirrorcol
for (ui a = 0; a<4; a++){
ImgDst[MYdstIndex - a * 3] = ImgSrc[MYsrcIndex + a * 3];
ImgDst[MYdstIndex - a * 3 + 1] = ImgSrc[MYsrcIndex + a * 3 + 1];
ImgDst[MYdstIndex - a * 3 + 2] = ImgSrc[MYsrcIndex + a * 3 + 2];
if ((MYcol4 + a + 1) >= Hpixels) return; // next pixel is out of range
}
}
the penalty for bad ideas seems to be much higher with it. As we witnessed many times
before, this seems to be the trend with more advanced architectures.
attempt to compute multiple pixels in a kernel? The answer is absolutely No. We just have
to do it right.
eliminating a significant amount of overhead at the trailing end of the kernel code. Second,
having to read so many bytes that are on consecutive addresses allows the memory controller
to aggregate many consecutive bytes and issue them as much larger memory address region
reads; we know that this makes the DRAM memory accesses much more efficient. Readers
are encouraged to increase the amount of bytes being copied in each kernel to see where there
will be an “inflection point” and the steady performance improvement with the increased
amount of memory reads will slow down or stop.
B = (double)ImgGPU[MYsrcIndex];
G = (double)ImgGPU[MYsrcIndex + 1];
R = (double)ImgGPU[MYsrcIndex + 2];
ImgBW[MYpixIndex] = (R + G + B) / 3.0;
}
Clearly, the architectural improvements in Pascal made some of our suggested techniques
irrelevant to boost kernel performance because they were both attempting to cover for
some hardware inefficiency (which doesn’t exist anymore). This is very typical in GPU
development in that it is hard to come up with performance improvement techniques that
can be applied to many consecutive families. One example is atomic variables, which were
300 GPU Parallel Program Development Using CUDA
extremely slow to process in Fermi; however, they are orders-of-magnitude faster in Pascal.
For this reason, during the development of this book, I intentionally stayed away from hard
statements such as “this technique is great.” Instead, I compared its impact on multiple
families and observed which family benefited from it more. The reader should be aware that
this trend will never stop. In the upcoming Volta family, many other hardware deficiencies
will be addressed, potentially yielding very different results using our code.
Understanding GPU
Memory
emember from the previous chapters that we introduced terms such as memory friendly
R and core friendly and tried to make our programs one or the other. The reality is that
they are not independent concepts. Consider our GaussKernel(), which is a core-intensive
kernel. It is so core-intensive that it is pointless to try to make it memory friendly. As
a quantitative example, assume that this kernel is spending 10% of its time in memory
accesses and 90% of its time in core computations. Let us assume that you made the kernel
much more memory friendly by making memory accesses 2× faster. Now, instead of memory
and core taking 10+90 units of time, respectively, they will take 5+90 units of time; you
just made your program 5% faster! Instead, if you tried to make the core accesses 2×
faster, your program would take 10+45=55 units of time, which would make it 45% faster.
So, does this mean that we should pick one or the other and not bother with the other
one? Not really. Let us continue the same example. Assume that your memory+core time
was 10+90 units and you applied tricks that could make core accesses 6× faster, which
would drop your execution time to 10+15=25 units and make your kernel 4× faster overall.
Now, assume that you can still apply the same memory-friendly techniques to this kernel
and make memory accesses 2× faster, which would drop your execution time to 5+15=20
units. Now, instead of a puny 5% improvement, the same memory-friendly technique can
make your program 20% faster. The moral of the story is that the reason for the initial
improvement to look weak was because your core accesses were very inefficient and were
masking the potential improvements due to memory friendliness. This is why memory and
core optimizations should be viewed as a “co-optimization” problem, rather than individual
— and unrelated — problems.
In this chapter, we will study the memory architecture of different Nvidia GPU families
and improve our kernels so their access to the data in different memory regions is efficient,
i.e., we will make them memory friendly. As I mentioned in the previous chapter, we will
learn how to use a very important tool named CUDA Occupancy Calculator in this chap-
ter, which will allow us to establish a formal methodology for determining kernel launch
parameters to ensure optimum resource utilization inside the SMs of the GPU.
303
304 GPU Parallel Program Development Using CUDA
480 GBps memory bandwidth due to its GDDR5X memory type, the GTX 1070 only has
a 256 GBps bandwidth, due to its lower-bandwidth GDDR5 memory type. In either case,
which memory is being advertised? The answer is the global memory (abbreviated GM),
which is the main memory of the GPU.
Starting with Section 3.5, we spent quite a bit of time talking about CPU memory and
how accessing consecutive large chunks was the best way to read data from the CPU’s main
memory. The GPU memory is surprisingly similar and almost every bit of intuition you
gained about how to efficiently access CPU memory will apply to GPU memory. The only
big difference is that because a significantly higher number of cores — as compared to a
CPU — need to be fed data simultaneously from the GPU memory, GDDR5 (and the newer
GDDR5X) are designed to provide data to multiple sources a lot more efficiently.
10.2 L2 CACHE
L2$ is where all of the data read from GM is cached. L2$ is coherent, meaning that an
address in L2$ means exactly the same address to every core in the GPU. So far, every
Nvidia GPU architecture we looked at had an L2$ as its Last Level Cache (LLC): Fermi
(Figure 9.2) had a 768 KB L2$, while Kepler (Figure 9.4) had 1.5 MB. Maxwell (Figure 9.6)
increased its L2$ to 2 MB, while Pascal (Figure 9.8) enjoys a large 4 MB L2$. While
GK110, GM200, and GP100 represent the biggest architectures in their respective fami-
lies, smaller (scaled-down) versions were released also; for example, although GTX 1070 is
a Pascal family GPU, it only has a 2 MB L2$ because this is what the GP104-200 engine
includes.
Table 10.1 lists some example GPUs and their global memory and L2$ sizes. Addition-
ally, a new metric, bandwidth per core, is shown to demonstrate how much memory a GPU
has, relative to its number of cores. Bandwidth per core was ≈0.1–0.17 GBps per core in the
Kepler family and went down to about 0.08 in Maxwell and came back up to 0.13 GBps per
core in Pascal. A comparison to two CPUs is shown below the thick line in Table 10.1.
A CPU is designed to have nearly 50× more bandwidth allocated per core (e.g., 0.134 vs.
6.4 GBps per core). Similarly, a CPU-based PC enjoys nearly 1000× more main memory
per core (e.g., 4.27 vs. 4096). Comparing the LLCs (which is L3$ in the CPU), a CPU is,
again, equipped with 2000× more (e.g., 1.14 vs. 2560). This shouldn’t come as a surprise
because the CPU architecture is significantly more sophisticated, allowing the CPU cores
to do a lot more work. However, to deliver this performance the CPU cores need a lot more
resources.
TABLE 10.1Nvidia microarchitecture families and the size of global memory, L1$,
L2$ and shared memory in each one of them.
Model #cores Sh. Mem L1$ L2$ Global Memory
(Engine) (KB/core) (MB/C) (GBps/C)
GTX550Ti 64/32 256/192 1024/192 98.5/192
192
(GF110) (2.00 combined) (1.33) (5.33) (0.513)
GTX 760 64/192 768/1152 2048/1152 192/1152
1152
(GK104) (0.33 combined) (0.67) (1.78) (0.167)
Titan Z 64/192 1536/2880 6144/2880 336/2880
2x2880
(2xGK110) (0.33 combined) (0.53) (2.13) (0.117)
Tesla K80 112/192 1536/2496 12288/2496 240/2496
2x2496
(2xGK210) (0.58 combined) (0.62) (4.92) (0.096)
GTX 980Ti 96/192 2048/2816 6144/2816 224/2816
2816
(GM200) (0.50) (0.73) (2.19) (0.080)
GTX 1070 96/128 48/128 2048/1920 8192/1920 256/1920
1920
(GP104-200) (0.75) (0.38) (1.07) (4.27) (0.134)
Titan X 64/64 4096/3584 12288/3584 480/3584
3584
(GP102) (1.00) (1.14) (3.43) (0.134)
Xeon E5-2690 L3$ 64/1 256/1 32768/8 51.2/8
8
(Sandy Br EP) (2560) (64) (256) (4096) (6.400)
E5-2680v4 L3$ 64/1 256/1 262144/14 76.8/14
14
(Broadwell) (2560) (64) (256) (18724) (5.486)
Below the thick line, the same parameters are shown for two different CPUs. Note: /C
means per core.
For example, if a kernel required 20 KB shared memory in Fermi, Fermi hardware would
automatically split this memory as (shared memory=48 KB, L1$=16 KB), which is option 2.
Alternatively, Kepler would split it as (shared memory=32 KB, L1$=32 KB), which is more
efficient since it leaves more room for the hardware cache L1$. The decision as to when
this split takes place is made at runtime by the streaming multiprocessors (SM) hardware.
Because a different block can be running in each SM at different times, the split can be
changed as new blocks are scheduled to run in the same SM.
While Kepler improved the L1$ and shared memory usage efficiency by introducing a
(32 KB, 32 KB) split option in addition to the (16 KB, 48 KB) and (48 KB, 16 KB) that
Fermi had, Nvidia decided to place the shared memory and L1$ in totally separate areas
starting with the Maxwell generation. They decided that it is more efficient to have L1$
share the same area with texture memory, which is used in computer graphics operations.
In these newer families, the shared memory area is fixed in size and is dedicated to this
software cache duty.
CUDA until now. However, the amount of constant cache has varied slightly. While constant
memory is shared by the entire GPU, constant cache is local to the SMs. We will be using
constant memory to speed up our kernels in this chapter.
which allocates 3027 elements of type unsigned char (uc). This is an allocation of 3072 Bytes
total, allocated at the block level. So, any block that you launch running multiple threads
like this will only be allocated single 3072 Byte shared memory area. If you, for example,
launch your blocks with 128 threads, these 128 threads are allocated a total of 3072 Byte
buffer area, corresponding to 3072/128=24 Bytes for each thread. The initial part of the
kernel is identical to the original Hflip() and Vflip() kernels, however, the way that each
thread addresses the shared memory depends on its tid. Because of the way the Hflip6()
and Vflip6() kernels are written, they process only one pixel, i.e., 3 Bytes. So, if they are
launched with 128 threads per block, they will only need 384 Bytes of shared memory,
making the remaining 2688 Bytes of shared memory sit there idle during their execution.
If we drop the allocated shared memory to 384 Bytes, then this kernel cannot be launched
with more than 128 threads. So, there is an intricate formula in determining how much
shared memory to declare.
Once the shared memory is declared, the SM allocates this much from its entire shared
memory before launching a block. During its execution, these lines of code
PixBuffer[MYtid3]=ImgSrc[MYsrcIndex]; ...
copy pixels from GM (pointed to by ImgSrc) into shared memory (the PixBuffer array).
These lines copy it back to GM (at its flipped position, using the pointer ImgDst)
ImgDst[MYdstIndex]=PixBuffer[MYtid3]; ...
The following line ensures that the reads into the shared memory by all of the threads
in the block are completed before each thread is allowed to proceed.
__syncthreads();
Understanding GPU Memory 309
TABLE 10.2 Kernel performances: Hflip() vs. Hflip6() and Vflip() vs. Vflip6().
Kernel Box II Box III Box IV Box VII Box VIII
Hflip (ms) 20.12 6.73 4.17 2.15 1.40
Hflip6 (ms) 18.23 5.98 3.73 1.83 1.37
Vflip (ms) 20.02 6.69 4.11 2.12 1.40
Vflip6 (ms) 18.26 5.98 3.65 1.90 1.35
// Each thread copies a pixel from GM into shared memory (PixBuffer[]) and back
__global__
void Hflip6(uch *ImgDst, uch *ImgSrc, ui Hpixels, ui RowBytes)
{
__shared__ uch PixBuffer[3072]; // holds 3*1024 Bytes (1024 pixels).
Table 10.2 compares the Hflip() and Vflip() kernels to their shared memory versions,
Hflip6() and Vflip6(). The small performance improvements are due to the previous im-
provements suggested in Chapter 9. So, it is safe to say that using the shared memory —
in this specific kernel — made no difference. Now it is time to find out why.
310 GPU Parallel Program Development Using CUDA
PixBuffer[MYtid3] = ImgSrc32[MYsrcIndex];
PixBuffer[MYtid3+1] = ImgSrc32[MYsrcIndex+1];
PixBuffer[MYtid3+2] = ImgSrc32[MYsrcIndex+2];
These lines correspond to nice, consecutive, clean, and fast GM read and should be extremely
efficient. When written into the shared memory, they are stored in Big Endian notation,
where the smaller addresses correspond to higher-valued bytes. Therefore, after this read,
the shared memory holds the following three int’s:
Unfortunately, none of the bytes are where we want them to be; to put these 12 bytes in
their desired form, 6 byte swap operations are needed as follows using a pointer that points
to the shared memory as an int *:
Flipped 4 pixels are written back to GM from shared memory as 3 consecutive int’s again.
Table 10.3 compares this kernel to the previous one; results are nothing to brag loudly
about. Although we are getting closer, something is still not right.
Understanding GPU Memory 311
TABLE 10.3 Kernel performances: Hflip(), Hflip6(), and Hflip7() using mars.bmp.
Kernel Box II Box III Box IV Box VII Box VIII
Hflip (ms) — — 7.93 — —
Hflip6 (ms) — — 7.15 — —
Hflip7 (ms) — — 6.97 — —
// Each kernel uses Shared Memory (PixBuffer[]) to read in 12 Bytes (4 pixels) into
// Shared Mem. and flips 4 pixels inside Shared Mem. and writes to GM as 3 int’s
// Horizontal resolution MUST BE A POWER OF 4.
__global__
void Hflip7(ui *ImgDst32, ui *ImgSrc32, ui RowInts)
{
__shared__ ui PixBuffer[3072]; // holds 3*1024*4 Bytes (1024*4 pixels).
ui ThrPerBlk=blockDim.x; ui MYbid=blockIdx.x;
ui MYtid=threadIdx.x; ui MYtid3=MYtid*3;
ui MYrow=blockIdx.y; ui MYoffset=MYrow*RowInts;
uch SwapB; uch *SwapPtr;
ui MYcolIndex=(MYbid*ThrPerBlk+MYtid)*3; if (MYcolIndex>=RowInts) return;
ui MYmirrorcol=RowInts-1-MYcolIndex; ui MYsrcIndex=MYoffset+MYcolIndex;
ui MYdstIndex=MYoffset+MYmirrorcol-2; // -2 is to copy 3 Bytes at a time
// read 4 pixel blocks (12B = 3 int’s) into Shared Memory
// PixBuffer: [B0 G0 R0 B1] [G1 R1 B2 G2] [R2 B3 G3 R3]
// Our Target: [B3 G3 R3 B2] [G2 R2 B1 G1] [R1 B0 G0 R0]
PixBuffer[MYtid3] = ImgSrc32[MYsrcIndex];
PixBuffer[MYtid3+1] = ImgSrc32[MYsrcIndex+1];
PixBuffer[MYtid3+2] = ImgSrc32[MYsrcIndex+2];
__syncthreads();
// swap these 4 pixels inside Shared Memory
SwapPtr=(uch *)(&PixBuffer[MYtid3]); //[B0 G0 R0 B1] [G1 R1 B2 G2] [R2 B3 G3 R3]
SWAP(SwapPtr[0], SwapPtr[9] , SwapB) //[B3 G0 R0 B1] [G1 R1 B2 G2] [R2 B0 G3 R3]
SWAP(SwapPtr[1], SwapPtr[10], SwapB) //[B3 G3 R0 B1] [G1 R1 B2 G2] [R2 B0 G0 R3]
SWAP(SwapPtr[2], SwapPtr[11], SwapB) //[B3 G3 R3 B1] [G1 R1 B2 G2] [R2 B0 G0 R0]
SWAP(SwapPtr[3], SwapPtr[6] , SwapB) //[B3 G3 R3 B2] [G1 R1 B1 G2] [R2 B0 G0 R0]
SWAP(SwapPtr[4], SwapPtr[7] , SwapB) //[B3 G3 R3 B2] [G2 R1 B1 G1] [R2 B0 G0 R0]
SWAP(SwapPtr[5], SwapPtr[8] , SwapB) //[B3 G3 R3 B2] [G2 R2 B1 G1] [R1 B0 G0 R0]
__syncthreads();
//write the 4 pixels (3 int’s) from Shared Memory into Global Memory
ImgDst32[MYdstIndex] = PixBuffer[MYtid3];
ImgDst32[MYdstIndex+1] = PixBuffer[MYtid3+1];
ImgDst32[MYdstIndex+2] = PixBuffer[MYtid3+2];
}
312 GPU Parallel Program Development Using CUDA
ui A, B, C, D, E, F;
// read 4 pixel blocks (12B = 3 int’s) into 3 long registers
A = ImgSrc32[MYsrcIndex];
B = ImgSrc32[MYsrcIndex + 1];
C = ImgSrc32[MYsrcIndex + 2];
This is the only access to GM before the flipped pixels are written back again to GM.
An important note here is that the data in GM is stored in Little Endian format, contrary
to shared memory. So, the goal of Hflip8() is to turn the following values in A, B, C
What makes this method efficient is that due to the requirement for a small number of
variables in the kernel, all of these variables can be easily mapped to core registers by the
compiler, making their manipulation extremely efficient. Furthermore, the core operations
that are used in the byte manipulations is only shift, AND, OR operations, which are the
fundamental operations of the ALU inside the cores and can be compiled into the fastest
possible instructions by the compiler.
As an example, let us analyze the following C statement:
Here, there are 4 barrel shift/AND operations that result in the following 32-bit values:
(B<<24) = [G1, 0, 0, 0] (B>>24) = [0, 0, 0, G2]
( (A>>8) & 0x00FF0000 ) = ([0, B1, R0, G0] & [00, FF, 00, 00]) = [0, B1, 0, 0]
( (C<<8) & 0x0000FF00 ) = ([G3, B3, R2, 0] & [00, 00, FF, 00]) = [0, 0, R2, 0]
When they are OR’ed, the following result is obtained: E = [G1, B1, R2, G2] which is
listed as our goal in the code. The remaining manipulations are very similar by extracting
different bytes from the initial 32-bit values. An important note here is that the barrel shift
(i.e., shifting a 32-bit value by 0–31 times to the left or right) as well as the AND, OR
instructions are native instructions of a GPU’s integer unit.
Understanding GPU Memory 313
Table 10.3 shows the results of Hflip8(). The significant improvement is due to the fact
that although accessing memory using any format other than the natural 32-bit types causes
significant inefficiencies, the same is not true to GPU cores, because they are designed to
execute barrel shift, bitwise logical AND, OR instructions in a single cycle. As shown in
Code 10.3, the unpleasant data access patterns are confined inside the GPU core instructions
rather than being exposed as memory accesses.
314 GPU Parallel Program Development Using CUDA
// Improved Vflip6() kernel that uses shared memory to copy 4 B (int) at a time.
// It no longer worries about the pixel RGB boundaries
__global__
void Vflip7(ui *ImgDst32, ui *ImgSrc32, ui Vpixels, ui RowInts)
{
__shared__ ui PixBuffer[1024]; // holds 1024 int = 4096B
ui ThrPerBlk=blockDim.x; ui MYbid=blockIdx.x;
ui MYtid=threadIdx.x; ui MYrow=blockIdx.y;
ui MYcolIndex=MYbid*ThrPerBlk+MYtid; if (MYcolIndex>=RowInts) return;
ui MYmirrorrow=Vpixels-1-MYrow; ui MYsrcOffset=MYrow*RowInts;
ui MYdstOffset=MYmirrorrow*RowInts; ui MYsrcIndex=MYsrcOffset+MYcolIndex;
ui MYdstIndex=MYdstOffset+MYcolIndex;
// swap pixels RGB @MYrow , @MYmirrorrow
PixBuffer[MYtid] = ImgSrc32[MYsrcIndex];
__syncthreads();
ImgDst32[MYdstIndex] = PixBuffer[MYtid];
}
The last int contains [R2 B0 G3 R3], which is information relating to three different pixels.
Because of the bytes of each pixel not aligning at 32-bit int boundaries, almost every pixel
is guaranteed to require accesses to two int elements. So, it is not unreasonable to expect
2× the performance from its aligned version, Hflip8(), which we studied in Section 10.7.3.
Understanding GPU Memory 315
__global__
void Vflip8(ui *ImgDst32, ui *ImgSrc32, ui Vpixels, ui RowInts)
{
__shared__ ui PixBuffer[2048]; // holds 2048 int = 8192B
ui ThrPerBlk=blockDim.x; ui MYbid=blockIdx.x;
ui MYtid=threadIdx.x; ui MYtid2=MYtid*2;
ui MYrow=blockIdx.y;
ui MYcolIndex=(MYbid*ThrPerBlk+MYtid)*2; if(MYcolIndex>=RowInts) return;
ui MYmirrorrow=Vpixels-1-MYrow; ui MYsrcOffset=MYrow*RowInts;
ui MYdstOffset=MYmirrorrow*RowInts; ui MYsrcIndex=MYsrcOffset+MYcolIndex;
ui MYdstIndex=MYdstOffset+MYcolIndex;
// swap pixels RGB @MYrow , @MYmirrorrow
PixBuffer[MYtid2]=ImgSrc32[MYsrcIndex];
if ((MYcolIndex+1)<RowInts) PixBuffer[MYtid2+1]=ImgSrc32[MYsrcIndex+1];
__syncthreads();
ImgDst32[MYdstIndex]=PixBuffer[MYtid2];
if ((MYcolIndex+1)<RowInts) ImgDst32[MYdstIndex+1]=PixBuffer[MYtid2+1];
}
TABLE 10.6 Kernel performances: Vflip(), Vflip6(), Vflip7(), Vflip8(), and Vflip9().
Kernel Box II Box III Box IV Box VII Box VIII
Vflip (ms) 20.02 6.69 4.11 2.12 1.40
Vflip6 (ms) 18.26 5.98 3.65 1.90 1.35
Vflip7 (ms) 9.43 3.28 2.08 1.28 0.82
Vflip8 (ms) 14.83 5.00 2.91 1.32 0.84
Vflip9 (ms) 6.76 2.61 1.70 1.27 0.82
// Uses shared memory as a temporary local buffer. Copies one byte at a time
__global__
void PixCopy4(uch *ImgDst, uch *ImgSrc, ui FS)
{
__shared__ uch PixBuffer[1024]; // Shared Memory: holds 1024 Bytes.
ui ThrPerBlk=blockDim.x; ui MYbid=blockIdx.x;
ui MYtid=threadIdx.x; ui MYgtid=ThrPerBlk*MYbid+MYtid;
if(MYgtid > FS) return; // outside the allocated memory
PixBuffer[MYtid] = ImgSrc[MYgtid];
__syncthreads();
ImgDst[MYgtid] = PixBuffer[MYtid];
}
__global__
void PixCopy5(ui *ImgDst32, ui *ImgSrc32, ui FS)
{
__shared__ ui PixBuffer[1024]; // Shared Mem: holds 1024 int (4096 Bytes)
ui ThrPerBlk=blockDim.x; ui MYbid=blockIdx.x;
ui MYtid=threadIdx.x; ui MYgtid=ThrPerBlk*MYbid+MYtid;
if((MYgtid*4)>FS) return; // outside the allocated memory
PixBuffer[MYtid] = ImgSrc32[MYgtid];
__syncthreads();
ImgDst32[MYgtid] = PixBuffer[MYtid];
}
10.7.8 PixCopy4(), PixCopy5(): Copying One versus 4 Bytes Using Shared Memory
Code 10.7 shows two new versions of the original PixCopy() kernel: The PixCopy4() kernel
uses shared memory to copy pixels one byte at a time, while PixCopy5() copies pixels one int
at a time. Table 10.7 compares these two new kernels to all of its previous versions. Similar
to many other results we have seen before, PixCopy4() does not show any performance
gain, while PixCopy5() is ≈2× faster in Pascal and ≈2–3× faster in Kepler. Furthermore,
compared to the previous versions of this kernel (PixCopy2() and PixCopy3()), only the
PixCopy5() shows sufficient gain to become the fastest PixCopy() kernel so far.
318 GPU Parallel Program Development Using CUDA
__global__
void PixCopy6(ui *ImgDst32, ui *ImgSrc32, ui FS)
{
ui ThrPerBlk=blockDim.x; ui MYbid=blockIdx.x;
ui MYtid=threadIdx.x; ui MYgtid=ThrPerBlk*MYbid+MYtid;
if((MYgtid*4)>FS) return; // outside the allocated memory
ImgDst32[MYgtid] = ImgSrc32[MYgtid];
}
__global__
void PixCopy7(ui *ImgDst32, ui *ImgSrc32, ui FS)
{
ui ThrPerBlk=blockDim.x; ui MYbid=blockIdx.x;
ui MYtid=threadIdx.x; ui MYgtid=ThrPerBlk*MYbid+MYtid;
if((MYgtid*4)>FS) return; // outside the allocated memory
ImgDst32[MYgtid] = ImgSrc32[MYgtid];
MYgtid++;
if ((MYgtid * 4) > FS) return; // next 32 bits
ImgDst32[MYgtid] = ImgSrc32[MYgtid];
}
A = ImgGPU32[MYsrcIndex]; // A=[B1,R0,G0,B0]
B = ImgGPU32[MYsrcIndex+1]; // B=[G2,B2,R1,G1]
C = ImgGPU32[MYsrcIndex+2]; // C=[R3,G3,B3,R2]
// Pix1 = R0+G0+B0;
Pix1 = (A & 0x000000FF) + ((A >> 8) & 0x000000FF) + ((A >> 16) & 0x000000FF);
// Pix2 = R1+G1+B1;
Pix2 = ((A >> 24) & 0x000000FF) + (B & 0x000000FF) + ((B >> 8) & 0x000000FF);
// Pix3 = R2+G2+B2;
Pix3 = (C & 0x000000FF) + ((B >> 16) & 0x000000FF) + ((B >> 24) & 0x000000FF);
// Pix4 = R3+G3+B3;
Pix4=((C>>8) & 0x000000FF) + ((C>>16) & 0x000000FF) + ((C>>24) & 0x000000FF);
ImgBW[MYpixAddr] = (double)Pix1 * 0.33333333;
ImgBW[MYpixAddr + 1] = (double)Pix2 * 0.33333333;
ImgBW[MYpixAddr + 2] = (double)Pix3 * 0.33333333;
ImgBW[MYpixAddr + 3] = (double)Pix4 * 0.33333333;
}
From Table 10.9, we see that BWKernel3() is nearly 2× as fast as our previous version
BWKernel2() in Kepler’s and a little less than that in Pascal GPUs. The memory access
efficiency of Pascal is evident here too, with Pascal GPUs reaching a much higher percentage
of their maximum bandwidth. One interesting note to make here is that BWKernel3() is a
balanced core-memory kernel. This is demonstrated by how the Pascal GPUs reach ≈50%
of their peak bandwidth with this kernel. The memory operations in this kernel include
GM read/writes, while core operations include index computations, barrel shift, AND, OR
operations, and double type multiplications.
Understanding GPU Memory 321
G=0.0;
for (i = -2; i <= 2; i++){
for (j = -2; j <= 2; j++){
row=MYrow+i; col = MYcol + j;
indx=row*Hpixels+col; G+=(ImgBW[indx]*Gauss[i+2][j+2]);
}
}
ImgGauss[MYpixIndex] = G / 159.00;
The big question is where is the Gauss[] array stored, which holds these filter coefficients.
They are stored in a double array of elements that do not change their values throughout
the execution of these kernels, as initially shown in GaussKernel() (Code 8.4):
__device__
double Gauss[5][5] = { { 2, 4, 5, 4, 2 },
...
The problem is that multiple threads want to access the same exact value. Intuitively, assume
that we launched a block with 128 threads. In this scenario, if each thread is calculating a
single pixel, it has to access all 25 constant values. This is a total of 128×25=3200 accesses to
these constant values (Gauss[i+2][j+2]) to compute 128 pixels. Even the simplest thinking
indicates that when all 128 threads are trying to read these 25 constants, on average,
d128 ÷ 25e = 6 threads will want to access the same constant simultaneously. In other
words, instead of an N:N pattern, where N threads access N values, there are lots of N:1
type accesses, where N threads want to read a single value.
The constant memory and constant cache are designed precisely for this purpose in
all Nvidia hardware. In the GaussKernel3() (Code 10.10), the constant coefficient ar-
ray is declared as follows. A total of 25×8=400 bytes of storage is needed for this
constant array.
__constant__
double GaussC[5][5] = { { 2, 4, 5, 4, 2 },
...
__constant__
double GaussC[5][5] = { { 2, 4, 5, 4, 2 },
{ 4, 9, 12, 9, 4 },
{ 5, 12, 15, 12, 5 },
{ 4, 9, 12, 9, 4 },
{ 2, 4, 5, 4, 2 } };
// Improved GaussKernel2. Uses constant memory to store filter coefficients
__global__
void GaussKernel3(double *ImgGauss, double *ImgBW, ui Hpixels, ui Vpixels)
{
ui ThrPerBlk=blockDim.x; ui MYbid=blockIdx.x;
ui MYtid=threadIdx.x; int row, col, indx, i, j;
double G; ui MYrow=blockIdx.y;
ui MYcol=MYbid*ThrPerBlk+MYtid; if (MYcol>=Hpixels) return;
ui MYpixIndex=MYrow*Hpixels+MYcol;
if ((MYrow<2) || (MYrow>Vpixels - 3) || (MYcol<2) || (MYcol>Hpixels - 3)){
ImgGauss[MYpixIndex] = 0.0;
return;
}else{
G = 0.0;
for (i = -2; i <= 2; i++){
for (j = -2; j <= 2; j++){
row = MYrow + i;
col = MYcol + j;
indx = row*Hpixels + col;
G += (ImgBW[indx] * GaussC[i + 2][j + 2]); // use constant memory
}
}
ImgGauss[MYpixIndex] = G / 159.00;
}
}
Table 10.10 compares the GaussKernel3() (Code 10.10) to its previous versions. It is
clear that the impact of using the constant memory varies significantly in Kepler versus
Pascal, as well as high-end versus low-end GPUs in each generation. This fact must be
due to the major changes Nvidia made in its memory controller hardware, including sepa-
rating the shared memory from texture/L1$ memory in Pascal, as well as their improved
memory sub-system in their high-end GPUs (e.g., Titan Z), making certain parts of the
hardware more efficient (e.g., constant memory). It is hard, at this point, to draw more con-
crete conclusions. We will keep building different versions of the kernel and observe their
performance impact.
Understanding GPU Memory 323
Table 10.11 shows the results of GaussKernel4(), which is possibly the most interesting
set of results we have ever seen so far. The K3000M and the almighty Titan Z show a 6–7×
performance degradation, while the GTX760 shows a modest 1.6–1.7× degradation. The two
Pascal GPUs show a degradation of 1.6–1.7×, which is very similar to the GTX760. At this
point the reader should be puzzled, but not about the 6–7×; instead, what is more puzzling
is the major difference between 1.6× and the 6–7×. Let us continue formulating different
versions of the kernel and the answer should start being obvious. For now, it suffices to say
that shared memory accesses are not as cheap as one might think. They are most definitely
324 GPU Parallel Program Development Using CUDA
better than global memory accesses, but if the read/write patterns are not friendly to the
shared memory, performance will suffer greatly.
An interesting note about GaussKernel4(): assuming that we launch our blocks with 128
threads, and considering the shared memory size of 25 KB, each thread is responsible for
reading 25,600÷128=200 bytes from GM into shared memory using a for loop above and
uses this information later in a different for loop.
Understanding GPU Memory 325
The reason for adding “+4” should be clear to the reader, because they store the 4 edge
pixels, which are wasted but make the code a little cleaner. By eliminating such a large
amount of space, we are increasing the maximum number of threads to 1024. With the 1024
set as the maximum number of threads, 8 × 1028 × 5 = 41,120 bytes or ≈41 KB. The reality
is that the Nvidia hardware will possibly allocate 48 KB for each block that is running this
kernel, because 41 KB is simply an “ugly” number to an architecture that loves numbers like
32 or 48 ... This is still a very large amount of shared memory and will possibly put strain
on shared memory resources. A wise thing would be to reduce the #define MAXTHGAUSSKN67
down to 128, which will allocate 5280 bytes (≈6 KB) for each block, leaving room for each
SM to run 8 blocks inside it (i.e., 48÷6=8 blocks). Or, on Pascal GPUs, which have 64 KB
shared memory allocated to an SM, 10 blocks can be launched in each SM without hitting
the shared memory limit.
Results of GaussKernel6() are shown in Table 10.13, which indicate that computing a
single pixel per thread is a better idea than computing 4, because the additional for loops
necessary mean that the loop is being done with internal kernel variables rather than the
Nvidia hardware that gives us free for loops, etc.
328 GPU Parallel Program Development Using CUDA
ui IsEdgeThread=(MYtid==(ThrPerBlk-1));
// Read from GM to Shared Memory. Each thread will read a single pixel
indx = MYpixIndex-2*Hpixels-2; // start 2 rows above & 2 columns left
if (!IsEdgeThread) {
...
}else{
...
}
In this GaussKernel6() code above, you will notice that the code is trying to avoid going
out of bounds in index computations. So, it populates the Neighbors[][]] array differently
based on whether a thread is an edge thread or not. These cases are designated by a Boolean
variable IsEdgeThread, which is TRUE if the pixel is an edge thread.
The necessity to check for edge pixels is eliminated by two different facts. Because the
following lines of code simply write 0.00 in the GaussImage for the two leftmost and two
rightmost pixels
there is no reason to worry about the edge pixels as long as we store the pixel at coordinates
(x-2,y-2) in shared memory. Because of this (–2, –2) “shifted storage,” we can multiply this
with the constants without shifting during the actual computation as follows:
ui ThrPerBlk=blockDim.x; ui MYbid=blockIdx.x;
ui MYtid=threadIdx.x; int indx, i, j;
double G; ui MYrow = blockIdx.y;
ui MYcol=MYbid*(ThrPerBlk-4)+MYtid; if (MYcol >= Hpixels) return;
ui MYpixIndex = MYrow * Hpixels + MYcol;
if ((MYrow<2) || (MYrow>Vpixels - 3) || (MYcol<2) || (MYcol>Hpixels - 3)) {
ImgGauss[MYpixIndex] = 0.0; return;
}
// Read from GM to Shr Mem. Each threads 1 pixel and 5 neighboring rows
// Each block reads ThrPerBlk pixels starting at (2 left) location
indx = MYpixIndex - 2 * Hpixels - 2; // start 2 rows above & 2 columns left
for (j = 0; j < 5; j++) {
Neighbors[MYtid][j] = ImgBW[indx];
indx += Hpixels; // Next iteration will read next row, same column
}
__syncthreads();
if (MYtid >= ThrPerBlk - 4) return; // Each block computes ThrPerBlk-4 pixels
G = 0.0;
for (i = 0; i < 5; i++) {
for (j = 0; j < 5; j++) {
G += (Neighbors[MYtid + i][j] * GaussC[i][j]);
}
}
ImgGauss[MYpixIndex] = G / 159.00;
}
double G[8] = { 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00 };
...
// Write all computed pixels back to GM
for (j = 0; j < 8; j++) {
ImgGauss[MYpixIndex] = G[j] / 159.00;
MYpixIndex += Hpixels;
}
which requires the results of these 8 pixels to be saved in an array named G[] that takes
up 8×8=64 bytes of storage. If the compiler allocates registers for this array, a valuable
128 registers are used in this kernel out of the available 255 maximum that is limited by
Nvidia hardware; 64 double registers take up a 128 native 32 bit register allocation in the
register file. If the compiler decides to place this array in the global memory, than this
defeats the purpose of using the shared memory in the first place. As the reader can see
from this example, trying to do more work in each kernel increases the resource profile of
the kernel, which limits the number of blocks that can be launched that run this kernel,
thereby limiting the performance. In a lot of the cases we investigated, the solutions that
involved kernels with small resource profiles always won, although this is not a hard rule.
Trying to compute multiple pixels has other side effects, such as having to check whether
we are computing the last few rows, as shown by the code below:
ui isLastBlockY=(blockIdx.y == (blockDim.y-1));
if (isLastBlockY) {
indx=(Vpixels-2)*Hpixels + MYcol;
ImgGauss[indx]=0.0; ImgGauss[indx+Hpixels]=0.0; // last row-1, last row
}
332 GPU Parallel Program Development Using CUDA
• The #define MAXTHGAUSSKN8 256 restriction in GaussKernel8() was not just some
random number we picked; any value higher than 256 would have resulted in ex-
ceeding the maximum allowed shared memory in this kernel and the kernel would
not have compiled. Because GaussKernel7() uses a lot less shared memory, we could
have pushed it to 1024, without exceeding the 48 KB shared memory limitation of
the GPU.
• The GaussKernel8() kernel in Section 10.8.8 achieved its performance by using a lot
of shared memory, so it was destined to be uncompetitive against other kernels (e.g.,
GaussKernel7()) that are not so shared memory-hungry!
• GaussKernel7() seems to always perform better then GaussKernel6(), because both
have the same restriction on the threads/block.
Understanding GPU Memory 335
Maximum warps=64
32 means 50%
occupancy
24 means 38%
occupancy
FIGURE 10.1 CUDA Occupancy Calculator: Choosing the Compute Capability, max.
shared memory size, registers/kernel, and kernel shared memory usage. In this spe-
cific case, the occupancy is 24 warps per SM (out of a total of 64), translating to
an occupancy of 24 ÷ 64 = 38 %.
338 GPU Parallel Program Development Using CUDA
This kernel is using a total of 1024 × 8 × 5 = 40,960 Bytes (40 KB) shared memory; note: 8
is the sizeof(double). We are estimating that this kernel needs 16 registers per kernel. The
number of registers can be roughly determined from the initial part of the kernel, which is
shown below:
void GaussKernel7(...)
{
ui ThrPerBlk=blockDim.x; ui MYbid=blockIdx.x;
ui MYtid=threadIdx.x; int indx, i, j;
double G; ui MYrow = blockIdx.y;
ui MYcol=MYbid*(ThrPerBlk-4)+MYtid; if (MYcol >= Hpixels) return;
ui MYpixIndex = MYrow * Hpixels + MYcol;
...
This kernel requires nine 32-b registers and one 64-b register (i.e., double, which requires
2×32-b), totaling 11 32-b registers. Conservatively, it will need a few temporary registers
to move the data around. Let us assume 5 temporary registers. So, 16 is not a bad estimate
for the number of 32-b registers.
Remember that we ran and reported GaussKernel7() with 256 threads per block in
Table 10.15. When we plug these parameters into the CUDA Occupancy Calculator, we get
the plot in Figure 10.4 for the register-induced limitation (top) and shared memory-induced
limitation (bottom). Undoubtedly, the shared memory is the worst of the two limitations. In
fact, it is so bad that it doesn’t even allow us to launch more than a single block in an SM.
If we launch our blocks with 256 threads/block (i.e., 8 warps/block), our occupancy will
always be 8 warps/SM. Because the SM is limited to 64 warps/SM, our occupancy is only
8÷64 ≈ 13 %, which is shown in Figure 10.5. As this example demonstrates, we designed our
kernels without knowing the “occupancy” concept. Now we realize that we created a kernel
that only keeps each SM 13% occupied, leaving 87% of the warps unused. There should be
no doubt in the reader’s mind at this point that we could have done better. Of course, if we
launched the blocks with 1024 threads/block (i.e., 32 warps per block), then our occupancy
would have been 50%, although we are still limited by the same 40 KB shared memory.
By simply changing the threads/block parameter, we increased our occupancy from 13%
to 50%. One would wonder how this translates to performance, when we have 4× higher
occupancy. The answer is in Table 10.16: our runtime goes from 36 down to 19 ms. In other
words, it increased substantially (almost 2× in this case). Although it is difficult to make
generalizations such as “x amount of occupancy corresponds to y amount of performance
increase,” etc., one thing that is for sure: you will never know how high you can go unless
you increase your occupancy as much as possible. Notice that even with 1024 threads/block,
we are still at 50% occupancy. This begs the question about whether GaussKernel7() could
have done much better. This requires the programmer to go back to the drawing board and
design the kernel to require a little less shared memory.
Understanding GPU Memory 341
FIGURE 10.4 Analyzing the GaussKernel7(), which uses (1) registers/thread ≈ 16,
(2) shared memory/kernel=40,960 (40 KB), and (3) threads/block=256. It is clear
that the shared memory limitation does not allow us to launch more than a single
block with 256 threads (8 warps). If you could reduce the shared memory down to
24 KB by redesigning your kernel, you could launch at least 2 blocks (16 warps, as
shown in the plot) and double the occupancy.
342 GPU Parallel Program Development Using CUDA
For the GaussKernel7() case study, threads/block and limiting conditions are shown in
Figure 10.5. The top plot shows that you could have gone up to 32 warps if you launched
1024 threads/block, although this would have made you hit the shared memory limitation,
shown in Figure 10.4. The implication of the occupancy concept is far-reaching: maybe even
a technically worse kernel can perform better, if it is not resource-hungry, because it can
“occupy” the SM better. To phrase differently: writing kernels with a high resource profile
has the danger of putting shackles on the kernel’s ankles. Maybe it is better to design kernels
with much smaller profiles, although they are not seemingly high performance. When a lot
of them are running on the SM, at a much higher occupancy, maybe they will translate
to an overall higher performance. This is the beauty of GPU programming. It is art as
much as a science. You can be sure that the resource limitations will not go away even 10
generations from today, although you might have a little more shared memory in each, a
little more registers, etc. So, this “resource-constrained thinking” will always be a part of
GPU programming.
To provide an analogy: if you have two runners — one is a world-class marathon runner
athlete and the other one is an ordinary guy like me — who will win the race? This is an
Understanding GPU Memory 343
obvious answer. But, here is the second question: who will win the race, if you put shackles
on the marathon runner’s feet? This is what a technically excellent, but resource-heavy
kernel is like. It is a marathon runner with shackles on his feet. You are destroying his
performance due to resource limitations.
memory limitation does not let us to do so. Because of the shared memory limitation, we
are stuck at 25% occupancy (16 warps out of the max 64 warps). What we deduce from
Figure 10.6 is that we can only reach 100% occupancy if our shared memory-per-kernel
requirement is 6 KB per kernel.
The reader is now encouraged to go back to all of the Gauss kernels we designed and
check their CUDA occupancy to see if we missed any potential excellent candidate that is
low-performance but has a much lower occupancy limitation. It is likely that such a kernel
can beat any of the others, because it can reach 100% occupancy.
CHAPTER 11
CUDA Streams
n the entire rest of this book, we focused on improving the kernel execution time. In a
Imemory;
CUDA program, first the data must be transferred from the CPU memory into the GPU
it is only when the data is in GPU memory that the GPU cores can access it and
process it. When the kernel execution is done, the processed data must be transferred back
into the CPU memory. One clear exception to this sequence of events is when we are using
the GPU as a graphics machine; if the GPU is being used to render the graphics that are
required for a computer game, the output of the processed data is the monitor (or multiple
monitors), which is directly connected to the GPU card. So, there is no loss of time to
transfer data back and forth between the CPU and the GPU.
However, the type of programs we are focusing on in this book are GPGPU applications,
which use the GPU as a general purpose processor; these types of computations are, in the
end, totally under the CPU control, which are originated by the CPU code. Therefore, the
data must be shuttled back and forth between the CPU and GPU. What is worse, the data
transfer rate between these two dear friends (CPU and GPU) is bottlenecked by the PCIe
bus, which has a much lower rate than either the CPU-main memory bus or GPU-global
memory bus (see Sections 4.6 and 10.1 for details).
As an example, what if we are performing edge detection on the astronaut.bmp file?
Table 11.1 (top half) shows the run time results of this operation on four different GPUs
(two Kepler and two Pascal), which is broken down into three different components for the
execution time, as shown below:
• CPU→GPU transfers take, on average, 31% of the total execution time.
• Kernel execution inside the GPU takes, on average, 39% of the execution time.
• GPU→CPU transfers take, on average, 30% of the total execution time.
For the purposes of this discussion, the details for the Kepler versus Pascal are not
important. It suffices to say that the CPU→GPU, kernel execution, and GPU→CPU compo-
nents of the runtime are approximately a third of the total time. Taking the Pascal GTX
1070 as an example, we wait to transfer the entire 121 MB astronaut.bmp image into the
GPU (which takes 25 ms); once this transfer is complete, we run the kernel and finish per-
forming edge detection (which takes another 41 ms); once the kernel execution is complete,
we transfer it back to the CPU (which is another 24 ms).
When we look at the results for the horizontal flip in Table 11.1 (bottom half), we see a
different ratio for the execution time of operations; the CPU→GPU and GPU→CPU transfers
are almost half of the total execution time (53% and 43%, respectively), while the kernel
execution time is negligible (only 4% of the total).
The striking observation we make from these two cases is that most of the execution time
goes to shuttling the data around, rather than doing the actual (and useful) work! Although
345
346 GPU Parallel Program Development Using CUDA
TABLE 11.1 Runtime for edge detection and horizontal flip for astronaut.bmp (in ms).
Operation Task Titan Z K80 GTX 1070 Titan X Avg %
CPU→GPU tfer 37 46 25 32 31%
Edge Kernel execution 64 45 41 26 39%
Detection GPU→CPU tfer 42 18 24 51 30%
Total 143 109 90 109 100%
CPU→GPU tfer 39 48 24 32 53%
Horizontal Kernel execution 4 4 2 1 4%
Flip GPU→CPU tfer 43 17 24 34 43%
Total 86 69 50 67 100%
Kernel execution times are lumped into a single number for clarity. GTX Titan Z and K80
use the Kepler architecture, while the GTX 1070 and the Titan X (Pascal) use the Pascal
architecture.
our efforts to try to decrease the kernel execution time was well justified so far, Table 11.1
makes it clear that we cannot just worry about the kernel execution time when trying to
improve our program’s overall performance. The data transfer times must be considered as
an integral part of the total runtime. So, here is our motivating question for this chapter:
could we have started processing a part of this image once it was in GPU memory, so we
wouldn’t have to wait for the entire image to be transferred? The answer is most definitely
YES; the idea is that the CPU←→GPU data transfers can be overlapped with the kernel
execution because they use two different pieces of hardware (PCI controller and the GPU
cores) that can work independently and, more importantly, concurrently. Here is an analogy
that helps us understand this concept:
TABLE 11.2 Execution timeline for the second team in Analogy 11.1.
Time Cindy Keith Gina
0–20 Bring 20 coconuts to Keith — —
20–40 Bring 20 more coconuts to Keith Harvest 20 coconuts —
40–60 Bring 20 more coconuts to Keith Harvest 20 coconuts Deliver 20 coconuts
60–80 Bring 20 more coconuts to Keith Harvest 20 coconuts Deliver 20 coconuts
80–100 Bring 20 more coconuts to Keith Harvest 20 coconuts Deliver 20 coconuts
100–120 — Harvest 20 coconuts Deliver 20 coconuts
120–140 — — Deliver 20 coconuts
Cindy brings 20 coconuts at a time from the jungle to Keith. Keith harvests the coconuts
immediately when he receives them. When 20 of them are harvested, Gina delivers them
from Keith’s harvesting area to the competition desk. When the last 20 coconuts are
delivered to the competition desk, the competition ends.
If any of the executions can be overlapped, as shown in Table 11.1, we can use the
following notation to explain the total runtime, by breaking the individual tasks (say, Task1,
denoted as T 1) into subtasks that can be partially overlapped with other subtasks (e.g.,
T 1a, T 1b, and T 1c). Some of these subtasks might be serially executed (e.g., T 1a) and some
can be overlapped with other subtasks (e.g., T 1b can be overlapped with T 2a). So, the total
runtime is (the case for Team 2):
• Task1 = 100 = 20 non-overlapped, 20 overlapped with Task2, 60 fully overlapped,
T 1 = T 1a + T 1b + T 1c = 20 + 20 + 60,
348 GPU Parallel Program Development Using CUDA
Pipelined runtime = T 1a + (T 1b||T 2a) + (T 1c||T 2b||T 3a) + (T 2c||T 3b) + T 3c, (11.2)
= 20 + (20||20) + (60||60||60) + (20||20) + 20 = 140.
which are typically 4 KB in size, we could build the illusion that we have many more pages
available than the actual physical memory has. If a user requires 1 MB, he or she requires
256 pages of storage. The OS needs 4 MB, i.e., 1024 pages. If we used the hard disk to store
all of the pages that are much less frequently accessed, and say, we allocated 24 MB area
on the disk as our virtual memory, we could store all of the pages there. Although a user
needs 256 pages, the way a program works is that a page is used heavily before another page
is needed; this locality of usage means that we can keep the pages that the user currently
requires in physical memory, while the rest can stay on the disk. In this scheme, only the
virtual addresses are given to any application that is running. They are not aware of the
actual physical address their data is sitting in.
void *p;
...
AllocErr=cudaMallocHost((void**)&p, IMAGESIZE);
if (AllocErr == cudaErrorMemoryAllocation){
...
}
Here, the cudaMallocHost() function works very similar to malloc(), but allocates physical
memory from the CPU memory. It returns a CUDA error code if it could not allocate the
requested amount of physical memory. In case it is successful, it places the pointer to the
physical memory address in the first argument (&p). Note that cudaMallocHost() is a CUDA
API, which is designed with the sole purpose to facilitate fast CPU←→GPU transfers over
the PCIe bus. It might sound a little strange to the reader that CUDA (the GPU-side
people) is being used to allocate CPU-side resources. Not really. This API communicates
peacefully with the CPU side to allocate CPU-side resources for use strictly in GPU-side
functionality (i.e., transferring data from the CPU to the GPU and vice versa).
The reader, at this point, might be analyzing every character in the code above trying to
find any difference between this transfer and the ones we have seen in almost every code
listing in the past GPU chapters. There isn’t! What we saw over and over again in the
previous code listings was all synchronous transfers. We just haven’t seen anything else. In
a synchronous transfer, the cudaMemcpy() API function is called and the code continues to
the next line when cudaMemcpy() finishes execution. Until then, the execution “hangs” on
that line and cannot continue. The bad news is that nothing below this line can be executed
while the cudaMemcpy() is in progress, even if there is work that doesn’t depend on this
memory transfer.
Almost nothing is different, except that the API is named cudaMemcpyAsync() and it is
associated with a stream. When the execution hits the line with cudaMemcpyAsync(), it
doesn’t actually execute this transfer; rather, it queues this transfer task in a stream (whose
number is given in stream[0]) and immediately moves onto the next C command. While we
knew that the data transfer was complete upon reaching the next line with a synchronous
transfer, we cannot make any such assumption with an asynchronous transfer. The only
assumption we can make is that “the transfer is scheduled and will be completed sooner
or later.” The good news is that we can move onto doing other useful work, such as some
computations that do not depend on this data being transferred. While we are doing this
extra work, the transfer continues in the background and will eventually be completed.
the copy engine of stream[0] immediately queues this host-to-device (CPU→GPU) memory
copy request and the execution immediately moves onto the next line. The program has no
idea about when the transfer will actually takes place. Although there are API functions to
check the status of the queue, in general, the program does not need to be worried about
it. It is guaranteed that whenever the PCIe bus is available for a transfer, this transfer will
immediately initiate. Until then, it will sit in the queue, waiting for the right time to start the
transfer. In the meantime, because the program has advanced to the next line, something on
the CPU side can be executed, or something else can be queued up in stream[0]. Better yet,
something else can be queued up in a different stream. This is how a streamed program keeps
shoving the tasks into the queues of different streams and the streams execute concurrently.
cudaStream_t stream[MAXSTREAMS];
...
BWKernel2S <<< dimGrid2DSm5, ThrPerBlk, 0, stream[i] >>> (...);
The parameters are not shown intentionally because they are not relevant. The important
part is the inclusion of a stream ID, which is stream[i] in this case. This is a kernel call
that is associated to a specific stream ID. If we omit the last two parameters as follows:
this is exactly the same kernel launch, with the exception that it is assigned to the default
stream. The default stream is a special stream in CUDA for unstreamed operations (pretty
354 GPU Parallel Program Development Using CUDA
much everything we have done so far up until this chapter). The unstreamed kernel launches
work exactly the way the streamed ones do; however, you cannot execute them in a streamed
fashion and take advantage of the execution overlapping among different streams.
cudaGetDeviceProperties(&GPUprop, 0);
// Shows whether the device can transfer in both directions simultaneously
deviceOverlap = GPUprop.deviceOverlap;
...
printf("This device is %s capable of simultaneous CPU-to-GPU and GPU-to-CPU data
transfers\n", deviceOverlap ? "" : "NOT");
So, what would happen to our expectations in this case if we are using a low
end GPU, which cannot overlap incoming and outgoing data transfers (in which case
GPUprop.deviceOverlap()=FALSE)? If the runtimes are 25, 40, and 30, assuming 10 chunks,
2.5 of the incoming transfer would be exposed, leaving 22.5 to overlap with the 40; only 22.5
of the 40 is overlapped, leaving another 17.5; this 17.5 can be overlapped with 30 partially;
so, we could expect a runtime of: 2.5 + (22.5||22.5) + (17.5||30) = 2.5 + 22.5 + 30 = 55.
If the incoming and outgoing transfers could be performed concurrently (i.e.,
GPUProp.deviceOverlap= TRUE), then we would expect the streamed runtime to go down
to 2.5 + (40||40||40) + 3 = 45.5. In this case, 2.5 is the non-overlapped portion of the
incoming transfer and 3 is the non-overlapped version of the outgoing transfer.
Although this looks like the savings is minimal (i.e., 55 vs. 45.5), let us analyze a more
dramatic case: If we are performing the horizontal flip, as shown on the bottom of Table 11.1,
the CPU→GPU transfer time is 24, kernel execution time is 2, and the GPU→CPU transfer
time is 24. The serialized version takes 24 + 2 + 24 = 50 ms, while the streamed version
on a high-end GPU is expected to take 2.4+(21.6||2||21.6)+2.4=26.6 ms. However, on a
low-end GPU that cannot simultaneously perform incoming and outgoing transfers, we
expect the total runtime to be (24||2)+24=48 ms; in other words, only the kernel and one
of the transfers can be overlapped and the incoming and outgoing transfers are serialized.
It clearly doesn’t sound impressive that the streaming allowed us to go from 50 to 48!
However, on a high-end GPU, the savings is drastic, going from 50 to 26.6, which is almost
a 2× improvement. It should be very clear to the reader at this point that the number of
GPU cores is hardly the only parameter to look at when buying a GPU!
CUDA Streams 355
cudaStream_t stream[MAXSTREAMS];
...
if(NumberOfStreams != 0){
for (i = 0; i < NumberOfStreams; i++) {
chkCUDAErr(cudaStreamCreate(&stream[i]));
}
}
if (NumberOfStreams != 0) {
for (i = 0; i < NumberOfStreams; i++) {
chkCUDAErr(cudaStreamDestroy(stream[i]));
}
}
cudaStreamSynchronize(stream[i]);
In which case, we are telling the CUDA runtime: do not do anything else until the stream
with the ID contained in stream[i] has completed all of its copy and kernel operations. This
will completely execute everything that is in this stream’s FIFO (First In First Out) buffer
and the control will move onto the next line. If you wanted to synchronize every stream we
created, we would use the following:
You would use something like this if you wanted to be sure that a large batch of things
you queued up — in many streams — has all completed before moving onto another part
of the program that will queue a bunch of other stream jobs.
356 GPU Parallel Program Development Using CUDA
#define MAXSTREAMS 32
...
int main(int argc, char **argv)
{
char Operation = ’E’;
float totalTime, Time12, Time23, Time34; // GPU code run times
cudaError_t cudaStatus;
cudaEvent_t time1, time2, time3, time4;
int deviceOverlap, SMcount;
ul ConstMem, GlobalMem;
ui NumberOfStreams=1,RowsPerStream;
cudaStream_t stream[MAXSTREAMS];
...
if (NumberOfStreams > 32) {
printf("Invalid NumberOfStreams (%u). Must be 0...32.\n", NumberOfStreams);
...
}
if (NumberOfStreams == 0) {
TheImg=ReadBMPlin(InputFileName); // Read the input image into a memory
if(TheImg == NULL) { ...
}else{
TheImg=ReadBMPlinPINNED(InputFileName); // Read input img into a PINNED mem
if(TheImg == NULL) { ...
}
...
cudaGetDeviceProperties(&GPUprop, 0);
...
deviceOverlap=GPUprop.deviceOverlap; // bi-directional PCIe transfers?
ConstMem = (ul) GPUprop.totalConstMem;
GlobalMem = (ul) GPUprop.totalGlobalMem;
// CREATE EVENTS
cudaEventCreate(&time1); ... cudaEventCreate(&time4);
// CREATE STREAMS
if(NumberOfStreams != 0){
for (i = 0; i < NumberOfStreams; i++) {
chkCUDAErr(cudaStreamCreate(&stream[i]));
}
}
...
// Deallocate CPU, GPU memory
cudaFree(GPUptr);
// DESTROY EVENTS
cudaEventDestroy(time1); ... cudaEventDestroy(time4);
// DESTROY STREAMS
if (NumberOfStreams != 0) for(i=0; i<NumberOfStreams; i++)
chkCUDAErr(cudaStreamDestroy(stream[i]));
...
}
358 GPU Parallel Program Development Using CUDA
regular (virtual) memory is allocated for the image using the ReadBMPlin() function and the
code is no different than its previous version, with the exception that a different version of
the kernel is used; kernels with their name ending with ’S’ (e.g., BWKernel2S()) are designed
to work in a streamed environment. Even in the case of NumberOfStreams=0, the same ker-
nel is launched. This will allow us to provide a fair performance comparison between the
streamed and synchronous version of the same kernel.
As an example, the synchronous and single-streamed version of the horizontal flip oper-
ation are shown below. Here is the synchronous version:
case 0: cudaMemcpy(GPUImg,TheImg,IMAGESIZE,cudaMemcpyHostToDevice);
cudaEventRecord(time2, 0); // Time stamp @ begin kernel exec
switch(Operation){
case ’E’: BWKernel2S<<<dimGrid2D,ThrPerBlk>>>(GPUBWImg, ..., 0);
...
}
cudaMemcpy(CopyImg,GPUResultImg,IMAGESIZE,cudaMemcpyDeviceToHost);
case 1: cudaMemcpyAsync(GPUImg,TheImg,...,cudaMemcpyHostToDevice,stream[0]);
cudaEventRecord(time2, 0); // Time stamp @ begin kernel exec
switch(Operation){
case ’E’: BWKernel2S<<<dimGrid2D,ThrPerBlk,0,stream[0]>>>(...);
...
}
cudaMemcpyAsync(CopyImg,GPU...,cudaMemcpyDeviceToHost,stream[0]);
Note that when there is only a single stream — in the streaming version of the code —
we can simply use stream[0]. In this case, if there is any improvement in performance, it
will be due to using the pinned memory, not necessarily from the streaming effect. Because,
obviously, the operation, again, serialized, because there is not other stream that can be used
for execution overlapping. First, the CPU→GPU transfer finishes, then the kernel execution,
then the GPU→CPU transfer. However, because of using pinned memory, the transfers over
the PCIe bus are much faster, which speeds up the execution. This is the reason why the
single-stream version of the program is analyzed thoroughly.
• For Stream 0, the bottom two rows require the data from the top two rows of Stream 1.
• For Stream 1, the top two rows require the data from the bottom two rows of Stream 0.
• For Stream 1, the bottom two rows require the data from the top two rows of Stream 2.
• ...
It should be clear that the very first stream (i.e., Stream #0) and the very last stream
(i.e., Stream #NumberOfStreams-1) only have one-sided dependence because their two edge
rows will be set to 0.00 anyway. The “inside” streams have two-sided dependence because
they must calculate their top and bottom two rows. In Code 11.4, a simple approach is
taken: The crossing rows between any two streams is calculated in a non-streaming fashion
at the very beginning; that way, when the streamed execution starts, the processed data is
ready and the streams are only required to process the remaining rows.
if (Operation == ’H’) {
for (i = 0; i < NumberOfStreams; i++) {
StartRow = i*RowsPerStream;
StartByte = StartRow*IPHB;
CPUstart = TheImg + StartByte;
GPUstart = GPUImg + StartByte;
RowsThisStream = (i != (NumberOfStreams - 1)) ?
RowsPerStream : (IPV-(NumberOfStreams-1)*RowsPerStream);
cudaMemcpyAsync(GPUstart,CPUstart,RowsThisStream*IPHB,
cudaMemcpyHostToDevice,stream[i])
cudaEventRecord(time2, 0); // begin CPU --> GPU transfer
Hflip3S<<<dimGrid2DS,ThrPerBlk,0,stream[i]>>>(GPUResultImg,
GPUImg,IPH,IPV,IPHB,StartRow);
cudaEventRecord(time3, 0); // end of kernel exec
CPUstart = CopyImg + StartByte;
GPUstart = GPUResultImg + StartByte;
cudaMemcpyAsync(CPUstart,GPUstart,RowsThisStream*IPHB,
cudaMemcpyDeviceToHost,stream[i])
}
}
CUDA Streams 363
The part of the code that performs the preprocessing is shown below:
Although it looks a little weird, the list above is fairly straightforward. As an example,
between Chunk0 and Chunk1, the overlapping 10 rows were transferred and converted to
B&W during the preprocessing, using BWKernel(). Because the next kernel (GaussKernel())
requires a 5x5 matrix, which accesses a 2 pixel-wide area on each pixel’s surroundings, the
10 rows that are transferred can only be used to calculate the Gaussian version of the inner
6 rows because we lose 2 rows from the top and 2 from the bottom. Similarly, because Sobel
filtering needs a one pixel-wide access on its surroundings, 6 rows of Gaussian can only be
used to calculate 4 rows of Sobel. The threshold needs a one-pixel neighborhood access,
which means that 2 rows of threshold can be calculated from 4 rows of Sobel (i.e., one row
in one chunk, and another row in the adjacent chunk).
This means that when the preprocessing is complete, we have 10 rows of B&W, 6 rows
of Gauss, 4 rows of Sobel, and 2 rows of threshold fully computed between any adjacent
chunks. The goal of the asynchronous processing part is to compute the rest. In Code 11.4,
in many places, the code uses a variable StartRow to either launch the appropriate kernels or
perform data transfers. This is because all of the kernels are modified to accept a starting
row number, as we will see very shortly. Additionally, the variable StartByte is used to
determine the memory starting address that corresponds to this row (i.e., StartRow), as
shown below:
Note that, much like the synchronous version of the code, the GPU memory is allocated
only once using a single bulk cudaMalloc() and different parts of this bulk memory area
are used to store B&W, Gauss, etc. This is possible because the pinned memory is only a
relevant concept when it comes to CPU memory allocation. When pinned memory is being
used, the GPU memory allocation is exactly the same as before, using cudaMalloc(). For
example, in the memory transfer cudaMemcpyAsync(GPUstart, CPUstart,...), GPUstart is no
different than the GPU memory pointer we used in the synchronous version of the program;
however, CPUstart is pointing to pinned memory.
366 GPU Parallel Program Development Using CUDA
__global__
void Hflip3S(uch *ImgDst,uch *ImgSrc,ui Hpixels,ui Vpixels,ui RowBytes,ui StartRow)
{
ui ThrPerBlk=blockDim.x; ui MYbid=blockIdx.x;
ui MYtid=threadIdx.x; ui MYrow = StartRow + blockIdx.y;
if (MYrow >= Vpixels) return; // row out of range
ui MYcol = MYbid*ThrPerBlk + MYtid;
if (MYcol >= Hpixels) return; // col out of range
ui MYmirrorcol=Hpixels-1-MYcol; ui MYoffset=MYrow*RowBytes;
ui MYsrcIndex=MYoffset+3*MYcol; ui MYdstIndex=MYoffset+3*MYmirrorcol;
// swap pixels RGB @MYcol , @MYmirrorcol
ImgDst[MYdstIndex] = ImgSrc[MYsrcIndex];
ImgDst[MYdstIndex + 1] = ImgSrc[MYsrcIndex + 1];
ImgDst[MYdstIndex + 2] = ImgSrc[MYsrcIndex + 2];
}
ui MYrow = blockIdx.y;
Because as many kernels in the y dimension as the number of rows in the image were
launched, an error checking was not necessary to see if the row number went out of range.
It sufficed to use blockIdx.y to determine the row number that is being flipped. In the new
kernel, Hflip3S(), the modification is simple by adding a starting row number as follows:
Without the error checking on the MYcol variable, we would have some issues when, for
example, we have 5 streams; dividing 5376 by 5, we get CEIL(5376,5), which is 1076. So, the
first chunk would be 1076 rows and the rest would be 1075 rows. Because we are launching
the second dimension of every kernel with 1076 rows, the last kernel launch would go beyond
the image and would error out. The error checking prevents that.
B = (ui)ImgGPU[MYsrcIndex];
G = (ui)ImgGPU[MYsrcIndex + 1];
R = (ui)ImgGPU[MYsrcIndex + 2];
ImgBW[MYpixIndex] = (double)(R + G + B) * 0.333333;
}
368 GPU Parallel Program Development Using CUDA
__constant__
double GaussC[5][5] = { { 2, 4, 5, 4, 2 },
{ 4, 9, 12, 9, 4 },
{ 5, 12, 15, 12, 5 },
{ 4, 9, 12, 9, 4 },
{ 2, 4, 5, 4, 2 } };
__global__
void GaussKernel3S(double *ImgGauss,double *ImgBW,ui Hpixels,ui Vpixels,ui
StartRow)
{
ui ThrPerBlk = blockDim.x;
ui MYbid = blockIdx.x;
ui MYtid = threadIdx.x;
int row, col, indx, i, j;
double G;
ui MYrow = StartRow+blockIdx.y;
ui MYcol = MYbid*ThrPerBlk + MYtid;
if (MYcol >= Hpixels) return; // col out of range
if (MYrow >= Vpixels) return; // row out of range
__device__
double Gx[3][3] = { { -1, 0, 1 },
{ -2, 0, 2 },
{ -1, 0, 1 } };
__device__
double Gy[3][3] = { { -1, -2, -1 },
{ 0, 0, 0 },
{ 1, 2, 1 } };
__global__
void SobelKernel2S(double *ImgGrad, double *ImgTheta, double *ImgGauss, ui
Hpixels, ui Vpixels, ui StartRow)
{
ui ThrPerBlk = blockDim.x;
ui MYbid = blockIdx.x;
ui MYtid = threadIdx.x;
int indx;
double GX,GY;
indx+=Hpixels;
GX += (-2*ImgGauss[indx-1]+2*ImgGauss[indx+1]);
indx+=Hpixels;
GX += (-ImgGauss[indx-1]+ImgGauss[indx+1]);
GY += (ImgGauss[indx-1]+2*ImgGauss[indx]+ImgGauss[indx+1]);
ImgGrad[MYpixIndex] = sqrt(GX*GX + GY*GY);
ImgTheta[MYpixIndex] = atan(GX / GY)*57.2957795; // 180.0/PI = 57.2957795;
}
}
370 GPU Parallel Program Development Using CUDA
__global__
void ThresholdKernel2S(uch *ImgResult, double *ImgGrad, double *ImgTheta, ui
Hpixels, ui Vpixels, ui RowBytes, ui ThreshLo, ui ThreshHi, ui StartRow)
{
ui ThrPerBlk = blockDim.x; ui MYbid = blockIdx.x;
ui MYtid = threadIdx.x;
ui MYrow = StartRow + blockIdx.y; if(MYrow >= Vpixels) return;
ui MYcol = MYbid*ThrPerBlk + MYtid; if(MYcol >= Hpixels) return;
unsigned char PIXVAL; double L, H, G, T;
TABLE 11.3 Streaming performance results (in ms) for imGStr, on the astronaut.bmp
image.
Operation # Streams Titan Z K80 GTX 1070 Titan X
143 109 90 109
SYNCH
(37+64+42) (46+45+18) (25+41+24) (32+26+51)
1 103 68 59 70
2 92 66 53 60
Edge
3 81 69 51 56
4 79 56 50 54
5 75 60 50 54
6 73 66 50 53
7 88 55 50 53
8 82 65 47 51
86 69 50 67
SYNCH
(39+4+43) (48+4+17) (24+2+24) (32+1+34)
1 44 30 22 42
2 44 23 19 33
Hflip
3 44 22 19 30
4 44 24 18 28
5 44 23 18 28
6 44 21 18 27
7 44 23 18 27
8 44 21 17 26
Synchronous results are repeated from Table 11.1, where the three numbers (e.g.,
37+64+42) denote the CPU→GPU transfer time, kernel execution time, and
GPU→CPU transfer time, respectively.
ordering is true for all of the streams within themselves; however, there is absolutely no
guarantee that SobelKernel() of Stream 1 will run anytime we can guess in relationship
to Stream 0, or any other stream for that matter. Let us now go through a few runtime
scenarios. For readability, the names of the kernels will be shortened (e.g., BW instead of
BWKernel()); additionally, the stream names are shortened (e.g., S0 to denote Stream 0).
• Time 12–18 (Copy Engine): At Time=12, the copy engine has two options; it
can initiate a CPU→GPU transfer for either S2 or S3. Let us assume that it initiates
CPU→GPU [S2] [6 ms].
• Time 16–22 (Copy Engine): At Time=16, because S0 has finished all of its kernels,
it is now ready for its GPU→CPU transfer [S0] [6 ms], which the copy engine can
initiate.
• Time 18–24 (Copy Engine): If this is a GPU that supports bi-directional PCI
transfers simultaneously (i.e., GPUProp.deviceOverlap=TRUE, as detailed in Sec-
tion 11.4.5), it can also initiate the CPU→GPU transfer [S3], which is the final
CPU→GPU transfer. Assuming GPUProp.deviceOverlap=TRUE, all CPU→GPU trans-
fers are done at Time=24, as well as the CPU→GPU transfer [S0]. Note that because
the kernel execution is still continuing non-stop, the CPU→GPU transfer time [S0] is
100% coalesced. Furthermore, the GPU→CPU transfer times for S1, S2, and S3 are
also 100% coalesced, which totals 18 ms. Only the CPU→GPU transfer time [S0] is
completely exposed (6 ms).
• Time 16–26 (Kernel Engine): Because we are studying the perfect scenario, let
us assume that the kernel engine launches and finishes BW [S1] [1 ms], Gauss [S1]
[3 ms], Sobel [S1] [5 ms], and Thresh [S1] [1 ms] next. We are at Time=26, at which
point the kernel execution for S1 is complete.
• Time 24–32 (Copy Engine): Notice that the copy engine has to sit idle between
Time 24–26, because no stream has data to transfer. At Time=26, because the kernel
execution of S1 is complete, the copy engine can initiate its GPU→CPU transfer [S1]
[6 ms]. Because kernel execution will continue during this interval, this transfer time
is also 100% coalesced.
• Time 26–36 (Kernel Engine): Let us assume the best case scenario again: BW
[S2] [1 ms], Gauss [S2] [3 ms], Sobel [S2] [5 ms], and Thresh [S2] [1 ms] next. We
are at Time=36, at which point the kernel execution for S2 is complete.
• Time 36–42 (Copy Engine): Copy engine sits idle between Time 32–36 and initiates
GPU→CPU transfer [S2] [6 ms], which will be 100% coalesced again.
• Time 36–46 (Kernel Engine): The next best case scenario is BW [S3] [1 ms], · · · ,
Thresh [S2] [1 ms] next. We are at Time=46, at which point the kernel execution
for S3 is complete. Indeed, all four streams have completed their kernel execution at
this point. The only remaining work is the GPU→CPU transfer [S3].
• Time 46–52 (Copy Engine): Copy engine sits idle between Time 42–46 and initiates
GPU→CPU transfer [S3] [6 ms], which will be 100% exposed, because there is no
concurrent kernel execution that can be launched.
A total of 52 ms will be required for four streams, which is the sum of all kernel executions
(40 ms based on our approximations), and the exposed CPU→GPU transfer (6 ms), as well
as the exposed CPU→GPU transfer (6 ms).
• Time 0–24 (Copy Engine): We know that the copy engine will complete all incom-
ing transfers until Time=24.
• Time 0–9 (Kernel Engine): Kernel engine launches and finishes BW [S0] [1 ms],
Gauss [S0] [3 ms], Sobel [S0] [5 ms].
• Time 9–18 (Kernel Engine): Kernel engine launches and finishes BW [S1] [1 ms],
Gauss [S1] [3 ms], Sobel [S1] [5ms].
• Time 18–27 (Kernel Engine): Kernel engine launches and finishes BW [S2] [1 ms],
Gauss [S2] [3 ms], Sobel [S2] [5 ms].
• Time 27–36 (Kernel Engine): Kernel engine launches and finishes BW [S3] [1 ms],
Gauss [S3] [3 ms], Sobel [S3] [5 ms].
• Time 36–37 (Kernel Engine): Kernel engine executes Thresh [S0] [1 ms].
• Time 37–43 (Copy Engine): Until Time=37, no stream had any GPU→CPU trans-
fer available. At Time=37, S0 is the first stream that has it available, so the copy
engine can schedule it. A large portion of the transfer time is exposed.
• Time 37–40 (Kernel Engine): Kernel engine executes Thresh [S1] [1 ms], Thresh
[S2] [1 ms], and Thresh [S3] [1 ms]. Because of the concurrent GPU→CPU transfers,
these execution times are 100% coalesced.
• Time 43–49 (Copy Engine): After the previous transfer, now the copy engine can
initiate the next GPU→CPU transfer [S1] [6 ms]. This transfer time is 100% exposed,
along with the next two transfer times.
• Time 49–55 (Copy Engine): The next GPU→CPU transfer [S2] [6 ms].
• Time 55–61 (Copy Engine): The next GPU→CPU transfer [S3] [6 ms].
In this worst case scenario, the runtime is 61 ms, which is not too much worse than
the best case result we analyzed, which was 52 ms. However, this 15% performance penalty
could have been prevented by a carefull scheduling of events.
• In Windows, double-click the executable for nvvp. In Unix, type: nvvp &
• In either case, you will have the screen in Figure 11.1.
• Once in the Profiler, go to File → New Session. Click Next.
• In the “Create New Session: Executable Properties” window, type the name of the
CUDA executable you want to profile under “File” and fill in the command line
arguments if the program needs it (for example, our imGStr program will need the
arguments we discussed in Section 11.5).
• In the “Profiling Options” window, you can change the profiling options. Usually, the
default settings suffice. You can learn more about each settings options through the
following link [20].
• The Nvidia Visual Profiler main window is divided into different sections. The timeline
section represents the timing results of your program. The upper part is the CPU
timeline while the lower part is the GPU timeline. You can learn more about the
timeline options, through the following link [21].
• Click Finish to close the “Profiling Options” window. The profiler will run the code
and will save the results for visualization. After this step, you are ready to view the
timing of events.
CUDA Streams 377
FIGURE 11.4 Nvidia NVVP results with 2 and 4 streams, on the K80 GPU.
There are so many things to try that some of these ideas will eventually lead to the most
efficient answer. Thank goodness for the Visual Profiler; with it, you can see what kind of
a runtime execution pattern is being induced for these different ideas.
PART III
More To Know
381
CHAPTER 12
CUDA Libraries
Mohamadhadi Habibzadeh
University at Albany, SUNY
Tolga Soyata
University at Albany, SUNY
n the previous chapters of this book, we learned how to create a CUDA program without
Iintentional;
the help of any “prepackaged” library, like the ones we will see in this chapter. This was
a deeper understanding of the inner-workings of the GPU can only be gained —
and appreciated — when you create a program with the primitives that allow you to get close
to the metal (i.e., the GPU cores). Surely, assembly language is a little too low-level, but the
libraries we will see in this chapter barely require you to know how the GPU works; so, they
are too high-level. The clear choice for the right “level” was plain and simple CUDA, which
we based our GPU programming on up to this point in the book. In real life, however, when
you are developing GPU programs, it is tedious to have to build everything from scratch.
For example, there is no way you can build and optimize a matrix multiplication code the
way Nvidia engineers can; because they spend weeks, months optimizing it. Because of this,
CUDA programmers typically use high level libraries, such as cuBLAS, cuFFT, etc. and
use CUDA itself as a glue language, to make everything work together. Every now and
then, you will find something that there is no library for; well, this is when you go back
to good-and-old CUDA. Aside from that, there is nothing wrong with using the libraries,
especially because they are provided free of charge.
12.1 cuBLAS
The roots of Basic Linear Algebra Subprograms (BLAS) go back to the late 1970s and was
initially written in Fortran. Note: The exciting programming language of the late 1950s,
Formula Translation (Fortran), provided a way for programmers to do scientific computation
without requiring assembly language. Having BLAS on top of that was a Godsend.
383
384 GPU Parallel Program Development Using CUDA
their programs easily. Because vector operations are significantly less compute-intensive
than matrix operations, BLAS comes in three different flavors.
The BLAS operations can be categorized into the following:
• BLAS Level 1 operations include algebraic operations on vectors. For example,
adding two vectors or dot multiplication is considered a level 1 BLAS operation.
BLAS Level 1 vector-vector operations have the generic form:
y←α·x+y
z←α·x+y
where x, y, and z are vectors and α is a scalar. Shown above are the different flavors
of BLAS Level 1, allowing two or three vectors.
However, the use of legacy APIs is not recommended by the Nvidia. Creating cuBLAS
programs is similar to other CUDA implementations. Typically, every cuBLAS code can be
implemented in the following six stages.
// Number of rows
#define M 6
// Number of columns
#define N 5
// Converting column-major to row-major format
#define IDX2C(i,j,ld) (((j)*(ld))+(i))
float* devPtrA;
cublasStatus_t stat;
if (cudaStat != cudaSuccess){
printf ("device memory allocation failed");
return EXIT_FAILURE;
}
cublasStatus_t stat;
cublasHandle_t handle;
stat = cublasCreate(&handle);
if (stat != CUBLAS_STATUS_SUCCESS){
printf ("CUBLAS initialization failed\n");
return EXIT_FAILURE;
}
cublasStatus_t
cublasSetMatrix(int rows, int cols, int elemSize, const void *A, int lda, void *B,
int ldb)
This API allocates a rows × cols matrix in the device memory. It assumes that each
element is of size elemSize. A pointer to the beginning of allocated space is copied in variable
B. The API then transfers the context of matrix A (from host memory) to matrix B. lda and
ldb represent the leading dimensions of matrices A and B, respectively. In other words, these
variables determine the number of rows in A to B.
cublasSetMatrix API assumes that matrices are in column-major format.
cublasStatus_t
cublasSscal (cublasHandle_t handle, int n, const float *alpha, float *x, int incx)
where handle is a pointer to a context. This function multiplies each element of vector x (in
device memory) by the scalar alpha. The length of the vector is determined by n and incx
determines the stride size.
388 GPU Parallel Program Development Using CUDA
cublasStatus_t
cublasGetMatrix(int rows, int cols, int elemSize, const void *A, int lda, void *B,
int ldb)
cudaFree (devPtrA);
cublasDestroy(handle);
free(a);
return EXIT_SUCCESS;
#define M 6
#define N 5
#define IDX2C(i,j,ld) (((j)*(ld))+(i))
stat = cublasCreate(&handle);
if (stat != CUBLAS_STATUS_SUCCESS){
printf ("CUBLAS initialization failed\n");
return EXIT_FAILURE;
}
cudaFree (devPtrA);
cublasDestroy(handle);
12.2 CUFFT
cuFFT is the CUDA Fast Fourier Transform (FFT) API library, which allows working in the
frequency domain by computing the frequency components of images or audio signals. In
digital signal processing (DSP), the convolution operation in the time domain corresponds to
the Fourier transform in the frequency domain. Using FFT allows filters to be built, which
work on the frequency domain, rather than the time domain. Note that the FFT operations
involve complex numbers; therefore, they are much slower than using the time domain.
However, for the right applications, they can vastly simplify the necessary operations.
#define NX64
#define NY64
#define NZ128
cufftHandleplan;
cufftComplex*data1, *data2;
cudaMalloc((void**)&data1,sizeof(cufftComplex)*NX*NY*NZ);
cudaMalloc((void**)&data2,sizeof(cufftComplex)*NX*NY*NZ);
CUDA Libraries 391
#define NX 256
#define BATCH 1
cufftHandle plan;
cufftComplex *data;
cudaMalloc((void**)&data, sizeof(cufftComplex)*(NX/2+1)*BATCH);
if (cudaGetLastError() != cudaSuccess){
fprintf(stderr, "Cuda error: Failed to allocate\n");
return;
}
...
if (cudaDeviceSynchronize() != cudaSuccess){
fprintf(stderr, "Cuda error: Failed to synchronize\n");
return;
}
...
cufftDestroy(plan);
cudaFree(data);
392 GPU Parallel Program Development Using CUDA
• Image Compression
• Image Filters
1D linear filter, 1D window sum, convolution, 2D fixed linear filters, rank filters,
fixed filters
• Image Geometry
Resize, remap, rotate, mirror, affine transform, perspective transform
• Image Morphological
Dilation, erode
• Image Statistic and Linear
Sum, min, max, mean, Mean StdDev, norms, DotProd, integral, histogram,
error, etc.
• Image Support and Data Exchange
Set, copy, convert, scale, transpose, etc.
• Image Threshold and Compare
• Signal
Not working on a picture, but rather a “signal”
Many subclasses, e.g., all the arithmetic and logical operations
Many more such as Cauchy, Cubrt, Arctan, etc.
Following is a sample code that implements box filter using NPP:
//Creating the vector on the device and copying the host vector to the device
//Thrust:
thrust::device_vector<int> device_vec = host_vec;
//C:
int* device_vec_c;
cudaMalloc((void**)&device_vec_c, 1000 * sizeof(int));
cudaMemcpy(device_vec_c, host_vec_c, 1000 * sizeof(int), cudaMemcpyHostToDevice);
394 GPU Parallel Program Development Using CUDA
You can see how you can create a vector with length of 16 and assign random numbers
to it using the Thrust library on the host side. Then copy this vector to the device side and
fill all the values with “10.”
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/fill.h>
int main(void)
{
//Creating the host vector;
thrust::host_vector<int> host_vec(16);
//Assigning random numbers to the host vector
thrust::generate(host_vec.begin(), host_vec.end(), rand);
return 0;
}
Many algorithms are implemented efficiently in the Thrust library. Some of these algo-
rithms are
• Reductions: Reduces a vector to single value. Examples are max, min, sum, etc. For
example, the following code computes the sum of numbers from 1 to 100:
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
int main(void)
{
thrust::host_vector<int> host_vec(100);
for (int i = 0; i < 100; i++)
host_vec[i] = i+1;
//reduce sums up all the values in a vector and return the final value
int x = thrust::reduce(device_vec.begin(), device_vec.end(), 0,
thrust::plus<int>());
return 0;
}
#include <thrust/find.h>
thrust::device_vector<int> device_vec;
thrust::device_vector<int>::iterator iter;
...
thrust::find(device_vec.begin(), device_vec.end(), 3);
#include <thrust/sort.h>
thrust::device_vector<int> device_vec;
...
thrust::sort(device_vec.begin(), device_vec.end());
CHAPTER 13
Introduction to OpenCL
Chase Conklin
University of Rochester
Tolga Soyata
University at Albany, SUNY
n this chapter, we will be familiarized with OpenCL, which is the most popular GPU
Isimplifies
programming language, excluding CUDA. This chapter is designed to show how OpenCL
writing multiplatform parallel programs. From previous chapters, we have become
familiar with programs such as imflip and imedge. Though OpenCL and CUDA both exist
to write highly parallel code, you will quickly see how the approach for writing programs
in each differs.
13.1.1 Multiplatform
OpenCL supports many different devices, but it is up to the device manufacturer to im-
plement the drivers that allow OpenCL to work on their devices. These different imple-
mentations are known as platforms. Depending on your hardware, your computer may have
multiple OpenCL platforms available; for example one from Intel to run on the integrated
graphics, and one from Nvidia to run on their discrete graphics.
OpenCL considers devices to be in one of three categories: (1) CPU, (2) GPU, and
(3) accelerator. Of the three, only the accelerator should be unfamiliar. Hardware acceler-
ators include FPGAs and DSPs, or devices such as Intel’s Xeon Phi (See Section 3.9 for a
detailed introduction to Xeon Phi).
13.1.2 Queue-Based
Unlike CUDA, which operates on either synchronous blocking calls, or can operate asyn-
chronously using streams, OpenCL’s execution is queue-based, where all commands are
dispatched to a command queue, and execute once they reach the head of the queue.
397
398 GPU Parallel Program Development Using CUDA
As mentioned previously, OpenCL can support multiple different devices. In fact, it can
support running on multiple devices simultaneously. To do this, one only needs to create a
queue for each device.
__kernel void
hflip(
__global const unsigned char *input,
__global unsigned char *output,
const int M,
const int N)
{
int idx = get_global_id(0);
output[start+col] = input[end-col-2];
output[start+col+1] = input[end-col-1];
output[start+col+2] = input[end-col];
output[end-col-2] = input[start+col];
output[end-col-1] = input[start+col+1];
output[end-col] = input[start+col+2];
}
The __global__ identifier is gone, and is replaced with __kernel. Pointers to global
memory must be prefaced with the __global identifier, and other values should be declared
with const. Read-only memory should also be declared with const so that attempts to
modify it will raise reasonable errors rather than crash the program. CUDA’s threadIdx and
blockIdx are replaced by a call to get_global_id(). The argument passed to get_global_id()
determines which dimension the id is in. For example, if this was a 2D kernel, we could write
Otherwise, this OpenCL kernel is very similar to the comparable CUDA kernel.
Table 13.1 shows some of the comparable CUDA and OpenCL terms.
Introduction to OpenCL 399
#ifdef __APPLE__
#include <OpenCL/opencl.h>
#else
#include <CL/cl.h>
#endif
400 GPU Parallel Program Development Using CUDA
cl_device_id
selectDevice(void)
{
int i, choice = -1;
char * value;
size_t valueSize;
cl_uint deviceCount;
cl_device_id * devices, selected;
This function returns a cl_device_id, which is an OpenCL type that refers to a specific
device. We call clGetDeviceIDs once to get the number of available devices (stored into
deviceCount). We then allocate an array to hold each device id, and call clGetDeviceIDs()
again to populate our newly allocated array with device ids.
The function clGetDeviceIds has a signature as shown below:
cl_int
clGetDeviceIDs(
cl_platform_id platform,
cl_device_type device_type,
cl_uint num_entries,
cl_device_id *devices,
cl_uint *num_devices)
Introduction to OpenCL 401
The first argument, platform, allows specification of an OpenCL platform; for simplicity
we will ignore this argument and pass NULL.
The second argument, device_type, allows filtering of which OpenCL devices to
show. It accepts CL_DEVICE_TYPE_CPU (only the CPU), CL_DEVICE_TYPE_GPU (any GPUs),
CL_DEVICE_TYPE_ACCELERATOR (such as Xeon Phi), CL_DEVICE_TYPE_DEFAULT (the default CL
device in the system), and CL_DEVICE_TYPE_ALL (all available OpenCL devices).
The third argument, num_entries, specifies the number of device ids that may be placed
in the array specified by devices. If devices is not NULL, this must be greater than 0.
The fourth argument, devices, is a pointer to a space of memory that will be filled with
device ids. If this value is NULL, it is ignored.
The final argument, num_devices, is the number of devices that match device_type. This
argument is ignored if it is NULL.
fp = fopen(kernel_file, "rb");
if (!fp){
printf("Failed to load kernel from %s\n", kernel_file);
exit(1);
}
fseek(fp, 0, SEEK_END);
program_size = ftell(fp);
rewind(fp);
source_str = (char*)malloc(program_size + 1);
source_str[program_size] = ’\0’;
fread(source_str, sizeof(char), program_size, fp);
fclose(fp);
return source_str;
}
Now that we can read the file, we need to load it into OpenCL.
This is similar to allocating memory in CUDA, with the exception that we can set
the read/write permissions of the memory. These flags give the OpenCL more information
on how the memory will be used, which allows it to better optimize performance. Since
our input image will only be read and our output image will only be written to, we use
CL_MEM_READ_ONLY and CL_MEM_WRITE_ONLY, respectively.
Having allocated memory on our device, we can transfer our image to the device.
The OpenCL function clEnqueueWriteBuffer() enqueues the transfer onto the command
queue. By passing CL_TRUE as the third argument, we ensured that clEnqueueWriteBuffer()
will block until the transfer is complete, though in other applications, it may be possible to
schedule a transfer, perform other useful work in the shadow of the transfer, then execute
the kernel.
404 GPU Parallel Program Development Using CUDA
Now that we have the data on the device, we can finally run our kernel! The first step
to doing that is to set the kernel arguments. Unlike in CUDA, these are not set during the
kernel call (since to CUDA, the kernel is just another function), but using the OpenCL API.
err = 0;
err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &input);
err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &output);
err |= clSetKernelArg(kernel, 2, sizeof(int), &M);
err |= clSetKernelArg(kernel, 3, sizeof(int), &N);
if (err != CL_SUCCESS){
printf("Error: Failed to set kernel arguments! %d\n", err);
exit(1);
}
Now, we need to get our work group sizes. This example uses a 1D work-group. Because
setting the size of the work-groups is non-trivial, we place the logic for it into another
function, getWorkGroupSizes().
This function (getWorkGroupSizes) takes a device, kernel, and desired global work-group
size, and outputs a local and global work-group size compatible with the requested values.
The call to clGetKernelWorkGroupInfo() determines the maximum local work-group size for
the device. Because the global work-group size must be an integer multiple of the local
work-group size, we ”round” the size up. Finally, we can execute our kernel!
clFinish(commands);
cudaDeviceSynchronize(). Having run our kernel, we can enqueue a transfer back from the
device of our flipped image.
clReleaseMemObject(input);
clReleaseMemObject(output);
clReleaseProgram(program);
clReleaseKernel(kernel);
clReleaseCommandQueue(commands);
clReleaseContext(context);
Our imedge.cl file starts with three array definitions. Note how they have been declared
to be __constant, which specifies them as read-only. Not only does this provide some guar-
antees that they will not be erroneously modified, but may also allow the implementation
to optimize by placing this data in a constant cache. We also made use of __local memory,
Introduction to OpenCL 409
which places the data in a special cache with L1 access times. Local memory is used in
a manner similar to __shared__ memory in CUDA. In our kernels, we first load the data
from global memory into shared memory. Because we have a 1D arrangement of threads,
we need to load the pixels in the same column as the pixel that the initial thread works
on. The edges need to load the values next to them as well. Were the threads dispatched in
a 2D arrangement, we could have made better use of thread locality with respect to local
memory; however, running with a 2D arrangement of threads runs significantly slower than
a 1D arrangement.
After loading our data into local memory, we use a barrier to ensure that the local
memory is fully populated before any threads proceed. Unlike with CUDA, the argument
CLK_LOCAL_MEM_FENCE is passed to the barrier call. This argument tells the barrier to syn-
chronize only the work-items in the current work-group before continuing, while a different
argument CLK_GLOBAL_MEM_FENCE would synchronize all work-items in all work-groups for
the current kernel. Since we are working with local memory, CLK_LOCAL_MEM_FENCE will be
sufficient.
The host-side code for imedge is similar to imflip.
As before, we select a device, create a context, a command queue, and compile our
program. Creating each kernel is the same as before as well, except we need to do it once
per kernel.
First, we set the arguments to our kernel. This is similar to how arguments were set in
imflip, but with one notable exception.
This line creates an array that will be the local memory used by our kernel. Unlike other
arguments passed to the kernel, note that the address of the value is NULL.
Next, we determine the size of our work group. Note that this is done once per kernel,
as this allows us to achieve maximum occupancy for each kernel.
We then enqueue the transfer, each kernel, and the transfer back. Because we do not
allow OpenCL to execute commands in the queue out of order, we can be assured that each
kernel will have the necessary data ready before starting.
How does our program perform? As before, we will test it using Astronaut.bmp. Results
are reported in Table 13.3.
There are a few points to note here. The first is that there is still a memory transfer
penalty on the CPU. This is because OpenCL has allocated the buffer in a different region
of memory than was allocated by malloc(). Because of this, using the CPU is not a way to
avoid transfer penalties, as this must be incurred regardless of the device.
Using local memory can greatly increase the performance of kernels, but this benefit can
be offset by the penalty of synchronizing threads. To get the most performance gain, ensure
that as much data is shared as possible so that each initial access to global memory that
places the data in local memory offsets what would initially be multiple accesses to global
memory.
CHAPTER 14
Andrew Boggio-Dandry
University at Albany, SUNY
Tolga Soyata
University at Albany, SUNY
n this chapter, we will briefly look at GPU programming languages other than OpenCL
IOpenGL
and CUDA. Additionally, we will investigate some of the common APIs, such as OpenGL,
ES, OpenCV, and Apple’s Metal API. Although these APIs are not programming
languages, they transform an existing language into a much more practical one.
413
414 GPU Parallel Program Development Using CUDA
This portion of the code (the listing below) allows the user to select which platform and
device to choose. On an Apple device, there is only one platform available, since Apple writes
the drivers and OpenCL library for its own devices. If on a Windows or Linux computer,
there is the possibility for multiple platforms. For example, if you have an Nvidia GPU
and an Intel CPU, you can download and install the OpenCL library for both devices.
Each vendor is responsible for writing an OpenCL driver that properly takes your kernel
and writes intermediate level language specific to the device (CPU or GPU). In the case of
AMD, their OpenCL version can talk to both the CPU and GPU using the same drivers.
By using the get_devices method, PyOpenCL mirrors the OpenCL API calls to get
devices of a certain type. In this version of the code, the device defaults to the CPU if
nothing is specified. Note that this version does not do any extensive error checking to
make sure that devices exist and that there is only one platform and one type of each
device. Finally, the queue is created using the selected device. Note also that the properties
include turning on profiling to allow timing of the kernel or any data transfers.
platform = cl.get_platforms()[0]
gpu_device = platform.get_devices(cl.device_type.GPU)
cpu_device = platform.get_devices(cl.device_type.CPU)
accel_device = platform.get_devices(cl.device_type.ACCELERATOR)
if cl_device_type == ’gpu’:
dev = gpu_device
elif cl_device_type == ’accelerator’:
dev = accel_device
else:
dev = cpu_device
ctx = cl.Context(dev)
queue =
cl.CommandQueue(ctx,properties=cl.command_queue_properties.PROFILING_ENABLE)
Other GPU Programming Languages 415
platform = cl.get_platforms()[0]
gpu_device = platform.get_devices(cl.device_type.GPU)
cpu_device = platform.get_devices(cl.device_type.CPU)
accel_device = platform.get_devices(cl.device_type.ACCELERATOR)
if cl_device_type == ’gpu’:
dev = gpu_device
elif cl_device_type == ’accelerator’:
dev = accel_device
else:
dev = cpu_device
ctx = cl.Context(dev)
queue =
cl.CommandQueue(ctx,properties=cl.command_queue_properties.PROFILING_ENABLE)
mf = cl.mem_flags
The listing below shows how the memory is allocated on the device. These method calls
the same syntax as the OpenCL C version.
mf = cl.mem_flags
This portion of the code reads in the imflip.cl source code file and builds a kernel off of it.
It then calls the kernel function hflip() with the same arguments as the original OpenCL C
version. An event exec_evt is returned from the kernel function call. This is used to profile
the runtime of the kernel. Note that exec_evt.wait() method ensures that the kernel has
finished running before any timing values are used. The extraction of the runtime is then a
simple arithmetic operation.
Finally, the listing below shows how the flipped image buffer is copied back to the host.
The array is reshaped to fit the OpenCV data format before writing it to disk as a BMP
file.
The results of running this imflip.py program are shown in Figure 14.1. Two different
image sizes were tested on 5 different devices. Both the Tesla and Xeon Phi vastly out-
performed the other devices as expected due to their high-end specs. The Intel Iris is a
respectable middle-of-the-line integrated GPU found on most MacBook Pros. Surprisingly
the Intel i5 CPU performed better than the server Xeon E5 CPU. The disparity between
CPUs and GPUs could also be lessened if more effort was spent on optimization. One of the
strong points of OpenCL is that it can be run on such a wide variety of devices. This high
level of flexibility also comes with a slight caveat. Even though your OpenCL kernels will
run just about everywhere, you will still need to think about device specific optimization,
especially when running on CPUs, where memory layout is vastly different then GPUs. This
leads us into the next section, where some of these hurdles are overcome by meta-templating
and runtime code generation handled by PyOpenCL.
418 GPU Parallel Program Development Using CUDA
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
# Create a dx value
dx = 2.0
As in any OpenCL code, a platform and an appropriate queue needs to be set up.
PyOpenCL has the convenience method create_some_context(), highlighted in the listing be-
low, that will query the user at runtime to select which platform and device to run the code
on. This is not the only way to select devices, as shown in the imflip.py example, where you
can specify which device or types of device to choose.
To create an array of numbers to compute a derivative for an array, filled with random
numbers, is created with standard Numpy methods on the CPU. A corresponding device
array, f_device, is created using the to_device() method. This will create a device array
exactly like the CPU version and transfer the contents to the device. An empty device array,
dfdx_device, is created to hold the derivative values that the kernel will populate later in
the code.
# Create a dx value
dx = 2.0
Once the device arrays have been set up, the kernel can be created. The
ElementiseKernel() method takes a set number of arguments, formatted in C style, and an
operation, from which it creates a kernel behind the scenes to perform the operations. The
real power in this portion of the code comes about because PyOpenCL uses meta-templating
methods to create and analyze the kernel all in real time. This allows you to simply give it
the basic operation, in this case a simple finite difference operation, and it will adapt it to
your data. This runtime code generation and automatic tuning does a lot of work for you so
that you simply have to give it the basic operation to perform. This is especially powerful
when you wish to get optimal kernel performance on different devices.
Finally, once the element-wise kernel has been created and run, the data is copied
back to the CPU using the device array method get(). Compare this with the imflip.py
example where the data is transferred back to the CPU using cl.enqueue_copy(queue,
output_image, res_g). Although both do essentially the same thing, the device array version
adds yet another level of convenience.
420 GPU Parallel Program Development Using CUDA
14.2 OPENGL
Open Graphics Library (OpenGL) is a library of API functions that substantially simplify
performing 2D and 3D computer graphics operations. The functionality included in OpenGL
goes far beyond rotating objects, scaling them, etc. It also includes z-buffer functionality
to turn a 3D image into a 2D projection of it to make it suitable for display in computer
monitors, as well as compute the lighting effects of multiple lighting sources. This API is
used to accelerate graphics computing by interfacing with a hardware accelerator.
OpenGL was introduced in the early 1990s by Silicon Graphics Inc. (SGI), which is when
a graphics accelerator was vector units, built into a CPU or maybe other specialized chips,
designed strictly to accelerate graphics. OpenGL is used extensively by Computer-Aided
Graphics (CAD) applications; for example, the AutoCAD application, which is the de facto
mechanical drawing tool among architects, to designers, and more, requires that a hardware
graphics accelerator is available in the PC that is using this application. Furthermore,
visualization programs, such as flight simulators, and more general information visualization
tools (e.g., a tornado’s travel pattern) use it to speed up the rate of information refresh (e.g.,
to visualize in real time). OpenGL is managed by the non-profit technology consortium
Khronos Group, the same group that manages OpenCL.
The latest OpenGL 4.0 specification can be found on Khronos’ website:
https://www.khronos.org/registry/OpenGL/specs/gl/glspec40.core.pdf [23].
It is fairly common for OpenGL programmers to use higher level languages that build
on top of OpenGL. These libraries are
• OpenGL Utility Library (GLU): A deprecated (as of 2009) library that provided
higher-level functionality on top of OpenGL, such as support for spheres, disks, and
cylinders.
• OpenGL Utility Toolkit (GLUT): Exposes a library of utilities to the program-
mer, who is writing an OpenGL-based program. These library functions are typically
OS-specific, such as window definition, creation, control, and keyboard/mouse control.
Without this add-on library, OpenGL alone is too low-level to program windows-based
programs comfortably.
Installing GLUT on your computer is easy. A free version, freeglut is available on this
website: http://freeglut.sourceforge.net/ [6].
Once you install it, you will have to include the following lines in your C code (this is
just an example installation in Windows on an older version of MS Visual Studio):
#define WIN32_LEAN_AND_MEAN
#include <Windows.h>
#include <gl/gl.h>
#include <gl/glu.h>
#define FREEGLUT_STATIC
#include <gl/glut.h>
expected that Vulkan will replace OpenGL ES. OpenGL ES, partly owing to its age, is the
most widely deployed 3D graphics platform in history. More information about OpenGL
ES can be found on Khronos’ website: https://www.khronos.org/opengles/ [24].
14.4 VULKAN
Vulkan is another language by Khronos group, which intends to offer a more balanced
CPU/GPU usage balance for the high-performance 3D graphics and generic computations.
It was introduced in early 2016, which is when almost any commercial CPU included an
integrated GPU. For example, Apple’s A10 processor includes six CPU and 12 GPU cores,
Intel’s i5, i7 processors include four CPU cores and tens of GPU cores, AMD’s APUs include
multiple CPU and multiple GPU cores. Vulkan is similar to Direct 3D 12, which spreads
the compute tasks among CPU and GPUs in a balanced manner, while staying loyal to its
predecessor API, OpenGL. More information about Vulkan can be obtained on Khronos’s
website: https://www.khronos.org/registry/vulkan/specs/1.0/html/vkspec.html [27].
14.5.1 Shading
The first step in creating computer-generated visual effects is to design and build 3D objects
using a set of points in the 3D space (see Chapter 6). These objects are then combined
together with actual images taken by a camera to create a realistic scene. Raw 3D models,
however, fail to blend well with the rest of the image. The result is a conspicuous 3D object
that seems completely disconnected from its surrounding environment. This makes the
entire image unrealistic and hence curtails the applicability of computer-generated visual
effects in their target industries such as movie and video game production. This problem
has been addressed by introducing shading programs. These programs analyze the surface
of a 3D object and apply shading and light reflection effects in accordance with lighting
characteristics of the scene with the ultimate aim to better blend in the objects and create
more believable images.
Shading can be performed either off-line or on-line. The former is used in movie produc-
tion, while the latter is typically employed in video game industries. OpenGL and Microsoft
Direct3D are the examples of on-line shading tools. Although the initial shading tools were
hard-coded in the GPU architecture, it was soon realized that a software-based assembly
solution can improve the flexibility and customizability of the features. The complexity
422 GPU Parallel Program Development Using CUDA
of assembly language, however, soon fueled the demand for a high-level shading language.
Nvidia introduced their own language Nvidia Cg (C for graphics) and Microsoft introduced
High Level Shading Language (HLSL). Both languages use the same syntax, however, they
are branded with two different names for commercial reasons. Unlike Cg, however, HLSL
can only compile shaders for DirectX and is not compatible with OpenGL (not surprisingly,
because Direct 3D is a competitor of OpenGL). Cg is compatible with both OpenGL and
DirectX.
14.8 OPENCV
The OpenCV library is one of the most important open source API libraries that is avail-
able today [22]. It is completely royalty-free and includes many image processing APIs,
as well as high-end APIs such as face recognition. Later versions (OpenCV 3.3) include
deep learning APIs. More information about OpenCV can be obtained on their website:
http://opencv.org/ [22].
Tolga Soyata
University at Albany, SUNY
n this chapter, we will study how GPUs can be used in deep learning. Deep learning is
IANNs
an emerging machine intelligence algorithm based on artificial neural networks (ANNs).
were proposed to be computational models of neurological systems; they were de-
signed to “learn” performing a certain task by mimicking the way a brain learns.
15.1.1 Neurons
Neurons are the building blocks of ANNs. The structure of a neuron is shown in Figure 15.2,
in which the neuron is shown to take multiple inputs and create an output by passing a
weighted sum of the inputs through an activation function. These input values may come
from other neurons in the previous layer or from the primary inputs of a system.
425
426 GPU Parallel Program Development Using CUDA
i1 o1
x1 ω1j
x2 ω2j
∑ f i2 o2
xn ωnj
Inputs Bias
Weights
in om
Output Layer
Input Layer
Kth Hidden
1st Hidden
Layer
Layer
FIGURE 15.1Generalized architecture of a fully connected artificial neural network
with n inputs, k hidden layers, and m outputs.
x1 ω1j
x2 ω2j
∑ f()
xn ωnj
Bias
Inputs
Weights
FIGURE 15.2 Inner structure of a neuron used in ANNs. ωij are the weights by which
inputs to the neuron (x1 , x2 , ..., xn ) are multiplied before they are summed. “Bias”
is a value by which this sum is augmented, and f () is the activation function, which
is used to introduce a non-linear component to the output.
Fully connected networks work fine for shallower networks, but they show some problems
when more hidden layers are added to their structure. One problem is saturating activation
functions such as sigmoid function or hyperbolic tangent that are used in feedforward neural
networks. In addition to saturating activation functions, the number of network parameters
grows rapidly by adding extra hidden layers to them, which introduces computational com-
plexities for training the network that is extra sensitive to even the smallest changes in the
input, while those changes might not be important; this is called the overfitting problem.
These two problems are addressed in other deep architectures such as convolutional neural
networks.
Deep Learning Using CUDA 427
(in this case, stride is equal to 2). This means that if a 6×6 input is given to a pooling
layer with maxpooling function of size 2×2, it will create a 3×3 output.
• Activation Layer: Activation layer is a layer that each neuron takes one input
and passes it through the activation function to produce its output. In a sense the
activation functions of a neuron are decoupled from summing the inputs into a new
layer. The activation function that is mostly used in CNNs is ReLU, since other
activation functions tend to saturate due to having hard limits.
• Fully Connected Layer: Up to the late stages of CNNs, only convolutional, acti-
vation, and pooling layers are used which has detected the presence of local features
in an input. At the end, there is a need to gather all the local data and have a global
network to make a final decision on the input. This is where a few layers of fully con-
nected networks are useful to “mix” the results from different local feature detectors
into a final global result.
• Softmax Layer: Softmax layer is usually the last layer of a convolutional neural
network that normalizes the output. Assume that the network is supposed to do
image classification between n objects that are mutually exclusive. This means that
we can use softmax function (Equation 15.1) to create n outputs that represent the
probability that a given input is classified as a certain output. Note that softmax
function creates outputs that are always summed to 1 and that softmax function is
differentiable that helps with the training of the network.
e vi
f (vi ) = X (15.1)
e vj
j
the library. The library is a host-callable C language API and like cuBLAS, it requires that
the input and output data be resident on the GPU. The operations that are widely used in
CNNs (and are optimized for both forward and backward passes) in cuDNN are
• Convolution
• Pooling
• Softmax
• Neuron activation functions
Rectified linear (ReLU)
Sigmoid
Hyperbolic tangent (tanh)
Exponential linear unit (ELU)
• Tensor transformation functions
struct Conv_Layer
{
int inputs, outputs, kernelSize;
int inputWidth, inputHeight, outputWidth, outputHeight;
std::vector<float> convV;
std::vector<float> biasV;
...
};
struct Maxpool_Layer
{
int size, stride;
...
};
struct Fully_Connected_Layer
{
int inputs, outputs;
std::vector<float> neuronsV;
std::vector<float> biasV;
...
};
In the layer descriptions the data and values related to that specific layer are contained.
For example, a convolution layer needs to store the size of the input and output, the number
of input and outputs (as you might want a convolution layer to create more than one output
or take input from multiple inputs), and vectors of the biases and weights for the neurons.
Depending on your application, you might want to implement functions that read these
430 GPU Parallel Program Development Using CUDA
values from a pretrained network (that is saved to a file) or write the trained values to a
new file at the end of the training phase.
struct My_Network
{
cudnnTensorDescriptor_t dataTensorDesc, convTensorDesc;
cudnnConvolutionDescriptor_t convDesc;
cudnnActivationDescriptor_t lastLayerActDesc;
cudnnFilterDescriptor_t filterDesc;
cudnnPoolingDescriptor_t poolDesc;
void createHandles()
{
//General tensors and layers used in the network.
//These need to be initialized by a descriptor.
cudnnCreateTensorDescriptor(&dataTensorDesc);
cudnnCreateTensorDescriptor(&convTensorDesc);
cudnnCreateConvolutionDescriptor(&convDesc);
cudnnCreateActivationDescriptor(&lastLayerActDesc);
cudnnCreateFilterDescriptor(&filterDesc);
cudnnCreatePoolingDescriptor(&poolDesc);
}
void destroyHandles()
{
cudnnDestroyTensorDescriptor(&dataTensorDesc);
cudnnDestroyTensorDescriptor(&convTensorDesc);
cudnnDestroyConvolutionDescriptor(&convDesc);
cudnnDestroyActivationDescriptor(&lastLayerActDesc);
cudnnDestroyFilterDescriptor(&filterDesc);
cudnnDestroyPoolingDescriptor(&poolDesc);
}
...
};
This is where the tensors and layers are described by descriptors and created.
Deep Learning Using CUDA 431
convoluteForward(...)
{
cudnnSetTensor4dDescriptor(dataTensorDesc, ...);
cudnnSetFilter4dDescriptor(filterDesc, ...);
cudnnSetConvolution2dDescriptor(convDesc, ...);
cudnnConvolutionForward(...);
}
Note that all these functions take multiple inputs that are not shown here. These inputs
vary based on the function, but common inputs are the cuDNN handle, descriptors of the
input and output tensors, size of the data, and data types.
15.5.4 Backpropagation
For the training phase, backpropagation is done to adjust the weights and parameters in
the network.
cudnnActivationBackward(...)
cudnnPoolingBackward(...)
cudnnConvolutionBackwardBias(...)
cudnnConvolutionBackwardFilter(...)
fullyConenctedForward(...)
{
...
cublasSgemv(...);
...
}
Multiple input arguments for cublasSgemv are the cuBLAS handle, source data, desti-
nation data, dimension of the data, etc.
432 GPU Parallel Program Development Using CUDA
15.6 KERAS
As shown in the previous sections, creating a CNN by using cuDNN is a time-consuming
and confusing task. For the purpose of quickly creating and testing a prototype, it is not
a good solution to use only cuDNN. There are many deep learning frameworks that take
advantage of the GPU processing power and the cuDNN library to provide easy-to-develop
networks that offer an acceptable performance. Frameworks such as Caffe, TensorFlow,
Theano, Torch, and Microsoft Cognitive Toolkit (CNTK) are used to implement deep neural
networks easily and achieve high performance.
We provide an example from the Keras framework on how to create a neural network.
Keras is a Python library for deep learning; it can run on top of TensorFlow, CNTK, or
Theano. Keras keeps all components of a network discretely, which are easy to add or
remove. Keras is completely Python native, so there is no need for external file formats.
Keras provides support for different types of layers, even more than the layers that
have been introduced previously in this chapter. It even provides the ability to define and
write your own layer structure in the network. It also provides a variety of loss functions
and performance metrics along many other supporting tools such as pre-existing widely
used datasets, visualization support, and optimizers. A sample Keras code below shows the
backbone of creating a simple network:
model = Sequential()
model.add(Dense(units=..., input_dim=...))
model.add(Activation(’relu’))
model.add(Conv2D(..., activation=’relu’))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dense(..., activation=’softmax’))
model.compile(loss=losses.mean_squared_error,
optimizer=’sgd’,
metrics=[’accuracy’])
In the code above, a sequential network is created, meaning that the layers are connected
through a linear stack. By using the add method, different layers are created and connected
to each other in the network. There are a variety of network layers implemented in Keras
and they take input arguments such as the dimensionality of the output, type of activation,
the dimensionality of the input, etc.
The compile method is used before the training phase which configures the learning
process. It takes an optimizer, such as stochastic gradient descent (sgd), a loss function,
and a metric to set up the network.
Deep Learning Using CUDA 433
After compiling, fit method is used to train the network. It takes the training input
and output data, the size that the training data need to be chopped in the training process,
input and output validation data if there are any available, etc.
The final stage is evaluating the network by using the evaluate method that takes input
and output test data and checks the performance of the network on these data.
Other useful methods include predict that processes a given input and generates an out-
put, get_layer that returns a layer in the network, and train_on_batch and test_on_batch
that train and test the network on only one batch of input data.
Note that if Keras is running on the TensorFlow or CNTK backends, it automatically
runs on the GPU if any GPU is detected. If the backend is Theano, there are multiple meth-
ods to use the GPU. One way is manually setting the device of the Theano configuration,
as follows:
import theano
theano.config.device = ’gpu’
theano.config.floatX = ’float32’
Bibliography
435
436 Bibliography
[36] T. Soyata, H. Ba, W. Heinzelman, M. Kwon, and J. Shi. Accelerating Mobile Cloud
Computing: A Survey. In H. T. Mouftah and B. Kantarci, editors, Communication
infrastructures for cloud computing, chapter 8, pages 175–197. IGI Global, Sep 2013.
[37] T. Soyata, R. Muraleedharan, S. Ames, J. H. Langdon, C. Funai, M. Kwon, and W. B.
Heinzelman. COMBAT: mobile Cloud-based cOmpute/coMmunications infrastructure
for BATtlefield applications. In Proceedings of SPIE, volume 8403, pages 84030K–
84030K, May 2012.
[38] T. Soyata, R. Muraleedharan, C. Funai, M. Kwon, and W. Heinzelman. Cloud-Vision:
Real-Time Face Recognition Using a Mobile-Cloudlet-Cloud Acceleration Architec-
ture. In Proceedings of the 17th IEEE Symposium on Computers and Communications
(ISCC), pages 59–66, Cappadocia, Turkey, Jul 2012.
[39] Jane Vanderkooi. Your inner engine: An introductory course on human metabolism.
CreateSpace, 2014.
Index
439
440 Index