High Performance Computing ChapterSampler
High Performance Computing ChapterSampler
www.crcpress.com
Contents
5. Contemporary High
Performance Computing
From: Contemporary High Performance
Computing: From Petascale toward
Exascale, by Jeffrey S. Vetter
6. Introduction to Computational
Modeling
From: Introduction to Modeling and
Simulation with MATLAB® and Python,
by Steven I. Gordon, Brian Guilfoos
Please note: This discount code cannot be combined with any other discount or offer and is only valid on
print titles purchased directly from www.crcpress.com.
www.crcpress.com
Copyright Taylor & Francis Group. Do Not Distribute.
CHAPTER 1
Overview of Parallel
Computing
1.1 INTRODUCTION
In the first 60 years of the electronic computer, beginning in 1940, computing performance
per dollar increased on average by 55% per year [52]. This staggering 100 billion-fold increase
hit a wall in the middle of the first decade of this century. The so-called power wall arose
when processors couldn’t work any faster because they couldn’t dissipate the heat they pro-
duced. Performance has kept increasing since then, but only by placing multiple processors
on the same chip, and limiting clock rates to a few GHz. These multicore processors are
found in devices ranging from smartphones to servers.
Before the multicore revolution, programmers could rely on a free performance increase
with each processor generation. However, the disparity between theoretical and achievable
performance kept increasing, because processing speed grew much faster than memory band-
width. Attaining peak performance required careful attention to memory access patterns in
order to maximize re-use of data in cache memory. The multicore revolution made things
much worse for programmers. Now increasing the performance of an application required
parallel execution on multiple cores.
Enabling parallel execution on a few cores isn’t too challenging, with support available
from language extensions, compilers and runtime systems. The number of cores keeps in-
creasing, and manycore processors, such as Graphics Processing Units (GPUs), can have
thousands of cores. This makes achieving good performance more challenging, and parallel
programming is required to exploit the potential of these parallel processors.
1
3
Copyright Taylor & Francis Group. Do Not Distribute.
better use of them. Specialized parallel code is often essential for applications requiring high
performance. The challenge posed by the rapidly growing number of cores has meant that
more programmers than ever need to understand something about parallel programming.
Fortunately parallel processing is natural for humans, as our brains have been described as
parallel processors, even though we have been taught to program in a sequential manner.
1.2 TERMINOLOGY
It’s important to be clear about terminology, since parallel computing, distributed computing,
and concurrency are all overlapping concepts that have been defined in different ways.
Parallel computers can also be placed in several categories.
Definition 1.1 (Parallel Computing). Parallel Computing means solving a computing prob-
lem in less time by breaking it down into parts and computing those parts simultaneously.
Parallel computers provide more computing resources and memory in order to tackle
problems that cannot be solved in a reasonable time by a single processor core. They differ
from sequential computers in that there are multiple processing elements that can execute
instructions in parallel, as directed by the parallel program. We can think of a sequential
Instruction streams
single multiple
Data streams
single
SISD MISD
multiple
SIMD MIMD
4
Copyright Taylor & Francis Group. Do Not Distribute.
distributed
concurrency
computing
parallel
computing
5
Copyright Taylor & Francis Group. Do Not Distribute.
the domain of enthusiasts and those whose needs required high performance computing
resources.
Not only is expert knowledge needed to write parallel programs for a cluster, the signif-
icant computing needs of large scale applications require the use of shared supercomputing
facilities. Users had to master the complexities of coordinating the execution of applications
and file management, sometimes across different administrative and geographic domains.
This led to the idea of Grid computing in the late 1990s, with the dream of computing as
a utility, which would be as simple to use as the power grid. While the dream hasn’t been
realized, Grid computing has provided significant benefit to collaborative scientific data
analysis and simulation. This type of distributed computing wasn’t adopted by the wider
community until the development of cloud computing in the early 2000s.
Cloud computing has grown rapidly, thanks to improved network access and virtualiza-
tion techniques. It allows users to rent computing resources on demand. While using the
cloud doesn’t require parallel programming, it does remove financial barriers to the use of
compute clusters, as these can be assembled and configured on demand. The introduction
of frameworks based on the MapReduce programming model eliminated the difficulty of
parallel programming for a large class of data processing applications, particularly those
associated with the mining of large volumes of data.
With the emergence of multicore and manycore processors all computers are parallel
computers. A desktop computer with an attached manycore co-processor features thousands
of cores and offers performance in the trillions of operations per second. Put a number of
these computers on a network and even more performance is available, with the main
limiting factor being power consumption and heat dissipation. Parallel computing is now
relevant to all application areas. Scientific computing isn’t the only player any more in
large scale high performance computing. The need to make sense of the vast quantity of
data that cheap computing and networks have produced, so-called Big Data, has created
another important use for parallel computing.
6
Copyright Taylor & Francis Group. Do Not Distribute.
If the input consists of two documents, one containing “The quick brown fox jumps
over a lazy dog” and the other containing “The brown dog chases the tabby cat,” then the
output would be the list: ha, 1i, hbrown, 2i, . . . , htabby, 1i, hthe, 3i.
Think about how you would break this algorithm in parts that could be computed in
parallel, before we discuss below how this could be done.
7
Copyright Taylor & Francis Group. Do Not Distribute.
Once loops that have independent iterations have been identified, the parallelism is very
simple to express. Ensuring correctness is another matter, as we’ll see in Chapters 2 and 4.
8
Copyright Taylor & Francis Group. Do Not Distribute.
Computational
Structural Patterns Chapters 6-8: Case Studies
Patterns
ways of thinking [69]. It could be argued that making the transition from sequential to
parallel notional machines is another threshold concept.
9
Copyright Taylor & Francis Group. Do Not Distribute.
English
messages plain text words
stream of language metadata
tokenization ...
messages identification removal
10
Copyright Taylor & Francis Group. Do Not Distribute.
x0 core 0
x1
A00 A01 A02 A03 X x2
= b0
x3
x0 core 1
x1
A10 A11 A12 A13 X x2
= b1
x3
x0 core 2
x1
A20 A21 A22 A23 X x2
= b2
x3
x0 core 3
x1
A30 A31 A32 A33 X x2
= b3
x3
libraries. The High Performance Linpack (HPL) benchmark used to classify the top 500
computers involves solving a dense system of linear equations. Parallelism arises naturally
in these applications. In matrix-vector multiplication, for example, the inner products that
compute each element of the result vector can be computed independently, and hence in
parallel, as seen in Figure 1.5. In practice it is more difficult to develop solutions that scale
well with matrix size and the number of processors, but the plentiful literature provides
guidance.
Another important pattern is one where operations are performed on a grid of data. It
occurs in scientific simulations that numerically solve partial differential equations, and also
in image processing that executes operations on pixels. The solutions for each data point
can be computed independently but they require data from neighboring points. Other pat-
terns include those found in graph algorithms, optimization (backtrack/branch and bound,
dynamic programming), and sorting. It is reassuring that the lists that have been drawn up
of these patterns include less than twenty patterns. Even though the landscape of parallel
computing is vast, most applications can be composed of a small number of well studied
computational patterns.
11
Copyright Taylor & Francis Group. Do Not Distribute.
map map
<a,1>,<brown,2>,<cat,1>,<chases,1>,<dog,2>,<fox,1>,<jumps,1>,
<lazy,1>,<over,1>,<quick,1>,<tabby,1>,<the,3>
algorithmic and implementation structures to create parallel software. These structures will
be explored in detail in Chapters 3 and 4.
12
Copyright Taylor & Francis Group. Do Not Distribute.
Both distributed and shared implementations could be combined. Taking a look at the
sequential algorithm we can see that word counts could be done independently for each line
of text. The processing of the words of the sentences of each document could be accomplished
in parallel using a shared data structure, while the documents could be processed by multiple
computers.
This is a glimpse of how rich the possibilities can be in parallel programming. It can get
even richer as new computational platforms emerge, as happened in the mid 2000s with gen-
eral purpose programming on graphics processing units (GPGPU). The patterns of parallel
computing provided a guide in our example problem, as they allowed the recognition that
it fit the MapReduce structural pattern, and that implementation could be accomplished
using well established algorithmic and implementation patterns. While this book will not
follow the formalism of design patterns, it does adopt the similar view that there is a set
of parallel computing elements that can be composed in many ways to produce clear and
effective solutions to computationally demanding problems.
13
Copyright Taylor & Francis Group. Do Not Distribute.
CHAPTER 6
Introduction to GPU
Parallelism and CUDA
137
14
Copyright Taylor & Francis Group. Do Not Distribute.
This story gets worse. Even if you had a 486DX CPU, the FPU inside your 486DX was
still not fast enough for most of the games. Any exciting game demanded a 20× (or even
50×) higher-than-achievable floating point computational power from its host CPU. Surely,
in every generation the CPU manufacturers kept improving their FPU performance, just to
witness a demand for FPU power that grew much faster than the improvements they could
provide. Eventually, starting with the Pentium generation, the FPU was an integral part
of a CPU, rather than an option, but this didn’t change the fact that significantly higher
FPU performance was needed for games. In an attempt to provide much higher scale FPU
performance, Intel went on a frenzy to introduce vector processing units inside their CPUs:
the first ones were called MMX, then SSE, then SSE2, and the ones in 2016 are SSE4.2.
These vector processing units were capable of processing many FPU operations in parallel
and their improvement has never stopped.
Although these vector processing units helped certain applications a lot — and they still
do – the demand for an ever-increasing amount of FPU power was insane! When Intel could
deliver a 2× performance improvement, game players demanded 10× more. When they
could eventually manage to deliver 10× more, they demanded 100× more. Game players
were just monsters that ate lots of FLOPS! And, they were always hungry! Now what? This
was the time when a paradigm shift had to happen. Late 1990s is when the manufacturers
of many plug-in boards for PCs — such as sound cards or ethernet controller — came
up with the idea of a card that could be used to accelerate the floating point operations.
Furthermore, routine image coordinate conversions during the course of a game, such as
3D-to-2D conversions and handling of triangles, could be performed significantly faster by
dedicated hardware rather than wasting precious CPU time. Note that the actual unit
element of a monster in a game is a triangle, not a pixel. Using triangles allows the games
to associate a texture for the surface of any object, like the skin of a monster or the surface
of a tank, something that you cannot do with simple pixels.
These efforts of the PC card manufacturers to introduce products for the game market
gave birth to a type of card that would soon be called a Graphics Processing Unit. Of course,
we love acronyms: it is a GPU ... A GPU was designed to be a “plug-in card” that required
a connector such as PCI, AGP, PCI Express, etc. Early GPUs in the late 1990s strictly
focused on delivering as high of a floating point performance as possible. This freed the
CPU resources and allowed a PC to perform 5× or 20× better in games (or even more if
you were willing to spend a lot of money on a fancy GPU). Someone could purchase a $100
GPU for a PC that was worth $500; for this 20% extra investment, the computer performed
5× faster in games. Not a bad deal. Alternatively, by purchasing a $200 card (i.e., a 40%
extra investment), your computer could perform 20× faster in games. Late 1990s was the
point of no return, after which the GPU was an indispensable part of every computer, not
just for games, but for a multitude of other applications explained below. Apple computers
used a different strategy to build a GPU-like processing power into their computers, but
sooner or later (e.g., in the year 2017, the release year of this book) the PC and Mac lines
have converged and they started using GPUs from the same manufacturers.
15
Copyright Taylor & Francis Group. Do Not Distribute.
FIGURE 6.1 Turning the dog picture into a 3D wire frame. Triangles are used to rep-
resent the object, rather than pixels. This representation allows us to map a texture
to each triangle. When the object moves, so does each triangle, along with their as-
sociated textures. To increase the resolution of this kind of an object representation,
we can divide triangles into smaller triangles in a process called tesselation.
or cos(), or even floating point computations of any sort. The entire game could run by
performing integer operations, thereby requiring only an ALU. Even a low-powered CPU
was perfectly sufficient to compute all of the required movements in real time. However,
having watched the Terminator 2 movie a few years ago, the Pacman game was far from
exciting for gamers of the 1990s. First of all, objects had to be 3D in any good computer
game and the movements were substantially more sophisticated than Pacman — and in
3D, requiring every transcendental operation you can think of. Furthermore, because the
result of any transcendental function due to a sophisticated object move — such as the
rotation operation in Equation 4.1 or the scaling operation in Equation 4.3 — required
the use of floating point variables to maintain image coordinates, GPUs, by definition, had
to be computational units that incorporated significant FPU power. Another observation
that the GPU manufacturers made was that the GPUs could have a significant edge in
performance if they also included dedicated processing units that performed routine con-
versions from pixel-based image coordinates to triangle-based object coordinates, followed
by texture mapping.
To appreciate what a GPU has to do, consider Figure 6.1, in which our dog is represented
by a bunch of triangles. Such a representation is called a wire-frame. In this representation,
a 3D object is represented using triangles, rather than an image using 2D pixels. The unit of
element for this representation is a triangle with an associated texture. Constructing a 3D
wire-frame of the dog will allow us to design a game in which the dog jumps up and down;
as he makes these moves, we have to apply some transformation — such as rotation, using
the 3D equivalent of Equation 4.1 — to each triangle to determine the new location of that
triangle and map the associated texture to each triangle’s new location. Much like a 2D
image, this 3D representation has the same “resolution” concept; to increase the resolution
of a triangulated object, we can use tesselation, in which a triangle is further subdivided into
smaller triangles as shown in Figure 6.1. Note: Only 11 triangles are shown in Figure 6.1
to avoid cluttering the image and make our point on a simple figure; in a real game, there
could be millions of triangles to achieve sufficient resolution to please the game players.
Now that we appreciate what it takes to create scenes in games where 3D objects
are moving freely in the 3-dimensional space, let’s turn our attention to the underlying
16
Copyright Taylor & Francis Group. Do Not Distribute.
FIGURE 6.2 Steps to move triangulated 3D objects. Triangles contain two attributes:
their location and their texture. Objects are moved by performing mathematical
operations only on their coordinates. A final texture mapping places the texture
back on the moved object coordinates, while a 3D-to-2D transformation allows the
resulting image to be displayed on a regular 2D computer monitor.
computations to create such a game. Figure 6.2 depicts a simplified diagram of the steps
involved in moving a 3D object. The designer of a game is responsible for creating a wire-
frame of each object that will take part in the game. This wire-frame includes not only the
locations of the triangles — composed of 3 points for each triangle, having an x, y, and z
coordinate each — but also a texture for each triangle. This operation decouples the two
components of each triangle: (1) the location of the triangle, and (2) the texture that is
associated with that triangle. After this coupling, triangles can be moved freely, requiring
only mathematical operations on the coordinates of the triangles. The texture information
— stored in a separate memory area called texture memory — doesn’t need to be taken into
account until all of the moves have been computed and it is time to display the resulting
object in its new location. Texture memory does not need to be changed at all, unless, of
course, the object is changing its texture, as in the Hulk movie, where the main character
turns green when stressed out! In this case, the texture memory also needs to be updated
in addition to the coordinates, however, this is a fairly infrequent update when compared
to the updates on the triangle coordinates. Before displaying the moved object, a texture
mapping step fills the triangles with their associated texture, turning the wire-frame back
into an object. Next, the recomputed object has to be displayed on a computer screen;
because every computer screen is composed of 2D pixels, a 3D-to-2D transformation has to
be performed to display the object as an image on the computer screen.
17
Copyright Taylor & Francis Group. Do Not Distribute.
• Ability to convert from triangle coordinates back to image coordinates for display in
a computer screen (Box IV)
Based on this observation, right from the first day, every GPU was manufactured with
the ability to implement some sort of functionality that matched all of these boxes. GPUs
kept evolving by incorporating faster Box IIs, although the concept of Box I, III, and IV
never changed too much. Now, imagine that you are a graduate student in the late 1990s
—in a physics department— and trying to write a particle simulation program that requires
an extensive amount of floating point computations. Before the introduction of the GPUs,
all you could use was a CPU that had an FPU in it and, potentially, a vector unit. However,
when you bought one of these GPUs at an affordable price and realized that they could
perform a much higher volume of FPU operations, you would naturally start thinking:
“Hmmm... I wonder if I could use one of these GPU things in my particle simulations?”
This investigation would be worth every minute you put into it because you know that these
GPUs are capable of 5× or 10× faster FPU computations. The only problem at that time
was that the functionality of Box III and Box IV couldn’t be “shut off.” In other words,
GPUs were not designed for non-gamers who are trying to do particle simulations!
Nothing can stop a determined mind! It didn’t take too long for our graduate student
to realize that if he or she mapped the location of the particles as the triangle locations
of the monsters and somehow performed particle movement operations by emulating them
as monster movements, it could be possible to “trick” the GPU into thinking that you
are actually playing a game, in which particles (monsters) are moving here and there and
smashing into each other (particle collisions). You can only imagine the massive challenges
our student had to endure: First, the native language of the games was OpenGL, in which
objects were graphics objects and computer graphics APIs had to be used to “fake” particle
movements. Second, there were major inefficiencies in the conversions from monster-to-
particle and particle-back-to-monster. Third, accuracy was not that great because the initial
cards could only support single precision FPU operations, not double precision. It is not
like our student could make a suggestion to the GPU manufacturers to incorporate double
precision to improve the particle simulation accuracy; GPUs were game cards and they were
game card manufacturers, period! None of these challenges stopped our student! Whoever
that student was, the unsung hero, created a multibillion dollar industry of GPUs that are
in almost every top supercomputer today.
Extremely proud of the success in tricking the GPU, the student published the results
... The cat was out of the bag ... This started an avalanche of interest; if this trick can be
applied to particle simulations, why not circuit simulations? So, another student applied it
to circuit simulations. Another one to astro-physics, another one to computational biology,
another ... These students invented a way to do general purpose computations using GPUs,
hence the birth of the term GPGPU.
18
Copyright Taylor & Francis Group. Do Not Distribute.
example, oil explorers could analyze the underwater SONAR data to find oil under water, an
application that requires a substantial volume of floating point operations. Alternatively,
the academic and research market, including many universities and research institutions
such as NASA or Sandia National Labs, could use the GPGPUs for extensive scientific
simulations. For these simulations, they would actually purchase hundreds of the most
expensive versions of GPGPUs and GPU manufacturers could make a significant amount
of money in this market and create an alternative product to the already-healthy game
products.
In the late 1990s, GPU manufacturers were small companies that saw GPUs as ordinary
add-on cards that were no different than hard disk controllers, sound cards, ethernet cards,
or modems. They had no vision of the month of September 2017, when Nvidia would become
a company that is worth $112 B (112 billion US dollars) in the Nasdaq stock market (Nasdaq
stock ticker NVDA), a pretty impressive 20-year accomplishment considering that Intel, the
biggest semiconductor manufacturer on the planet with its five decade history, was worth
$174 B the same month (Nasdaq stock ticket INTC). The vision of the card manufacturers
changed fairly quickly when the market realized that GPUs were not in the same category
as other add-on cards; it didn’t take a genius to figure out that the GPU market was ready
for an explosion. So the gold rush started. GPU cards needed two main ingredients: (1) the
GPU chips, responsible for all of the computation, (2) GPU memory, something that could
be manufactured by the CPU DRAM manufacturers that were already making memory
chips for the CPU market, (3) interface chips to interface to the PCI bus, (4) power supply
chips that provide the required voltages to all of these chips, and (5) other semiconductors
to make all of these work together, sometimes called “glue logic.”
The market already had manufacturers for (2), (3), and (4). Many small companies
were formed to manufacture (1), the GPU “chips,” so the functionality shown in Figure 6.2
could be achieved. The idea was that GPU chip designers — such as Nvidia — would
design their chips and have them manufactured by third parties — such as TSMC — and
sell the GPU chips to contractor manufacturers such as FoxConn. FoxConn would purchase
the other components (2,3,4, and 5) and manufacture GPU add-on cards. Many GPU chip
designers entered the market just to see a massive consolidation toward the end of 1990s.
Some of them bankrupted and some of them sold out to bigger manufacturers. As of 2016,
only three key players remain in the market (Intel, AMD, and Nvidia), two of them being
actual CPU manufacturers. Nvidia became the biggest GPU manufacturer in the world as
of 2016 and made multiples pushes to enter into the GPU/CPU market by incorporating
ARM cores into their Tegra line GPUs. Intel and AMD kept incorporating GPUs into
their CPUs to provide an alternative to consumers that didn’t want to buy a discrete
GPU. Intel has gone through many generations of designs eventually incorporating Intel
HD Graphics and Intel Iris GPUs into their CPUs. Intel’s GPU performance improved to
the point when in 2016, Apple deemed the built-in Intel GPU performance sufficient to be
included in their Mac Books as the only GPU, instead of discrete GPUs. Additionally, Intel
introduced the Xeon Phi cards to compete with Nvidia in the high-end supercomputing
market. While this major competition was taking place in the desktop market, the mobile
market saw a completely different set of players emerge. QualComm and Broadcom built
GPU cores into their mobile processors by licensing them from other GPU designers. Apple
purchased processor designers to design their “A” family processors that had built-in CPUs
and GPUs with extreme low power consumption. By about 2011 or 2012, CPUs couldn’t be
thought of as the only processing unit of any computer or mobile device. CPU+GPU was
the new norm.
19
Copyright Taylor & Francis Group. Do Not Distribute.
20
Copyright Taylor & Francis Group. Do Not Distribute.
GPU code side ... Furthermore, a single compiler would be great to compiler both sides’
code, without requiring two separate compilations.
â There is no such thing as GPU Programming ...
â GPU always interfaces to the CPU through certain APIs ...
â So, there is always CPU+GPU programming ...
Given these facts, CUDA had to be based on the C programming language (for the
CPU side) to provide high performance. The GPU side also had to be almost exactly like
the CPU side with some specific keywords to distinguish between host versus device code.
The burden to determine how the execution would take place at runtime — regarding CPU
versus GPU execution sequences — had to be determined by the CUDA compiler. GPU
parallelism had to be exposed on the GPU side with a mechanism similar to the Pthreads
we saw in the Part I of this book. By taking into account all of these facts, Nvidia designed
its nvcc compiler that is capable of compiling CPU and GPU code simultaneously. CUDA,
since its inception, has gone through many version updates, incorporating an increasing set
of sophisticated features. The version I use in this book is CUDA 8.0, released in September
2016. Parallel to the progress of CUDA, Nvidia GPU architectures have gone through
massive updates as I will document shortly.
21
Copyright Taylor & Francis Group. Do Not Distribute.
FIGURE 6.3Three farmer teams compete in Analogy 6.1: (1) Arnold competes alone
with his 2× bigger tractor and “the strongest farmer” reputation, (2) Fred and
Jim compete together in a much smaller tractor than Arnold. (3) Tolga, along with
32 boy and girl scouts, compete together using a bus. Who wins?
Analogy 6.1 is depicted in Figure 6.3 with three alternatives: Arnold represents a single-
threaded CPU that can work at 4 GHz, while Fred and Jim together in their smaller tractor
represent a dual-core CPU in which each core works at something like 2.5 GHz. We have done
major evaluations on the performance differences between these two alternatives in Part I
of the book. The interesting third alternative in Figure 6.3 is Tolga with the 32 boy/girl
scouts. This represents a single CPU core — probably working at 2.5 GHz — and a GPU
co-processor composed of 32 small cores that each work at something like 1 GHz. How could
we compare this alternative to the first two?
22
Copyright Taylor & Francis Group. Do Not Distribute.
the GPUs win is the fact that the GPU cores are much simpler and they work at lower
speed. This allows the GPU chip designers to build a significantly higher number of cores
into their GPU chips and the lower speed keeps the power consumptions below the magic
200–250 W, which is about the peak power you can consume from any semiconductor device
(i.e., “chip”).
Note that the power that is consumed by the GPU is not proportional to the frequency
of each core; instead, the dependence is something like quadratic. In other words, a 4 GHz
CPU core is expected to consume 16× more power than the same core working at 1 GHz.
This very fact allows GPU manufacturers to pack hundreds or even thousands of cores
into their GPUs without reaching the practical power consumption limits. This is actually
exactly the same design philosophy behind multicore CPUs too. A single core CPU working
at 4 GHz versus a dual-core CPU in which both cores work at 3 GHz could consume similar
amounts of power. So, as long as the parallelization overhead is low (i.e., η is close to 1),
a dual-core 3 GHz CPU is a better alternative than a single core 4GHz CPU. GPUs are
nothing more than this philosophy taken to the ultimate extreme with one big exception:
while the CPU multicore strategy calls for using multiple sophisticated (out-of-order)
cores that work at lower frequencies, this design strategy only works if you are trying to
put 2, 4, 8, or 16 cores inside the CPU. It simply won’t work for 1000! So, the GPUs had to
go through an additional step of making each core simpler. Simpler means that each core
is in-order (see Section 3.2.1), work at lower frequencies, and their L1$ memories are not
coherent. Many of these details are going to become clear as we go through the following
few chapters. For now, the take-away from this section should be that GPUs incorporate a
lot of architectural changes — as compared to CPUs — to provide a manageable execution
environment for such a high core count.
23
Copyright Taylor & Francis Group. Do Not Distribute.
4. The existence of the warp concept has dramatic implications on the GPU architecture.
In Figure 6.3, we never talked about how the coconuts arrive in the bus. If you brought
only 5 coconuts into the bus, 27 of the scouts would sit there doing nothing; so, the
data elements must be brought in to the GPU in the same bulk amounts, although
the unit of these data chunks is half warp or 16 elements.
5. The fact that the data arrives into the GPU cores in half warp chunks means that the
memory sub-system that is bringing the data into the GPU cores should be bringing
in the data 16-at-a-time. This implies a parallel memory subsystem that is capable of
shuttling around data elements 16 at a time, either 16 floats or 16 integers, etc. This
is why the GPU DRAM memory is made from GDDR5, which is parallel memory.
6. Because the CPU cores and GPU cores are completely different processing units,
it is expected that they have different ISAs (instruction set architectures). In other
words, they speak a different language. So, two different sets of instructions must
be written: one for Tolga, one for the scouts. In the GPU world, a single compiler
— nvcc — compiles both the CPU instructions and GPU instructions, although there
are two separate programs that the developer must write. Thank goodness, the CUDA
language combines them together and makes them so similar that the programmer can
write both of these programs without having to learn two totally different languages.
24
Copyright Taylor & Francis Group. Do Not Distribute.
1. First, the CPU will read the command line arguments and will parse them and place
the parsed values in the appropriate CPU-side variables. Exactly the same story as
the plain-simple CPU version of the code, the imflipP.c.
2. One of the command line variables will be the file name of the image file we have to
flip, like the file that contains the dog picture, dogL.bmp. The CPU will read that
file by using a CPU function that is called ReadBMP(). The resulting image will be
placed inside a CPU-side array named TheImg[]. Notice that the GPU does absolutely
nothing so far.
3. Once we have the image in memory and are ready to flip it, now it is time for the
GPU’s sun to shine! Horizontal or vertical flipping are both massively parallel tasks, so
the GPU should do it. At this point in time, because the image is in a CPU-side array
(more generally speaking, in CPU memory), it has to be transferred to the device
side. What is obvious from this discussion is that the GPU has its own memory, in
addition to the CPU’s own memory — DRAM — that we have been studying since
the first time we saw it in Section 3.5.
4. The fact that the CPU memory versus GPU memory are completely different memory
areas (or “chips”) should be pretty clear because the GPU is a different plug-in device
that shares none of the electronic components with the CPU. The CPU memory is
soldered on the motherboard and the GPU memory is soldered on the GPU plug-in
card; the only way a data transfer can happen between these two memory areas is an
explicit data transfer — using the available APIs we will see shortly in the following
pages — through the PCI Express bus that is connecting them. I would like the reader
to refresh his or her memory with Figure 4.3, where I showed how the CPU connected
to the GPU through the X99 chipset and the PCI Express bus. The X99 chip facilitates
the transfers, while the I/O portion of the CPU “chip” employs hardware to interface
to the X99 chip and shuttle the data back and forth between the GPU memory and
the DRAM of the CPU (by passing through the L3$ of the CPU along the way).
5. So, this transfer must take place from the CPU’s memory into the GPU’s memory
before the GPU cores can do anything with the image data. This transfer occurs by
using an API function that looks like an ordinary CPU function.
6. After this transfer is complete, now somebody has to tell the GPU cores what to do
with that data. It is the GPU side code that will accomplish this. Well, the reality is
that you should have transferred the code before the data, so by the time the image
data arrives at the GPU cores they are aware of what to do with it. This implies that
we are really transferring two things to the GPU side: (1) data to process, (2) code
to process the data with (i.e., compiled GPU instructions).
7. After the GPU cores are done processing the data, another GPU→CPU transfer must
transfer the results back to the CPU.
Using our Figure 6.3 analogy, it is as if Tolga is first giving a piece of paper with the
instructions to the scouts so they know what to do with the coconuts (GPU side code),
grabbing 32 coconuts at a time (read from CPU memory), dumping 32 coconuts at a
time in front of the scouts (CPU→GPU data transfer), telling the scouts to execute their
given instructions, which calls for harvesting the coconuts that just got dumped in front
of them (GPU-side execution), and grabbing what is in front of them when they are done
(GPU→CPU data transfer in the reverse direction) and putting the harvested coconuts back
in the area where he got them (write the results back to the CPU memory).
25
Copyright Taylor & Francis Group. Do Not Distribute.
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <iostream>
#include <ctype.h>
#include <cuda.h>
This process sounds a little inefficient due to the continuous back-and-forth data transfer,
but don’t worry. There are multiple mechanisms that Nvidia built into their GPUs to make
the process efficient and the sheer processing power of the GPU eventually partially hides
the underlying inefficiencies, resulting in a huge performance improvement.
26
Copyright Taylor & Francis Group. Do Not Distribute.
main() function and the following five pointers that facilitate image storage in CPU and
GPU memory:
• TheImg variable is the pointer to the memory that will be malloc()’d by the
ReadBMPLin() function to hold the image that is specified in the command line (e.g.,
dogL.bmp) in the CPU’s memory. Notice that this variable, TheImg, is a pointer to the
CPU DRAM memory.
• CopyImg variable is another pointer to the CPU memory and is obtained from a sep-
arate malloc() to allocate space for a copy of the original image (the one that will be
flipped while the original is not touched). Note that we have done nothing with the
GPU memory so far.
• As we will see very shortly, there are APIs that we will use to allocate memory in the
GPU memory. When we do this, using an API called cudaMalloc(), we are asking the
GPU memory manager to allocate memory for us inside the GPU DRAM. So, what
the cudaMalloc() returns back to us is a pointer to the GPU DRAM memory. Yet, we
will take that pointer and will store it in a CPU-side variable, GPUImg. This might look
confusing at first because we are saving a pointer to the GPU side inside a CPU-side
variable. It actually isn’t confusing. Pointers are nothing more than “values” or more
specifically 64-bit integers. So, they can be stored, copied, added, and subtracted in
exactly the same way 64-bit integers can be. When do we store GPU-side pointers
on the CPU side? The rule is simple: Any pointer that you will ever use in an API
that is called by the CPU must be saved on the CPU side. Now, let’s ask ourselves
the question: will the variable GPUImg ever be used by the CPU side? The answer
is definitely yes, because we will need to transfer data from the CPU to the GPU
using cudaMalloc(). We know that cudaMalloc() is a CPU-side function, although its
responsibility has a lot to do with the GPU. So, we need to store the pointers to both
sides in CPU-side variables. We will most definitely use the same GPU-side pointer
on the GPU side itself as well! However, we are now making a copy of it at the host
(CPU), so the CPU has the chance of accessing it when it needs it. If we didn’t do
this, the CPU would never have access to it in the future and wouldn’t be able to
initiate memory transfers to the GPU that involved that specific pointer.
• The other GPU-side pointers, GPUCopyImg and GPUResult, have the same story. They
are pointers to the GPU memory, where the resulting “flipped” image will be stored
(GPUResult) and another temporary variable that the GPU code needs for its operation
(GPUCopyImg). These two variables are CPU-side variables that store pointers that we
will obtain from cudaMalloc(); storing GPU pointers in CPU variables shouldn’t be
confusing.
There are multiple #include directives you will see in every CUDA program, which are
<cuda runtime.h>, <cuda.h>, and <device launch parameters.h> to allow us to use Nvidia
APIs. These APIs, such as cudaMalloc(), are the bridge between the CPU and the GPU
side. Nvidia engineers wrote them and they allow you to transfer data between the CPU
and the GPU side magically without worrying about the details.
Note the types that are defined here, ul, uch, and ui, to denote the unsigned long,
unsigned char, and unsigned int, respectively. They are used so often that it makes the
code cleaner define them as user-defined types. It serves, in this case, no purpose other than
to reduce the clutter in the code. The variables to hold the file names are InputFileName
and OutputFileName, which both come from the command line. The ProgName variable is
hard-coded into the program for use in reporting as we will see later in this chapter.
27
Copyright Taylor & Francis Group. Do Not Distribute.
28
Copyright Taylor & Francis Group. Do Not Distribute.
29
Copyright Taylor & Francis Group. Do Not Distribute.
Analogy 6.2 actually has quite a bit of detail. Let’s understand it.
• The city of cocoTown is the CPU and cudaTown is the GPU. Launching the spaceship
between the two cities is equivalent to executing GPU code. The notebook they left in
the spaceship contains the function parameters for the GPU-side function (for exam-
ple, the Vflip()); without these parameters cudaTown couldn’t execute any function.
• It is clear that the data transfer from the earth (cocoTown) to the moon (cudaTown)
is a big deal; it takes a lot of time and might even marginalize the amazing execution
speed at cudaTown. The spaceship is representing the data transfer engine, while the
space itself is the PCI Express bus that is connecting the CPU and GPU.
30
Copyright Taylor & Francis Group. Do Not Distribute.
• The satellite phone represents the CUDA runtime API library for cocoTown and
cudaTown to communicate. One important detail is that just because the satellite
phone operator is in cudaTown, it doesn’t guarantee that a copy is also saved in
cudaTown; so, these parameters (e.g., warehouse number) must still be put inside the
spaceship (written inside the notebook).
The variables time1, time2, time3, and time4 are all CPU-side variables that store time-
stamps during the transfers between the CPU and GPU, as well as the execution of the
GPU code on the device side. A curious observation from the code above is that we only
use Nvidia APIs to time-stamp the GPU-related events. Anything that touches the GPU
must be time-stamped with the Nvidia APIs, specifically cudaEventRecord() in this case.
But, why? Why can’t we simply use the good-and-old gettimeofday() function we saw in the
CPU code listings?
The answer is in Analogy 6.2: We totally rely on Nvidia APIs (the people from the
moon) to time anything that relates to the GPU side. If we are doing that, we might as
well let them time all of the space travel time, both forward and back. We are recording
the beginning and end of these data transfers and GPU kernel execution as events, which
allows us to use Nvidia event timing APIs to time them, such as cudaEventRecord(). To be
used in this API an event must be first created using the cudaEventCreate() API. Because
the event recording mechanism is built into Nvidia APIs, we can readily use them to time
our GPU kernels and the CPU←→GPU transfers, much like we did with our CPU code.
31
Copyright Taylor & Francis Group. Do Not Distribute.
In Code 6.3, we use time1 to time-stamp the very beginning of the code and time2 to
time-stamp the point when the CPU→GPU transfer is complete. Similarly, time3 is when
the GPU code execution is done and time4 is when the arrival of the results to the CPU
side is complete. The difference between any of these two time-stamps will tell us how long
each one of these events took to complete. Not surprisingly, the difference must also be
calculated by using the cudaEventElapsedTime() API — shown in Code 6.4 — in the CUDA
API library, because the stored time-stamps are in a format that is also a part of the Nvidia
APIs rather than ordinary variables.
Nvidia Runtime Engine contains a mechanism — through the cudaMalloc() API — for
the CPU to “ask” Nvidia to see if it can allocate a given amount of GPU memory. The
answer is returned in a variable of type cudaError t. If the answer is cudaSuccess, we know
that the Nvidia runtime Engine was able to create the GPU memory we asked for and
placed the starting point of this memory area in a pointer that is named GPUImg. Remember
from Code 6.1 that the GPUImg is a CPU-side variable, pointing to a GPU-side memory
address.
32
Copyright Taylor & Francis Group. Do Not Distribute.
Available
GPUs
Your
Nvidia SysTray
Icon
Nvidia Driver
Driver Version
Nvidia Control Panel
FIGURE 6.4Nvidia Runtime Engine is built into your GPU drivers, shown in your
Windows 10 Pro SysTray. When you click the Nvidia symbol, you can open the
Nvidia control panel to see the driver version as well as the parameters of your
GPU(s).
Much like the memory allocation API cudaMalloc(), the memory transfer API
cudaMemcpy() also uses the same status type cudaError t, which returns cudaSuccess if
the transfer completes without an error. If it doesn’t, then we know that something went
wrong during the transfer.
Going back to our Analogy 6.2, the cudaMemcpy() API is a specialized function that
the spaceship has; a way to transfer 166,656 coconuts super fast in the spaceship, instead
of worrying about each coconut one by one. Fairly soon, we will see that this memory
transfer functionality will become a lot more sophisticated and the transfer time will end
up being a big problem that will face us. We will see a set of more advanced memory transfer
functions from Nvidia to ease the pain! In the end, just because the transfers take a lot of
time, cudaTown people do not want to lose business. So, they will invent ways to make the
coconut transfer a lot more efficient to avoid discouraging cocoTown people from sending
business their way.
33
Copyright Taylor & Francis Group. Do Not Distribute.
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed! %s",cudaGetErrorString(cudaStatus));
exit(EXIT_FAILURE);
}
It is fairly common for programmers to write a wrapper function that wraps every single
CUDA API call around some sort of error checking as shown below:
where the wrapper function — chkCUDAErr() — is one that we write within our C code,
which directly uses the error code coming out of a CUDA API. An example wrapper function
is shown below, which exits the program when a GPU runtime code is returned by any
CUDA API:
// helper function that wraps CUDA API calls, reports any error and exits
void chkCUDAErr(cudaError_t ErrorID)
{
if (ErrorID != CUDA_SUCCESS){
printf("CUDA ERROR :::%\n", cudaGetErrorString(ErrorID));
exit(EXIT_FAILURE);
}
}
The Flip parameter is set based on the command line argument the user enters. When
the option ’H’ is chosen by the user, the Hflip() GPU-side function is called and the three
34
Copyright Taylor & Francis Group. Do Not Distribute.
specified arguments (GPUCopyImg, GPUImg, and IPH) are passed onto Hflip() from the CPU
side. The ’V’ option launches the Vflip() kernel with four arguments, as opposed to the
three arguments in the Hflip() kernel; GPUCopyImg, GPUImg, IPH, and IPV. Once we look at
the details of both kernels, it will be clear why we need the additional argument inside
Vflip().
The following lines show what happens when the user chooses the ’T’ (transpose) or ’C’
(copy) options in the command line. I could have implemented transpose in a more efficient
way by writing a specific kernel for it; however, my goal was to show how two kernels can
be launched, one after the other. So, to implement ’T’, I launched Hflip followed by Vflip,
which effectively transposes the image. For the implementation of the ’C’ option, though, I
designed a totally different kernel PixCopy().
switch (Flip){
...
case ’T’: Hflip <<< NumBlocks, ThrPerBlk >>> (GPUCopyImg, GPUImg, IPH);
Vflip <<< NumBlocks, ThrPerBlk >>> (GPUImg, GPUCopyImg, IPH, IPV);
GPUResult = GPUImg; GPUDataTransfer = 4*IMAGESIZE;
break;
case ’C’: NumBlocks = (IMAGESIZE+ThrPerBlk-1) / ThrPerBlk;
PixCopy <<< NumBlocks, ThrPerBlk >>> (GPUCopyImg, GPUImg, IMAGESIZE);
GPUResult = GPUCopyImg; GPUDataTransfer = 2*IMAGESIZE;
break;
}
When the option ’H’ is chosen by the user, the execution of the following line is handled
by the Nvidia Runtime Engine, which involves launching the Hflip() kernel and passing the
three aforementioned arguments to it from the CPU side.
Going forward, I will use the terminology launching GPU kernels. This contrasts with the
terminology of calling CPU functions; while the CPU calls a function within its own planet,
say earth according to Analogy 6.2, this is possibly not a good terminology for GPU kernels.
Because the GPU really acts as a co-processor, plugged into the CPU using a far slower
connection than the CPU’s own internal buses, calling a function in a far location such as
moon deserves a more dramatic term like launching. In the GPU kernel launch line above,
Hflip() is the GPU kernel name, and the two parameters that are inside the ≪ and ≫
symbols (NumBlocks and ThrPerBlk) tell the Nvidia Runtime Engine what dimensions to
run this kernel with; the first argument (NumBlocks) indicates how many blocks to launch,
and the second argument (ThrPerBlk) indicates how many threads are launched in each
block. Remember from Analogy 6.2 that these two numbers are what the cudaTown people
wanted to know; the number of boxes (NumBlocks) and the number of coconuts in each box
(ThrPerBlk). The generalized kernel launch line is as follows:
GPU Kernel Name <<< dimension, dimension >>> (arg1, arg2, ...);
where arg1, arg2, ... are the parameters passed from the CPU side onto the GPU kernel. In
Code 6.3, the arguments are the two pointers (GPUCopyImg and GPUImg) that were given to us
by cudaMalloc() when we created memory areas to store images in the GPU memory and
IPH is a variable that holds the number of pixels in the horizontal dimension of the image
(ip.Hpixels). GPU kernel Hflip() will need these three parameters during its execution and
35
Copyright Taylor & Francis Group. Do Not Distribute.
would have no way of getting them had we not passed them during the kernel launch.
Remember that the two launch dimensions in Analogy 6.2 were 166,656 and 256, effectively
corresponding to the following launch line:
This tells the Nvidia Runtime Engine to launch 166,656 blocks of the Hflip() kernel and pass
the three parameters onto every single one of these blocks. So, the following blocks will be
launched: Block 0, Block 1, Block 2, ... Block 166,655. Every single one of these blocks will
execute 256 threads (tid = 0, tid = 1, ... , tid = 255), identical to the pthreads examples we
saw in Part I of the book. What we are really saying is that we are launching a total of
166,656×256 ≈ 41 M threads with this single launch line.
It is worth noting the difference between Million and Mega: Million threads means
1,000,000 threads, while Mega threads means 1024×1024 = 1,048,576 threads. Similarly
Thousand is 1000 and Kilo is 1024. I will notate 41 Mega threads as 41 M threads. Same
for 41,664 Kilo threads, being notated as 41,664 K threads. To summarize:
One important note to take here is that the GPU kernel is a bunch of GPU machine code
instructions, generated by the nvcc compiler, on the CPU side. These are the instructions
for the cudaTown people to execute in Analogy 6.2. Let’s say you wanted them to flip the
order in which the coconuts are stored in the boxes and send them right back to earth. You
then need to send them instructions about how to flip them (Hflip()). Because cudaTown
people do not know what to do with the coconuts once they receive them. They need the
coconuts (data), as well as the sequence of commands to execute (instructions). So, the
compiled instructions also travel to cudaTown in the spaceship, written on a big piece of
paper. At runtime, these instructions are executed on each block independently. Clearly,
the performance of your GPU program depends on the efficiency of the kernel instructions,
i.e., the programmer.
Let’s refresh our memory with Code 2.8, which was the MTFlipH() CPU function that
accepted a single parameter named tid. By looking at the tid parameter that is passed onto
it, this CPU function knew “who it was.” Based on who it was it processed a different part of
the image, indexed by tid in some fashion. The GPU kernel Hflip() has stark similarities to
it: This kernel acts almost exactly like its CPU sister MTFlipH() and the entire functionality
of the Hflip() kernel will be dictated by a thread ID. Let’s now compare them:
• MTFlipH() function is launched with 4–8 threads, while the Hflip() kernel is launched
with almost 40 million threads. I talked about the overhead in launching CPU threads
in Part I, which was really high. This overhead is almost negligent in the GPU world,
allowing us to launch a million times more of them.
• MTFlipH() expects the Pthread API call to pass the tid to it, while the Hflip() kernel
will receive its thread ID (0...255) directly from Nvidia Runtime Engine, at runtime.
As the GPU programmer, all we have to worry about is to tell the kernel how many
threads to launch and they will be numbered automatically.
• Due to the million times higher number of the threads we launch, some sort of hierar-
chy is necessary. This is why the thread numbering is broken down into two values: the
blocks are little chunks that execute, with 256 threads in each. Each block executes
completely independent from each other.
36
Copyright Taylor & Francis Group. Do Not Distribute.
cudaEventSynchronize(time1); cudaEventSynchronize(time2);
cudaEventSynchronize(time3); cudaEventSynchronize(time4);
cudaEventElapsedTime(&totalTime, time1, time4);
cudaEventElapsedTime(&tfrCPUtoGPU, time1, time2);
cudaEventElapsedTime(&kernelExecutionTime, time2, time3);
cudaEventElapsedTime(&tfrGPUtoCPU, time3, time4);
cudaStatus = cudaDeviceSynchronize();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "\n Program failed after cudaDeviceSynchronize()!");
free(TheImg); free(CopyImg); exit(EXIT_FAILURE);
}
WriteBMPlin(CopyImg, OutputFileName); // Write the flipped image back to disk
...
37
Copyright Taylor & Francis Group. Do Not Distribute.
The cudaDeviceSynchronize() function waits for every single launched kernel to complete its
execution. The result could be an error, in which case cudaDeviceSynchronize() will return
an error code. Otherwise, everything is good and we move onto reporting the results.
cudaEventRecord(time3, 0);
// Copy output (results) from GPU buffer to host (CPU) memory.
cudaStatus = cudaMemcpy(CopyImg, GPUResult, IMAGESIZE, cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy GPU to CPU failed!");
exit(EXIT_FAILURE);
}
cudaEventRecord(time4, 0);
cudaEventSynchronize(time1); cudaEventSynchronize(time2);
cudaEventSynchronize(time3); cudaEventSynchronize(time4);
cudaEventElapsedTime(&totalTime, time1, time4);
cudaEventElapsedTime(&tfrCPUtoGPU, time1, time2);
cudaEventElapsedTime(&kernelExecutionTime, time2, time3);
cudaEventElapsedTime(&tfrGPUtoCPU, time3, time4);
38
Copyright Taylor & Francis Group. Do Not Distribute.
39
Copyright Taylor & Francis Group. Do Not Distribute.
40
Copyright Taylor & Francis Group. Do Not Distribute.
41
Copyright Taylor & Francis Group. Do Not Distribute.
PTX instructions and further half-compiles them at runtime and feeds the full-compiled
instructions into the GPU cores. In Windows, all of the “Nvidia magic code” that facil-
itates this “further-half-compiling” is built into a Dynamic Link Library (DLL) named
cudart (CUDA Run Time). There are two flavors: in modern x64 OSs, it is cudart64
and in old 32-bit OSs, it is cudart32, although the latter should never be used because
all modern Nvidia GPUs require a 64-bit OS for efficient use. In my Windows 10 Pro
PC, for example, I was using cudart64 80.dll (Runtime Dynamic Link Library for CUDA
8.0). This file is not something you explicitly have to worry about; the nvcc compiler
will put it in the executable directory for you. I am just mentioning it so you are aware
of it.
Let’s compare Code 6.7 to its CPU sister Code 2.7. Let’s assume that both of them are
trying to flip the astronaut.bmp image in Figure 5.1 vertically. astronaut.bmp is a 7918×5376
image that takes ≈ 121 MB on disk. How would their functionality be different?
• For starters, assume that Code 2.7 uses 8 threads; it will assign the flipping task of 672
lines to each thread (i.e., 672 × 8 = 5376). Each thread will, then, be responsible for
processing ≈ 15 MB of information out of the entire image, which contains ≈ 121 MB
of information in its entirety. Because the launch of more than 10–12 threads will not
help on an 8C/16T CPU, as we witnessed over and over again in Part I, we cannot
really do better than this when it comes to the CPU.
• The GPU is different though. In the GPU world, we can launch a gazillion threads
without incurring any overhead. What if we went all the way to the bitter extreme and
had each thread swap a single pixel ? Let’s say that each GPU thread takes a single
pixel’s RGB value (3 bytes) from the source image GPU memory area (pointed to by
*ImgSrc) and writes it into the intended vertically flipped destination GPU memory
area (pointed to by *ImgDst).
• Remember, in the GPU world, our unit of launch is blocks, which are clumps of threads,
each clump being 32, 64, 128, 256, 512, or 1024 threads. Also remember that it cannot
be less than 32, because “32” is the smallest amount of parallelism we can have and
32 threads are called a warp, as I explained earlier in this chapter. Let’s say that each
one of our blocks will have 256 threads to flip the astronaut image. Also, assume that
we are processing one row of the image at a time using multiple blocks. This means
that we need d7918/256e = 31 blocks to process each row.
• Because we have 5376 rows in the image, we will need to launch 5376 × 31 = 166,656
blocks to vertically flip the astronaut.bmp image.
• We observe that 31 blocks-per-row will yield some minor loss, because 31 × 256 = 7936
and we will have 18 threads (7936 − 7918 = 18) doing nothing to process each row
of the image. Oh well, nobody said that massive parallelism doesn’t have its own
disadvantages.
• This problem of “useless threads” is actually exacerbated by the fact that not only
are these threads useless, but also they have to check to see if they are the useless
threads as shown in the line below:
This line simply says “if my tid is between 7918 and 7935 I shouldn’t do anything,
because I am a useless thread.” Here is the math: We know that the image has
42
Copyright Taylor & Francis Group. Do Not Distribute.
7918 pixels in each row. So, the threads tid = 0...7917 are useful, and because we
launched 7936 threads (tid = 0...7935), this designates threads (tid = 7918...7935) as
useless.
• Don’t worry about the fact that we do not see tid in the comparison; rather, we see
MYcol. When you calculate everything, the underlying math ends up being exactly
what I just described. The reason for using a variable named MYcol is because the
code has to be parametric, so it works for any size image, not just the astronaut.bmp.
• Why is it so bad if only 18 threads check this? After all, 18 is only a very small
percentage of the 7936 total threads. Well, this is not what happens. Like I said before,
what you are seeing in Code 6.7 is what every thread executes. In other words, all
7936 threads must execute the same code and must check to see if they are useless,
just to find that they aren’t useless (most of the time) or they are (only a fraction
of the time). So, with this line of code, we have introduced overhead to every thread.
How do we deal with this? We will get to it, I promise. But, not in this chapter ... For
now, just know that even with these inefficiencies — which are an artifact of massively
parallel programming — our performance is still acceptable.
• And, finally, the __global__ is the third CUDA symbol that I am introducing here,
after ≪ and ≫. If you precede any ordinary C function with __global__ the
nvcc compiler will know that it is a GPU-side function and it compiles it into PTX,
rather than the x64 machine code output. There will be a few more of these CUDA
designators, but, aside from that, CUDA looks exactly like C.
Here, the ts and te variables computed the starting and ending row numbers in the
image, respectively. Vertical flipping was achieved by two nested loops, one scanning the
columns and the other scanning the rows. Now, let’s compare this to the Vflip() function
in Code 6.7:
43
Copyright Taylor & Francis Group. Do Not Distribute.
__global__
void Vflip(uch *ImgDst, uch *ImgSrc, ui Hpixels, ui Vpixels)
{
ui ThrPerBlk = blockDim.x;
ui MYbid = blockIdx.x;
ui MYtid = threadIdx.x;
ui MYgtid = ThrPerBlk * MYbid + MYtid;
ui BlkPerRow = (Hpixels + ThrPerBlk - 1) / ThrPerBlk; // ceil
ui RowBytes = (Hpixels * 3 + 3) & (˜3);
We see that there are major similarities and differences between CPU and GPU func-
tions. The task distribution in the GPU function is completely different because of the
blocks and the number of threads in each block. So, although the GPU still calculates a
bunch of indexes, they are completely different than the CPU function. GPU first wants to
know how many threads were launched with each block. The answer is in a special GPU
value named blockDim.x. We know that this answer will be 256 in our specific case because
we specified 256 threads to be launched in each block (Vflip≪ ..., 256 ≫). So, each block
contains 256 threads, with thread IDs 0...255. The specific thread ID of this thread is in
threadIDx.x. It also wants to know, out of the 166,656 blocks, what is its own block ID.
This answer is in another GPU value named blockIdx.x. Surprisingly, it doesn’t care about
the total number of blocks (166,656) in this case. There will be other programs that do.
It saves its block ID and thread ID in two variables named bid and tid. It then computes
a global thread ID (gtid) using a combination of these two. This gtid gives a unique ID
to each one of the launched GPU threads (out of the total 166,656 × 256 ≈ 41 M threads),
thereby linearizing them. This concept is very similar to how we linearized the pixel memory
locations on the disk according to Equation 6.2. However, an immediate correlation between
linear GPU thread addresses and linear pixel memory addresses is not readily available in
this case due to the existence of the useless threads in each row. Next, it computes the blocks
per row (BlkPerRow), which was 31 in our specific case. Finally, because the value of the
number of horizontal pixels (7918) was passed onto this function as the third parameter, it
can compute the total number of bytes in a row of the image (3 × 7918 = 23,754 Bytes) to
determine the byte index of each pixel.
After these computations, the kernel then moves onto computing the row and column
index of the single pixel that it is responsible for copying as follows:
44
Copyright Taylor & Francis Group. Do Not Distribute.
After these lines, the source pixel memory address is in MYsrcIndex and the destination
memory address is in MYdstIndex. Because each pixel contains three bytes (RGB) starting
at that address, the kernel copies three consecutive bytes starting at that address as follows:
Let’s now compare this to CPU Code 2.7. Because we could only launch 4–8 threads, instead
of the massive 41 M threads we just witnessed, one striking observation from the GPU kernel
is that the for loops are gone! In other words, instead of explicitly scanning over the columns
and rows, like the CPU function has to, we don’t have to loop over anything. After all, the
entire purpose of the loops in the CPU function was to scan the pixels with some sort of
two-dimensional indexing, facilitated by the row and column variables. However, in the GPU
kernel, we can achieve this functionality by using the tid and bid, because we know the
precise relationship of the coordinates and the tid and bid variables.
45
Copyright Taylor & Francis Group. Do Not Distribute.
46
Copyright Taylor & Francis Group. Do Not Distribute.
TABLE 6.1 CUDA keyword and symbols that we learned in this chapter.
CUDA Description Examples
Keyword
precedes a __global__
void PixCopy(uch *ImgDst, uch *ImgSrc, ui FS)
device-side
global {
function ...
(i.e., a kernel) }
Launch a
Hflip<<<NumBlocks, ThrPerBlk>>>(..., ..., ...);
device-side
≪, ≫ kernel
Vflip<<<NumBlocks, ThrPerBlk>>>(..., ..., ...);
from the
host-side PixCopy<<<NumBlocks, ThrPerBlk>>>(..., ..., ...);
case we got lucky. However, if we had 20 more bytes in the file, we would have 236 threads
wasted out of the 256 in the very last block. This is why we still have to put the following
if statement in the kernel to check for this condition as shown below:
The if statement in Code 6.9, much like the ones in Code 6.7 and Code 6.8, checks if “it
is a useless thread” and does nothing if it is. The performance impact of this line is similar
to the previous two kernels: although this condition will be only true for a negligible number
of threads, every thread still has to execute it for every single byte they are copying. We can
improve this, but we will save all of these improvement ideas to the upcoming chapters. For
now, it is worth noting that the performance impact of this if statement in the PixCopy()
kernel is far worse than the one we saw in the other two kernels. The PixCopy() kernel has
a much finer granularity as it copies only a single byte. Because of this, there are only 6
lines of C code in PixCopy(), one of which being the if statement. In contrast, Code 6.7 and
Code 6.8 contain 16–17 lines of code, thereby making the impact of one added line much
less. Although “lines of code” clearly does not translate to “the number of cycles that it
takes for the GPU core to execute the corresponding instructions” one-on-one, we can still
get an idea about the magnitude of the problem.
47
Copyright Taylor & Francis Group. Do Not Distribute.
the (1) horizontal or (2) vertical direction, (3) copies it to another image, or (4) transposes
it. The command line to run imflipG.cu is as follows:
imflipG astronaut.bmp a.bmp V 256
This vertically flips an image named astronaut.bmp and writes the flipped image into another
file named a.bmp. The ’V’ option is the flip direction (vertical) and 256 is the number of
threads in each block, which is what we will plug into the second argument of our kernel
dimensions with the launch parameters Vflip ≪ ..., 256 ≫ (...). We could choose ’H’, ’C’, or
’T’ for horizontal flip, copy, or transpose operations.
48
Copyright Taylor & Francis Group. Do Not Distribute.
FIGURE 6.5 Creating a Visual Studio 2015 CUDA project named imflipG.cu. Assume
that the code will be in a directory named Z:\code\imflipG in this example.
• You can select the Network Installer, which will install straight from the Internet.
After spending 50 GB of your hard disk space on VS 2015, you will not be terribly
worried about another GB. Either option is fine. I always choose the network installer,
so I don’t have to worry about deleting the local installer code after the installation
is done.
• Click OK for the default extraction paths. The screen may go blank for a few second
while the GPU drivers are being configured. After the installation is complete, you
will see a new option in your Visual Studio, named “NSIGHT.”
49
Copyright Taylor & Francis Group. Do Not Distribute.
FIGURE 6.6 Visual Studio 2015 source files are in the Z:\code\imflipG\imflipG di-
rectory. In this specific example, we will remove the default file, kernel.cu, that VS
2015 creates. After this, we will add an existing file, imflipG.cu, to the project.
project source files are going to be placed by VS 2015; so, the source files will be under
the directory Z:\code\imflipG\imflipG. Go into Z:\code\imflipG\imflipG; you will see a file
named kernel.cu and another file we don’t care about. The kernel.cu file is created in the
source file directory automatically by VS 2015 by default.
At this point, there are three ways you can develop your CUDA project:
1. You can enter your code inside kernel.cu by using it as a template and delete the parts
you don’t want from it and compile it and run it as your only kernel code.
2. You can rename kernel.cu as something else (say, imflipG.cu) by right clicking on it
inside VS 2015. You can clean what is inside the renamed imflipG.cu and put your
own CUDA code in there. Compile it and run it.
3. You can remove the kernel.cu file from the project and add another file, imflipG.cu,
to the project. This assumes that you already had this file; either by acquiring from
someone or editing it in a different editor.
I will choose the last option. One important thing to remember is that you should never
rename/copy/delete the files from Windows. You should perform any one of these operations
inside Visual Studio 2015. Otherwise, you will confuse VS 2015 and it will try to use a file
that doesn’t exist. Because I intend to use the last option, the best thing to do is to actually
plop the file imflipG.cu inside the Z:\code\imflipG\imflipG directory first. The screen shot
after doing this is shown at the bottom of Figure 6.6. This is, for example, what you would
50
Copyright Taylor & Francis Group. Do Not Distribute.
FIGURE 6.7The default CPU platform is x86. We will change it to x64. We will also
remove the GPU debugging option.
do if you are testing the programs I am supplying as part of this book. Although you will
get only a single file, imflipG.cu, as part of this book, it must be properly added to a VS
2015 project, so you can compile it and execute it. Once the compilation is done, there will
be a lot of miscellaneous files in the project directory, however, the source file is only a
single file: imflipG.cu.
Figure 6.6 also shows the steps in deleting the kernel.cu() file. You right click and choose
“Remove” first (top left). A dialog box will appear asking you whether you want to just
remove it from the project, but keep the actual file (the “Remove” option) or remove it from
the project and delete the actual file too (the “Delete” option). If you choose the “Delete”
option, the file will be gone and it will no longer be a part of the project. This is the graceful
way to get this file permanently out of your life, while also letting VS 2015 know about it
along the way. After kernel.cu() is gone, you right click the project and this time Add a file
to it. You can either add the file that we just dropped into the source directory (which is
what we want to do by choosing the “Add Existing Item” option), or add a new file that
doesn’t exist and you will start editing (the “Add New Item” option). After we choose “Add
Existing,” we see the new imflipG.cu file added to the project in Figure 6.6. We are now
ready to compile it and run it.
51
Copyright Taylor & Francis Group. Do Not Distribute.
FIGURE 6.8The default Compute Capability is 2.0. This is too old. We will change it
to Compute Capability 3.0, which is done by editing Code Generation under Device
and changing it to compute 30, sm 30.
will open, as shown in Figure 6.7. For the GPU, the first option you choose is Generate GPU
Debug Information. If you choose “Yes” here, you will be able to run the GPU debugger,
however your code will run at half the speed because the compiler has to add all sorts of
break points inside your code. Typically, the best thing to do is to keep this at “Yes” while
you are developing your code. After your code is fully debugged, you switch it to “No” as
shown in Figure 6.7.
After you choose the GPU Debug option, you have to edit the Code Generation under
CUDA C/C++ → Device and select the Code Generation next, as shown in Figure 6.8.
The default Compute Capability is 2.0, which will not allow you to run a lot of the new
features of the modern Nvidia GPUs. You have to change this to Compute Capability 3.0.
Once the “Code Generation” dialog box opens, you have to first uncheck Inherit from parent
of project defaults. The default Compute Capability is 2.0, which the “compute 20, sm 20”
string represents; you have to change it to “compute 30, sm 30” by typing this new string
into the textbox at the top of the Code Generation dialog box, as shown in Figure 6.8. Click
“OK” and the compiler knows now to generate code that will work for Compute Capability
3.0 and above. When you do this, your compiled code will no longer work with any GPU
that only supports 2.0 and below. There have been major changes starting with Compute
Capability 3.0, so it is better to compile for at least 3.0. Compute Capability of the Nvidia
GPUs is exactly like the x86 versus x64 Intel ISA, except there are quite a few more options
from Compute Capability 1.0 all the way up to 6.x (for the Pascal Family) and 7.x for the
upcoming Volta family.
The best option to choose when you are compiling your code is to set your Compute Capa-
bility to the lowest that will allow you to run your code at an acceptable speed. If you set
it too high, like 6.0, then your code will only run on Pascal GPUs, however you will have
the advantage of using some of the high-performance instructions that are only available
52
Copyright Taylor & Francis Group. Do Not Distribute.
annoying
squiggly
lines
executable
imflipG.exe file
FIGURE 6.9 Compiling imflipG.cu to get the executable file imflipG.exe in the
Z:\code\imflipG\x64\Debug directory.
in Pascal GPUs. Alternatively, if you use a low number, like 2.0, then your code might
be exposed to the severe limitations of the early year-2000 days, when — just as a quick
example — the block size limitations were so restrictive that you had to launch the kernels
in a loop because each kernel launch could only have a maximum of ≈ 60,000 blocks, rather
than the multibillion, starting with 3.0. This would be a huge problem even in our very
first CUDA program imflipG.cu; as we analyzed in Section 6.4.15, imflipG.cu required us
to launch 166,656 blocks. Using Compute Capability 2.0 would require that we somehow
chop up our code into three separate kernel launches, which would make the code messy.
However, using Compute Capability 3.0 and above, we no longer have to worry about this
because we can launch billions of blocks with each kernel. We will study this in great de-
tail in the next chapter. This is why the 3.0 is a good default for your projects and I will
choose 3.0 as my default assumption for all of the code I am presenting in this book, unless
otherwise stated explicitly. If 3.0 is continuously what you will use, it might be better to
change the Project defaults, rather than having to change this every time you create a new
CUDA program template.
Once you choose the Compute Capability, you can compile and run your code; go to
BUILD → Build Solution as shown in Figure 6.9. If there are no problems, your screen will
look like what I am showing in Figure 6.9 (1 succeeded and 0 failed) and your executable
file will be in the Z:\code\imflipG\x64\Debug directory. If there are errors, you can click it
to go to the source line of the error.
Although Visual Studio 2015 is a very nice IDE, it has a super annoying feature when it
comes to developing CUDA code. As you see in Figure 6.9 (see ebook for color version), your
kernel launch lines in main() — consisting of CUDA’s signature ≪ and ≫ brackets —
will have squiggly red lines as if they are a syntax error. It gets worse; because VS 2015
sees them as dangerous aliens trying to invade this planet, any chance it gets, it will try to
separate them into double and single brackets: “≪” will become “<”. It will drive you
53
Copyright Taylor & Francis Group. Do Not Distribute.
nuts when it separates them and you connect them back together and, in a minute, they are
separated again. Don’t worry. You will figure out how to handle them in time and you will
get over it. I no longer have any issues with it. Ironically, even after being separated, nvcc
will actually compile it correctly. So, the squiggly lines are nothing more than a nuisance.
C:\> Z:
Z:\> CD Z:\code\imflipG\x64\Debug
Z:\code\imflipG\x64\Debug> imflipG Astronaut.bmp Output.bmp V 256
As seen in Figure 6.10, if you have File Explorer open, you can browse this executable
code directory and when you click the location dropdown box, the directory name will be
highlighted (Z:\code\imflipG\x64\Debug), allowing you to copy it using Ctrl-C. You can
then type “CD” inside your CMD window and paste that directory name after “CD”, which
eliminates the need to remember that long directory name. The program will require the
source file Astronaut.bmp that we are specifying in the command line. If you try to run it
54
Copyright Taylor & Francis Group. Do Not Distribute.
without the Astronaut.bmp in the code executable directory, you will get an error message,
otherwise the program will run and place the expected resulting output file, Output.bmp
in this case, in the same directory. To visually inspect this file, all you have to do is to
open a browse — such as Internet Explorer or Mozilla — and drop the file into the browser
window. Even simpler, you can double click on the image and Windows will open the
associated application to view it. If you want to change that default application, Windows
will typically give you an option to do so.
If everything checks out OK in this list after the execution of the program is complete,
then your program may be fine. After these checks the only remaining issues are subtle
ones. These issues do not manifest themselves as errors or crashes; they may have subtle
effects that are hard to tell through the checklist above, such as the image being one pixel
shifted to the right and the one row on the left being blank (e.g., white). You wouldn’t be
able to tell this problem with the simple visual check, not even when you drag and drop the
file into a browser. The one horizontal column of blank pixels would be white, much like
the background color of the browser, thereby making the two difficult for you to distinguish
between the browser background versus image column. However, a trained eye knows to be
suspicious of everything and can spot the most subtle differences like this. In any event,
a simple file checker will clear up any doubt that you have in mind for these kinds of
problems.
Just as computer programmer trivia, I can’t stop myself from mentioning a third kind
of a problem: everything checks out fine, and the golden and output files compare fine.
However, the program gradually makes computer performance degrade. So, in a sense, al-
though your program is producing the expected output, it is not running properly. This
is the kind of problem that will really challenge an intermediate programmer, even an
experienced one. But, more than likely, an experienced programmer will not have these
types of bugs in his or her code; yeah right! Examples of these bugs include ones that
allocate memory and do not free it or ones that write a file with the wrong attributes,
preventing another program from modifying it, assuming that the intent of the program
55
Copyright Taylor & Francis Group. Do Not Distribute.
is to produce an output that can be further modified by another program, etc. If you are
a beginner, you will develop your own experience database as time goes on and will be
proficient in spotting these bugs. I can make a suggestion for you though: be suspicious of
everything! You should be able to detect any anomalies in performance, output speed, the
difference between two different runs of the same code, and more. When it comes to com-
puter software bugs — and, for that matter even hardware design bugs — it might be a good
time to repeat Intel’s former CEO and legend, late Andy Grove’s words: only the paranoid
survive.
56
Copyright Taylor & Francis Group. Do Not Distribute.
57
Copyright Taylor & Francis Group. Do Not Distribute.
FIGURE 6.11 The /usr/local directory in Unix contains your CUDA directories.
Actually, in a Windows platform, this is precisely what Visual Studio does when you
click the “Build” option. You can view and edit the command line options that VS 2015
will use when compiling your CUDA code by going to PROJECT → imflipG Properties
on the menu bar. Xcode IDE is no different. Indeed, the Eclipse IDE that I will describe
when showing the Unix CUDA development environment is identical. Every IDE will have
an area to specify the command line arguments to the underlying nvcc compiler. In Xcode,
Eclipse, or VS 2015, you can completely skip the IDE and compile your code using the
command line terminal. The CMD tool of Windows also works for that.
58
Copyright Taylor & Francis Group. Do Not Distribute.
Here, there are two different CUDA directories shown. This is because CUDA 7.5 was
installed first, followed by CUDA 8.0. So, both of the directories for 7.5 and 8.0 are there.
The /usr/local/cuda symbolic link points to the one that we are currently using. This is why
it might be a better idea to actually put this symbolic link in your PATH variable, instead
of a specific one like cuda-8.0, which I showed above.
A dialogue box opens asking you for the workspace location. Use the default or set it to
your preferred location and press OK. You can create a new CUDA project by choosing
File → New → CUDA C/C++ Project, as shown in Figure 6.12.
Build your code by clicking the hammer icon and run it. To execute a compiled program
on your local machine, run it as you would any other program. However, because we are
59
Copyright Taylor & Francis Group. Do Not Distribute.
FIGURE 6.12 Creating a new CUDA project using the Eclipse IDE in Unix.
normally going to be passing files and command line arguments to the program, you will
probably want to put the cd into the directory with the binary and run it from there. You
could specify the command line arguments from within your IDE, but this is somewhat
tedious if you are changing them frequently. The binaries generated by IDEs generally
appear in some subfolder of your project (Eclipse puts them in the Debug and Release
folders). As an example, to run the “Release” version of an application in Linux that we
developed in Eclipse/Nsight, we may type the following commands:
cd ~/cuda-workspace/imflipG/Release
. /imflipG
This will run your CUDA code and will display the results exactly like in Windows.
60
Copyright Taylor & Francis Group. Do Not Distribute.
CHAPTER 6
Optimization
techniques and best
practices for parallel
codes
CONTENTS
6.1 Data prefetching, communication and computations
overlapping and increasing computation efficiency . . . . . . 252
6.1.1 MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
6.1.2 CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
6.2 Data granularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
6.3 Minimization of overheads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
6.3.1 Initialization and synchronization overheads . . . . 258
6.3.2 Load balancing vs cost of synchronization . . . . . . . 260
6.4 Process/thread affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
6.5 Data types and accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
6.6 Data organization and arrangement . . . . . . . . . . . . . . . . . . . . . 261
6.7 Checkpointing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
6.8 Simulation of parallel application execution . . . . . . . . . . . . . 264
6.9 Best practices and typical optimizations . . . . . . . . . . . . . . . . 265
6.9.1 GPUs/CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
6.9.2 Intel Xeon Phi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
6.9.3 Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
6.9.4 Hybrid systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
61 251
Copyright Taylor & Francis Group. Do Not Distribute.
Such idle times may show up at various levels in a parallel system, including:
62
Copyright Taylor & Francis Group. Do Not Distribute.
6.1.1 MPI
There are at least two programming approaches in MPI that allow implemen-
tation of overlapping communication and computations and data prefetching.
Listing 6.1 presents an approach with non-blocking API calls described in
detail in Section 4.1.11. The solution uses MPI_I* calls for starting fetching
data. Since these are non-blocking calls, a calling process can perform compu-
tations immediately after the call. The latter only issues a request for starting
communication. After computations, i.e. processing of a data packet, have
completed, non-blocking communication needs to be finalized using MPI_Wait
and processing of the just received data packet can follow. If there are to be
more data packets processed then a new data packet can be fetched before
computations start.
MPI_Recv(inputbuffer,...);
packet=unpack(inputbuffer);
while (shallprocess(packet)) {
// first start receiving a data packet
MPI_Irecv(inputbuffer,...,&mpirequest);
}
...
Alternatively, the code without unpacking of data from a buffer and using
two buffers instead is shown in Listing 6.2.
63
Copyright Taylor & Francis Group. Do Not Distribute.
while (shallprocess(buffer)) {
if (buffer==inputbuffer1) {
buffer=inputbuffer0;
prevbuffer=inputbuffer1;
} else {
buffer=inputbuffer1;
prevbuffer=inputbuffer0;
}
}
...
In fact, a slave process would normally send back its results to the parent
process. Overlapping sends and processing of subsequent data, or packets can
also be arranged. Such a solution is shown in Listing 6.3.
64
Copyright Taylor & Francis Group. Do Not Distribute.
while (shallprocess(buffer)) {
if (buffer==inputbuffer1) {
buffer=inputbuffer0;
prevbuffer=inputbuffer1;
prevresultbuffer=outputbuffer1;
} else {
buffer=inputbuffer1;
prevbuffer=inputbuffer0;
prevresultbuffer=outputbuffer0;
}
...
65
Copyright Taylor & Francis Group. Do Not Distribute.
6.1.2 CUDA
Implementation of overlapping communication between the host and a GPU
and processing on the GPU can be done using streams described in Section
4.4.5. Specifically, using two streams potentially allows overlapping communi-
cation between page-locked host memory and a device (in one stream), com-
putations on the device (launched in another stream) as well as processing on
the host:
cudaStreamCreate(&streamS1);
cudaStreamCreate(&streamS2);
cudaMemcpyAsync(devicebuffer1,sourcebuffer1,copysize1,
cudaMemcpyHostToDevice,streamS1);
kernelA<<<gridsizeA,blocksizeA,0,streamS1>>>(...);
cudaMemcpyAsync(hostresultbuffer1,
deviceresultbuffer1,copyresultsize1,
cudaMemcpyDeviceToHost,streamS1);
cudaMemcpyAsync(devicebuffer2,sourcebuffer2,copysize2,
cudaMemcpyHostToDevice,streamS2);
kernelB<<<gridsizeB,blocksizeB,0,streamS2>>>(...);
cudaMemcpyAsync(hostresultbuffer2,
deviceresultbuffer2,copyresultsize2,
cudaMemcpyDeviceToHost,streamS2);
processdataonCPU();
cudaDeviceSynchronize();
66
Copyright Taylor & Francis Group. Do Not Distribute.
3.5+ and a 64-bit application running under Linux [119]. An MPS daemon
can be started as follows:
nvidia-cuda-mps-control -d
The following experiment tests a scenario without MPS and with MPS
for the geometric SPMD application implemented with MPI and CUDA and
shown in Section 5.2.4. In every scenario the code was run with various num-
bers of MPI processes. For every configuration execution time for the best out
of three runs is presented in Table 6.1. Application parameters used were 384
384 960 10 2. Tests were performed on a workstation with 2 x Intel Xeon
E5-2620v4, 2 x NVIDIA GTX 1070 GPUs and 128 GB RAM Two GPUs were
used.
1. Small data packets would allow good balancing of data among comput-
ing nodes/processors. This might be especially useful if there are pro-
cessors of various computing speed. On the other hand, too many small
data packets would result in considerable overhead for communication.
2. Large data packets might result in poor load balancing among comput-
ing nodes/processors. Also, in case of really large data packets, compu-
tations might start with a delay.
67
Copyright Taylor & Francis Group. Do Not Distribute.
with 2 x Intel Xeon E5-2620v4 and 128 GB RAM. For each configuration, the
best out of 3 runs are shown. Testbed results are shown for 4 and 16 processes
of an application.
50 16 processes
4 processes
40
Execution time [s]
30
20
10
0
10 100 1000 10000 100000 6
1x10
number of data packets
It should be noted that proper distribution of such data among e.g. nodes in
a cluster requires efficient load balancing, as discussed in Section 3.1.4.
68
Copyright Taylor & Francis Group. Do Not Distribute.
Such approaches are also discussed in [23] in the context of OpenMP ap-
plications run on an Intel Xeon Phi. In the case of a hybrid MPI+OpenMP
application discussed here, a master thread of a process would be involved in
communication with other processes. Furthermore, processing of subdomain
data assigned to a process must be partitioned into several threads within the
process, in this case using OpenMP.
Two functionally identical implementations are as follows:
1. The main loop is executed by the master thread with some parts paral-
lelized using OpenMP constructs. Specifically:
(a) Exchange of data by the master thread (might be needed first if
each process initializes its own domain independently) without any
special constructs.
(b) Parallel update of subdomain cells – parallelization performed using
#pragma omp parallel for.
(c) Substitution of pointers for source domain and target domain per-
formed by the master thread without any special constructs.
2. Entering a parallel region outside of the loop. Then each thread would
execute loop iterations independently which results in the need for syn-
chronization. Specifically, steps within a loop iteration include:
(a) Exchange of data by the master thread (might be needed first
if each process initializes its own domain independently) in code
within #pragma omp master.
(b) Synchronization using #pragma omp barrier.
(c) Parallel update of subdomain cells – parallelization performed using
#pragma omp for.
(d) Substitution of pointers for source domain and target domain per-
formed by the master thread in code within #pragma omp master.
69
Copyright Taylor & Francis Group. Do Not Distribute.
TABLE 6.2 Execution times [s] for two versions of MPI+OpenMP SPMD
code
version minimum execution average execution
time [s] time out of 10 runs
[s]
#pragma omp parallel 161.362 163.750
for inside main loop
#pragma omp parallel 160.349 163.010
outside of the main loop
1. Input data packets are divided into the number of groups equal to the
number of thread groups.
2. Instead of one critical section, the number of critical sections equal to
the number of thread groups is used, one per group. Then there are fewer
threads per critical section which potentially reduces time a thread needs
to wait for fetching a new data packet.
70
Copyright Taylor & Francis Group. Do Not Distribute.
float sinf(float a)
double sin(double a)
long double sinl(long double a)
Resulting code may offer better performance and, in case of smaller data types,
open more potential for vectorization e.g. on Intel Xeon Phi.
Similarly, precision for floating point operations can be controlled with
various compiler switches as discussed in [34]. As a result, various trade-offs
between execution times and accuracy can be obtained.
In certain applications, smaller data types can be used effectively such as
16-bit fixed point instead of 32-bit floating for training deep networks [79].
71
Copyright Taylor & Francis Group. Do Not Distribute.
that data cells that are fetched are located next to each other in memory. Let
us call this configuration A.
If, for the sake of a test, the order of loops is reversed i.e. indices are
browsed in the x, y and z dimension this is no longer the case. Let us call this
configuration B.
Table 6.3 presents comparison of execution times of the two configurations
for selected numbers of processes of an MPI application, run on a workstation
with 2 x Intel Xeon E5-2620v4 and 128 GB RAM. For each configuration, the
best out of 3 runs are shown. The codes can be compiled as follows:
Tests were run using 16 processes as follows, for a domain of size 600x600x600:
Often, tiling or blocking data [110] allows reusing of data that has been
loaded into cache or registers. For instance, if a loop stores data in a large
array (that does not fit into cache) such that in each iteration a successive
element is stored and then this data is used for subsequent updates of other
data, such a large loop can be tiled into a few passes each of which reuses data
from the cache.
6.7 CHECKPOINTING
Checkpointing is a mechanism that allows saving of the state of an application
and resume processing at a later time. Saving the state of a parallel application
is not easy because it requires saving a global consistent state of many pro-
cesses and/or many threads, possibly running on various computing devices,
either within a node or on a cluster, with consideration of communication and
synchronization.
Checkpointing might be useful for maintenance of the hardware. In some
cases, it may also allow moving the state of an application or individual ap-
plication processes to other locations. This effectively implements migration.
The latter might be useful for minimization of application execution times
72
Copyright Taylor & Francis Group. Do Not Distribute.
Checkpointing 263
73
Copyright Taylor & Francis Group. Do Not Distribute.
Examples of such systems and use cases applicable in the context of this
book include:
74
Copyright Taylor & Francis Group. Do Not Distribute.
75
Copyright Taylor & Francis Group. Do Not Distribute.
76
Copyright Taylor & Francis Group. Do Not Distribute.
77
Copyright Taylor & Francis Group. Do Not Distribute.
threads, there is a risk of false sharing. There are a few ways of dealing
with false sharing, including:
In case of Intel Xeon Phi x200, efficient execution of parallel codes will
also involve decisions on the following (article [74] discusses these modes in
more detail along with potential use cases):
• Flat – both DRAM and MCDRAM are available for memory al-
location (as NUMA nodes), the latter preferably for bandwidth
critical data.
• Cache – in this case MCDRAM acts as an L3 cache.
• Hybrid – a part of MCDRAM is configured as cache and a part as
memory that can be allocated by an application.
• MCDRAM – only MCDRAM is available.
2. cluster mode that defines how requests to memory are served through
memory controllers:
• All2All (default in case of irregular DIMM configuration) – a core, a
tag directory (to which a memory request is routed) and a memory
channel (to which a request is sent in case data is not in cache) can
be in various parts.
• Quadrant (in case of symmetric DIMM configuration) – a tag di-
rectory and a memory channel are located in the same region.
• Sub-NUMA clustering – in this mode regions (quarter – SNC4,
half – SNC2) will be visible as separate NUMA nodes. A core, a
tag directory and a memory channel are located in the same region
78
Copyright Taylor & Francis Group. Do Not Distribute.
6.9.3 Clusters
For clusters, that consist of many nodes, the following techniques should be
employed during development of parallel programs:
79
Copyright Taylor & Francis Group. Do Not Distribute.
80
Copyright Taylor & Francis Group. Do Not Distribute.
export MIC_ENV_PREFIX=PHI
export PHI_KMP_AFFINITY=balanced
export PHI_KMP_PLACE_THREADS=60c,3t
export PHI_OMP_NUM_THREADS=180
81
Copyright Taylor & Francis Group. Do Not Distribute.
CHAPTER 2
Determining an Exaflop
Strategy
CONTENTS
2.1 Foreword by John Levesque . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Looking at the Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Degree of Hybridization Required . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Decomposition and I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 Parallel and Vector Lengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7 Productivity and Performance Portability . . . . . . . . . . . . . . . . 15
2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
7
82
Copyright Taylor & Francis Group. Do Not Distribute.
the computation. If things looked good, we would go over to the Officers club
and have a couple beers and shoot pool. The operator would give us a call if
anything crashed the system.
There was another Captain “X” (who will remain anonymous) who was
looking at the material properties we would use in our runs. There are two
funny stories about Captain X. First, this was in the days of card readers, and
Captain X would take two trays of cards to the input room when submitting
his job. My operator friend, who was rewarded yearly with a half gallon of
Seagram’s VO at Christmas time, asked if I could talk to Captain X. I told the
operator to call me the next time Captain X submitted his job, and I would
come down and copy the data on the cards to a tape that could be accessed,
instead of the two trays of cards. The next time he submitted the job, X and
I would be watching from the window looking into the computer room. I told
the operator to drop the cards while we were looking so X would understand
the hazards of using trays of cards. Well, it worked – X got the shock of his
life and wouldn’t speak to me for some time, but he did start using tapes.
The other story is that X made a little mistake. The sites were numbered
with Roman numerals, and X confused IV with VI. He ended up giving us the
wrong data for a large number of runs.
Some Captains, like John Thompson, were brilliant, while others, like X,
were less so. John and I had an opportunity to demonstrate the accuracy
of our computations. A high explosive test was scheduled for observing the
ground motions from the test. We had to postmark our results prior to the
test. We worked long hours for several weeks, trying our best to predict the
outcome of the test. The day before the test we mailed in our results, and I
finally was able to go home for dinner with the family. The next day the test
was cancelled and the test was never conducted. So much for experimental
justification of your results.
2.2 INTRODUCTION
It is very important to understand what architectures you expect to see moving
forward. There are several trends in the high performance computing industry
that point to a very different system architecture than we have seen in the
past 30 years. Due to the limitations placed on power utilization and the new
memory designs, we will be seeing nodes that look a lot like IBM’s Blue Gene
on a chip and several of these chips on a node sharing memory. Like IBM’s
Blue Gene, the amount of memory will be much less than what is desirable
(primarily due to cost and energy consumption).
Looking at the system as a collection of nodes communicating across an
interconnect, there will remain the need to have the application communicate
effectively – the same as the last 20 years. The real challenges will be on
the node. How will the application be able to take advantage of thousands of
degrees of parallelism on the node? Some of the parallelism will be in the form
of a MIMD collection of processors, and the rest will be more powerful SIMD
83
Copyright Taylor & Francis Group. Do Not Distribute.
instructions that the processor can employ to generate more flops per clock
cycle. On systems like the NVIDIA GPU, many more active threads will be
required to amortize the latency to memory. While some threads are waiting
for operands to reach the registers, other threads can utilize the functional
units. Given the lack of registers and cache on the Nvidia GPU and Intel
KNL, latency to memory is more critical since the reuse of operands within
the cache is less likely. Xeon systems do not have as much of an issue as
they have more cache. The recent Xeon and KNL systems also have hyper-
threads or hardware threads – also called simultaneous multithreading (SMT).
These threads share the processing unit and the hardware can context switch
between the hyper-threads in a single clock cycle. Hyper-threads are very
useful for hiding latency associated with the fetching of operands.
While the NVIDIA GPU uses less than 20 MIMD processors, one wants
thousands of threads to be scheduled to utilize those processors. More than
ever before, the application must take advantage of the MIMD parallel units,
not only with MPI tasks on the node, but also with shared memory threads.
On the NVIDIA GPU there should be thousands of shared memory threads,
and on the Xeon Phi there should be hundreds of threads.
The SIMD length also becomes an issue since an order of magnitude of
performance can be lost if the application cannot utilize the SIMD (or vector
if you wish) instructions. On the NVIDIA GPU, the SIMD length is 32 eight-
byte words, and on the Xeon Phi it is 8 eight-byte words. Even on the new
generations of Xeons (starting with Skylake), it is 8 eight-byte words.
Thus, there are three important dimensions of the parallelism: (1) message
passing between the nodes, (2) message passing and threading within the node,
and (3) vectorization to utilize the SIMD instructions.
Going forward the cost of moving data is much more expensive than the
cost of doing computation. Application developers should strive to avoid data
motion as much as possible. A portion of minimizing the data movement is to
attempt to utilize the caches associated with the processor as much as possible.
Designing the application to minimize data motion is the most important issue
when moving to the next generation of supercomputers. For this reason we
have dedicated a large portion of the book to this topic. Chapters 3 and
6, as well as Appendix A will discuss the cache architectures of the leading
supercomputers in detail and how best to utilize them.
84
Copyright Taylor & Francis Group. Do Not Distribute.
and the threading should be at a high level. Both have relatively high overhead,
so granularity must be large enough to overcome the overhead of the parallel
region and benefit from the parallelization. The low level parallel structures
could take advantage of the SIMD instructions and hyper-threading.
If the application works on a 3D grid there may be several ways of dividing
the grid across the distributed nodes on the inter-connect. Recently, applica-
tion developers have seen the advantage of dividing the grid into cubes rather
than planes to increase the locality within the nodes and reduce the surface
area of the portion of the grid that resides on the node. Communication off
the node is directly proportional to the surface area of the domain contained
on the node. Within the node, the subdomain may be subdivided into tiles,
once again to increase the locality and attempt to utilize the caches as well
as possible.
A very well designed major application is the Weather Research and Fore-
casting (WRF) Model, a next-generation mesoscale numerical weather predic-
tion system designed for both atmospheric research and operational forecast-
ing needs. Recent modifications for the optimization of the WRF application
are covered in a presentation by John Michalakes given at Los Alamos Scien-
tific Laboratories in June, 2016 [22]. These techniques were further refined for
running on KNL, which will be covered in Chapter 8. The salient points are:
85
Copyright Taylor & Francis Group. Do Not Distribute.
86
Copyright Taylor & Francis Group. Do Not Distribute.
87
Copyright Taylor & Francis Group. Do Not Distribute.
nificantly different strategy for the threaded domains. While the Xeon and
KNL can benefit from having multiple MPI tasks running within the node,
within the GPU, all of the threads must be shared memory threads as MPI
cannot be employed within the GPU.
The decomposition is also coupled with the data layout utilized in the
program. While the MPI domain is contained within the MPI task’s memory
space, the application developer has some flexibility in organizing the data
within the MPI task. Once again, irregular, unstructured grids tend to intro-
duce unavoidable indirect addressing. Indirect addressing is extremely difficult
and inefficient in today’s architectures. Not only does operand fetching require
fetching of the address prior to the operand, cache utilization can be destroyed
by randomly accessing a large amount of memory with the indirect address-
ing. The previously discussed tile structure can be helpful, if and only if the
memory is allocated so that the indirect addresses are accessing data within
the caches. If the data is stored without consideration of the tile structure,
then cache thrashing can result.
Without a doubt, the most important aspect of devising a strategy for
moving an existing application to these new architectures is to design a mem-
ory layout that supplies addressing flexibility without destroying the locality
required to effectively utilize the cache architecture.
88
Copyright Taylor & Francis Group. Do Not Distribute.
0.5
0
1 2 4 8 16 32 64
OpenMP Threads per MPI Task
The two primary reasons for the superior performance of MPI on these
systems are the locality forced by MPI and the fact that MPI allows the tasks
to run asynchronously, which allows for better utilization of available memory
bandwidth. When MPI is run across all the cores on a node, the MPI task
is restricted to using the closest memory and cache structure to its cores. On
the other hand, threading across cores allows the threads to access memory
that may be further away, and multiple threads running across multiple cores
have a chance to interfere with each other’s caches. On KNL, running MPI
across all the cores and threading for employing the hyper-threads seems to be
a good starting point in a hybrid MPI+OpenMP approach. The performance
of a hybrid application is directly proportional to the quality of the thread-
ing. In fact, as seen in the OpenMP chapter, the SPMD threading approach,
which mimics the operation of MPI tasks, performs very well. OpenMP has
one tremendous advantage over MPI: it can redistribute work within a group
of threads more efficiently than MPI can, since OpenMP does not have to
move the data. There is and will always be a place for well-written OpenMP
threading. Of course, threading is a necessity on the GPU; one cannot run
separate MPI tasks on each of the symmetric processors within the GPU.
89
Copyright Taylor & Francis Group. Do Not Distribute.
90
Copyright Taylor & Francis Group. Do Not Distribute.
Niklaus Wirth
Chief Designer of Pascal, 1984 Turing Award Winner
“C makes it easy to shoot yourself in the foot; C++ makes it harder, but
when you do it blows your whole leg off.”
Bjarne Stroustrup
Chief Designer of C++
The productivity argument is that the cost of talented labor is greater than
the cost of the high performance computing system being utilized, and it is
91
Copyright Taylor & Francis Group. Do Not Distribute.
3x
Speedup
2x
1x
0x
XE6 XK7 (GPU) XC30 XC30 (GPU)
(Nov. 2011) (Nov. 2012) (Nov. 2012) (Nov. 2013)
92
Copyright Taylor & Francis Group. Do Not Distribute.
This work does show that a well-planned design can benefit from C++ high-
level abstraction. However, there has to be significant thought put into the
performance of the generated code.
Original Optimized
6
Energy (kWh / Ensemble Member)
1.41x
4 1.75x
1.49x
2.51x
2
2.64x 6.89x
0
XE6 XK7 (GPU) XC30 XC30 (GPU)
(Nov. 2011) (Nov. 2012) (Nov. 2012) (Nov. 2013)
93
Copyright Taylor & Francis Group. Do Not Distribute.
committee really does not care about how difficult it might be to optimize
the language extensions and that their principal goal is to make programmers
more productive. When users see an interesting new feature in the language,
they assume that the feature will be efficient; after all, why would the language
committee put the feature in the language if it wouldn’t run efficiently on the
target systems?
At some point this trend to productivity at the expense of performance has
to stop. Most, if not all of the large applications that have taken the productiv-
ity lane have implemented MPI and messaging outside of the abstraction, and
they have obtained an increase in performance from the parallelism across
nodes. Increased parallelism must be obtained on the node in the form of
threading and/or vectorization, with special attention paid to minimizing the
movement of data within the memory hierarchy of the node. At some point,
application developers have to put in extra work to ensure that data motion
on the node is minimized and that threading and vectorization are being well
utilized.
2.8 CONCLUSION
The primary strategy for designing a well-performing application for these
systems is to first get the memory decomposition correct. We must have good
locality to achieve a decent performance on the target system. This memory
layout imposes a threading decomposition on the node, which could also be
employed for the MPI decomposition on the node. Ideally, a legacy code may
already have found a good decomposition for MPI usage across all the nodes.
In review, we would like the vector dimension to be contiguous, and we would
like the threads to operate on a working set that fits into the cache structure
for a single core. Then we want to ensure that the tasks do not interfere with
each other, by either doing OpenMP right or by using MPI.
2.9 EXERCISES
2.1 Compare and contrast weak and strong scaling.
2.2 When choosing a parallel decomposition which equalizes computational
load, what other aspect of the decomposition is of critical importance?
2.3 What is a good decomposition strategy to use with multidimensional
(e.g., 3D) problems to minimize communication between MPI domains?
2.4 Construct a case study: Consider an application and a target system
with respect to the amount of parallelism available (the critical issue of
data motion will be covered in more detail in later chapters).
94
Copyright Taylor & Francis Group. Do Not Distribute.
2.5 Consider the execution of HPL on the KNL system and assume with
full optimization it sustains 80% of peak performance on the system:
95
Copyright Taylor & Francis Group. Do Not Distribute.
Chapter 1
Contemporary High Performance Computing
Jeffrey S. Vetter
Oak Ridge National Laboratory and Georgia Institute of Technology
3
96
Copyright Taylor & Francis Group. Do Not Distribute.
4 Contemporary High Performance Computing: From Petascale toward Exascale
97
TABLE 1.1: Significant systems in HPC.
System Type Organization Location Country
Blacklight SGI UV Pittsburgh Supercomputing Center Pittsburgh USA
Blue Waters Cray XE6, XK6 National Center for Supercomputing Applications Urbana USA
JUGENE Blue Gene/P Jülich Research Centre, Forschungszentrum Jülich Jülich Germany
Gordon x86/IB Cluster San Diego Supercomputing Center San Diego USA
HA-PACS x86/IB/GPU cluster University of Tsukuba Tsukuba Japan
Keeneland x86/IB/GPU cluster Georgia Institute of Technology Atlanta USA
Kraken Cray XT5 National Institute for Computational Science Knoxville USA
Lomonosov T-Platforms Moscow State University Moscow Russia
Mole-8.5 x86/IB/GPU cluster Chinese Academy of Sciences, Institute of Process Engi- Beijing China
neering
Monte Rosa Cray XE6 Swiss National Supercomputing Centre Lugano Switzerland
98
Numerous Cray XE6 DOD High Performance Modernization Project Numerous USA
Pleiades x86/IB Cluster NASA Ames Mountain View USA
Roadrunner x86/Cell/IB cluster Los Alamos National Laboratory Los Alamos USA
Sequoia, Mira Blue Gene/Q Lawrence Livermore National Laboratory and Argonne Na- Livermore and Ar- USA
tional Laboratory gonne
TERA 100 x86/IB Cluster CEA Arpajon France
Tianhe-1A x86/IB/GPU cluster National University of Defense Technology Tianjin China
Titan Cray XK6 Oak Ridge National Laboratory Oak Ridge USA
Tsubame 2.0 x86/IB/GPU cluster Tokyo Institute of Technology Tokyo Japan
Contemporary High Performance Computing
Copyright Taylor & Francis Group. Do Not Distribute.
8. Data center/facility
1.3 Performance
The most prominent HPC benchmark is the TOP500 Linpack benchmark. TOP500 has
qualities that make it valuable and successful: easily scaled problem size, straightforward
validation, an open source implementation, almost 20 years of historical data, a large au-
dience familiar with the software, and well managed guidelines and submission procedures.
As shown in Table 1.2, the TOP500 prize for the #1 system has been awarded since 1993.
In that time, performance for this prize has grown from 124 GFlops to 17,590,000 GFlops.
This increase is 5 orders of magnitude!
Although TOP500 Linpack is a formidable and long-lived benchmark, it does not fully
capture the spectrum of applications across HPC as is often pointed out. TOP500 Lin-
pack aside, the HPC community has created many metrics and benchmarks for tracking
the success of different HPC solutions. These alternatives include the Gordon Bell Prize,
the HPC Challenge benchmark (cf. Ch. 2.1), the NAS Parallel Benchmarks, the Green500
benchmark (cf. Ch. 3.1), the SHOC benchmarks (cf. Ch. 7.8), along with a large number of
procurement benchmarks from DoE, DoD, NASA, and many other organizations.
99
TABLE 1.2: Gordon Bell Prize winners for sustained performance compared with the #1 system on the TOP500 ranking since their
inception.
Year Type Application System Cores Increase GB Prize Increase TOP500 #1 Increase
Log10 Gflop/s Log10 Gflop/s Log10
1987 PDE Structures N-CUBE 1,024 0.45
1988 PDE Structures Cray Y-MP 8 1
1989 PDE Seismic CM-2 2,048 5.6
1990 PDE Seismic CM-2 2,048 14
1991 NO PRIZE AWARDED
1992 NB Gravitation Delta 512 5.4
1993 MC Boltzmann CM-5 1,024 - 60 - 124 -
1994 IE Structures Paragon 1,904 0.27 143 0.38 170 0.14
1995 MC QCD NWT 128 -0.90 179 0.47 170 0.14
1996 PDE CFD NWT 160 -0.81 111 0.27 368 0.47
1997 NB Gravitation ASCI Red 4,096 0.60 170 0.45 1,338 1.03
1998 DFT Magnetism T3E-1200 1,536 0.18 1,020 1.23 1,338 1.03
1999 PDE CFD ASCI Blue Pacific 5,832 0.76 627 1.02 2,379 1.28
100
2000 NB Gravitation Grape-6 96 -1.03 1,349 1.35 4,938 1.60
2001 NB Gravitation Grape-6 1,024 0.00 11,550 2.28 7,226 1.77
2002 PDE Climate Earth Simulator 5,120 0.70 26,500 2.65 35,860 2.46
2003 PDE Seismic Earth Simulator 1,944 0.28 5,000 1.92 35,860 2.46
2004 PDE CFD Earth Simulator 4,096 0.60 15,200 2.40 70,720 2.76
2005 MD Solidification BG/L 131,072 2.11 101,700 3.23 280,600 3.35
2006 DFT Electronic structures BG/L 131,072 2.11 207,000 3.54 280,600 3.35
2007 MD Kelvin-Helmholtz BG/L 131,072 2.11 115,000 3.28 478,200 3.59
Contemporary High Performance Computing
2008 DFT Crystal structures Jaguar/XT-5 150,000 2.17 1,352,000 4.35 1,105,000 3.95
Copyright Taylor & Francis Group. Do Not Distribute.
2009 DFT Nanoscale systems Jaguar/XT-5 147,464 2.16 1,030,000 4.23 1,759,000 4.15
2010 FMM Blood flow Jaguar/XT-5 196,608 2.28 780,000 4.11 2,566,000 4.32
2011 RSDFT Nanowires K/Fujitsu 442,368 2.64 3,080,000 4.71 10,510,000 4.93
2012 NB Astrophysics K/Fujitsu 663,552 2.81 4,450,000 4.87 17,590,000 5.15
Note: For a specific year, the system occupying the TOP500 #1 rank may be different from the system listed as the Gordon Bell Prize winner.
Source: This table was compiled from a number of sources including ACM and IEEE documents and including conversations with the following scientists: Jack
Dongarra, David Keyes, Alan Karp, John Gustafson, and Bill Gropp.
7
Copyright Taylor & Francis Group. Do Not Distribute.
8 Contemporary High Performance Computing: From Petascale toward Exascale
1.3.3 Green500
In 2007, the Green500 benchmark and list (cf. Ch. 3.1) were created in order to recog-
nize the growing importance of energy efficiency in HPC. The list requires submitters to
submit both the performance of their system on the TOP500 HPL benchmark and the em-
pirically measured power consumption during the benchmark test. Using this information,
Green500 calculates and ranks each system by their megaFLOPS/Watt metric. All recent
HPC reports predict that energy efficiency will continue to drive the design of HPC systems
in the foreseeable future [KBB+ 08, FPT+ 07, HTD11], so the Green500 list will allow the
community to continue to track progress on this important topic.
1.3.4 SHOC
Most recently, heterogeneous systems have become a viable commodity option for pro-
viding high performance in this new era of limited power and facility budgets. During this
time, several new programming models (e.g., CUDA, OpenCL, OpenACC) and architectural
features, such as accelerators attached via PCIe, have emerged that have made it difficult
to use existing benchmarks effectively. The Scalable Heterogeneous Computing benchmark
suite (cf. Ch. 7.8) was created to facilitate benchmarking of scalable heterogeneous clusters
for computational and data-intensive computing. In contrast to most existing GPU and con-
sumer benchmarks, SHOC focuses on scalability so that it can run on 1 or 1000s of nodes,
and it focuses on scientific kernels prioritized by their importance in existing applications.
1.4 Trends
Looking at the contributions in this book and at industry more broadly, it is clear
that dominant trends have emerged over the past 15 years that have directly impacted
contemporary HPC. These trends span hardware, software, and business models.
101
Copyright Taylor & Francis Group. Do Not Distribute.
Contemporary High Performance Computing 9
For example, Linux was used only as a research operating system in HPC in the 1990s,
while now it is the operating system running on nearly all HPC systems. In another example,
in the late 1990s, MPI (Message Passing Interface) was just emerging as a new de facto
standard that is now ubiquitous. Meanwhile, other trends including multicore processors
and graphics processors were not even imagined outside of a few scientists in research
communities. Finally, perhaps most important of all, open-source software has grown to be
a very strong component of HPC, even resulting in international planning exercises for the
path toward Exascale [KBB+ 08, FPT+ 07, HTD11]. In the following sections, we examine
these trends in more detail.
1.4.1 Architectures
Recent architectures for HPC can be categorized into a few classes. First, commodity-
based clusters dominate the TOP500 list. These clusters typically have an x86 commodity
processor from Intel or AMD, and a commodity-based interconnect, which today is Infini-
Band. These clusters offer significant capability at a very competitive price because they are
high volume products for vendors. Standard HPC software stacks, much of it open-source,
make these clusters easy to install, run, and maintain.
Second, GPU-accelerated commodity-based clusters have quickly emerged over the past
three years as viable solutions for HPC applications [OLG+ 05b]. Two important aspects of
these systems often go understated. First, because the GPU leverages multiple markets such
as gaming and professional graphics, these GPGPU architectures are commodity solutions.
Second, the very quick adoption of CUDA and OpenCL for programming these architectures
lowered the switching costs for users to port their applications. Recently, the move toward
directive-based compilation, with tools like PGI Accelerate, CAPS HMPP, and OpenACC,
demonstrates even more support and interest for easing this transition.
Third, customized architectures represent a significant fraction of the top systems. Take,
for example, the K Computer [ASS09] and the Blue Gene Q systems [bgp]. These systems
have customized logic for both their compute nodes and interconnection networks. They
have demonstrated excellent scalability, performance, and energy efficiency.
Finally, even more specialized systems, such as DE Shaw’s Anton [SDD+ 07], have been
designed that show excellent performance on specialized problems like protein folding but
are likewise inflexible such that they cannot run any of the aforementioned benchmarks like
TOP500 or HPCC.
1.4.2 Software
Although HPC systems share many hardware components with servers in enterprise
and data centers, the HPC software stack is dramatically different from an enterprise or
cloud software stack and is unique to HPC. Generally speaking, an HPC software stack has
multiple levels: system software, development environments, system management software,
and scientific data management and visualization systems. Nearest to the hardware, system
software typically includes operating systems, runtime systems, and low level I/O software,
like filesystems. Next, development environment is a broad area that facilitates application
design and development. In our framework, it includes programming models, compilers,
scientific frameworks and libraries, and correctness and performance tools. Then, system
management software coordinates, schedules, and monitors the system and the applications
running on that system. Finally, scientific data management and visualization software
provides users with domain specific tools for generating, managing, and exploring data for
their science. This data may include empirically measured data from sensors in the real
world that is used to calibrate and validate simulation models, or output from simulations
102
Copyright Taylor & Francis Group. Do Not Distribute.
10 Contemporary High Performance Computing: From Petascale toward Exascale
per se. As Table 1.3 shows, the systems described in this book have a tremendous amount of
common software, even though some of the systems are very diverse in terms of hardware.
Moreover, a considerable amount of this software is open-source, and is funded by a wide
array of sponsors.
Over the past 15 years, HPC software has had to adapt and respond to several challenges.
First, the concurrency in applications and systems has grown over three orders of magnitude.
The primary programming model, MPI, has had to grow and change to allow this scale.
Second, the increase in concurrency has on a per core basis driven lower the memory and
I/O capacity, and the memory, I/O, and interconnect bandwidth. Third, in the last five
years, heterogeneity and architectural diversity have placed a new emphasis on application
and software portability.
103
Copyright Taylor & Francis Group. Do Not Distribute.
Contemporary High Performance Computing 11
EC2, and internal corporate clouds continue to grow dramatically. IDC indicates that total
worldwide revenue from public IT cloud services exceeded $21.5 billion in 2010 (http://
www.idc.com/prodserv/idc cloud.jsp), and they predict that it will reach $72.9 billion in
2015. With this tremendous growth rate – (CAGR) of 27.6% – Clouds and Grids will most
likely influence the HPC marketplace, even if indirectly, so we include three chapters on
Cloud and Grid systems being used and tested for scientific computing markets. These
chapters highlight both the strengths and weaknesses of existing cloud and grid systems.
104
Copyright Taylor & Francis Group. Do Not Distribute.
Chapter 1
Introduction to
Computational Modeling
105
Copyright Taylor & Francis Group. Do Not Distribute.
106
Copyright Taylor & Francis Group. Do Not Distribute.
107
Copyright Taylor & Francis Group. Do Not Distribute.
TABLE 1.1 Timeline of Advances in Computer Power and Scientific Modeling (Part 1)
Example Hardware Max. Speed Date Weather and Climate Modeling
ENIAC 400 Flops 1945
1950 First automatic weather forecasts
UNIVAC 1951
IBM 704 12 KFLOP 1956
1959 Ed Lorenz discovers the chaotic
behavior of meteorological processes
IBM7030 Stretch; 500-500 KFLOP ~1960
UNIVAC LARC
1965 Global climate modeling underway
CDC6600 1 Megaflop 1966
CDC7600 10 MFLOP 1975
CRAY1 100 MFLOP 1976
CRAY-X-MP 400 MFLOP
1979 Jule Charney report to NAS
CRAY Y-MP 2.67 GFLOP
1988 Intergovernmental Panel on Climate
Change
1992 UNFCCC in Rio
IBM SP2 10 Gigaflop 1994
ASCII Red 2.15 TFLOP 1995 Coupled Model Intercomparison
Project (CMIP)
2005 Earth system models
Blue Waters 13.34 PFLOP 2014
Sources: Bell, G., Supercomputers: The amazing race (a history of supercomputing, 1960–2020),
2015, http://research.microsoft.com/en-us/um/people/gbell/MSR-TR-2015-2_
Supercomputers-The_Amazing_Race_Bell.pdf (accessed December 15, 2016).
Bell, T., Supercomputer timeline, 2016, https://mason.gmu.edu/~tbell5/page2.html
(accessed December 15, 2016).
Esterbrook, S., Timeline of climate modeling, 2015, https://prezi.com/pakaaiek3nol/
timeline-of-climate-modeling/ (accessed December 15, 2016).
108
Copyright Taylor & Francis Group. Do Not Distribute.
TABLE 1.2 Timeline of Advances in Computer Power and Scientific Modeling (Part 2)
Date Theoretical Chemistry Aeronautics and Structures Software and Algorithms
1950 Electronic wave functions
1951 Molecular orbital theory
(Roothan)
1953 One of the first
molecular simulations
(Metropolis et al.)
1954 Vector processing
directives
1956 First calculation of
multiple electronic
states of a molecule on
EDSAC (Boys)
1957 FORTRAN created
1965 Creation of ab initio
molecular modeling
(People)
1966 2D Navier-Stokes
simulations; FLO22;
transonic flow over a
swept wing
1969 UNIX created
1970 2D Inviscid Flow Models;
design of regional jet
1971 Nastran (NASA
Structural Analysis)
1972 C programming
language created
1973 Matrix computations
and errors
(Wilkinson)
1975 3D Inviscid Flow Models;
complete airplane
solution
1976 First calculation of a DYNA3D which became
chemical reaction LS-DYNA (mid-70s)
(Warshel)
1977 First molecular dynamics Boeing design of 737-500
of proteins (Karplus)
First calculation of a
reaction transition state
(Chandler)
(Continued)
109
Copyright Taylor & Francis Group. Do Not Distribute.
110
Copyright Taylor & Francis Group. Do Not Distribute.
111
Copyright Taylor & Francis Group. Do Not Distribute.
Modeling and simulation has also become a key part of the process and
designing, testing, and producing products and services. Where the build-
ing of physical prototypes or the completion of laboratory experiments
112
Copyright Taylor & Francis Group. Do Not Distribute.
may take weeks or months and cost millions of dollars, industry is instead
creating virtual experiments that can be completed in a short time at
greatly reduced costs. Proctor and Gamble uses computer modeling to
improve many of its products. One example is the use of molecular mod-
eling to test the interactions among surfactants in their cleaning products
with a goal of producing products that are environmentally friendly and
continue to perform as desired (Council on Competiveness, 2009).
Automobile manufacturers have substituted modeling for the building
of physical prototypes of their cars to save time and money. The build-
ing of physical prototypes called mules is expensive, costing approxi-
mately $500,000 for each vehicle with 60 prototypes required before
going into production (Mayne, 2005). The design of the 2005 Toyota
Avalon required no mules at all—using computer modeling to design and
test the car. Similarly, all of the automobile manufacturers are using mod-
eling to reduce costs and get new products to market faster (Mayne, 2005).
These examples should illustrate the benefits of using modeling and
simulation as part of the research, development, and design processes for
scientists and engineers. Of course, students new to modeling and simula-
tion cannot be expected to effectively use complex, large-scale simulation
models on supercomputers at the outset of their modeling efforts. They
must first understand the basic principles for creating, testing, and using
models as well as some of the approaches to approximating physical real-
ity in computer code. We begin to define those principles in Section 1.3
and continue through subsequent chapters.
113
Copyright Taylor & Francis Group. Do Not Distribute.
One of the most ambitious physical models ever built was a costly 200 acre
model of the Mississippi River Basin used to simulate flooding in the
watershed (U.S. Army Corps of Engineers, 2006). A photo of a portion
of this model is shown in Figure 1.1. It included replicas of urban areas,
the (Fatherree, 2006) stream bed, the underlying topography, levees, and
other physical characteristics. Special materials were used to allow flood
simulations to be tested and instrumented.
Through theory and experimentation, scientists and engineers also
developed mathematical models representing aspects of physical behaviors.
These became the basis of computer models by translating the mathemat-
ics into computer codes. Over time, mathematical models that started
as very simplistic representations of complex systems have evolved into
systems of equations that more closely approximate real-world phenomena
such as the large-scale models discussed earlier in this chapter.
Creating, testing, and applying mathematical models using computa-
tion require an iterative process. The process starts with an initial set of
simplifying assumptions and is followed by testing, alteration, and applica-
tion of the model. Those steps are discussed in Section 1.3.1.
114
Copyright Taylor & Francis Group. Do Not Distribute.
Choose variables
Interpret results Implement the Define relationships
Verify and refine computer Define equations
model model and functions
Draw conclusions
Maintain and refine
the model
115
Copyright Taylor & Francis Group. Do Not Distribute.
Dictates
Average speed
FIGURE 1.3 Partial concept map of model to calculate travel time using Cmap.
116
Copyright Taylor & Francis Group. Do Not Distribute.
Time to traverse
road segment
More lanes
Parked cars Average speed
Faster
Slower Wider lanes
FIGURE 1.4 Partial mind map of model to calculate travel time using mind
map maker.
The average speed across a road segment is slowed by parked cars and
traffic control devices while wider lanes and higher speed limits take
less time. The total time for a trip would need to add the average times
associated with traversing each road segment. Thus, data on each seg-
ment will be needed as input to the model. Simple versions of such esti-
mates are provided by global positioning satellite (GPS) equipment or
the Internet mapping services that are available online. There are many
other conditions that would impact this system. Modeling traffic condi-
tions are a topic of one of the exercises at the end of the chapter.
Going back to Figure 1.2, one must choose which simplifying assump-
tions can be made in a model. This, in turn, leads to a selection of the
data that would be needed, the variables that will drive the model, and the
equations and mathematical functions that will comprise the model.
Once these items have been defined, a computer version of the model
can be created and tested. The results must be verified to ascertain that the
code is working properly. If the model is giving unexpected results with
the code working properly, there may be a need to reexamine the simplify-
ing assumptions and to reformulate the model. Thus, one may go through
several iterations until the model is providing sufficiently accurate results.
This can be validated against available experimental or field data to pro-
vide a quantitative assessment of model accuracy. Finally, the model can
be used to undertake more detailed analysis and the results reported.
As time goes on, the model must be maintained and may be improved
by relaxing more of the assumptions and/or improving the input data. It
should be noted that the judgment of whether a model is giving reasonable
117
Copyright Taylor & Francis Group. Do Not Distribute.
There are several different ways to classify models. Models can be deter-
ministic or probabilistic. Another term for probabilistic is stochastic mean-
ing a random process or a process, which occurs by chance. A probabilistic
model includes one or more elements that might occur by chance or at ran-
dom while a deterministic model does not. A deterministic model applies a
set of inputs or initial conditions and uses one or more equations to produce
118
Copyright Taylor & Francis Group. Do Not Distribute.
model outputs. The outputs of a deterministic model will be the same for
each execution of the computer code with the same inputs. A probabilistic
model will exhibit random effects that will produce different outputs for
each model run.
Models can also be characterized as static or dynamic. A dynamic
model considers the state of a system over time while a static model does
not. For example, one could have a model of a material like a steel beam
that considered its ability to bear weight without bending under a set of
standard environmental conditions. This would be considered to be a
static model of that system. A dynamic model of the same structure would
simulate how the bearing strength and possible deformation of the beam
would change under stresses over time such as under high temperatures,
vibration, and chemical corrosion.
119
Copyright Taylor & Francis Group. Do Not Distribute.
120
Copyright Taylor & Francis Group. Do Not Distribute.
EXERCISES
1. Using a graphics program or one of the free concept-mapping or
mind-mapping tools, create a complete conceptual map of the traffic
model introduced earlier in the chapter. You should include all of
the other factors you can think of that would contribute either to the
increase or decrease in the traffic speed that might occur in a real
situation.
2. Insert another concept mapping example here.
121
Copyright Taylor & Francis Group. Do Not Distribute.
REFERENCES
Bartlett, B. N. 1990. The contributions of J.H. Wilkinson to numerical analysis.
In A History of Scientific Computing, ed. S. G. Nash, pp. 17–30. New York:
ACM Press.
Bell, G. 2015. Supercomputers: The amazing race. (A History of Supercomputing,
1960–2020). http://research.microsoft.com/en-us/um/people/gbell/MSR-TR-
2015-2_Supercomputers-The_Amazing_Race_Bell.pdf (accessed December 15,
2016).
Bell, T. 2016. Supercomputer timeline, 2016. https://mason.gmu.edu/~tbell5/
page2.html (accessed December 15, 2016).
Biesiada, J., A. Porollo, and J. Meller. 2012. On setting up and assessing docking
simulations for virtual screening. In Rational Drug Design: Methods and
Protocols, Methods in Molecular Biology, ed. Yi Zheng, pp. 1–16. New York:
Springer Science and Business Media.
Cmap website. http://cmap.ihmc.us/ (accessed February 22, 2016).
Computer History Museum. 2017. Timeline of computer history, software and
languages. http://www.computerhistory.org/timeline/software-languages/
(accessed January 2, 2017).
Council on Competitiveness. 2004. First Annual High Performance Computing
Users Conference. http://www.compete.org/storage/images/uploads/File/PDF%
20Files/2004%20HPC%2004%20Users%20Conference%20Final.pdf.
Council on Competitiveness. 2009. Procter & gamble’s story of suds, soaps, simu-
lations and supercomputers. http://www.compete.org/publications/all/1279
(accessed January 2, 2017).
Dorzolamide. 2016. https://en.wikipedia.org/wiki/Dorzolamide (accessed December
15, 2016).
Esterbrook, S. 2015. Timeline of climate modeling. https://prezi.com/pakaaiek3nol/
timeline-of-climate-modeling/ (accessed December 15, 2016).
122
Copyright Taylor & Francis Group. Do Not Distribute.
123
Copyright Taylor & Francis Group. Do Not Distribute.
124