Optimizing HPC Applications With Intel Cluster Tools
Optimizing HPC Applications With Intel Cluster Tools
com
For your convenience Apress has placed some of the front
matter material after the index. Please use the Bookmarks
and Contents at a Glance links to access them.
www.allitebooks.com
Contents at a Glance
www.allitebooks.com
Introduction
Let’s optimize some programs. We have been doing this for years, and we still love doing it.
One day we thought, Why not share this fun with the world? And just a year later, here we are.
Oh, you just need your program to run faster NOW? We understand. Go to Chapter 1
and get quick tuning advice. You can return later to see how the magic works.
Are you a student? Perfect. his book may help you pass that “Software Optimization
101” exam. Talking seriously about programming is a cool party trick, too. Try it.
Are you a professional? Good. You have hit the one-stop-shopping point for Intel’s
proven top-down optimization methodology and Intel Cluster Studio that includes
Message Passing Interface* (MPI), OpenMP, math libraries, compilers, and more.
Or are you just curious? Read on. You will learn how high-performance computing
makes your life safer, your car faster, and your day brighter.
And, by the way: You will ind all you need to carry on, including free trial
software, code snippets, checklists, expert advice, fellow readers, and more at
www.apress.com/source-code.
Hereandelsewhere,certainproductnamesmaybethepropertyoftheirrespectivethirdparties.
*
xxi
www.allitebooks.com
■ INTRODUCTION
Computing has progressed a lot since those heady days. here is hardly a better
illustration of this than the famous TOP500 list.2 Twice a year, the teams running the
most powerful non-classiied computers on earth report their performance. his
data is then collated and published in time for two major annual trade shows: the
International Supercomputing Conference (ISC), typically held in Europe in June; and the
Supercomputing (SC), traditionally held in the United States in November.
Figure 1 shows how certain aspects of this list have changed over time.
xxii
www.allitebooks.com
■INTRODUCTION
Why Optimize?
Optimization is probably the most proitable time investment an engineer can make, as
far as programming is concerned. Indeed, a day spent optimizing a program that takes an
hour to complete may decrease the program turn-around time by half. his means that
after 48 runs, you will recover the time invested in optimization, and then move into
the black.
Optimization is also a measure of software maturity. Donald Knuth famously said,
“Premature optimization is the root of all evil,”5 and he was right in some sense. We will
deal with how far this goes when we get closer to the end of this book. In any case, no one
should start optimizing what has not been proven to work correctly in the irst place. And
a correct program is still a very rare and very satisfying piece of art.
Yes, this is not a typo: art. Despite zillions of thick volumes that have been written
and the conferences held on a daily basis, programming is still more art than science.
Likewise, for the process of program optimization. It is somewhat akin to architecture: it
must include light of fantasy, forensic attention to detail, deep knowledge of underlying
materials, and wide expertise in the prior art. Only this combination—and something
else, something intangible and exciting, something we call “talent”—makes a good
programmer in general and a good optimizer in particular.
Finally, optimization is fun. Some 25 years later, one of us still cherishes the
memories of a day when he made a certain graphical program run 300 times faster than
it used to. A screen update that had been taking half a minute in the morning became
almost instantaneous by midnight. It felt almost like love.
xxiii
www.allitebooks.com
■ INTRODUCTION
he good news are, once you know what you want to achieve, the methodology is
roughly the same. We will look into those details in Chapter 3. Briely, you proceed in
the top-down fashion from the higher levels of the problem under analysis (platform,
distributed memory, shared memory, microarchitecture), iterate in a closed-loop manner
until you exhaust optimization opportunities at each of these levels. Keep in mind that
a problem ixed at one level may expose a problem somewhere else, so you may need to
revisit those higher levels once more.
his approach crystallized quite a while ago. Its previous reincarnation was
formulated by Intel application engineers working in Intel’s application solution centers
in the 1990’s.6 Our book builds on that solid foundation, certainly taking some things a tad
further to account for the time passed.
Now, what happens when top-down optimization meets the closed-loop approach?
Well, this is a happy marriage. Every single level of the top-down method can be handled
by the closed-loop approach. Moreover, the top-down method itself can be enclosed
in another, bigger closed loop where every iteration addresses the biggest remaining
problem at any level where it has been detected. his way, you keep your priorities
straight and helps you stay focused.
xxiv
www.allitebooks.com
■INTRODUCTION
xxv
www.allitebooks.com
■ INTRODUCTION
References
1. “Bomba_(cryptography),” [Online]. Available:
http://en.wikipedia.org/wiki/Bomba_(cryptography).
2. Top500.Org, “TOP500 Supercomputer Sites,” [Online]. Available:
http://www.top500.org/.
3. Top500.Org, “Performance Development TOP500 Supercomputer
Sites,” [Online]. Available: http://www.top500.org/statistics/
perfdevel/.
4. G. E. Moore, “Cramming More Components onto Integrated
Circuits,” Electronics, p. 114–117, 19 April 1965.
5. “Knuth,” [Online]. Available: http://en.wikiquote.org/wiki/
Donald_Knuth.
6. Intel Corporation, “ASC Performance Methodology - Top-Down/
Closed Loop Approach,” 1999. [Online]. Available:
http://smartdata.usbid.com/datasheets/usbid/2001/2001-q1/
asc_methodology.pdf.
7. Intel Corporation, “Intel Cluster Studio XE,” [Online]. Available:
http://software.intel.com/en-us/intel-cluster-studio-xe.
xxvi
www.allitebooks.com
■INTRODUCTION
xxvii
www.allitebooks.com
CHAPTER 1
We know what it feels like to be under pressure. Try out a few quick and proven optimization
stunts described below. They may provide a good enough performance gain right away.
There are several parameters that can be adjusted with relative ease. Here are the
steps we follow when hard pressed:
• Use Intel MPI Library1 and Intel Composer XE2
• Got more time? Tune Intel MPI:
• Collect built-in statistics data
• Tune Intel MPI process placement and pinning
• Tune OpenMP thread pinning
• Got still more time? Tune Intel Composer XE:
• Analyze optimization and vectorization reports
• Use interprocedural optimization
$ source /opt/intel/impi_latest/bin64/mpivars.sh
$ mpirun -np 16 -ppn 2 xhpl
If you use another MPI and have access to the application source code, you can
rebuild your application using Intel MPI compiler scripts:
• Use mpicc (for C), mpicxx (for C++), and mpifc/mpif77/mpif90
(for Fortran) if you target GNU compilers.
• Use mpiicc, mpiicpc, and mpiifort if you target Intel Composer XE.
1
CHAPTER 1 ■ NO TIME TO READ THIS BOOK?
Revisit the compiler flags you used before the switch; you may have to remove some
of them. Make sure that Intel Composer XE is invoked with the flags that give the best
performance for your application (see Table 1-1). More information can be found in the
Intel Composer XE documentation.6
(continued)
2
CHAPTER 1 ■ NO TIME TO READ THIS BOOK?
For most applications, the default optimization level of -O2 will suffice. It runs fast
and gives reasonable performance. If you feel adventurous, try -O3. It is more aggressive
but it also increases the compilation time.
$ export I_MPI_STATS=ipm
$ mpirun -np 16 xhpl
By default, this will generate a file called stats.ipm. Listing 1-1 shows an example
of the MPI statistics gathered for the well-known High Performance Linpack (HPL)
benchmark.8 (We will return to this benchmark throughout this book, by the way.)
3
CHAPTER 1 ■ NO TIME TO READ THIS BOOK?
Listing 1-1. MPI Statistics for the HPL Benchmark with the Most Interesting Fields
Highlighted
############################################################################
#
# command : /home/book/hpl/./xhpl_hybrid_intel64_dynamic (completed)
# host : esg066/x86_64_Linux mpi_tasks : 16 on 8 nodes
# start : 02/14/14/12:43:33 wallclock : 2502.401419 sec
# stop : 02/14/14/13:25:16 %comm : 8.43
# gbytes : 0.00000e+00 total gflop/sec : NA
#
############################################################################
# region : * [ntasks] = 16
#
# [total] <avg> min max
# entries 16 1 1 1
# wallclock 40034.7 2502.17 2502.13 2502.4
# user 446800 27925 27768.4 28192.7
# system 1971.27 123.205 102.103 145.241
# mpi 3375.05 210.941 132.327 282.462
# %comm 8.43032 5.28855 11.2888
# gflop/sec NA NA NA NA
# gbytes 0 0 0 0
#
#
# [time] [calls] <%mpi> <%wall>
# MPI_Send 2737.24 1.93777e+06 81.10 6.84
# MPI_Recv 394.827 16919 11.70 0.99
# MPI_Wait 236.568 1.92085e+06 7.01 0.59
# MPI_Iprobe 3.2257 6.57506e+06 0.10 0.01
# MPI_Init_thread 1.55628 16 0.05 0.00
# MPI_Irecv 1.31957 1.92085e+06 0.04 0.00
# MPI_Type_commit 0.212124 14720 0.01 0.00
# MPI_Type_free 0.0963376 14720 0.00 0.00
# MPI_Comm_split 0.0065608 48 0.00 0.00
# MPI_Comm_free 0.000276804 48 0.00 0.00
# MPI_Wtime 9.67979e-05 48 0.00 0.00
# MPI_Comm_size 9.13143e-05 452 0.00 0.00
# MPI_Comm_rank 7.77245e-05 452 0.00 0.00
# MPI_Finalize 6.91414e-06 16 0.00 0.00
# MPI_TOTAL 3375.05 1.2402e+07 100.00 8.43
############################################################################
4
CHAPTER 1 ■ NO TIME TO READ THIS BOOK?
From Listing 1-1 you can deduce that MPI communication occupies between 5.3
and 11.3 percent of the total runtime, and that the MPI_Send, MPI_Recv, and MPI_Wait
operations take about 81, 12, and 7 percent, respectively, of the total MPI time. With
this data at hand, you can see that there are potential load imbalances between the job
processes, and that you should focus on making the MPI_Send operation as fast as it can
go to achieve a noticeable performance hike.
Note that if you use the full IPM package instead of the built-in statistics, you will also
get data on the total communication volume and floating point performance that are not
measured by the Intel MPI Library.
Intel MPI supports process pinning to restrict the MPI ranks to parts of the system
so as to optimize process layout (for example, to avoid NUMA effects or to reduce latency
to the InfiniBand adapter). Many relevant settings are described in the Intel MPI Library
Reference Manual.9
Briefly, if you want to run a pure MPI program only on the physical processor cores,
enter the following commands:
$ export I_MPI_PIN_PROCESSOR_LIST=allcores
$ mpirun -np 2 your_MPI_app
If you want to run a hybrid MPI/OpenMP program, don’t change the default Intel
MPI settings, and see the next section for the OpenMP ones.
If you want to analyze Intel MPI process layout and pinning, set the following
environment variable:
$ export I_MPI_DEBUG=4
$ export KMP_AFFINITY=granularity=thread,compact
$ export KMP_AFFINITY=granularity=thread,scatter
5
CHAPTER 1 ■ NO TIME TO READ THIS BOOK?
Programs that use the OpenMP API version 4.0 can use the equivalent OpenMP
affinity settings instead of the KMP_AFFINITY environment variable:
$ export OMP_PROC_BIND=close
$ export OMP_PROC_BIND=spread
If you use I_MPI_PIN_DOMAIN, MPI will confine the OpenMP threads of an MPI
process on a single socket. Then you can use the following setting to avoid thread
movement between the logical cores of the socket:
$ export KMP_AFFINITY=granularity=thread
Listing 1-2. Example Optimization Report with the Most Interesting Fields Highlighted
[...]
[...]
6
CHAPTER 1 ■ NO TIME TO READ THIS BOOK?
[...]
[...]
LOOP END
LOOP END
LOOP END
LOOP END
LOOP END
7
CHAPTER 1 ■ NO TIME TO READ THIS BOOK?
[...]
LOOP END
LOOP END
LOOP END
LOOP END
LOOP END
LOOP END
Listing 1-3 shows the vectorization report for the example in Listing 1-2. As you can
see, the vectorization report contains the same information about vectorization as the
optimization report.
Listing 1-3. Example Vectorization Report with the Most Interesting Fields Highlighted
[...]
8
CHAPTER 1 ■ NO TIME TO READ THIS BOOK?
[...]
LOOP END
LOOP END
[...]
[...]
LOOP END
LOOP END
LOOP END
[...]
LOOP END
LOOP END
LOOP END
LOOP END
LOOP END
[...]
5
CHAPTER 1 ■ NO TIME TO READ THIS BOOK?
Summary
Switching to Intel MPI and Intel Composer XE can help improve performance because
the two strive to optimally support Intel platforms and deliver good out-of-the-box (OOB)
performance. Tuning measures can further improve the situation. The next chapters will
reiterate the quick and dirty examples of this chapter and show you how to push the limits.
References
1. Intel Corporation, “Intel(R) MPI Library,” http://software.intel.com/en-us/
intel-mpi-library.
2. Intel Corporation, “Intel(R) Composer XE Suites,”
http://software.intel.com/en-us/intel-composer-xe.
3. Argonne National Laboratory, “MPICH: High-Performance Portable MPI,” www.mpich.
org.
4. Ohio State University, “MVAPICH: MPI over InfiniBand, 10GigE/iWARP and RoCE,”
http://mvapich.cse.ohio-state.edu/overview/mvapich2/.
5. International Business Machines Corporation, “IBM Parallel
Environment,” www-03.ibm.com/systems/software/parallel/.
6. Intel Corporation, “Intel Fortran Composer XE 2013 - Documentation,”
http://software.intel.com/articles/intel-fortran-composer-xe-
documentation/.
7. The IPM Developers, “Integrated Performance Monitoring - IPM,” http://ipm-hpc.
sourceforge.net/.
8. A. Petitet, R. C. Whaley, J. Dongarra, and A. Cleary, “HPL : A Portable
Implementation of the High-Performance Linpack Benchmark for Distributed-
Memory Computers,” 10 September 2008, www.netlib.org/benchmark/hpl/.
9. Intel Corporation, “Intel MPI Library Reference Manual,” http://software.intel.
com/en-us/node/500285.
10
www.allitebooks.com
CHAPTER 2
Overview of Platform
Architectures
In order to optimize software you need to understand hardware. In this chapter we give
you a brief overview of the typical system architectures found in the high-performance
computing (HPC) today. We also introduce terminology that will be used throughout
the book.
11
CHAPTER 2 ■ OVERVIEW OF PLATFORM ARCHITECTURES
Figure 2-1. Runtime: observed time interval between the start and the finish of a car on a
race track
The runtime, or the period of time from the start to the completion of an application,
is important because it tells you how long you need to wait for the results. In networking,
latency is the amount of time it takes a data packet to travel from the source to the
destination; it also can be referred to as the response time. For measurements inside the
processor, we often use the term instruction latency as the time it takes for a machine
instruction entering the execution unit until results of that instruction are available—that
is, written to the register file and ready to be used by subsequent instructions. In more
general terms, latency can be defined as the observed time interval between the start of a
process and its completion.
We can generalize this class of metrics to represent more of a general class of
consumable resources. Time is one kind of a consumable resource, such as the time
allocated for your job on a supercomputer. Another important example of a consumable
resource is the amount of electrical energy required to complete your job, called energy to
solution. The official unit in which energy is measured is the joule, while in everyday life
we more often use watt-hours. One watt-hour is equal to 3600 joules.
The amount of energy consumption defines your electricity bill and is a very visible
item among operating expenses of major, high-performance computing facilities. It drives
demand for optimization of the energy to solution, in addition to the traditional efforts
to reduce the runtime, improve parallel efficiency, and so on. Energy optimization work
has different scales; going from giga-joules (GJ, or 109 joules) consumed at the application
level, to pico-joules (pJ, or 10–12 joules) per instruction.
One of the specific properties of the latency metrics is that they are additive, so that
they can be viewed as a cumulative sum of several latencies of subtasks. This means that
if the application has three subtasks following one after another, and these subtasks take
times T1, T2 and T3, respectively, then the total application runtime is Tapp = T1 + T2 + T3.
Other types of metrics describe the amount of work that can be completed by the
system per unit of time, or per unit of another consumable resource. One example of car
performance would be its speed defined as the distance covered per unit of time; or of its
fuel efficiency, defined as the distance covered per unit of fuel—, such as miles per gallon.
We call these metrics throughput metrics. For example, the number of instructions per
second (IPS) executed by the processor, or the number of floating point operations per
second (FLOPS) are both throughput metrics. Other widely used metrics of this class are
memory bandwidth (reaching tens and hundreds of gigabytes per second these days),
and network interconnection throughput (in either bits per second or bytes per second).
The unit of power (watt) is also a throughput metric that is defined as energy flow per unit
of time, and is equal exactly to 1 joule per second.
12
CHAPTER 2 ■ OVERVIEW OF PLATFORM ARCHITECTURES
13
CHAPTER 2 ■ OVERVIEW OF PLATFORM ARCHITECTURES
14
CHAPTER 2 ■ OVERVIEW OF PLATFORM ARCHITECTURES
15
CHAPTER 2 ■ OVERVIEW OF PLATFORM ARCHITECTURES
It may be depressing to realize that the maximum possible speedup will be limited
by something you can’t improve by adding more resources. Even so, consider the same
speedup problem from another angle: what happens if the amount of work in the
parallelizable part of the execution can be increased?
If the relative share of time taken by the serial portion of the application remains
unchanged with the increase of the workload size, there is no inherent speedup factor
available, and as illustrated in Figure 2-4 (left), Amdahl’s Law still works. However, John
Gustafson observed that there was significant speedup opportunity available if the serial
component shrank in size relative to the parallel part as the amount of data processed by
the application (and consequently the amount of computation) expanded.5
16
9
CHAPTER 2 ■ OVERVIEW OF PLATFORM ARCHITECTURES
Figure 2-5. Bottlenecks on the road are commonly known as traffic jams
As shown in Figure 2-5, bottlenecks can create traffic jams on the highway. Using
the terminology of queuing theory,6 we are talking about the toll gate as a single service
center. Customers (here, cars) arrive at this service center at a certain rate, called arrival
rate or workload intensity. There is also certain duration of time required to collect money
from each car, which is referred to as service demand. For specific parameter values of
the workload intensity and the service demand, it is possible to analytically evaluate this
model and produce performance metrics, such as utilization (proportion of time when
the server point is busy), residence time (average time spent at the service center by a
customer), length of the queue (average number of customers waiting at the service center),
and throughput (rate at which customers depart from the service center).
17
CHAPTER 2 ■ OVERVIEW OF PLATFORM ARCHITECTURES
Roofline Model
Amdahl’s law and the queuing network models both offer “bound and bottleneck
analysis,” and they work quite well in many cases. However, both complexity and the
level of concurrency of modern high-performance systems keep increasing. Indeed, even
smartphones today have complex multicore chips with pipelines, caches, superscalar
instruction issue, and out-of-order execution, while the applications increasingly use
tasks and threads with asynchronous communication between them. Quantitative
queuing network models that simulate behavior of very complex applications on modern
multicore and heterogeneous systems have become very complex. At the same time, the
speed of microprocessor development has outpaced the speed of the memory evolution;
and in most cases, specifically in high-performance computing, the bandwidth of the
memory subsystem is often the main bottleneck.
In search of a simplified model that would relate processor performance to the
off-chip memory traffic, Williams, Waterman, and Patterson observed that that “the
Roofline [model] sets an upper bound on performance of a kernel depending on the
kernel’s operational intensity.”7 The Roofline model subsumes two platform specific
ceilings in one single graph: floating-point performance and memory bandwidth. The
model, despite its apparent simplicity, provides an insightful visualization of the system
bottlenecks. Peak floating point and memory throughput performances can usually be
found from the architecture specifications. Alternatively, it is possible to find sustained
memory performance by running the STREAM benchmark.8
Figure 2-6 shows a roofline plot for a platform with peak performance P = 518.4
GFLOPS (such as a dual-socket server with Intel Xeon E5-2697 v2 processors) and
bandwidth B = 101 GB/s (gigabytes per second) attainable with the STREAM TRIAD
benchmark on this system.
18
CHAPTER 2 ■ OVERVIEW OF PLATFORM ARCHITECTURES
Figure 2-6. Roofline model for dual Intel Xeon E5-2697 v2 server with DDR3-1866 memory
The horizontal line shows peak performance of the computer. This is a hardware
limit for this server. The X-axis represents amount of work (in number of floating point
operations, or Flops) done for every byte of data coming from memory: Flops/byte (here,
“Flops” stands for the plural of “Flop”–the number of floating point operations, rather
than FLOPS, which is Flops per second). And the Y-axis represents gigaFLOPS (109
FLOPS), which is a throughput metric showing the number of floating point operations
executed every second (Flops/second, or FLOPS). With that, taking into account that
Flops / second
bytes / second = , the memory throughput metric gigabytes/second is
Flops / byte
represented by a line of unit slope in Figure 2-6. Thus, the slanted line shows the
maximum floating point performance that the memory subsystem can support for the
given operational intensity. The following formula drives the two performance limits in
the graph shown in Figure 2-6:
Attainable performance [GLOPS ]
ì Peak floating point performance , ü
= min í ý
î Peak memory bandwidth ´ Operational int ensity þ
The horizontal and diagonal lines form a kind of roofline, and this gives the
model its name. The roofline sets an upper bound on performance of a computational
kernel depending on its operational intensity. Improving performance of a kernel with
operational intensity of 6 Flops/byte (shown as the dotted line marked by “O” in the
plot) will hit the flat part of the roof, meaning that the kernel performance is ultimately
19
CHAPTER 2 ■ OVERVIEW OF PLATFORM ARCHITECTURES
compute-bound. For another kernel (the one marked by “X”), any improvement will
eventually hit the slanted part of the roof, meaning its performance is ultimately memory
bound. The roofline found for a specific system can be reused repeatedly for classifying
different kernels.
20
www.allitebooks.com
CHAPTER 2 ■ OVERVIEW OF PLATFORM ARCHITECTURES
21
CHAPTER 2 ■ OVERVIEW OF PLATFORM ARCHITECTURES
Figure 2-7. SIMD approach: single instruction produces results for several data elements
simultaneously
Following this principle, the SIMD vector instruction sets implement not only basic
arithmetic operations (such as additions, multiplications, absolute values, shifts, and
divisions) but also many other useful instructions present in nonvectorized instruction
sets. They also implement special operations to deal with the contents of the vector
registers—for example, any-to-any permutations—and gather instructions that are useful
for vectorized code that accesses nonadjacent data elements.
SIMD extensions for the x86 instruction set were first brought into the Intel
architecture under the Intel MMX brand in 1996 and were used in Pentium processors.
MMX had a SIMD width of 64 bits and focused on integer arithmetic. Thus, two 32-bit
integers, or four 16-bit integers (as type short in C), or eight 8-bit integer numbers (C
type char), could be processed simultaneously. Note also that the MMX instruction set
extensions for x86 supported both signed and unsigned integers.
New SIMD instruction sets for x86 processors added support for new operations on
the vectors, increased the SIMD data width, and added vector instructions to process
floating point numbers much demanded in HPC. In 1999, SIMD data width was increased
to 128 bits with SSE (Streaming SIMD Extensions), and each SSE register (called xmm) was
able to hold two double precision floating point numbers or two 64-bit integers, four single
precision floats or four 32-bit integers, eight 16-bit integers or 16 single-byte elements.
In 2008, Intel announced doubling of the vector width to 256 bits in Intel AVX
(Advanced Vector eXtensions) instruction set. The extended register was called ymm. The
ymm registers can hold twice as much data as the SSE’s xmm registers. They support packed
data types for modern x86 processor cores (for instance, in the fourth-generation Intel
Core processors with microarchitecture, codenamed Haswell), as shown in Figure 2-8.
22
CHAPTER 2 ■ OVERVIEW OF PLATFORM ARCHITECTURES
The latest addition to Intel AVX, announced in 2013, includes definition of Intel
Advanced Vector Extensions 512 (or AVX-512) instructions. These instructions represent a
leap ahead to the 512-bit SIMD support (And guess what? The registers are now called zmm).
Consequently, up to eight double precision or 16 single precision floating point numbers,
or eight 64-bit integers, or 16 32-bit integers can be packed within the 512-bit vectors.
Figure 2-9 shows the relative sizes of SSE, AVX, and AXV-512 SIMD registers with highlighted
packed 64-bit data types (for instance, double precision floats).
Figure 2-9. SSE, AVX, and AVX-512 vector registers with packed 64-bit numbers
23
CHAPTER 2 ■ OVERVIEW OF PLATFORM ARCHITECTURES
Now that you’re familiar with the important concepts of SIMD processing and
superscalar microarchitectures, the time has come to discuss in greater detail the FLOPS
(floating point operations per second) metric, one of the most cited HPC performance
metrics. This measure of performance is widely used as a performance metric in the field
of scientific computing where heavy use of calculations with the floating point numbers
is very common. The last “S” designates not the plural form for FLOP but a ratio “per
second” and is historically written without a slash (/) and avoiding double “S” (i.e., FLOPS
instead of Flops/S). In our book we will stick to the common practice. In some situations,
we will need to refer to floating point operations, so abbreviate it as Flops and produce
required ratios as needed. For example, we will write Flops/cycle when there is a need to
count number of floating point operations per processor cycle of the processor core.
One of the most often quoted metrics for individual processors or complete
HPC systems is their peak performance. This is the theoretically maximum possible
performance that could be delivered by the system. It is defined as follows:
• Peak performance of a system is a sum of peak performances of
all computing elements (namely, cores) in the system.
• Peak performance for a vectorized superscalar core is calculated
as the number of independent floating point arithmetic operations
that the core can execute in parallel, multiplied by the number of
vector elements that are processed in parallel by these operations.
As an example, if you have a cluster of 16 nodes, each with a single Intel Xeon E3-
1285 v3 processor that has four cores with Haswell microarchitecture running at 3.6 GHz,
it will have peak performance of 3686.4 gigaFLOPS (or 109 FLOPS). Using the FMA
(fused multiply add, which is b = a × b + c ) instruction, a Haswell core can generate
four Flops/cycle (via execution of two FMAs per cycle) with a SIMD vector putting out
four results per cycle, thus delivering peak performance of 57.6 gigaFLOPS at
Flops
the frequency of 3.6 GHz: 4 ´ 4SIMD ´ 3.6GHz = 57.6GFLOPS . Multiplying this by
cycle
processor cores
total number of cores in the cluster (64 = 16 nodes ´1 ´4 ) gives
node processor
3686.4 gigaFLOPS, or 3.68 teraFLOPS.
Peak performance usually cannot be reached, but it serves as a guideline for
the optimization work. Actual application performance (often referred as sustained
performance) can be obtained by counting the total number of floating point operations
executed by the application (either by analyzing the algorithm or using special processor
counters), and then dividing this number by the application runtime in seconds. The ratio
between measured application performance (in FLOPS) and the peak performance of
the system it was run on, multiplied by 100 percent, is often referred to as computational
efficiency, which demonstrates what share of theoretically possible performance of the
system was actually used by the application. The best efficiencies close to 95 percent
are usually obtained by highly tuned computational kernels, such as BLAS (Basic Linear
Algebra Subprograms), while mainstream HPC applications often achieve efficiencies of
10 percent and lower.
24
CHAPTER 2 ■ OVERVIEW OF PLATFORM ARCHITECTURES
25
CHAPTER 2 ■ OVERVIEW OF PLATFORM ARCHITECTURES
To get details on the NUMA topology of your system, use the numactl tool that is
available for all major Linux distributions. On our workstation, the execution of numactl
tool with the --hardware argument displays the following information (see Listing 2-1):
The output of the numactl tool shows two NUMA nodes, each with 24 processors
(and just a hint-these are twelve physically independent cores with two threads each),
and 64 GB of RAM per NUMA node, or 128 GB in the server in total.
In a similar manner to physical memory, the Input/Output subsystem and the I/O
controllers are shared inside the multiprocessor systems, so that any processor can access
any I/O device. Similarly to memory controllers, the I/O controllers are often integrated
into the processors, and latency to access local and remote devices may differ. However,
since latency associated with getting data from or to external I/O devices is significantly
higher than the latency added by crossing the inter-processor network (such as QPI), this
additional inter-processor network latency can be ignored in most cases. We will discuss
specific I/O related issues in greater detail in Chapter 4.
26
CHAPTER 2 ■ OVERVIEW OF PLATFORM ARCHITECTURES
To achieve higher performance than a single shared memory system could offer, it is
more beneficial to put together several smaller shared memory systems, and interconnect
those with a fast network. Such interconnection does not make the memory from
different boxes look like a single address space. This leads to the need for software to take
care of copying data from one server to another implicitly or explicitly. Figure 2-11 shows
an example system.
Figure 2-11 shows a computer with four nodes, N0–N3, interconnected by a network,
also called interconnect or fabric. Processors in each node have their own dedicated
private memory and their own private I/O. In fact, these nodes are likely to be shared
memory systems like those we have reviewed earlier. Before any processor can access
data residing in another node’s private memory, that data should be copied to the private
memory of the node that is requesting the data. This hardware approach to building a
parallel machine is called distributed memory. The additional data copy step, of course,
has additional penalty associated with it, and the performance impact greatly depends on
characteristics of the interconnect between the nodes and on the way it is programmed.
27
CHAPTER 2 ■ OVERVIEW OF PLATFORM ARCHITECTURES
Figure 2-12. The complexity of a modern cluster with multi-processor, multicore systems
28
CHAPTER 2 ■ OVERVIEW OF PLATFORM ARCHITECTURES
and application software on these two. However, internal implementations of these two
processor cores are very different.
We refer to the internal implementations as microarchitecture. Thus, the Haswell
microarchitecture that is the basis for Intel Xeon E3-1200 v3 processors is very different
from the Silvermont microarchitecture used to build cores for Intel Atom C2000
processors. Detailed microarchitecture differences and specific optimization techniques
are described in the Intel 64 and IA-32 Architectures Optimization Reference Manual.12
This 600-page document describes a large number of Intel x86 cores and explains how to
optimize software for IA-32 and Intel 64 architecture processors.
The addendum to the aforementioned Intel 64 and IA-32 Architectures Optimization
Reference Manual contains data useful for quantitative analysis of the typical latencies
and throughputs of the individual processor instructions. The primary objective of this
information is to help the programmer with the selection of the instruction sequences
(to minimize chain latency) and in the arrangement of the instructions (to assist in
hardware processing).
However, this information also provides an understanding of the scale of
performance impact from various instruction choices. For instance, typical arithmetic
instruction latencies (reported in the number of clock cycles that are required for the
execution core to complete the execution of the instruction) are one to five cycles
(or 0.4-2 ns when running at 2.5 GHz) for simple instructions such as addition,
multiplication, taking maximum or minimum value. Latency can reach up to 45 cycles
(or 18 ns at 2.5 GHz) for division of double precision floating point numbers.
Instruction throughput is reported as the number of clock cycles that need to pass
before the issue ports can accept the same instruction again. This helps to estimate
the time it would take, for example, for a loop iteration to complete in presence of a
cross-loop dependency. For many instructions, throughput of an instruction can be
significantly smaller than its latency. Sometimes latency is given as just one half of
the clock cycle. This occurs only for the double-speed execution units found in some
microprocessors.
The same manual provides estimates for the best-case latencies and throughput
of the dedicated caches: the first (L1) and the second (L2) level caches, as well as the
translation lookaside buffers (TLBs). Particularly, on the latest Haswell cores, the load
latency from L1 data cache may vary from four to seven cycles (or 1.6-2.8 ns at 2.5 GHz),
and the peak bandwidth for data is equal to 64 (Load) + 32 (Store) bytes per cycle, or up to
240 GB/s aggregate bandwidth (160 GB/s to load data and 80 GB/s to store the data).
The architecture of modern Intel processors supports flexible integration of multiple
processor cores with a shared uncore subsystem. Uncores usually contain integrated
DRAM (Dynamic Random Access Memory) controllers, PCI Express I/O, Quick Path
Interconnect (QPI) links, and the integrated graphics processing units (GPUs) in some
models, as well as a shared cache (L2 or L3, depending on the processor, which is often
called the Last Level Cache, or LLC). An example of the system integration view of four
cores with uncore components is shown in Figure 2-13.
29
CHAPTER 2 ■ OVERVIEW OF PLATFORM ARCHITECTURES
Uncore resources typically reside farther away from the cores on the processor die,
so that typical latencies to access uncore resources (such as LLC) are normally higher
than that for a core’s own resources (such as L1 and L2 caches). Also, since the uncore
resources are shared, the cores compete for uncore bandwidth. The latency of accessing
uncore resources is not as deterministic as the latency inside the core. For example, the
latency of loading data from LLC may vary from 26 to 60 cycles (or from 10.4 to 24 ns for a
2.5 GHz processor), comparing to the typical best case of 12 cycles (or 4.8 ns) load latency
for the L2 cache.
Cache bandwidth improvements in the Haswell microarchitecture over the older
Sandy Bridge/Ivy Bridge microarchitectures doubled the number of bytes loaded
and stored per clock cycle from 32 and 16 to 64 and 32, respectively. Last Level Cache
bandwidth also jumped from 32 bytes per cycle to 64 bytes. At the same time, typical
access latencies stayed unchanged between the microarchitecture generations. This
confirms the earlier observation related to the bandwidth vs. latency development.
As for the next level in the memory hierarchy, the computer main memory, its
latency further increases and its bandwidth drops. Figure 2-14 shows schematically the
relative latency and bandwidth capabilities in the memory hierarchy of a quad-core
Haswell-based Intel Xeon Processor E3-1265L v3 processor.
30
www.allitebooks.com
CHAPTER 2 ■ OVERVIEW OF PLATFORM ARCHITECTURES
Another important aspect of the memory latency is that the effective time to load or
store data goes up with higher utilization of the memory busses. Figure 2-15 shows the
results of the latency measurement performed as a function of intensity of the memory
traffic for a dual-socket server. Here, two generations of server processors are compared,
with cores based on the Sandy Bridge and Ivy Bridge. The newer Ivy Bridge-based
processors (specifically Intel Xeon E5-2697 v2) support faster memory running at 1866 MHz
and contain improvements in the efficiency of the memory controller implementation over
the previous generation, Intel Xeon E5-2690 processor built with eight Sandy Bridge cores
and memory running at 1600 MHz. Despite the increase of the core count and faster DRAM
speed, latency is about the same in both cases when the concurrency of memory requests is
low (and thus the consumed memory bandwidth is far below the physical limits).
31
CHAPTER 2 ■ OVERVIEW OF PLATFORM ARCHITECTURES
32
CHAPTER 2 ■ OVERVIEW OF PLATFORM ARCHITECTURES
bandwidth reaching 350 GB/s. However, great throughput comes at the expense of
latency: the coprocessors usually run at frequencies around 1 GHz (2.5-3x slower than
standalone processors), and GDDR5 access latency is at least a factor of two times higher
versus DDR3 in a standard server. However, for a subset of applications, where higher
latency can be hidden by much higher concurrency, noticeable performance benefit
comes from the significantly higher throughput in hardware.
One important performance and programmability aspect of coprocessor is that they
are attached to the main processor(s) over the PCI Express (PCIe) bus. Often they have to
involve the host processor to execute I/O operations or perform other tasks. The second-
generation PCIe bus that is used in Intel Xeon Phi coprocessors can deliver up to 80 Gbps
(gigabits per second) of peak wire bandwidth in every direction via a x16 connector. This
translates into approximately 7 GB/s of sustained bandwidth, for the overhead includes
8/10 encoding scheme used to increase reliability of data transfers over the bus. This also
adds latency between the host processor and the coprocessor, on the order of 200-300 ns,
or more if the bus is heavily loaded. In heterogeneous applications that use both the
central processors and the coprocessors in a system, it is important to hide this added
latency of communications over the PCIe bus.
All of the most popular interconnects use the PCIe bus to connect to the processors
in the system, and thus they inherit all the latency and bandwidth limitations specific to
PCIe. On top of this, since both Ethernet and InfiniBand are designed to scale to a much
larger number of communicating agents in the network (at least tens of thousands)
than PCIe, their protocol overheads and cost of packet routing are significantly higher
compared to the PCIe bus used inside the server.
33
CHAPTER 2 ■ OVERVIEW OF PLATFORM ARCHITECTURES
34
CHAPTER 2 ■ OVERVIEW OF PLATFORM ARCHITECTURES
35
CHAPTER 2 ■ OVERVIEW OF PLATFORM ARCHITECTURES
It is worth noting that job schedulers take their portion of time for every job
execution, and this time can reach seconds per job submission. The good news is that
scheduling takes place only before the application starts and may add some time after the
job ends (for the clean-up). So, if your job takes several days to run on a cluster, these few
seconds have a small relative impact.
However, sometimes people need to run a large number of smaller jobs. For
example, if each job takes a couple of minutes to run, but there are many jobs (up to tens
of thousands have been observed in real life), the relative time taken by the job scheduler
becomes very visible. Most job schedules offer special support for large number of
smaller jobs with identical parameters via so-called job arrays. If your application is
of that kind, please take some time to study how to make effective use of your cluster’s
scheduling software.
Summary
This chapter briefly overviewed the main terms and concepts pertaining to the
performance analysis and gave an overview of the modern high-performance computing
platforms. Certainly, this is the minimum information needed to help you get started on
the subject or to refresh your existing knowledge.
If you are interested in computer architecture, you may enjoy the book Computer
Architecture: A Quantitative Approach.13 In the fourth edition of this book, the authors
increase their coverage of multiprocessors and explore the most effective ways of
achieving parallelism as the key to unlocking the power of modern architectures.
We also found an easy-to-read guide in the book Introduction to High Performance
Computing for Scientists and Engineers, written by Georg Hager and Gerhard Wellein.14
It contains a great overview of platforms architectures, as well as recommendations for
application optimization specific to the serial, multi-threaded, and clustered execution.
In his article Latency Lags Bandwidth, David Patterson presents an interesting study
that illustrates a chronic imbalance between bandwidth and latency.15 He lists half a
dozen performance milestones to document this observation, highlights many reasons
why this happens, and proposes a few ways to cope with the problem, as well as gives
a rule of thumb to quantify it, plus an example of how to design systems based on this
observation.
For readers interested in the queuing network modeling, we recommend the book
Quantitative System Performance: Computer System Analysis Using Queuing Network
Models.16 It contains an in-depth description of the methodology and a practical guide
to and case studies of system performance analysis. It also provides great insight into the
major factors affecting the performance of computer systems and quantifies the influence
of the system bottlenecks.
The fundamentals and practical methods of the queuing theory are described in the
book Queueing Systems: Theory.17 Step-by-step derivations with detailed explanation and
lists of the most important results make this treatise useful as a handbook.
36
CHAPTER 2 ■ OVERVIEW OF PLATFORM ARCHITECTURES
References
1. Merriam-Webster Collegiate Dictionary, 11th ed. (Springfield, MA:
Merriam-Webster, 2003).
2. A. S. Tanenbaum, Computer Networks (Englewood Cliffs, NJ: Prentice-Hall, 2003).
3. B. Cantrill and J. Bonwick, “Real-World Concurrency,” ACM Queue 6, no. 5
(September 2008): 16–25.
4. G. M. Amdahl, “Validity of the Single Processor Approach to Achieving
Large-Scale Computing Capabilities,” AFIPS ’67 (Spring) Proceedings of the 18–20
April 1967, spring joint computer conference, 483–85.
5. J. L. Gustafson, “Reevaluating Amdahl’s law,” Communications of the ACM 31, no. 5
(May 1988): 532–33.
6. “Queueing theory,” Wikipedia, http://en.wikipedia.org/wiki/Queueing_theory.
7. S. Williams, A. Waterman, and D. Patterson, “Roofline: An Insightful Visual
Performance Model for Multicore Architectures,” Communications of the ACM -
A Direct Path to Dependable Software 52, no. 4 (April 2009): 65–76.
8. J. McCalpin, “Memory Bandwidth and Machine Balance in Current High
Performance Computers,” IEEE Computer Society Technical Committee on Computer
Architecture (TCCA) Newsletter, December 1995.
9. IDC (International Data Corporation), “HPC Market Update: 2012,” September 2012,
www.hpcuserforum.com/presentations/dearborn2012/IDCmarketslidesChirag-
Steve.pdf.
10. Intel Corporation, “Intel VTune Amplifier XE 2013,” http://software.intel.com/
en-us/intel-vtune-amplifier-xe.
11. M. J. Flynn, “Very High-speed Computing Systems,” Proceedings of IEEE
54 (1966): 1901–909.
12. Intel Corporation, Intel 64 and IA-32 Architectures Optimization Reference Manual,
www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-
architectures-optimization-manual.html.
13. J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach,
4th ed. (Burlington, MA: Morgan Kaufmann, 2006).
14. G. Hager and G. Wellein, Introduction to High Performance Computing for Scientists
and Engineers (Boca Raton, FL: CRC Press, 2010).
15. D. A. Patterson, “Latency Lags Bandwith,” Communications of the
ACM - Voting Systems, January 2004, pp. 71–75.
16. E. D. Lazowska, J. Zahorjan, G. S. Graham, and K. C. Sevcik, Quantitative System
Performance: Computer System Analysis Using Queueing Network Models (Upper
Saddle River, NJ: Prentice-Hall, 1984).
17. L. Kleinrock, Queueing Systems: Theory, vol. 1 (Hoboken, NJ: John Wiley, 1976).
37
CHAPTER 3
Top-Down Software
Optimization
39
CHAPTER 3 ■ TOP-DOWN SOFTWARE OPTIMIZATION
Performance follows the weakest-link paradigm: if one stage of the pipeline does
not work according to expectations, the rest of the pipeline will starve. While optimizing
this pipeline, we should start with the biggest potential bottlenecks first—at the top of
this list, working our way down, as shown in Figure 3-1. Indeed, it makes little sense to
start working on the branch misprediction impact while the application spends most of
its time in the network communication or cache misses. Once we have made sure data is
available in the cache, a continuously occurring branch misprediction does have a huge
relative impact.
Table 3-1. Memory Technologies and Their Latency and Throughput (to the Order of
Magnitude)
*QPI remote connection latency is hardly observable on the backdrop of the remote
memory latency mentioned above.
40
CHAPTER 3 ■ TOP-DOWN SOFTWARE OPTIMIZATION
41
www.allitebooks.com
CHAPTER 3 ■ TOP-DOWN SOFTWARE OPTIMIZATION
System Level
Before worrying about the code of your application, the most important and impactful
tuning can be achieved looking at the system components of the compute node,
interconnect, and storage. No matter how advanced and skillfully implemented an
algorithm is, a wrongly configured network, a forgotten file I/O, or a misplaced memory
module in a NUMA system can undo all the effort you put into careful programming.
In many cases, you will be using a system that is in good shape. Particularly if you
are a user of an HPC compute center, your system administrators will have taken care in
choosing the components and their sound setup. Still, not even the most adept system
administrator is immune to a hard disk failure, the cooling deficits of an open rack door,
or a bug in a freshly installed network driver. No matter how well your system seems to be
maintained, you want to make sure it really does perform to its specification.
You’ll find a detailed description of the system tuning in Chapter 4, but here we give
an overview of the components and tools. For an HPC system, the hardware components
affecting performance at a system level are mostly as follows:
• Storage and file systems: As the most of HPC problems deal with
large amounts of data, an effective scaling storage hierarchy
is critical for application performance and scaling. If storage
is inadequate in terms of bandwidth or access latency, it may
introduce serialization into the entire application. Taking into
account Amdahl’s Law (discussed in Chapter 2), this should be
considered as the first optimization opportunity.
• Cluster interconnection hardware and software: HPC applications
do not only demand high bandwidth and low latency for
point-to-point exchanges. They also demand advanced
capabilities to support collective communications between very
large numbers of nodes. A single parameter set wrongly here may
completely change the relevant performance characteristics of
the network.
• Random access memory (RAM): The RAM attached to the
integrated memory controller of a CPU comes in packages called
dual-lnline memory modules (DIMMs). The memory controller
supports a number of channels that can be populated with several
DIMMs. At the same time, different specifications of DIMMs may
be supported by the memory controller, such as DIMMs of different
sizes in the same channel. Asymmetry in either size or placement
of the DIMMs may result in substantial performance degradation.
• Platform compute/memory balance: As discussed in Chapter 2,
each system has its compute/memory performance balance that
can be visualized by the Roofline model. Depending on the specific
platform configuration (including the number of cores, their speed
and capabilities, and the memory type and speed), the application
may end up being memory or compute bound, and these specific
platform characteristics will define the application performance.
42
CHAPTER 3 ■ TOP-DOWN SOFTWARE OPTIMIZATION
Application Level
After the bottlenecks at the system level are successfully cleared, the next category
we enter is the application level: we are actually getting our hands on the code here!
Application-level tuning is more complicated than system level because it requires a
certain degree of understanding of algorithmic details. At the system level, we dealt
with standard components—CPUs, OS, network cards, and so on. We rarely can change
anything about them, but they need to be carefully chosen and correctly set up. At the
43
CHAPTER 3 ■ TOP-DOWN SOFTWARE OPTIMIZATION
application level, things change. Software is seldom made from standard components:
most of its functionality is different from all other software. The essential part causing this
differentiation is the algorithm(s) used and the implementation thereof.
Note that optimization should not mean a major rewrite. You don’t want to change
the general algorithm as such. A finite difference program should remain that way, even
if finite elements might be more suitable. We are, rather, talking about optimizing the
algorithm at hand and the plethora of smaller algorithms that it is built from.
LRAM
S=
Lcachen
LRAM
S=
Lcachen
44
CHAPTER 3 ■ TOP-DOWN SOFTWARE OPTIMIZATION
Figure 3-2. Example for an automatic vectorization by the compiler in C source code, and
the resulting assembly instructions
45
CHAPTER 3 ■ TOP-DOWN SOFTWARE OPTIMIZATION
As in the shared-memory approach (discussed in the next section), there is need for
a robust library that would abstract all the low-level details and hide from the user the
differences between various interconnects available on the market. So, back in the early
1990s, a group of researchers designed and standardized the Message Passing Interface
(MPI).3 The MPI standard defines a language-independent communications protocol
as well as syntax and semantics of the routines required for writing portable message-
passing programs in Fortran or C/C++; nonstandard bindings are available for many
other languages, including C++, Perl, Python, R, and Ruby. The MPI standard is managed
by the MPI Forum4 and is implemented by many commercial and open-source libraries.
The MPI standard was widely used as a programming model for distributed memory
systems that were becoming increasingly popular in the early 1990s. As the shared
memory architecture of individual systems became more popular, the MPI library
evolved as well. The latest MPI-3 standard was issued in September 2012. It added fast
remote memory access routines, nonblocking and sparse collective operations, and some
other performance-relevant extensions, especially in the shared memory and threading
area. However, the programming model clearly remains the distributed memory one with
explicit parallelism: the developer is responsible for correctly identifying parallelism and
implementing parallel algorithms using MPI primitives.
The performance improvement that can be gained from distributed memory
parallelization is roughly proportional to the number of compute nodes available, which
ranges between 10x and 1000x for the usual compute clusters.
46
CHAPTER 3 ■ TOP-DOWN SOFTWARE OPTIMIZATION
The most recent development of the OpenMP moved the OpenMP API beyond
traditional management of pools of threads. In the OpenMP specification version 4.0,
released in July 2013, you find support for SIMD optimizations, as well as support for
accelerators and coprocessors that architecturally better fit into the distributed memory
system type discussed earlier in this chapter. Chapter 5 discusses OpenMP and other
threading-related optimization topics, including how to deal with the application-level
bottlenecks specific to the shared memory systems programming.
The performance improvement for shared memory parallelization is roughly
proportional to the number of cores available per compute node, which is from 10x to 20x
in modern server architectures.
47
CHAPTER 3 ■ TOP-DOWN SOFTWARE OPTIMIZATION
One thing to keep in mind when working at the algorithm level is that you do not
need to reinvent the wheel. If there is a library available that supports the features of
the system under consideration, you should use it. A good example is the standard
linear algebra operations. Nobody should program a matrix-matrix multiplication or an
eigenvalue solver if it is not absolutely necessary and known to deliver a great benefit.
The vector-vector, matrix-vector, and matrix-matrix operations are standardized in the
so-called Basic Linear Algebra System (BLAS),14 while the solvers can be addressed via
the Linear Algebra Package (LAPACK)15 interfaces, for which many implementations are
available. One of them is Intel Math Kernel Library (Intel MKL), which is, of course, fully
vectorized for all available Intel architectures and additionally offers shared memory
parallelization.16
Microarchitecture Level
Having optimized the system and the algorithmic levels, let’s turn now to the problem
of how the actual machine instructions are executed by the CPU. According to
Table 3-1, microarchitectural changes have the least individual impact in absolute
numbers, but when they are accumulated, their impact on performance may be large.
Microarchitectural tuning requires a certain understanding of the operation of the
individual components of a CPU (discussed in detail in Chapter 7). Here, we restrict
ourselves to a very limited overview.
48
CHAPTER 3 ■ TOP-DOWN SOFTWARE OPTIMIZATION
Closed-Loop Methodology
One of the most critical factors in the tuning process is the way you load the system. There
is some ambivalence in the use of the terms workload and application. Very often, they
are used interchangeably. In general, application means the actual code that is executed,
whereas workload is the task and data that you give to the application. For instance, the
application might be sort.exe, and the workload might be some data file that contains
the names of persons.
49
CHAPTER 3 ■ TOP-DOWN SOFTWARE OPTIMIZATION
50
CHAPTER 3 ■ TOP-DOWN SOFTWARE OPTIMIZATION
2. Analyze the data and identify issues: Focus on the most time-
consuming part(s). Begin by looking for unexpected results or
numbers that are out of tolerance. Try to fully understand the
issue by using appropriate tools. Make sure the analysis does
not affect the results.
3. Generate alternatives to resolve the issue: Remove the
identified bottlenecks. Try to keep focused on one step at
a time. Rate the solutions on how difficult they are to be
implemented and on their potential payback.
4. Implement the enhancement: Change only one thing at a
time in order to estimate the magnitude of the individual
improvement. Make sure none of the changes causes a
slowdown and negated other improvements. Keep track of the
changes so you can roll them back, if necessary.
5. Test the results: Check whether performance improvements
are up to your expectations and that they remove the
identified bottleneck.
After the last step, you restart the cycle to identify the next bottleneck (see Figure 3-3).
Clearly, this loop is normally infinite, for the time to stop is determined by the amount of
time left to do the job.
Figure 3-3. Left: The closed-loop iterative performance optimization cycle. Right: Example
performance gains by tuning through various levels
51
www.allitebooks.com
CHAPTER 3 ■ TOP-DOWN SOFTWARE OPTIMIZATION
The right graph in Figure 3-3 shows an artificial performance optimization across
different levels. Note that at each level the performance saturates to some degree and that
we switch levels when other bottlenecks become dominant. This can also mean going up
a level again, since successful tuning of the application level might expose a bottleneck
at the system level.
For example, consider an improvement in the OpenMP threading that suddenly
causes the memory bandwidth to be boosted. This might very well expose a previously
undiscovered bottleneck in the systems memory setup, such as DIMMs in the channels
of the memory controllers having different sizes, with the resulting decreased memory
bandwidth.
Summary
The methodology presented in this chapter provides a solid process to tune a system
consistently with the top-down/closed-loop approach. The main things to remember are
to investigate and tune your system through the following different levels:
1. System level (see Chapter 4)
2. Application level, including distributed and shared memory
parallelization (see Chapters 5 and 6)
3. Microarchitecture level (see Chapter 7)
Keep iterating at each level until convergence, and proceed to the next level with the
biggest impact as long as there is time left.
References
1. G. E. Moore, “Cramming More Components onto Integrated Circuits” Electronics 38,
no. 8 (19 April 1965): 114–17.
2. W. A. Wulf, and S. A. McKee, “Hitting the Memory Wall: Implications of the Obvious,”
1994, www.eecs.ucf.edu/~lboloni/Teaching/EEL5708_2006/slides/wulf94.pdf.
3. MPI Forum, “MPI Documents,” www.mpi-forum.org/docs/docs.html.
4. “Message Passing Interface Forum,” www.mpi-forum.org.
5. The Open Group, “Single UNIX Specification, Version 4, 2013 Edition,” 2013,
www2.opengroup.org/ogsys/jsp/publications/PublicationDetails.
jsp?publicationid=12310.
6. OpenMP.Org, “OpenMP,” http://openmp.org/wp.
7. UPC-Lang.Org., “Unified Parallel C,” http://upc-lang.org.
8. Co-Array.Org, “Co-Array Fortran,” www.co-array.org.
9. “SHMEM,” Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/SHMEM.
52
CHAPTER 3 ■ TOP-DOWN SOFTWARE OPTIMIZATION
53
CHAPTER 4
Addressing System
Bottlenecks
We start with a bold statement: every application has a bottleneck. By that, we mean that
there is always something that limits performance of a given application in a system. Even
if the application is well optimized and it may seem that no additional improvements are
possible by tuning it on the other levels, it still has a bottleneck, and that bottleneck is in
the system the program runs on. The tuning starts and ends at the system level.
When you improve your application performance to take advantage of all
the features provided in the hardware, you use all available concurrency, and the
application’s execution approaches peak performance, but that peak performance will
limit how fast the program can run. A trivial solution to improve performance in such
cases is to buy a new and better piece of hardware. But to make an informed selection of
new hardware, you would need to find a specific system-level bottleneck that restricts the
application performance.
This chapter starts with discussion of the typical system-level tweaks and checks
that one can implement before considering purchasing new hardware. These need to
be seen as sanity checks of the available hardware before you invest in tuning of your
applications on the other levels. The following chapters describe the tools and techniques
for application optimization, yet the optimization work in the application must rely on a
clean and sane system. This chapter covers what can be wrong with the system and how
to find out if that specific system limitation is impacting your application performance.
55
CHAPTER 4 ■ ADDRESSING SYSTEM BOTTLENECKS
The root causes of system bottlenecks can be split into two major categories,
depending on their origin:
1. Condition and the run-time environment
2. Configuration of hardware, firmware, or software
The first may be present temporarily, appear from time to time, and may go away
and come back unless the issues are identified and fixed. The second includes issues
that are really static in time unless the system is reconfigured, and they are caused
by limitations in how the system was built (for instance, component selection, their
assembly, and their configuration).
56
CHAPTER 4 ■ ADDRESSING SYSTEM BOTTLENECKS
57
CHAPTER 4 ■ ADDRESSING SYSTEM BOTTLENECKS
#!/bin/sh
let CPU_power_limit_count=0
let CPU_throttle_count=0
for cpu in $(ls -d /sys/devices/system/cpu/cpu[0-9]*); do
if [ -f $cpu/thermal_throttle/package_power_limit_count ]; then
let CPU_power_limit_count+=$(cat $cpu/thermal_throttle/
package_power_limit_count)
fi
if [ -f $cpu/thermal_throttle/package_throttle_count ]; then
let CPU_throttle_count+=$(cat $cpu/thermal_throttle/package_
throttle_count)
fi
done
echo CPU power limit events: $CPU_power_limit_count
echo CPU thermal throlling events: $CPU_throttle_count
The expected good values for both printed lines should be zero, as
shown in Listing 4-2.
$ ./check_throttle.sh
CPU power limit events: 0
CPU thermal throlling events: 0
58
CHAPTER 4 ■ ADDRESSING SYSTEM BOTTLENECKS
59
CHAPTER 4 ■ ADDRESSING SYSTEM BOTTLENECKS
$ sudo powertop
■ Note Here, prior to running the powertop command, we used another command: sudo.
This addition instructs the operating system to use administrator or superuser privileges to
execute the following command, while not asking for the super user password. In order to
be able to use sudo, the system administrator has to delegate the appropriate permission
to you—usually by means of including you in a specific group (such as wheel or adm) and
editing system configuration file /etc/sudoers. Depending on the configuration, you may
be asked to enter your password.
60
CHAPTER 4 ■ ADDRESSING SYSTEM BOTTLENECKS
The alsa driver is the one for the sound system, which is not
usually used in HPC systems and can safely be removed. In
general, it is good practice to remove all unused software that by
default may be installed with your favorite OS distribution (for
instance, the mail servers, such as sendmail, or the Bluetooth
subsystem that are not generally needed in HPC cluster compute
nodes), and thus should either be uninstalled from the OS or be
disabled from the startup process.
The kipmi0 is part of the OS kernel that is responsible for the
work of the intelligent platform management interface (IPMI)
subsystem,3 which is often used to monitor various platform
sensors, such as those for the CPU temperature and voltage.
This may be required by the in-band monitoring agents of the
monitoring tools like lm_sensors,4 Ganglia,5 or Nagios.6 Although
kipmid is supposed to use only the idle cycles, it does wake up the
system and can affect application performance. The good news is
that it is possible to limit the time taken by this kernel module by
61
CHAPTER 4 ■ ADDRESSING SYSTEM BOTTLENECKS
$ sudo bash -c \
'echo 1 > /sys/module/ipmi_si/parameters/kipmid_max_busy_us'
This will limit the kipmid CPU time and the number of times it
wakes up the OS.
After these two changes to the OS configuration, the number of
wakeups per second on our test system was reduced and, more
important, CPU usage went down. The CPU usage in the OS’s
idle state is around 1.5 percent (where at least 1 percent is taken
by the PowerTOP itself ), and the processor usage by the kipmi0
is reduced to around 1 ms per second. The next biggest cause
of OS wakeups is the timer process that takes only a third of a
millisecond every second.
Other configurations or versions of operating systems may have a
very different set of hardware and software components, but using
tools like Linux top and the PowerTOP will help you identify the
time-consumers and serve as a guide to improving system
idle-state overhead.
3. Check BIOS settings: As the next step after OS improvement,
it is worth checking important parameters of the basic
input/output system (BIOS). Unlike BIOS setups in client
platforms, BIOS in a server provides a lot more options to
tune the system characteristics for different application
workloads. While the available choice of settings is a very
good way to support multiple different usages of the server
platforms, it may also lead to inefficient settings for your
specific applications. And although the OEMs delivering
high-performance computing solutions try to configure their
servers in a proper way, it may still be good to follow some
basic recommendations for several important BIOS settings.
The BIOS provides a summary of detected hardware: processor
type and speed, memory capacity, and frequency. For instance,
if there is a memory module failure, it will not be detected and
presented by the BIOS to the operating system, and the server will
boot. And if the memory module failure is not noticed, the system
will continue working, but the memory capacity and performance
will be lower than expected.
62
www.allitebooks.com
CHAPTER 4 ■ ADDRESSING SYSTEM BOTTLENECKS
■ Note For the best possible performance, check that the memory is configured in the
channel interleave or the independent mode the BIOS.
63
CHAPTER 4 ■ ADDRESSING SYSTEM BOTTLENECKS
64
CHAPTER 4 ■ ADDRESSING SYSTEM BOTTLENECKS
You will find the scripts for the C-shell under the same folder as the scripts for
the Bash-compatible shells. The source code of the nodeperf application can be
found in the source folder of the HPL benchmark that comes with the Intel MKL. If
Intel Parallel Studio XE 2015 Cluster Edition is installed into its default folder, you can
copy the nodeperf.c source file to your current directory, where the benchmark will
be compiled and later executed (it is expected that this directory is shared among all
cluster nodes):
$ cp $MKLROOT/benchmarks/mp_linpack/nodeperf.c ./
Compile the nodeperf program using the Intel MPI compile wrapper script for Intel
compiler with the optimizations enabled to at least -O2 level, tuning for the instruction set
supported on the build machine (either by using the -xHost or by directly specifying another
instruction set target as described in Table 1-1), enabling OpenMP support (-qopenmp), and
linking with the MKL library for the optimized version of the DGEMM routine (-mkl):
$ export OMP_NUM_THREADS=24
$ export OMP_PLACES=cores
The first command requests 24 OpenMP threads. This is equal to the number of
physical processor cores in every node of our cluster: each of two CPUs has 12 cores, so
we ask exactly one thread per physical core. In case OMP_NUM_THREADS is not explicitly set
up, the Open MP runtime library will use all processors visible to the OS, which, in the
case of enabled Hyper Threading, will lead to assignment of two Open MP threads to each
physical core. The OMP_PLACES environment variable instructs Intel OpenMP runtime to
distribute threads between the cores in the system, so that two different threads will not
run on one physical core. Now we’re ready to start the test using the mpirun command
that requests eight ranks (using the -np 8 option), with only one rank per node (-ppn 1):
■ Note The -hosts option for the mpirun command explicitly lists the names of hosts
allocated for our job by the scheduler, and we provide it here for illustration purposes. More
convenient is to provide the list of hosts the program will run on in a separate file; or in a
majority of cases, Intel MPI can pick it up automatically from the resource manager of the
job scheduling system. Please consult your cluster documentation to see how the MPI jobs
are to be run on your cluster.
65
CHAPTER 4 ■ ADDRESSING SYSTEM BOTTLENECKS
The last two columns present the achieved performance of the DGEMM routine in
MFLOPS and the respective host name where the nodeperf was run. We immediately
find that node esg004 is the slowest one and esg009 is the fastest, and the performance
difference between the fastest and the slowest nodes is about 4.8 percent.
EXERCISE 4-1
Taking into account that our computational nodes are built using two Intel Xeon
E5-2697 v2 processors (each with 12 cores and 2700 MHz nominal clock frequency),
with support of the Intel AVX instruction set (so each core is capable of delivering up
to 8 Flops/cycle in double precision), compare the performance achieved on every
node as a ratio between the achieved performance and the peak performance.
After completing Exercise 4-1, you should have found that three nodes demonstrate
performance higher than the theoretical peak. How is this possible? This happens
because the peak performance was calculated using the nominal rated frequency of the
processor (2.7 GHz). But by default, Intel Turbo Boost technology is enabled. This allows
for processors running at a higher frequency than the nominal one when the CPU power
consumption stays within the specification and the cooling system can cool the processor
package below its critical temperature. An Intel Xeon E5-2697 v2 processor can run at a
up to 300 MHz higher clock speed in the Turbo Boost, thus reaching up to 3 GHz. When
Turbo Boost is disabled in BIOS settings, though, the processor clock frequency cannot
exceed the nominal 2.7 GHz, and consequently the performance reported by nodeperf is
lower, while still above 90 percent from the peak performance, as shown in Listing 4-4.
66
CHAPTER 4 ■ ADDRESSING SYSTEM BOTTLENECKS
Listing 4-4. Output of nodeperf with Turbo Boost Disabled (Cluster, 8 Nodes)
One final observation: this result also shows that the performance difference
between the fastest and the slowest nodes in the cluster is only 2.8 percent and is lower
than the case when the Turbo Boost is enabled. Turbo Boost can help achieve better
performance results; however, the observed performance variations between the nodes
will be higher, depending on each node,s conditions.
67
CHAPTER 4 ■ ADDRESSING SYSTEM BOTTLENECKS
Out of the entire output of the STREAM benchmark you may pick
just the values of the TRIAD component that executes the following
computational kernel a(i) = b(i) + q*c(i). For convenience, we
have put the stream command and necessary parsing of the results
into a shell script with the file name 2run.sh, shown in Listing 4-5.
In that script we assume the script with environment variables
settings and compiled stream binary reside in the same folder as
the 2run.sh script file. To keep the scripts portable, we have also
used a dynamic way to calculate the number of physical cores in
the server and set OMP_NUM_THREADS variable to that value.
#/bin/sh
. `dirname $0`/0env.sh
export OMP_NUM_THREADS=$(cat /proc/cpuinfo| awk 'BEGIN{cpus=0}
/processor/{cpus++}
/cpu cores/{cpu_cores=$4}
/siblings/{siblings=$3}
END{print cpus*cpu_cores/siblings}')
export OMP_PLACES=cores
`dirname $0`/stream | awk -v host=$(hostname) '/Triad:/{printf
"%s: %s\n",host,$2}'
You can run this script using the following command, assuming
the current working directory where the script resides is accessible
from all nodes.
■ Note In this command, we omitted listing hostnames to run the test on, and depending
on your cluster setup, you may or may not have to do it explicitly in the mpirun command.
esg145: 84845.0
esg215: 85768.2
esg281: 85984.2
esg078: 85965.3
esg150: 85990.6
esg084: 86068.5
esg187: 86006.9
esg171: 85789.7
68
CHAPTER 4 ■ ADDRESSING SYSTEM BOTTLENECKS
69
CHAPTER 4 ■ ADDRESSING SYSTEM BOTTLENECKS
EXERCISE 4-2
Run the STREAM benchmark and Intel Memory Latency Checker on your system to
determine maximum achievable memory bandwidth and access latency. What share
of the peak memory throughout is achieved on the STREAM test? If your system has
NUMA architecture, what NUMA factor does it have?
70
CHAPTER 4 ■ ADDRESSING SYSTEM BOTTLENECKS
For instance, the software stack coming along with the InfiniBand network drivers
usually contains a package called perftest with a set of low-level performance tests.
These tests allow for measuring bandwidth and latency for typical RDMA commands,
such as
• The read and write operations between two nodes: ib_read_bw,
ib_write_bw, ib_read_lat, ib_write_lat,
• The send command and atomic transactions: ib_send_bw, ib_
atomic_bw, ib_send_lat, ib_atomic_lat.
The tests require a server to be started on one of the nodes and the client on another.
Let’s take two nodes in our cluster (say, esg012 and esg013), and start a server on the
node named esg012 as follows:
Here, the option -d will request a specific InfiniBand device to be used (alternatively,
the tool will select the first device found). The list of available RDMA devices can be
obtained using the ibv_devices command, and for our cluster this list looks like this:
$ ibv_devices
device node GUID
------ ----------------
scif0 001b68fffe3d7f7a
mlx4_0 0002c903002f18b0
Listing 4-8. Example Output of the RDMA Read Latency Test (Cluster)
71
CHAPTER 4 ■ ADDRESSING SYSTEM BOTTLENECKS
local address: LID 0x130 QPN 0x0733 PSN 0xe892fa OUT 0x10 RKey 0x3010939
VAddr 0x00000000f60000
remote address: LID 0x179 QPN 0x06fe PSN 0x5afd2 OUT 0x10 RKey 0x010933
VAddr 0x000000015c8000
----------------------------------------------------------------------------
#bytes #iterations t_min[usec] t_max[usec] t_typical[usec]
2 1000 1.85 20.96 1.91
----------------------------------------------------------------------------
Here, the measured latency for the 2-bytes messages is equal to 1.91 microseconds.
Note that remote direct memory access to a different node takes approximately 15 times
longer than access to the memory of the remote socket, and it takes 27 times longer than
access to the local memory.
The bandwidth tests are run in a similar way. It may be interesting to see bandwidth
for different message sizes, which can be measured by adding the -a command
option. Listing 4-9 contains a typical output produced on a client compute nodes for a
unidirectional write bandwidth test to the server on the node esg012.
Listing 4-9. Output of the Unidirectional RDMA Read Bandwidth Test (Cluster)
72
CHAPTER 4 ■ ADDRESSING SYSTEM BOTTLENECKS
----------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
2 1000 8.11 7.82 4.098211
4 1000 16.56 16.55 4.337238
8 1000 33.12 33.07 4.334843
16 1000 66.13 66.07 4.329837
32 1000 132.47 132.43 4.339469
64 1000 264.09 254.07 4.162761
128 1000 517.41 509.61 4.174733
256 1000 1059.77 1059.05 4.337851
512 1000 2119.55 2116.35 4.334286
1024 1000 4205.29 4203.23 4.304106
2048 1000 5414.21 5410.27 2.770060
4096 1000 4939.99 4939.20 1.264436
8192 1000 5251.12 5250.99 0.672127
16384 1000 5267.51 5267.00 0.337088
32768 1000 5256.68 5256.65 0.168213
65536 1000 5263.24 5262.24 0.084196
131072 1000 5264.72 5262.55 0.042100
262144 1000 5271.87 5271.63 0.021087
524288 1000 5272.98 5270.55 0.010541
1048576 1000 5274.57 5273.48 0.005273
2097152 1000 5272.47 5271.77 0.002636
4194304 1000 5271.72 5270.81 0.001318
8388608 1000 5272.14 5270.38 0.000659
----------------------------------------------------------------------------
Using the default settings for the benchmark, we see that the maximum bandwidth
achieved on this test is equal to 5.4 GB/s, which represents approximately 80 percent
from the peak bandwidth of the InfiniBand link speed of around 6.8 GB/s.
EXERCISE 4-3
Run the perftest benchmarks on your favorite system to determine the maximum
achievable interconnect bandwidth and the latency for RDMA network traffic. What
share of the peak network speed is achieved on the bandwidth tests?
73
CHAPTER 4 ■ ADDRESSING SYSTEM BOTTLENECKS
limiting the application performance? And what is the knowledge this investigation will
give us?
The second question is rather easy to answer: finding out that an application
is memory bandwidth or I/O bandwidth bound (that is, it spends most of the time
transferring data to and from memory or over interconnect) will lead to decisions on
how to improve performance. Taking into account Amdahl’s Law and the roofline model
discussed in Chapter 2, an understanding of the share of I/O or memory dependent
execution time may guide you to a quantitative assessment of potential improvements
and provide ideas for improving the algorithms.
74
CHAPTER 4 ■ ADDRESSING SYSTEM BOTTLENECKS
and output to the comma-separated list (CSV) files for subsequent analysis using Excel
or other tools. PCM is written in C++ and is also available as a library that can be used
to instrument third-party applications to generate a detailed summary that covers all
important parts of the application code. The use of the Open Source Initiative (OSI) BSD
license makes them usable even inside commercial closed-source products. The tool is
available on Linux, Microsoft Windows, FreeBSD, and Apple MacOS operating systems
running on Intel Xeon, Core, and Atom processors.
Let’s review a typical example of the I/O and memory utilization analysis done using
Intel PCM tools. After downloading the source code of the tool from the Intel’s website,
you need to unpack the archive using the unzip command and then compile it. A simple
make command will do all that is needed, based on the supplied Makefile, to produce
several PCM tools:
• pcm.x: A command line PCM utility for monitoring core
utilization, including counting the number of executed
instructions, cache misses, and core temperature and per-core
energy consumption.
• pcm-memory.x: A tool for reading memory throughout utilization.
• pcm-numa.x: A performance counter monitoring utility for NUMA,
providing results for remote and local memory accesses.
• pcm-pcie.x: A command-line utility for monitoring PCIe bus
utilization (for processors with an integrated PCIe I/O controller).
• pcm-power.x: A power-monitoring utility, reporting the power
drawn by the cores and the memory, frequency residencies and
transition statistics, and the number of cycles the processor was
throttled, as well as many other power-related events.
• pcm-tsx.x: A performance- monitoring utility for Intel
Transactional Synchronization Extensions, available in processors
with Haswell microarchitecture.
In addition to the command-line monitoring tools, you will find a utility to view and
change the values of the processor model specific registers (MSRs) called pcm-msr.x,
and the pcm-sensor.x to produce visual plots using the KSysGuard utility from KDE. All
PCM tools require direct access to the processor’s MSRs, and thus call for administrative
privileges; for that purpose, the PCM tools should be started using the sudo command.
Intel VTune Amplifier XE11 (we refer to it as “VTune” throughout this chapter)
provides access to the processor performance counters and has a rich user interface
for data visualization and analysis. We will talk more about VTune for multithreaded
application analysis in Chapter 6 and microarchitecture-level tuning in Chapter 7, but
it is also useful for some system-level bottleneck characterization that we outline in this
chapter. (In the following chapters of this book we will provide examples of how VTune
could help you to extract knowledge about your application behavior and to pinpoint
potential areas for performance improvement.)
75
CHAPTER 4 ■ ADDRESSING SYSTEM BOTTLENECKS
Listing 4-10. Contents of the test_script File for the IOR Benchmark
IOR START
# MPIIO shared file test
reordertasksconstant=1 # defeat buffer cache for read after write by
reordering tasks
fsync=1 # call fsync for POSIX I/O before close
intraTestBarriers=1 # use barriers between open/read/write/close
repetitions=2
verbose=2
keepFile=0
segmentCount=10000
blockSize=1000000
fsync=0
filePerProc=0
api=MPIIO # Compare MPIIO to POSIX shared
collective=1 # enables data shipping optimization
testFile = IOR_MPIIO_Test # File name
transferSize=100000 # I/O call size
RUN
IOR STOP
76
CHAPTER 4 ■ ADDRESSING SYSTEM BOTTLENECKS
You should adjust the segmentCount parameter to ensure that the IOR benchmark
files are not cached in the memory. The value of the segmentCount should be chosen
so that the total amount of data written is greater than 1.5 times the available physical
memory in the compute clients involved in the test, so as to avoid OS caching effects. The
total size of the produced file, filesize, is given by the following formula:
Now you are ready to run the IOR benchmark. Let’s start it on workstation with two
local disks set up in a mirror. Launch the following command:
It will execute the IOR benchmark using 24 client processes with the input
configuration file called test_script. IOR will execute both read and write tests for each
run, doing two repetitions of each and calculating the maximum values. The bandwidth
numbers of interest are the results listed as the Max Write and Max Read measured in MB/s.
While the benchmark is running, let’s see a couple of tools to measure bandwidth
stress on the I/O subsystem that is being created by the IOR benchmark. A very rough
idea can be obtained by running the following command:
$ vmstat 1
It will print every second (or whatever interval you provide in the command line)
statistics, as shown in Listing 4-11.
The most interesting columns are labeled bi (“bytes in,” or read from the I/O
devices) and bo (“bytes out,” or written to the I/O devices), showing I/O load in KiB
(that is, multiples of 1024 bytes). The above example shows that around 90 megabytes
were written in the last second. Another key system statistic is the number of interrupts
(labeled in) processed every second, which in this example is over 10,000, and that
accounts for the 6 percent wait time reported in the column wa.
77
CHAPTER 4 ■ ADDRESSING SYSTEM BOTTLENECKS
A very nice tool to capture I/O utilization statistics and see them per process in a
top-like output is the iotop.15 It has to be run as root as follows:
$ sudo iotop
The output of the iotop is presented in Figure 4-2. This output shows both the total
disk bandwidth (around 60 MB/s) and the sustained I/O bandwidth per application
process, such as 2.3 MB/s for the IOR instances.
78
CHAPTER 4 ■ ADDRESSING SYSTEM BOTTLENECKS
Figure 4-2. Output of the iotop utility obtained while running the IOR write performance test
The IOR disk test takes almost an hour on our workstation and completes with the
performance summary showing an average sustained write and read bandwidth of
82.93 MiB/s and 60.76 MiB/s, respectively:
Let us now see a more complex example, when vmstat and other tools reading the
operating system state will not help. Specifically, when a parallel file system is connected
using a high-speed fabric such as InfiniBand, the file system also often provides its own
set of drivers and tools to manage the mount points. Such an example would be a high-
performance IBM General Parallel File System (GPFS).16 In our next example, we run the
same IOR benchmark and measure performance during its read test.
The parameter testFile in the test_script file is changed to a new location residing
in the GPFS storage. The vmstat command executed on one of the compute nodes
79
CHAPTER 4 ■ ADDRESSING SYSTEM BOTTLENECKS
participating in the IOR test showed no I/O activity at all. However, the presence of large I/O
happening on the node is indicated by around 20,000 interrupts processed every second:
$ vmstat 1
procs ----------memory----------- --swap-- --io-- ---system-- ------cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
17 0 0 62642488 4692 118004 0 0 0 0 20701 73598 69 7 25 0 0
15 0 0 62642244 4692 118004 0 0 0 0 19221 66678 70 7 23 0 0
14 0 0 62642932 4692 118004 0 0 0 0 21165 77092 69 6 25 0 0
To quantify the rate of data transfer, better tools are needed. We know that on this
cluster, the compute nodes use the InfiniBand network to access the storage servers, and
that the InfiniBand adapter is connected to the Intel Xeon processor via the PCI Express
bus. By analyzing the amount of traffic going via the PCIe bus, we could estimate the
bandwidth achieved by each node. Using the pcm-pcie.x tool from the Intel PCM toolset
will help us quantify the I/O load on the client nodes. The tool uses performance counters
to report the number of cache lines read and written by the I/O controller integrated into
the processors. So, to see the PCIe bus load, execute the following command:
$ sudo ./pcm-pcie.x -B 1
In this command, the option -B instructs the tool not only to report the number of
written cache lines but also to estimate the bandwidth, while the number provided as the
command-line option indicates the required refresh interval. The tool produced output
every second, as shown in Figure 4-3:
In Figure 4-3, we clearly see the estimated read bandwidth of 820 MB/s, via over
12 million of 64 byte cache-line transfers done every second through the integrated I/O
controller of one of the sockets (specifically, the processor socket 0). This observation
gives us a clear indication of the system I/O utilization by the IOR benchmark. Previously
we had found that the InfiniBand network can transfer data at rates exceeding 5.4 GB/s,
so the IOR benchmark consumes approximately 15 percent of the available bandwidth;
the bottleneck in this particular case is not in the network but, rather, likely in the storage
servers or the disk shelves. As another observation, the bandwidth to read data from a
shared parallel file system is over 16 times higher than for a standard SATA spinning disk.
80
CHAPTER 4 ■ ADDRESSING SYSTEM BOTTLENECKS
EXERCISE 4-4
Characterize your favorite application to measure the consumed I/O bandwidth using
the different tools described above. How much bandwidth is used from the peak
throughput? Note: you may need to additionally consult the datasheets for your local
disks if the local file I/O is used.
$ sudo ./pcm-memory.x
The output produced with the default refresh of 1 second is shown in Listing 4-12.
The tool measures memory bandwidth observed for every channel (four, in case of Intel
Xeon E5-2600 series), reporting separately throughput for reads from the memory and
writes to the memory. We see that each memory channel is utilized to approximately
11 GB/s and that the total memory utilization is around 87832.62 MB/s, which is close
to the benchmark’s own report of around 86,000 MB/s (as presented in Listing 4-3). The
PCM tool tends to report values slightly higher than the application’s own measurement
because the PCM measures all memory traffic, not only that specific to the application or
specific to the arrays being monitored inside the benchmark.
81
CHAPTER 4 ■ ADDRESSING SYSTEM BOTTLENECKS
Listing 4-12. Output of PCM Memory Monitoring Tool while Characterizing the STREAM
Benchmark
---------------------------------------||---------------------------------------
-- Socket 0 --||-- Socket 1 --
---------------------------------------||---------------------------------------
---------------------------------------||---------------------------------------
---------------------------------------||---------------------------------------
-- Memory Performance Monitoring --||-- Memory Performance Monitoring --
---------------------------------------||---------------------------------------
-- Mem Ch 0: Reads (MB/s): 6847.91 --||-- Mem Ch 0: Reads (MB/s): 6829.25 --
-- Writes(MB/s): 4137.15 --||-- Writes(MB/s): 4133.35 --
-- Mem Ch 1: Reads (MB/s): 6855.36 --||-- Mem Ch 1: Reads (MB/s): 6834.53 --
-- Writes(MB/s): 4136.61 --||-- Writes(MB/s): 4128.56 --
-- Mem Ch 4: Reads (MB/s): 6847.00 --||-- Mem Ch 4: Reads (MB/s): 6828.56 --
-- Writes(MB/s): 4138.83 --||-- Writes(MB/s): 4134.27 --
-- Mem Ch 5: Reads (MB/s): 6864.27 --||-- Mem Ch 5: Reads (MB/s): 6844.44 --
-- Writes(MB/s): 4139.95 --||-- Writes(MB/s): 4132.56 --
-- NODE0 Mem Read (MB/s): 27414.54 --||-- NODE1 Mem Read (MB/s): 27336.79 --
-- NODE0 Mem Write (MB/s): 16552.54 --||-- NODE1 Mem Write (MB/s): 16528.75 --
-- NODE0 P. Write (T/s): 49936444 --||-- NODE1 P. Write (T/s): 49307477 --
-- NODE0 Memory (MB/s): 43967.07 --||-- NODE1 Memory (MB/s): 43865.54 --
---------------------------------------||---------------------------------------
-- System Read Throughput(MB/s): 54751.32 --
-- System Write Throughput(MB/s): 33081.29 --
-- System Memory Throughput(MB/s): 87832.62 --
---------------------------------------||---------------------------------------
Another way to observe sustained memory utilization is to use Intel VTune Amplifier
XE. VTune provides a graphical interface and, among many other things, reports memory
bandwidth utilization of the application. Let’s consider a quick example of the memory
bandwidth analysis using the graphical interface of VTune.
First, in a terminal window under the X-window system, source the environment
settings for Bash compatible shells as:
$ source /opt/intel/vtune_amplifier_xe/amplxe-vars.sh
$ source /opt/intel/vtune_amplifier_xe/amplxe-vars.csh
These scripts will update all the necessary environment variables. For convenience,
you can change your working directory to the folder where the STREAM benchmark
is located. Now, start the VTune graphical user interface (GUI) using the amplxe-gui
command, create a project, and specify the path to the STREAM benchmark script that
was presented earlier. Go to the New Analysis, select Bandwidth analysis, and click the
Start button. VTune will start the benchmark and will wait until the application finishes.
You can always press the Stop button to interrupt the benchmark and proceed to analysis;
VTune will terminate the application then. After VTune finishes parsing the performance
82
CHAPTER 4 ■ ADDRESSING SYSTEM BOTTLENECKS
profile, you will be able to find the consumed bandwidth timeline per processor package
in the Bottom-up tab, as shown in Figure 4-4.
Figure 4-4. Memory bandwidth analysis for the STREAM benchmark with Intel VTune
Amplifier XE 2015
The values of the sustained memory bandwidth observed in VTune are similar to
those presented by PCM: around 44 GB/s for each processor socket, where approximately
30 GB/s are taken by the memory read traffic. The graphical representation of the
timeline in VTune provides additional information, such as a clearly defined memory
allocation phase, followed by 10 iterations of four benchmark kernels (called COPY,
SCALE, ADD, and TRIAD), and finally the verification stage. The visual representation
of the memory bandwidth over time helps you to see immediately what part of the
application is memory bandwidth bound.
Of course, VTune provides a great set of additional features beyond counting certain
event occurrences, such as finding that the average number of the processor clocks per
instruction (CPI) for the STREAM is over 9.5. Based on the statistical event sampling,
VTune allows you to drill down to the specific parts of your code and correlate the
performance of many simultaneously collected events with the time taken by the specific
code path.
In the following chapters you will find many more examples of using VTune to
analyze applications performance, with a detailed introduction to VTune in Chapter 6.
83
CHAPTER 4 ■ ADDRESSING SYSTEM BOTTLENECKS
EXERCISE 4-5
Analyze the memory bandwidth consumption of your favorite program using the
different tools described above. How much bandwidth is used from the peak
throughput? What part of the application consumes over 80 percent of the peak
memory bandwidth available in your specific system, found as the result of doing
Exercise 4-2?
Summary
In this chapter we have looked at the main types of bottlenecks and have classified
potential issues at the system level that are related to environment conditions or
the configuration of the system operations. Prior to fine-tuning any application
performance, you want the system to achieve a known good condition and deliver
expected performance on basic kernel benchmarks. These microkernels should cover at
least computational performance, memory bandwidth and latency, as well as external
interconnect bandwidth and latency. Good candidates for such tests are:
• DGEMM, with the nodeperf program coming with Intel Math
Kernel Library, to test computational performance.
• STREAM benchmark to measure memory bandwidth, and Intel
Memory Latency Checker to assess memory latency.
• For RDMA-capable high-performance interconnects such as
InfiniBand, the perftest to find maximum achievable bandwidth
and minimum latency.
The application performance dependency on the system-level characteristics can
be understood by monitoring the resource utilization using tools that rely on the software
performance counters (top, vmstat, iostat) and the utilities that collect data from the
built-in hardware performance counters (Intel VTune, Intel PCM, perf).
84
CHAPTER 4 ■ ADDRESSING SYSTEM BOTTLENECKS
References
1. “MCElog project,” http://freecode.com/projects/mcelog.
2. Intel Open Source Technology Center, “PowerTOP Home,” http://01.org/powertop.
3. Intel Corporation, “Intelligent Platform Management Interface,” http://www.intel.com/
content/www/us/en/servers/ipmi/ipmi-home.html.
4. “lm_sensors - Linux hardware monitoring,” http://lm-sensors.org.
5. “Ganglia Monitoring System,” http://ganglia.sourceforge.net.
6. “The Industry Standard In IT Infrastructure Monitoring,” http://www.nagios.org.
7. John D. McCalpin, “Memory Bandwidth and Machine Balance in Current High
Performance Computers”, IEEE Computer Society Technical Committee on Computer
Architecture (TCCA) Newsletter, p. 19-25, December 1995.
8. “Orders of Magnitude (data),” http://en.wikipedia.org/wiki/Orders_of_
magnitude_(data).
9. Intel Corporation, “Intel Memory Latency Checker,” https://software.intel.com/
en-us/articles/intelr-memory-latency-checker.
10. Intel Corporation, “Intel Performance Counter Monitor: A better way to measure
CPU utilization,” https://software.intel.com/en-us/articles/intel-
performance-counter-monitor-a-better-way-to-measure-cpu-utilization.
11. Intel Corporation, “Intel VTune Amplifier XE,” https://software.intel.com/en-us/
intel-vtune-amplifier-xe.
12. “IOR HPC benchmark,” http://sourceforge.net/projects/ior-sio/.
13. HDF Group, “HDF5 Home Page,” http://www.hdfgroup.org/HDF5/.
14. “POSIX,” http://en.wikipedia.org/wiki/POSIX.
15. G. Chazarain, “Iotop,” http://guichaz.free.fr/iotop/.
16. IBM, “IBM Platform Computing Elastic Storage,” www-03.ibm.com/systems/
platformcomputing/products/gpfs/.
85
CHAPTER 5
Addressing Application
Bottlenecks: Distributed
Memory
The first application optimization level accessible to the ever-busy performance analyst
is the distributed memory one, normally expressed in terms of the Message Passing
Interface (MPI).1 By its very nature, the distributed memory paradigm is concerned
with communication. Some people consider all communication as overhead—that
is, something intrinsically harmful that needs to be eliminated. We tend to call it
“investment.” Indeed, by moving data around in the right manner, you hope to get more
computational power in return. The main point, then, is to optimize this investment so
that your returns are maximized.
The time spent on the problem analysis and solution is an integral part of the
overall investment. Hence, it is important to detect quickly what direction may be
successful and what is going to be a waste of time, and to focus on the most promising
leads. Following this pragmatic approach, in this chapter we will show how to detect
and exploit optimization opportunities in the realm of communication patterns. Further
chapters will step deeper into the increasingly local optimization levels. “And where
are the algorithms?” you may ask. Well, we will deal with them as we go along, because
algorithms will cross our path at every possible level. If you have ever tried to optimize
bubble sort and then compared the result with the quick sort, you will easily appreciate
the importance of algorithmic optimization.
87
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
88
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
environment as described in Chapter 4, you can run this MPI program now in the way
you probably know better than we do. On a typical system with the Intel MPI library
installed, this would look as follows:
By default, the Intel MPI library will try to select the fastest possible communication
path for any particular runtime configuration. Here, the most likely candidate is
the shared memory channel. On our workstation, this leads to the output (skipping
unessential parts) shown in Listing 5-1:
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 1.16 0.00
1 1000 0.78 1.22
2 1000 0.75 2.53
4 1000 0.78 4.89
8 1000 0.78 9.77
16 1000 0.78 19.55
32 1000 0.88 34.50
64 1000 0.89 68.65
128 1000 0.99 123.30
256 1000 1.04 234.54
512 1000 1.16 420.02
1024 1000 1.38 706.15
2048 1000 1.63 1199.68
4096 1000 2.48 1574.10
8192 1000 3.74 2090.00
16384 1000 7.05 2214.91
32768 1000 12.95 2412.56
65536 640 14.93 4184.94
131072 320 25.40 4921.88
262144 160 44.55 5611.30
524288 80 91.16 5485.08
1048576 40 208.15 4804.20
2097152 20 444.45 4499.96
4194304 10 916.46 4364.63
89
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
message size. If you want to reduce this to just two numbers for the whole message
range, take the zero-byte message latency and the peak bandwidth at whatever message
size it is achieved. Note, however, that IMB performance may differ from what you see in
a real application.
From the output shown here we can deduce that zero-byte message latency is
equal to 1.16 microseconds, while the maximum bandwidth of 5.6 GB/s is achieved
on messages of 256 KiB. This is what the shared memory communication channel,
possibly with some extra help from the networking card and other MPI implementor
tricks, is capable of achieving in the default Intel MPI configuration. Note that the
default intranode MPI latency in particular is 7 to 20 times the memory access latency,
depending on the exact communication path taken (compare Listing 5-4). This is the
price you pay for the MPI flexibility, and this is why people call all communication
“overhead.” This overhead is what may make threading a viable option in some cases.
■ Note The Intel MPI Library is tuned by default for better bandwidth rather than for
lower latency, so that the latency can easily be improved by playing a bit with the process
pinning. We will look into this in due time.
The general picture of the bandwidth values (the last column in Listing 5-1) is almost
normal: they start small, grow to the L2 cache peak, and then go down stepwise, basically
reaching the main memory bandwidth on very long messages (most likely, well beyond
the 4 MiB cutoff selected by default).
However, looking a little closer at the latency numbers (third column), we notice
an interesting anomaly: zero-byte latency is substantially larger than that for 1-byte
messages. Something is fishy here. After a couple of extra runs we can be sure of this
(anomalous values are highlighted in italic; see Table 5-1):
This may be a measurement artifact, but it may as well be something worth keeping
in mind if your application is strongly latency bound. Note that doing at least three runs is
a good idea, even though your Statistics 101 course told you that this is not enough to get
anywhere close to certainty. Practically speaking, if you indeed have to deal with outliers,
you will be extremely unlucky to get two or all three of them in a row. And if just one
outlier is there, you will easily detect its presence and eliminate it by comparison to other
two results. If you still feel unsafe after this rather unscientific passage, do the necessary
calculations and increase the number of trials accordingly.
90
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
Let us try to eliminate the artifact as a factor by increasing tenfold the number of
iterations done per message size from its default value of 1000:
The option -iter 10000 requests 10,000 iterations to be done for each message size.
This is what we get this time (again, skipping unessential output); see Listing 5-2.
Listing 5-2. Modified IMB-MPI1 Output (Workstation, Intranode, with 10,000 Iterations)
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 10000 0.97 0.00
1 10000 0.80 1.20
2 10000 0.80 2.39
4 10000 0.78 4.87
8 10000 0.79 9.69
16 10000 0.79 19.33
32 10000 0.93 32.99
64 10000 0.95 64.06
128 10000 1.06 115.61
256 10000 1.05 232.74
512 10000 1.19 412.04
1024 10000 1.40 697.15
2048 10000 1.55 1261.09
4096 10000 1.98 1967.93
8192 5120 3.21 2437.08
16384 2560 6.27 2493.14
32768 1280 11.38 2747.05
65536 640 13.35 4680.56
131072 320 24.89 5021.92
262144 160 44.77 5584.68
524288 80 91.44 5467.92
1048576 40 208.23 4802.48
2097152 20 445.75 4486.85
4194304 10 917.90 4357.78
From this, it does look like we get a measurement artifact at the lower message sizes,
just because the machine is lightning fast. We can increase the iteration count even more
and check that out.
91
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
EXERCISE 5-1
Verify the existence of the IMB short message anomaly on your favorite platform. If it
is observable, file an issue report via Intel Premier Support.4
As before, the peak intranode bandwidth of 5.6 GiB/s at 256 KiB is confirmed, and we
can deduce that the intranode bandwidth stabilizes at about 4.4 GB/s for large messages.
These are quite reasonable numbers, and now we can proceed to investigate other
aspects of the baseline MPI performance.
Before we do this, just to be sure, we will do two extra runs (oh, how important it is to
be diligent during benchmarking!) and drive the new data into a new table (anomalous
values are highlighted in italic again); see Table 5-2:
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 10000 0.56 0.00
1 10000 0.56 1.69
2 10000 0.57 3.37
4 10000 0.57 6.73
92
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
You can see not only that now the anomaly is gone but also that the numbers have
changed quite substantially. This is in part why an application may behave differently
from the most carefully designed benchmark. It is arguable whether doing special
preconditioning of the benchmark like the one described earlier is valid all the time,
so we will refrain from this approach further on.
Of course, we will keep all the log files, clearly named, safe and sound for future
reference. The names like IMB-MPI1-n1p2-PingPong.logN, where N stands for the run
number, will do just fine in this case. The notation n1p2 tells us that the results have been
obtained on one node using two MPI processes.
93
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
Since in this case we are going to use more than one node, MPI startup will of
necessity be a bit more complicated, because the MPI library should be made aware of
the identity of the nodes we intend to run on. By far the easiest way that also leaves a
clear log trace of what exactly was done is to specify those nodes explicitly in the IMB
invocation command. For instance, on our example system:
Here, esg054 and esg055 stand for the respective node hostnames. They are very
likely to be rather different in your installation. If you’re in doubt, ask your friendly
systems administrator.
■ Note There are certainly more elegant and powerful ways of selecting the target nodes
for an Intel MPI run. Do not worry; we will learn them one by one in due time. This precise
inline method is just what we need right now.
Of course, your cluster may be controlled by a job-management system like PBS Pro,
LSF, Torque, or one of half a dozen other alternative products. The chances are that
mpirun will recognize any of them and allow a job to be started anyway, but this is a topic
we would need to devote a whole chapter to. Just ask one of the local experts you know,
and he or she will tell you what is needed to submit multiple node jobs.
Another conceptual complication that we will deal with is the way in which both
nodes will communicate with each other. Normally, as in the intranode case, Intel MPI
library will automatically try to select the fastest available communication path. Most
likely, this will be InfiniBand on a dedicated HPC cluster and some Gigabit Ethernet on a
general purpose cluster. In the case of InfiniBand, we get the following output on our test
cluster introduced in Chapter 4; see Listings 5-4 and 5-5:
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 0.67 0.00
1 1000 0.67 1.42
2 1000 0.68 2.82
4 1000 0.68 5.62
8 1000 0.70 10.85
16 1000 0.71 21.54
32 1000 0.86 35.63
64 1000 0.88 69.40
128 1000 0.98 124.95
94
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 1.09 0.00
1 1000 1.09 0.88
2 1000 1.09 1.75
4 1000 1.10 3.47
8 1000 1.10 6.91
16 1000 1.11 13.74
32 1000 1.15 26.44
64 1000 1.16 52.71
128 1000 1.23 98.97
256 1000 1.87 130.55
512 1000 1.98 246.30
1024 1000 2.30 425.25
2048 1000 2.85 685.90
4096 1000 3.42 1140.67
8192 1000 4.77 1639.06
16384 1000 7.28 2145.56
32768 1000 10.34 3021.38
65536 1000 16.76 3728.35
131072 1000 28.36 4407.30
262144 800 45.51 5493.00
524288 400 89.05 5614.98
1048576 200 171.75 5822.49
2097152 100 338.53 5907.97
4194304 50 671.06 5960.72
95
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
Several interesting differences between the shared memory and the InfiniBand paths
are worth contemplating. Let’s compare these results graphically; see Figures 5-1 and 5-2.
96
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
Now, let’s enumerate the differences that may be important when we later start
optimizing our application on the target cluster:
1. Intranode latency is substantially better than internode
latency on smaller message sizes, with the crossover occurring
at around 8 KiB. Hence, we should try to put onto the same
node as many processes that send smaller messages to each
other as possible.
2. Internode bandwidth is considerably higher than intranode
bandwidth on larger messages above 8 KiB, with the
exception of roughly 64 KiB, where the curves touch again.
Hence, we may want to put onto different nodes those MPI
ranks that send messages larger than 8 KiB, and surely larger
than 64 KiB, to each other.
3. It is just possible that InfiniBand might be beating the shared
memory path on the intranode bandwidth, as well. Since Intel
MPI is capable of exploiting this after a minor adjustment,
another small investigation is warranted to ascertain whether
there is any potential for performance improvement in using
InfiniBand for larger message intranode transfers.
97
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
98
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
45 1 11 1
46 1 12 1
47 1 13 1
===== Placement on packages =====
Package Id. Core Id. Processors
0 0,1,2,3,4,5,8,9,10,11,12,13
(0,24)(1,25)(2,26)(3,27)(4,28)(5,29)(6,30)
(7,31)(8,32)(9,33)(10,34)(11,35)
1 0,1,2,3,4,5,8,9,10,11,12,13
(12,36)(13,37)(14,38)(15,39)(16,40)(17,41)
(18,42)(19,43)(20,44)(21,45)(22,46)(23,47)
This utility outputs detailed information about the Intel processors involved. On
our example workstation we have two processor packages (sockets) of 12 physical cores
apiece, each of them in turn running two hardware threads, for the total of 48 hardware
threads for the whole machine. Disregarding gaps in the core numbering, they look well
organized. It is important to notice that both sockets share the 30 MB L3 cache, while
the much smaller L1 and L2 caches are shared only by the virtual cores (OS processors)
that are closest to each other in the processor hierarchy. This may have interesting
performance implications.
Now, let’s see how Intel MPI puts processes onto the cores by default. Recalling
Chapter 1, for this we can use any MPI program, setting the environment variable
I_MPI_DEBUG to 4 in order to get the process mapping output. If you use a simple
start/stop program containing only calls to the MPI_Init and MPI_Finalize, you will get
output comparable to Listing 5-7, once unnecessary data is culled from it:
99
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
Comparing Listings 5-6 and 5-7, we can see that the first eight MPI processes
occupy the first processor package, while the remaining eight MPI processes occupy
the other package. This is good if we require as much bandwidth as we can get, for
two parts of the job will be using separate memory paths. This may be bad, however,
if the relatively slower intersocket link is crossed by very short messages that clamor
for the lowest possibly latency. That situation would normally favor co-location of the
intensively interacting processes on the cores that share the highest possible cache level,
up to and including L1.
$ export I_MPI_PIN_PROCESSOR_LIST=allcores
If you wonder what effect this will have upon performance, compare Listing 5-1 with
Listing 5-8:
Listing 5-8. Example IMB-MPI1 Output (Workstation, Intranode, Physical Cores Only)
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 0.58 0.00
1 1000 0.61 1.56
2 1000 0.62 3.08
4 1000 0.27 14.21
8 1000 0.28 27.65
16 1000 0.32 48.05
32 1000 0.37 81.48
64 1000 0.38 161.67
100
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
Note that we can still observe the small message latency anomaly in some form.
This becomes outright intriguing. For the rest of it, latency is down by up to three times
and bandwidth is up by 40 to 50 percent, with bandwidth in particular still going up,
whereas it would sharply drop in prior tests. This is natural: in the absence of necessity
to share both the core internals and the off-core resources typical of the virtual cores,
MPI performance will normally go up. This is why pure MPI programs may experience a
substantial performance hike when run on the physical cores.
Note also that the performance hike observed here has to do as well with the
change in the process layout with respect to the processor sockets. If you investigate the
process layout and pinning in both cases (not shown), you will see that in the default
configuration, MPI ranks 0 and 1 occupy different processor sockets, while in the
configuration illustrated by Listing 5-8, these ranks sit on adjacent physical cores of the
same processor socket. That is, the observed difference is also the difference between the
intersocket and intrasocket performance, respectively.
At this point we have discovered about 90 percent of what needs to be known about
the underlying MPI performance. You might want to run more complicated IMB sessions
and see how particular collective operations behave on more than two processes and
so on. Resist this temptation. Before we go there, we need to learn a bit more about the
target application.
EXERCISE 5-2
Compare the virtual and physical core performance of your favorite platform using
the procedure described here. Try the -cache_off option of the IMB to assess the
influence of the cache vs. memory performance at the MPI level. Consider how
relevant these results may be to your favorite application.
101
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
102
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
We do not have to select the workload because HPL generates it automatically during
startup. What we need to change are a few workload parameters; see Listing 5-9:
Listing 5-9. HPL Input File (HPL.dat) with the Most Important Parameters Highlighted
Some of the points to note from the script in Listing 5-8 are:
• Problem size (N) is normally chosen to take about 80 percent of
the available physical memory by the formula memory = 8N 2 for
double precision calculations.
• Number of blocks (NB) usually ranges between 32 and 256, with
the higher numbers promoting higher computational efficiency
while creating more communication.
• Process grid dimensions (P and Q), where both P and Q are
typically greater than 1, P is equal to or slightly smaller than
Q, and the product of P and Q is the total number of processes
involved in the computation.
103
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
This and further details are well explained in the HPL FAQ.8 As can be seen,
Listing 5-9 was generated when the matrix size was set to 235,520, yielding total occupied
memory of about 413 GiB. We used 256 blocks and the process grid dimensions 4 x 4.
A quick look into the built-in statistics output given in Listing 1-1 that was obtained for
this input data shows that MPI communication occupied between 5.3 and 11.3 percent of
the total run time, and that the MPI_Send, MPI_Recv, and MPI_Wait operations took about
81, 12, and 7 percent of the total MPI time, respectively. The truncated HPL output file
(see Listing 5-10) reveals that the run completed correctly, took about 40 minutes, and
achieved about 3.7 TFLOPS.
Listing 5-10. HPL Report with the Most Important Data Highlighted (Cluster, 16 MPI
Processes)
============================================================================
HPLinpack 2.1 -- High-Performance Linpack benchmark -- October 26, 2012
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
============================================================================
N : 235520
NB : 256
PMAP : Column-major process mapping
P : 4
Q : 4
PFACT : Right
NBMIN : 4
NDIV : 2
RFACT : Crout
BCAST : 1ring
DEPTH : 1
SWAP : Binary-exchange
L1 : no-transposed form
U : no-transposed form
EQUIL : no
ALIGN : 8 double precision words
104
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
----------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
Column=001280 Fraction=0.005 Mflops=4809238.67
Column=002560 Fraction=0.010 Mflops=4314045.98
...
Column=210944 Fraction=0.895 Mflops=3710381.21
Column=234496 Fraction=0.995 Mflops=3706630.12
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WC10C2R4 235520 256 4 4 2350.76 3.70500e+03
HPL_pdgesv() start time Fri Feb 14 05:44:48 2014
Now, let’s document what we have found. The input and output files form the basis
of this dataset that needs to be securely stored. In addition to this, we should note that
this run was done on eight nodes with two Ivy Bridge processors, with 12 physical cores in
turbo mode per processor and 64 GiB of memory per node.
The following tools were used for this run:
• Intel MPI 5.0.1
• Intel MKL 11.2.0 (including MP_LINPACK binary precompiled by
Intel Corporation)
• Intel Composer XE 2015
105
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
export I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1
export I_MPI_PIN=enable
export I_MPI_PIN_DOMAIN=socket
export OMP_NUM_THREADS=12
export KMP_AFFINITY=verbose,granularity=fine,physical
export I_MPI_STATS=ipm
Some of these variables are set by default. However, setting them explicitly increases
the chances that we truly know what is being done by the library. The first line indicates
a particular communication fabric to be used by Intel MPI. The next four lines control
the Intel MPI and OpenMP process and thread pinning. (We will look into why and how
here, and in Chapter 6.) The last line requests the built-in, IPM-style statistics output to be
produced by the Intel MPI Library.
This dataset complements the lower-level data about the platform involved that we
collected and documented in Chapter 4. Taken together, they allow us to reproduce this
result if necessary, or to root-cause any deviation that may be observed in the future
(or in the past).
Since this program has not been designed to run on small problem sizes or small
numbers of processes, it does not make much sense to continue the runs before we come
to the preliminary conclusion. One data point will be sufficient, and we can decide what
to do next. If we compute the efficiency achieved during this run, we see it comes to about
90 percent. This is not far from the expected top efficiency of about 95 percent. From this
observation, as well as the MPI communication percentages shown here and Amdahl’s
Law explained earlier, we can deduce that there is possibly 2—at most 3—percent overall
performance upside in tuning MPI. In other words, it makes sense to spend more time
tuning MPI for this particular application only if you are after those last few extra drops of
efficiency. This may very well be the case if you want to break a record, by the way.
Just for comparison, we took the stock HPL off the netlib.org and compared it to
the optimized version presented here. The only tool in common was the Intel MPI library.
We used the GNU compiler, BLAS library off the netlib.org, and whatever default
settings were included in the provided Makefile.9 First, we were not able to run the
full-size problem owing to a segmentation fault. Second, the matrix size of 100,000 was
taking so much time it was impractical to wait for its completion. Third, on the very
modest matrix size of 10,000, with the rest of the aforementioned HPL.dat file unchanged
(see Listing 5-9), we got 35.66 GFLOPS for the stock HPL vs. 152.07 GFLOPS for the
optimized HPL, or a factor of more than 4.5 times in favor of the optimized HPL. As we
know from the estimates given, and a comparison of the communication statistics (not
shown), most of this improvement does not seem to be coming from the MPI side of the
equation. We will revisit this example in the coming chapters dedicated to other levels of
optimization to find out how to get this fabulous acceleration.
106
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
All this may look to you like a very costly exercise in finding out the painfully obvious.
Of course, we know that Intel teams involved in the development of the respective tools
have done a good job optimizing them for one of the most influential HPC benchmarks.
We also know what parameters to set and how, both for the application and for the Intel
MPI Library. However, all this misses the point. Even if you had a different application at
hand, you would be well advised to follow this simple and efficient routine before going
any further. By the way, if you miss beautiful graphs here, you will do well to get used to
this right away. Normally you will have no time to produce any pictures, unless you want
to unduly impress your clients or managers. Well-organized textual information will
often be required if you take part in the formal benchmarking efforts. If you would rather
analyze data visually, you will have to find something better than the plain text tables and
Excel graphing capabilities we have gotten used to.
EXERCISE 5-3
107
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
108
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
There is also a corresponding report file with the file extension .yaml, shown in
Listing 5-12:
109
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
CG solve:
Iterations: 200
Final Resid Norm: 0.00404725
WAXPY Time: 21.2859
WAXPY Flops: 2.2575e+11
WAXPY Mflops: 10605.6
DOT Time: 6.72744
DOT Flops: 1e+11
DOT Mflops: 14864.5
MATVEC Time: 98.8167
MATVEC Flops: 1.35947e+12
MATVEC Mflops: 13757.4
Total:
Total CG Time: 126.929
Total CG Flops: 1.68522e+12
Total CG Mflops: 13276.9
Time per iteration: 0.634643
Total Program Time: 185.796
From the last few lines of Listing 5-12, you can see that we achieve about
13.3 GFLOPS during the conjugate gradient (CG) solution stage, taking 185.8 seconds for
the whole job. Now we will look into whether this is the optimum we are after with respect
to the problem size, the number of the MPI processes, and the number of OpenMP
threads that are used implicitly by the Intel MKL. For comparison, we achieved only
10.72 MFLOPS for the problem size of 10 and 12.72 GFLOPS for the problem size of 100,
so that there is some dependency here.
For now, let’s do a quick investigation of the MPI usage along the lines mentioned.
If we collect the Intel MPI built-in statistics, we get the output seen in Listing 5-13:
############################################################################
#
# command : ./miniFE.x (completed)
# host : book/x86_64_Linux mpi_tasks : 16 on 1 nodes
# start : 05/27/14/17:21:30 wallclock : 185.912397 sec
# stop : 05/27/14/17:24:35 %comm : 7.34
# gbytes : 0.00000e+00 total gflop/sec : NA
#
110
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
############################################################################
# region : * [ntasks] = 16
#
# [total] <avg> min max
# entries 16 1 1 1
# wallclock 2974.58 185.911 185.91 185.912
# user 3402.7 212.668 211.283 213.969
# system 20.6389 1.28993 0.977852 1.56376
# mpi 218.361 13.6475 4.97802 20.179
# %comm 7.3409 2.67765 10.8541
# gflop/sec NA NA NA NA
# gbytes 0 0 0 0
#
#
# [time] [calls] <%mpi>
<%wall>
# MPI_Allreduce 212.649 6512 97.38 7.15
# MPI_Send 2.89075 29376 1.32 0.10
# MPI_Init 1.81538 16 0.83 0.06
# MPI_Wait 0.686448 29376 0.31 0.02
# MPI_Allgather 0.269436 48 0.12 0.01
# MPI_Irecv 0.0444376 29376 0.02 0.00
# MPI_Comm_size 0.00278258 3360 0.00 0.00
# MPI_Bcast 0.00242162 32 0.00 0.00
# MPI_Comm_rank 1.62125e-05 176 0.00 0.00
# MPI_Finalize 5.24521e-06 16 0.00 0.00
# MPI_TOTAL 218.361 98288 100.00 7.34
############################################################################
111
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
will probably play a more noticeable role here than at higher memory loads. Finally, the
program may even break if each of its computational units, be they processes or threads,
receives less than the minimum amount of data that it can sensibly handle.
As the memory load continues to grow, the workload will start occupying the
LLC pretty regularly. This is where you are likely to observe the maximum possible
computational performance of a particular computational node. This point in the
performance curve is very important because it may show to what degree the overall
problem needs to be split into smaller parts, so that those parts can be computed with
maximum efficiency by separate cluster nodes.
Further growth of the memory load will lead to part of the data spilling over into the
main system memory. At this point the application may become memory bound, unless
clever techniques or the built-in facilities of the platform, like prefetching, are able to
alleviate the detrimental effects of the spilling.
Eventually, when the size of the workload exceeds the size of the physical memory
available to the current process’s working set, the virtual memory mechanism of the
operating system will kick in and, depending on its quality and the speed of the offline
storage (like hard disk drives [HDD] or solid state disks [SSD]), this may depress
performance further.
Finally, the growing memory load will cause a job to exceed the limits of the virtual
memory subsystem, and the job will start extensively swapping data in and out of the
main memory. This effect is called thrashing. The program will effectively become
strongly I/O bound. At this point, unless the program was designed to handle offload data
gracefully (like so many off-core solvers of yore), all bets are off.
Another, no less important aspect of the workload selection is the choice of the
typical target problem class that the benchmarking will address. For example, if the target
application is intended for computing—as in car-to-car collisions—it may not make
much sense to benchmark it on a test case that leads to no contact and deformation of the
objects involved.
Benchmarking and a bit of back-of-the-envelope calculations can help in choosing
the right workload size. Only your experience and knowledge of the standards, traditions,
and expectations of the target area are going to help you to choose the right workload
class. Fortunately, both selections are more often than not resolved by the clients, who tell
you upfront what they are interested in.
112
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
few experiments as possible, we find that four extra data points appear warranted, namely
50, 250, 375, and 750. If we do these extra measurements in the 16 MPI processes, three
thread configurations used so far, we can add the new data to the data already obtained,
and thus save a bit of time.
Table 5-3 shows what we get once we drive all the data together:
Recalling the characteristics of the workstation at hand, we can deduce that the
problem size of 250 is probably the last one to fit into the physical memory, although the
virtual memory mechanism will kick in anyway long before that. It does not look as if the
size of 500 was overloading the system unduly, so we can safely keep using it.
Being a proper benchmark, this program outputs a lot of useful data that can be
analyzed graphically with relative ease. For example, Figure 5-3 illustrates the absolute
contribution of various stages of the computation to the total execution time.
Figure 5-3. MiniFE stage cummulative timing dependency on the problem size
(16 MPI processes)
113
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
These curves look like some power dependency to the trained eye, and this is what
they should look like, given that the total number of mesh nodes grows as a cube of the
problem size, while the number of the nonzero matrix elements grows as a quadrat of the
problem size owing to the two-dimensional nature of the finite element interaction. This,
however, is only a speculation until you can prove it (to be continued).
Table 5-4. MiniFE CG Performance Dependency on the Process to Thread Ratio (GFLOPS,
Size 500, Workstation)
MPI proc. OpenMP thr. Run 1 Run 2 Run 3 Mean Std. dev, %
12 4 13.24 13.24 13.21 13.23 0.13
16 3 13.27 13.26 13.26 13.26 0.02
24 2 13.26 13.25 13.26 13.26 0.04
114
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
Table 5-5. MiniFE Total Time Dependency on the Process to Thread Ratio (Seconds, Size 500,
Workstation)
MPI proc. OpenMP thr. Run 1 Run 2 Run 3 Mean Std. dev, %
12 4 210.39 210.93 210.58 210.64 0.13
16 3 187.77 186.39 185.94 186.70 0.51
24 2 174.82 175.19 174.55 174.85 0.18
115
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
MPI proc. OpenMP thr. Run 1 Run 2 Run 3 Mean Std. dev, %
8 undefined 11.95 11.83 12.04 11.94 0.90
12 undefined 13.02 13.00 13.00 13.00 0.11
16 undefined 13.32 13.32 13.32 13.32 0.01
24 undefined 13.36 13.36 13.36 13.36 0.03
48 undefined 13.19 13.20 13.20 13.19 0.04
Table 5-7. MiniFE CG Total Time Dependency on the Process Number (Seconds, Size 500,
Workstation, No OpenMP Threads)
MPI proc. OpenMP thr. Run 1 Run 2 Run 3 Mean Std. dev, %
8 undefined 257.41 254.92 257.63 256.65 0.59
12 undefined 212.77 212.80 212.28 212.61 0.14
16 undefined 185.80 185.72 185.43 185.76 0.03
24 undefined 173.37 173.71 173.65 173.58 0.11
48 undefined 160.92 160.91 160.69 160.84 0.08
By setting the environment variable KMF_AFFINITY to verbose, you can verify that
more than one OpenMP thread is started even if its number is not specified.
It is interesting that we get about 100 MFLOPS extra by not setting the OpenMP thread
number explicitly, and that the total time drops still further if all 48 cores are each
running MPI process Moreover, it drops between the process counts by as much as
16 and 24. This indicates that the application has substantial scaling potential in this
strong scaling scenario.
The tendency toward performance growth with the number of MPI processes
suggests that it might be interesting to see what happens if we use only the physical cores.
Employing the recipe described earlier, we get 13.38 GFLOPS on 24 MPI processes put
on the physical cores, taking 176.47 seconds for the whole job versus 173.58 for 24 MPI
processes placed by default. So, there is no big and apparent benefit in using the physical
cores explicitly.
Thus, we are faced with the question of what configuration is most appropriate for
the following investigation. From Tables 5-4 through 5-7, it looks like 16 MPI processes
running three threads each combine reasonable overall runtime, high CG block
performance, and potential for further tweaks at the OpenMP level. One possible issue
in the 16 MPI process, three OpenMP thread configuration is that every second core will
contain parts of two different MPI processes, which may detrimentally affect the caching.
Keeping this in mind, we will focus on this configuration from now on, and count on the
24 process, two thread configuration as plan B.
116
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
EXERCISE 5-4
Perform a focused sampling around the point that we consider as the optimum for
miniFE, to verify it is indeed at least the local maximum of performance we are after.
Replace miniFE by your favorite application and repeat the investigation. If intranode
scalability results warrant this, go beyond one node.
117
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
$ source /opt/intel/itac_latest/bin/itacvars.sh
$ mpirun -trace -np 16 ./miniFE.x -nx=500
The first command establishes the necessary environment. As usual, we added this
command to the script 0env.sh included in the respective example archive, so if you
have sourced that file already, you do not need to source the specific ITAC environment
script. The -trace option in the mpirun invocation instructs the Intel MPI library link
the executable at runtime against the ITAC dynamic library, which in turn creates the
requested trace file. Application rebuilding is not required.
118
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
If you work on a cluster or another remote computer, you will have to ship all the
files associated with the main trace file miniFE.x.stf (most of them are covered by the
file mask miniFE.x.stf*) to a computer where you have the ITAC installed. To make
this process a little easier, you can ask ITAC to produce a single trace file if you use the
following command instead of the earlier mpirun invocation:
You can learn more about the ways to control ITAC runtime configuration in the
product online documentation.12
Now you can run the ITAC:
$ traceanalyzer miniFE.x.stf
This way or another, after a few splash screens, the ITAC summary chart shows up
(see Figure 5-4; note that we maximized the respective view inside the ITAC window).
Figure 5-4. MiniFE trace file in ITAC summary chart (Workstation, 16 MPI processes)
This view basically confirms what we already know from the built-in statistics output.
Press the Continue button at the upper right corner, and you will see the default ITAC
screen that includes the function profile (left) and the performance assistant (right), with
the rest of the screen occupied by the main program and view menus (very top), as well as
the handy icons and the schematic timeline (below the top); see Figure 5-5, and note that
we maximized the respective window once again.
119
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
Figure 5-5. MiniFE trace file in ITAC default view (Workstation, 16 MPI processes)
The function profile is basically a reiteration of the statistics output at the moment,
while the performance assistant is pointing out an issue we may want to deal with when
we have performed the initial trace file review. To that end, let us restore the historical
ITAC trace file view. Go to the Charts item in the main chart menu and select the Event
Timeline item there. This chart will occupy the top of the screen. Again in the main view
menu, deselect the Performance Assistant item, then select the Message Profile item. Also,
hide the schematic timeline by right-clicking it and selecting the Hide item in the popup
menu. This will display the historical ITAC analysis view; see Figure 5-6.
Figure 5-6. MiniFE trace file in ITAC historical view (Workstation, 16 MPI processes)
120
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
Nothing can beat this view in the eyes of an experienced analyst. The event timeline
shows pretty clearly that the program is busy computing and communicating most of the
time after setup. However, during setup there are substantial issues concerning one of the
MPI_Allreduce calls that may need to be addressed. The message profile illustrates the
neighbor exchanges between the adjacent processes that possess the respective adjacent
slabs of the overall computation domain. These relatively short exchanges still differ in
duration by approximately four times. To make sure this is indeed the case, you can scroll
this view up and down using the scrollbar on the right. If you right-click on the Group MPI
in the function profile, and select Ungroup MPI in the popup menu, this will show how
MPI time is split between the calls. Again, this information is known to you from the
built-in statistics. Some scrolling may be required here as well, depending on the size of
your display. Alternatively, click on any column header (like TSelf) to sort the list.
Now, zoom in on a piece of the event timeline around the offending MPI_Allreduce;
move the mouse cursor where you see fit, hold and drag to highlight the selected
rectangle, and release to see the result. All charts will automatically adjust themselves to
the selected time range (see Figure 5-7).
Figure 5-7. MiniFE trace file in ITAC zoomed in upon the offending MPI_Allreduce operation
(Workstation, 16 MPI processes)
Well, this is exactly what we need to see if we want to understand the ostensibly main
MPI-related performance issue in this program. The updated Functional Profile chart
confirms that it is indeed this MPI_Allreduce operation that takes the lion’s share of MPI
communication time. On the other hand, the time spent for the actual data exchange is
very low, as can be seen in the Message Profile chart, so the volume of communication
cannot be the reason for the huge overhead observed. Therefore, we must assume this is
load imbalance. Let us take this as a working hypothesis (to be continued).
121
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
EXERCISE 5-5
Analyze the behavior of your favorite application using the process described here.
What operations consume most of the MPI communication time? Is this really
communication time or load imbalance?
122
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
Figure 5-8. MiniFE trace file in ITAC imbalance diagram (C version, Workstation, 16 MPI
processes, 3 OpenMP threads)
Of course, this may be an artifact of the model used to compute the load imbalance.
However, this certainly indicates we should look into the load imbalance first, and only
then look into further MPI communication details. Depending on this, we will decide
where to go.
Referring back to the Performance Assistant chart (see Figure 5-5), we can conclude
that the MPI_Wait issue most likely related to the internal workings of the MPI_Allreduce
operation that might issue a call to the MPI_Wait behind the curtain. However, taken at
face value, this indication itself is somewhat misleading until we understand what stands
behind the reported issue. Indeed, if you switch to the Breakdown Mode in the view of
Figure 5-8 (not shown), you will see that small message performance of the MPI_Wait call
is the sole major contributor of the load imbalance observed (to be continued).
EXERCISE 5-6
Choose the primary optimization objective for your favorite application using the
method described in this section. Is this load imbalance or MPI tuning? How can you
justify your choice?
123
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
124
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
the nodes, there are so many way to get things wrong that, most likely, only proper
application design upfront can “guarantee” success of the undertaking. Indeed, in such a
cluster, you will have several effects uniting their forces to cause trouble:
1. Differences in the clock rate and functionality of the
processors involved. These differences may go up to several
times, especially as far as the clock rate is concerned. You
will have to allocate proportionally less data to the weaker
components.
2. Differences between the intranode communication over the
shared memory and over the PCI Express bus. Again, the
latency and bandwidth will vary greatly, and sometimes
the relationship will not be linear. For example, PCI Express
will normally lose to the shared memory on latency but may
overtake it on bandwidth on certain message sizes, depending
on the way in which the bus is programmed.
3. Differences between the intranode communication of any
kind, on one hand, and internode communication over
the fast network, on the other. In addition to this normal
situation typical of any cluster, in a heterogeneous cluster with
accelerated nodes, there may be the need to tunnel data from
accelerator to accelerator via the PCI Express bus, over the
network, and then over the PCI Express bus on the other side.
Of course, a properly implemented MPI library will handle all of this transfer
transparently to your benefit, but you may still see big differences in the performance of
the various communication links involved, and you will have to take this into account
when partitioning the data.
On top of this, there is an interesting interaction between the component’s
computing capacity and its ability to push data to other components. Because of this
interaction, in an ideal situation, it is possible that a relatively slower component sitting
on a relatively slower interface may be loaded 100 percent of the time and cause no
trouble across the job, provided the relatively faster components get larger pieces of data
to deal with and direct the bulk data exchanges to the fastest available communication
paths. However, it may be difficult to arrive at this ideal situation. This consideration
applies, of course, to both explicit and implicit data-movement mechanisms.
125
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
Now, set the environment variable VT_PCTRACE to 5 and rerun the miniFE, asking
for the trace file to be produced. (You know how to do this.) Note, however, that call
stack tracing requested this time is a relatively expensive procedure that will slow the
execution, so it may make sense to take a rather low problem size, hoping that the
program execution path does not depend on it. We used the size of 50.
Open the resulting file miniFE.x.stf in the ITAC, go to the offending MPI_Allreduce
operation in the event timeline, right-click on it, and ask for details. When you click on the
View Source Code item in the resulting popup window, you will see where the offending
MPI_Allreduce was called from (see Figure 5-9).
If you browse the source code in this window, you will see that immediately prior to
this MPI_Allreduce call, the program imposes Dirichlet boundary conditions. Very likely,
the imbalance is coming from that piece of code. This is only a guess for now, so you will
have to do more work before you can be sure of having found the culprit. However, if this
guess is correct, and given that the program itself reports very low data imbalance as far
as the distribution of nonzero matrix elements across the processes is concerned, it looks
like an algorithmic issue. If you want to address it now, you know where to start. Later on,
we will cover advanced techniques that will allow you to pinpoint the exact problematic
code location in any situation, not only in presence of the conveniently placed and easily
identifiable MPI operational brackets (to be continued in Chapter 6).
126
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
EXERCISE 5-7
Narrow down the search area by recalling from Figure 5-6 that, prior to the
problematic MPI_Allreduce operation, there was another MPI_Allreduce operation
that also synchronized the processes that were almost perfectly aligned at that
moment. What remains to be done is to repeat this procedure and find out the other
code location. Did you find the culprit?
127
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
The only hitch happens around 32 processes. It is probably caused by half of the
MPI processes running on eight physical cores, with the other half occupying the other
16 cores for themselves. This is effectively a heterogeneous situation. Indeed, Listing 5-14
shows what the process pinning looks like:
Comparing this to Listing 5-5, we get the distribution of the MPI processes among
the virtual processors, as shown in Figure 5-10.
128
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
Figure 5-10. Default process pinning (workstation, 32 MPI processes): MPI ranks
(gray upper numbers) mapped upon processor identifiers (black lower numbers)
You could argue that this may not be the fairest mapping of all, but whatever you do,
you will end up with some MPI processes out of 32 running two apiece on some physical
cores. This probably explains the hitch we observed in Table 5-9.
129
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
Next, you will observe that this application suffers from noticeable load imbalance
and MPI overhead (called “interconnect” in the imbalance diagram; see Figure 5-11).
Figure 5-11. MiniMD trace file in ITAC imbalance diagram (Workstation, 16 MPI processes)
There is something to haul on the MPI side of the equation, at least on the default
workload in.lj.miniMD. We can find out what exactly is contributing to this by
comparing the real and ideal traces, ungrouping the Group MPI and sorting the list by
TSelf (see Figure 5-12).
130
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
Figure 5-12. MiniMD ideal and real traces compared (Workstation, 16 MPI processes)
Compare the ideal trace in the upper left corner with the real trace in the lower right
corner. The biggest part of the improvement comes from halving the time spent in the
MPI_Wait. Most of the remaining improvement can be attributed to the reduction of
the MPI_Send and MPI_Irecv durations to zero in the ideal trace, not to mention the
MPI_Finalize. Contrary to this, the time spent in the MPI_Allreduce changes only slightly.
By the looks of it, MPI issues might be induced by the load imbalance rather than
intrinsic communication overhead, but we cannot see this right now, for sure. Hence, we
should look into the communication pattern first. This is even more the case because the
relative portion of the MPI time is noticeable on this workload, and the increase of the
time step parameter in the input file to the more representative value of 1000 drives this
portion from 7 percent down to only 5.5 percent, on average (to be continued).
EXERCISE 5-8
Analyze and address the load imbalance in miniMD. What causes it? Replace
miniMD with your favorite application and address the load imbalance there,
provided this is necessary. What causes the imbalance?
131
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
132
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
interactions make it relatively difficult to give fast and ready advice for all possible
situations. As a rule of thumb, keep in mind the following priorities:
1. Map the application upon the target platform. This includes,
in particular, selection of the fastest communication fabrics,
proper process layout and pinning, and other settings that affect
the way application and platform interact via MPI mediation.
2. Tune the Intel MPI library for the platform and/or the
application involved. If your application is bandwidth bound,
you are likely to do well with the platform-specific tuning. If your
application differs, you may need to cater to its particular needs.
3. Optimize the application for Intel MPI library. This includes
typical MPI optimizations valid for any MPI implementation
and specific Intel MPI tricks for varying levels of complexity
and expected return on investment.
As usual, you will have to iterate until convergence or timeout. We will go through
these steps one by one in the following sections. However, if in a particular case you
perceive the need for bypassing some steps in favor of others, feel free to do so, but
beware of spending a lot of time addressing the wrong problem first.
You will notice that we differentiate between optimization and tuning. Optimization
is a wider term that may include tuning. Tuning normally concerns changing certain
environment settings that affect performance of the target application. In other words,
optimization may be more intrusive than tuning because deep optimization may
necessitate source code modifications.
133
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
Beyond that, you have already seen several examples of one simple pinning setting’s
dramatically changing the behavior of certain benchmarks and applications. Generally
speaking, if your application is latency bound, you will want its processes to share as
much of the memory and I/O subsystem paths as possible. This means, in part, that
you will try to put your processes onto adjacent cores, possibly even virtual ones. If
your application is bandwidth bound, you will do better sharing as little of the memory
subsystem paths as possible. This means, in part, putting your MPI processes on different
processor sockets, and possibly even different nodes, if you use a cluster.
Using IP over IB
Another trick is to switch over to IP over IB (IPoIB) when using the TCP transport over
InfiniBand. Here is how you can do this:
134
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
$ export I_MPI_FABRICS=shm:ofa
$ export I_MPI_OFA_NUM_ADAPTERS=<n> # e.g. 2 (1 by default)
$ export I_MPI_OFA_NUM_PORTS=<n> # e.g. 2 (1 by default)
135
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
136
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
For example, this will put only two processes on each node:
You will normally want to use the default process layout for pure MPI applications.
For hybrid programs, you may want to decrease the number of processes per node
accordingly, so as to leave enough cores for the OpenMP or another threading library
to use. Finally, and especially in benchmarking the internode rather than the intranode
communication, you will need to go down to one process per node.
137
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
In normal operational mode, you will probably use the long notation more often
when dealing with Intel Xeon Phi co-processor than otherwise, so let’s demonstrate it
in that case (here and elsewhere we split the overly long run strings into several lines by
using the shell backslash/new line notation):
You can see that the run string is separated into two parts by the colon (:). The first
half prescribes two MPI processes to be started on the host CPU. The second half puts
16 MPI processes upon the Intel Xeon Phi coprocessor connected to this CPU. This
coprocessor conventionally bears the name of the host node plus the extension -mic0.
This particular command will turn on the Intel Xeon Phi coprocessor support, and then
create OpenMP domains of four cores on the host processes and 16 cores on the Intel Xeon
Phi coprocessor.
I_MPI_PIN=on
I_MPI_PIN_MODE=pm
I_MPI_PIN_DOMAIN=auto,compact
I_MPI_PIN_RESPECT_CPUSET=on
I_MPI_PIN_RESPECT_HCA=on
I_MPI_PIN_CELL=unit
I_MPI_PIN_ORDER=compact
138
i
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
139
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
When you start playing with exact process placement upon specific cores, both
I_MPI_PIN_DOMAIN and I_MPI_PIN_PROCESSOR_LIST will help you by providing the
list-oriented, bit mask–based, and symbolic capabilities to cut the cake exactly the way
you want, and if you wish, by using more than one method. You will find them all fully
described in the Intel MPI Library Reference Manual.
140
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
Here, we renamed both executables to keep them separate and distinguishable from
each other and from the plain Xeon executable we may need to rebuild later on. This way
we cannot spoil our executable programs by accident.
Running the program is similar to the workstation:
$ export I_MPI_MIC=enable
$ mpiexec.hydra \
-env LD_LIBRARY_PATH /opt/intel/impi_latest/mic/lib:$MIC_LD_LIBRARY_PATH \
-host `hostname`-mic0 -np 16 ./miniMD_intel.mic
These environment settings make sure that the Intel Xeon Phi coprocessor is found
and that the path settings there are correct. If we compare performance of the programs
on Intel Xeon and Intel Xeon Phi at different process counts, we get the results shown
in Table 5-10:
Table 5-10. MiniMD Execution Time on Intel Xeon or Intel Xeon Phi (Seconds, Cluster)
As usual, we performed three runs at each process count and analyzed the results
for variability, which was all below 1 percent in this case. You have certainly gotten used
to this procedure by now, so that we can skip the details. From this table we can derive
that a Xeon is roughly 6.5 to 6.9 times faster than Xeon Phi for the same number of MPI
processes, as long as Xeon cores are not saturated. Note that this relationship holds for
the core-to-core comparison (one MPI process results ) and for MPI-to-MPI comparison
(two through 24 MPI processes). So, you will need between 6.5 and 12 times more Xeon
Phi processes to beat Xeon. Note that although Xeon Phi saturates later, at around
64 to 96 MPI processes, it never reaches Xeon execution times on 48 MPI processes.
The difference is again around six times.
141
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
It may be interesting to see how speedup and efficiency compare to each other in the
case of Xeon and Xeon Phi platforms; see Figure 5-13.
Figure 5-13. MimiMD speedup and efficiency on Xeon and Xeon Phi platforms (cluster)
Here, speedup is measured by the left-hand vertical axis, while efficiency goes by the
right-hand one. Looking at this graph, we can draw a number of conclusions:
1. We can see that Xeon efficiency surpasses Xeon Phi’s and goes
very much along the ideal speedup curve until Xeon efficiency
drops dramatically when we go beyond 24 MPI processes and
start using virtual rather than physical cores. It then recovers
somewhat by sheer weight of the resources applied.
2. Since this Xeon Phi unit has 61 physical cores, we observe a
comparable effect at around 61 MPI processes as well.
3. Xeon surpasses Xeon Phi on efficiency until the
aforementioned drop, when Xeon Phi takes over.
4. Xeon Phi becomes really inefficient and stops delivering
speedup growth on a large number of MPI processes. It is
possible that OpenMP threads might alleviate this somewhat.
5. There is an interesting dip in the Xeon Phi efficiency curve at
around 16 MPI processes. What it is attributed to may require
extra investigation.
If you try to use both Xeon and Xeon Phi at once, you will have to not only balance
their respective numbers but also keep in mind that the data traversing the PCI Express
bus may move slower than inside Xeon and Xeon Phi, and most likely will move slower
most of the time, apart from large messages inside Xeon Phi. So, if you start with the
aforementioned proportion, you will have to play around a bit before you get to the nearly
142
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
ideal distribution, not to mention doing the process pinning and other tricks we have
explored. A good spot to start from would probably be 16 to 24 MPI processes on Xeon
and 64 to 96 MPI processes on Xeon Phi.
The required command will look as follows:
$ export I_MPI_MIC=1
$ export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u,ofa-v2-scif0
$ mpiexec.hydra -host `hostname` -np 16 ./miniMD_intel.host : \
-env LD_LIBRARY_PATH /opt/intel/impi_latest/mic/lib:$MIC_LD_LIBRARY_PATH \
-host `hostname`-mic0 -np 96 ./miniMD_intel.mic
Table 5-11 shows a result of our quick testing on the same platform:
Table 5-11. MiniMD Execution Time on Intel Xeon and Intel Xeon Phi with Local Minima
Highlighted (Seconds, Cluster)
Xeon/Phi 48 64 96 128
8 1.396 1.349 1.140 1.233
16 1.281 1.324 1.133 1.134
24 1.190 1.256 1.137 1.222
48 0.959 1.219 1.157 1.093
We placed Xeon process counts along the vertical axis and Xeon Phi process counts
along the horizontal axis. This way we could obtain a sort of two-dimensional data-
distribution picture represented by numbers. Also, note that we prudently under- and
overshot the guesstimated optimal process count ranges, just in case our intuition was
wrong. And as it happens, it was wrong! We can observe two local minima: one for
the expected 16:96 Xeon to Xeon Phi process count ratio. However, the better global
minimum is located in the 48:48 corner of the table. And if we compare it to the best we
can get on 48 Xeon–based MPI processes alone, we see that Xeon Phi’s presence draws
the result down by more than three times.
One can use ITAC to see what exactly is happening: is this imbalance induced by
the aforementioned Xeon to Xeon Phi core-to-core performance ratio that has not been
taken into account during the data distribution? Or is it by the communication overhead
basically caused by the PCI Express bus? It may be that both effects are pronounced to
the point of needing a fix. In particular, if the load imbalance is a factor, which it most
likely is because the data is likely split between the MPI processes proportional to their
total number, without accounting for the relative processor speed, one way to fight back
143
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
would be to create a bigger number of OpenMP threads on the Xeon Phi part of the
system. Quite unusually, you can control the number of threads using the program’s
own -t option. For example, the following command uses one of the better miniMD
configurations while generating a valid ITAC trace file:
Even a quick look at the resulting trace file shows that load imbalance caused by the
platform heterogeneity is indeed the root cause of all the evil here, as shown in Figure 5-14.
Figure 5-14. MiniMD trace file in ITAC (cluster, 2 Xeon processes, 6 Xeon Phi processes)
Here, processes P0 and P1 sit on the Xeon, while the rest of them sit on the Xeon Phi.
The difference in their relative speed is very clear from the direct visual comparison of the
corresponding (blue) computing sections. We can discount the MPI_Finalize duration
because it is most likely caused by the ITAC data post-processing. However, the MPI_Send
and MPI_Wait times are out of all proportion.
Further analysis of the data-exchange pattern reveals that two closely knit groups take
four processes each, with somewhat lower exchange volumes between the groups (not
shown). Moreover, a comparison of the transfer rates that can be done by clicking on the
Message Profile and selecting Attribute to show/Maximum Transfer Rate shows that the
PCI Express links achieve at most 0.2 bytes per tick while up to 2 bytes per tick are possible
inside Xeon and up to 1.1 bytes per tick inside Xeon Phi (not shown). This translates to
about 0.23 GiB/s, 2.3 GiB/s, and 1.1 GiB/s, respectively, with some odd outliers.
144
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
EXERCISE 5-9
Find out the optimal MPI process to the OpenMP thread ratio for miniMD using a
heterogeneous platform. Quantify this ratio in comparison to the relative component
speeds. How much of the effect can be attributed to the computation and
communication parts of the heterogeneity?
145
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
respectively. By the way, for the program to build, you will have to change the Makefile
to reference Intel compilers, and also add the -qopenmp flag to the OPT_F and add the
-lifcore library to the LIBS variables there. It is quite usual that some minor adjustments
are necessary.
Long story made short, here is the run string we used for the workstation launch:
$ export OMP_NUM_THREADS=4
$ mpirun -np 12 ./miniGhost.x --scaling 1 --nx 200 --ny 200 --nz 200
--num_vars 40 \
--num_spikes 1 --debug_grid 1 --report_diffusion 21 --percent_sum 100 \
--num_tsteps 20 --stencil 24 --comm_method 10 --report_perf 1 --npx 1
--npy 3 --npz 4 \
--error_tol 8
Built-in statistics output shows the role distribution among the top three MPI calls,
as illustrated in Listing 5-15:
High relative cost of the MPI_Allreduce makes it a very attractive tuning target.
However, let us try the full-size workload first. Once we proceed to run this benchmark
in its “small” configuration on eight cluster nodes and 96 MPI processes, we will use the
following run string inspired in part by the one we used on the workstation (here, we
highlighted deviations from the original script run_small.sh):
$ export OMP_NUM_THREADS=4
$ export I_MPI_PERHOST=12
$ mpirun -np 96 ./miniGhost.x --scaling 1 --nx 672 --ny 672 --nz 672
--num_vars 40 \
--num_spikes 1 --debug_grid 1 --report_diffusion 21 --percent_sum 100 \
--num_tsteps 20 --stencil 24 --comm_method 10 --report_perf 1 --npx 4
--npy 4 --npz 6 \
--error_tol8
The irony of benchmarking in the context of a request for proposals (RFP) like
NERSC-8 Trinity is that we cannot change the parameters of the benchmarks and may
not be allowed to change the run string, either. This means that we will probably have to
go along with the possibly suboptimal data split between the MPI processes this time;
although looking at the workstation results, we would prefer to leave as few layers along
the X axis as possible. However, setting a couple of environment variables upfront to ask
for four instead of one OpenMP threads, and placing 12 MPI processes per node, might
146
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
be allowed. Thus, our initial investigation did influence the mapping of the application
to the platform, and we know that we may be shooting below the optimum in the data-
distribution sense.
Further, it is interesting to see what is taking most of the MPI time now. The built-in
statistics show a slightly different distribution; see Listing 5-16:
Listing 5-16. MiniGhost Statistics (Cluster, 8 Nodes, 12 MPI Processes per Node,
4 OpenMP Threads per Process)
The sharp hike in relative MPI_Init cost is probably explained by the presence of the
relatively slower network. It may also be explained by all the threads being busy when the
network stack itself needs some of them to process the connection requests. Whatever the
reason, this overhead looks abnormally high and certainly deserves further investigation.
This way or another, the MPI_Init, MPI_Allreduce, and MPI_Waitany take about 99
percent of all MPI time, between them. At least the first two calls may be amenable to the
MPI-level tuning, while the last one may indicate some load imbalance (to be continued).
EXERCISE 5-10
Find the best possible mapping of your favorite application on your favorite platform.
Do you do better with the virtual or the physical cores? Why?
147
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
If you elect to use the mpitune utility, run it once after installation and each time after
changes in cluster configuration. The best configuration of the automatically selected
Intel MPI tuning parameters is recorded for each combination of the communication
device, the number of nodes, the number of MPI ranks, and the process layout. The
invocation string is simple in this case:
$ mpitune
Be aware that this can take a lot of time, so it may make sense to run this job
overnight. Note also that for this mode to work, you should have the writing permission
for the etc subfolder of the Intel MPI Library installation directory, or use the -od option
to select a different output directory.
Once the mpitune finishes, you can reuse the recorded values in any run by adding
the -tune option to the normal mpirun invocation string; for example:
You can learn more about the mpitune utility in the Tutorial: MPI Tuner for Intel MPI
Library for Linux* OS.18 If you elect to do the tuning manually, you will have to dig into the
MPI internals quite a bit. There are several groups of tuning parameters that you will need
to deal with for every target fabric, number of processes, their layout, and the pinning.
They can be split into point-to-point, collective, and other magical settings.
■ Note You can output some variable settings using the I_MPI_DEBUG value of 5.
148
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
149
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
$ export I_MPI_SHM_CACHE_BYPASS_THRESHOLDS=16384,16384,-1,16384,-1,16384
$ mpirun –np 2 –genv I_MPI_FABRICS shm IMB-MPI1 PingPong
■ Note You can output default collective settings using the I_MPI_DEBUG value of 6.
You can use the environment variables named after the pattern I_MPI_
ADJUST_<opname>, where the <opname> is the name of the respective collective operation.
This way you come to the variable names like I_MPI_ADJUST_ALLREDUCE.
If we consider the case of the MPI_Allreduce a little further, we will see that there are
no less than eight different algorithms available for this operation alone. Once again, the
Intel MPI Library Reference Manual is your friend. Here, we will only be able to give some
rules of thumb as to the algorithm selection by their class. To see how this general advice
fits your practical situation, you will have to run a lot of benchmarking jobs to determine
where to change the algorithm, if at all. A typical invocation string looks as follows:
150
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
You can certainly use any other benchmark, or even application, you want for this
tuning. We will stick to the IMB here, out of sheer weight of experience. This way or
another, you will end up with pretty fancy settings of the following kind that will have to
be put somewhere (most likely, a configuration file):
$ export I_MPI_ADJUST_ALLGATHER= \
'1:4-11;4:11-15;1:15-27;4:27-31;1:31-32;2:32-51;3:51-5988;4:5988-13320'
Well, it’s your choice. Now, going through the most important collective operations
in alphabetical order, in Table 5-12, we issue general recommendations based on
extensive research done by Intel engineers.19 You should take these recommendations
with a grain of salt, for nothing can beat your own benchmarking.
151
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
152
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
This way you can tune Intel MPI for any kind of MPI application by specifying its
command line. By default, performance is measured as the inverse of the program
execution time. To reduce the overall tuning time, use the shortest representative
application workload (if applicable). Again, this process may take quite a while to
complete.
Once you get the configuration file, you can reuse it any time in the
following manner:
Note that here you not only mention the file name but also use the same number of
processes and generally the same run configuration as in the tuning session. (You can
learn more about this tuning mode in the aforementioned tuning tutorial.)
If you elect to tune Intel MPI manually, you will basically have to repeat all that you
did for the platform-specific tuning described in the previous section, with the exception
of using your application or a set of representative kernels instead of the IMB for the
procedure. Certainly, you will do better instead by addressing only those point-to-point
patterns and collective operations at the number of processes, their layout and pinning,
and message sizes that are actually used by the target application. The built-in statistics
and ITAC output will help you in finding out what to go for first.
153
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
154
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
What Else?
Here is an assorted mix of tips and tricks you may try in your spare time:
• I_MPI_SSHM=1 Turns on the scalable shared memory path, which
might be useful on the latest multicore Intel Xeon processors and
especially on the many-core Intel Xeon Phi coprocessor.
• I_MPI_OFA_USE_XRC=1 Turns on the extensible reliable
connection (XRC) capability that may improve scalability for
several thousand nodes.
• I_MPI_DAPL_UD_RDMA_MIXED=1 Makes DAPL UD use
connectionless datagrams for short messages and connection-
oriented RDMA for long messages.
• I_MPI_DAPL_TRANSLATION_CACHE_AVL_TREE=1 May be useful for
applications sending a lot of long messages over DAPL.
• I_MPI_DAPL_UD_TRANSLATION_CACHE_AVL_TREE=1 Same for DAPL UD.
Of course, even this does not exhaust the versatile toolkit of tuning methods
available. Read the Intel MPI documentation, talk to experts, and be creative. This is what
this work is all about, right?
155
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
Figure 5-15. MiniGhost trace file in ITAC imbalance diagram breakdown mode
(Workstation, 12 MPI processes, 4 OpenMP threads)
$ export I_MPI_ADJUST_ALLREDUCE=6
A quick trial we performed confirmed that algorithm number 6 was among the best
for this workload. However, algorithms 1 and 2 fared just as well and were only 0.2 seconds
below the default one. Hence, most likely, optimization of the program source code
aimed at reduction of the irregularity of the exchange pattern will bring more value if
done upfront here. That may include both load imbalance correction and tuning of the
communication per se, because they may be interacting detrimentally with each other
(to be continued).
156
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
EXERCISE 5-11
Try your hand at both platform- and application-specific Intel MPI tuning, using your
favorite platform and application. Gauge the overall performance improvement.
Identify the cases where platform-specific tuning goes against the application-
specific one.
Avoiding MPI_ANY_SOURCE
Try to make your exchanges deterministic. If you have to use the MPI_ANY_SOURCE, be
aware that you may be paying quite a bit on top for every message you get. Indeed,
instead of waiting on a particular communication channel, as prescribed by a specific
receive operation, in the case of MPI_ANY_SOURCE the MPI Library has to poll all existing
connections to see whether there is anything matching on input. This means extensive
looping and polling, unless you went for the wait mode described earlier. Note that use of
different message tags is not going to help here, because the said polling will be done still.
Generally, all kinds of nondeterminism are detrimental and should be avoided,
if possible. One way this cannot be done is when a server process distributes some
work among the slave processes and waits to them to report back. However the work is
apportioned, some will come back earlier than others, and enforcing a particular order
in this situation might slow down the overall job. In all other cases, though, try to see
whether you can induce order and can benefit from doing that.
157
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
skip this synchronization most of the time. If you are still afraid of missing things or
mixing them up, start using the MPI message tags to instill the desired order, or create a
communicator that will ensure all messages sent within it will stay there.
Another aspect to keep in mind is that, although collective operations are not
required to synchronize processes by the MPI standard (with the exception of the
aforementioned MPI_Barrier, of course), some of them may do this, depending on the
algorithm they use. This may be a boon in some cases, because you can exploit this
side effect to your ends. You should avoid doing so, however, because if the algorithm
selection is changed for some reason, you may end up with no synchronization point
where you implied one, or vice versa.
About the only time when you may want to introduce extra synchronization points is
in the search for the load imbalance and its sources. In that case, having every iteration or
program stage start at approximately the same time across all the nodes involved may be
beneficial. However, this may also tilt the scale so that you will fail to see the real effect of
the load imbalance.
158
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
Again, this is mostly a thing of the past. Unless you know a brilliant new algorithm
that beats, hands down, all that can be extracted by the MPI tuning described earlier, you
should try to avoid going for the point-to-point substitute. Moreover, you may actually
win big by replacing the existing homegrown implementations with an equivalent MPI
collective operation. There may be exceptions to this recommendation, but you will have
to justify any efforts very carefully in this case.
159
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
Likewise, large messages will probably be sent using the rendezvous protocol
mentioned above. In other words, the standard send operation will effectively become
a synchronous one. Depending on the MPI implementation details, this may or may
not be equivalent to just calling the MPI_Ssend. Once again, in absence of a noticeable
computation/communication overlap, you will not see any improvement if you replace
this operation with a nonblocking equivalent.
More often than not, what does make sense is trying to do bilateral exchanges
by replacing a pair of sends and receives that cross each other by the MPI_Sendrecv
operation. It may happen to be implemented so that it exploits the underlying hardware
in a way that you will not be able to reach out for unless you let MPI handle this transfer
explicitly. Note, however, that a careless switch to nonblocking communication may
actually introduce extra serialization into the program, which is well explained in the
aforementioned tutorial.
Another aspect to keep in mind is that for the data to move across, something or
someone—in the latter case, you—will need to give the MPI library a chance to help you.
If you rely on asynchronous progress, you may feel that this matter has been dealt with.
Actually, it may or it may not have been, and even if it has been addressed, doing some
relevant MPI call in between, be aware that even something apparently pointless, like
an MPI_Iprobe for a message that never comes, may speed up things considerably. This
happens because synchronous progress is normally less expensive than asynchronous.
Once again, here the MPI implementation faces a dilemma, trading latency for
guarantee. Synchronous progress is better for latency, but it cannot guarantee progress
unless the program issues MPI calls relatively often. Asynchronous progress can provide
the necessary guarantee, especially if there are extra cores or cards in the system doing
just this. However, the context switch involved may kill the latency. It is possible that in
the future, Intel MPI will provide more controls to influence this kind of behavior. Stay
tuned; until then, be careful about your assumptions and measure everything before you
dive into chaos.
Finally, believe it or not, blocking transfers may actually help application processes
self-organize during the runtime, provided you took into account their natural desires.
If your interprocess exchanges are highly regular, it may make sense to do them in a certain
order (like north-south, then east-west, and so on). After initial shaking in, the processes
will fall into lockstep with each other, and they will proceed in a beautifully synchronized
fashion across the computation, like an army column marching to battle.
160
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
EXERCISE 5-12
If your application fares better with the MPI-3 nonblocking collectives inside, let us
know; we are looking for good application examples to justify further tuning of this
advanced MPI-3 standard feature.
$ export I_MPI_EXTRA_FILE_SYSTEM=on
$ export I_MPI_EXTRA_FILE_SYSTEM_LIST=panfs,pvfs2,lustre
You can mention only those file systems that interest you in the second line, of course.
161
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
162
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
Load imbalance aside, we may have to deal with the less than optimal process layout
(4x4x6) prescribed by the benchmark formulation. Indeed, when we tried other process
layouts within the same job manager session, we observed that the communication along
the X axis was stumbling—and more so as more MPI processes were placed along it;
see Table 5-13:
Table 5-13. MiniGhost Performance Dependency on the Process Layout (Cluster, 8 Nodes,
96 MPI Processes, 4 OpenMP Threads per Process)
Let’s try to understand what exactly is happening here. If you view a typical
problematic patch of the miniGhost trace file in ITAC, you will notice the following
picture replicated many times across the whole event timeline, at various moments and at
different time scales, as shown in Figure 5-16.
163
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
This patch corresponds to the very first and most expensive exchange during the
program execution. Rather small per se, it becomes a burden due to endless replication;
all smaller MPI communication segments after this one follow the same pattern or at
least have a pretty imbalanced MPI_Allreduce inside (not shown). It is clear that the
first order of the day is to understand why the MPI_Waitany has to work in such irregular
circumstances, and then try to correct this. It is also possible that the MPI_Allreduce will
recover its dignity when acting in a better environment.
By the looks of it, the pattern in Figure 5-16 resembles a typical neighbor exchange
implemented by nonblocking MPI calls. Since the very first MPI_Allreduce is a
representative one, we have no problem identifying where the prior nonblocking
exchange comes from: a bit of source code and log file browsing lead us to the file called
MG_UNPACK_BSPMA.F, where the waiting is done using the MPI_Waitany on all
MPI_Request items filled by the prior calls to MPI_Isend and MPI_Irecv that indeed
represent a neighbor data exchange. In addition to this, as the name of the file suggests
and the code review confirms, the data is packed and unpacked using the respective MPI
calls. From this, at least three optimization ideas of different complexity emerge:
1. Relatively easy: Use the MPI_Waitall or MPI_Waitsome instead
of the fussy MPI_Waitany. The former might be able to
complete all or at least more than one request per invocation,
and do this in the most appropriate order defined by the MPI
implementation. However, there is some internal application
statistics collection that is geared toward the use of
MPI_Waitany, so more than just a replacement of one call may
be necessary technically.
2. Relatively hard: Try to replace the nonblocking exchange with
the properly ordered blocking MPI_Sendrecv pairs. A code
review shows that the exchanges are aligned along the three
spatial dimensions, so that a more regular messaging order
might actually help smoothe the data flow and reduce the
observed level of irregularity. If this sounds too hard, even
making sure that all MPI_Irecv are posted shortly before the
respective MPI_Isend might be a good first step.
3. Probably impossible: Use the MPI derived datatypes instead
of the packing/unpacking. Before this deep modification
is attempted, it should be verified that packing/unpacking
indeed matters.
This coding exercise is only sensible once the MPI_Allreduce issue has been dealt
with. For that we need to look into the node-level details in the later chapters of this book,
and then return to this issue. This is a good example of the back-and-forth transition
between optimization levels. Remember that once you introduce any change, you will
have to redo the measurements and verify that the change was indeed beneficial. After
that is done, you can repeat this cycle or proceed to the node optimization level we
will consider in the following chapters, once we’ve covered more about advanced MPI
analysis techniques (to be continued in Chapter 6).
164
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
EXERCISE 5-13
EXERCISE 5-14
Return here once the MPI_Allreduce load imbalance has been dealt with, and
implement one of the proposed source code optimizations. Gauge its effect on the
miniGhost benchmark, especially at scale. Was it worth the trouble?
165
r
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
Figure 5-17. MiniMD real and ideal traces compared side by side (Workstation, 16 MPI
processes)
166
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
This view confirms the earlier observation that, although there may be up to
2.5 times improvement to haul in the MPI area, the overall effect on the total program’s
execution time will be marginal. Another interesting view to observe is the Breakdown
Mode in the imbalance diagram shown in Figure 5-18 (here we again changed the default
colors to roughly match those in the event timeline).
Figure 5-18. MiniMD trace file in ITAC imbalance diagra breakdown mode (Workstation,
16 MPI processes)
From this view you can conclude that MPI_Wait is probably the call to investigate
as far as pure MPI performance is concerned. The rest of the overhead comes from the
load imbalance. If you want to learn more about comparing trace files, follow up with the
aforementioned serialization tutorial.
167
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
168
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
Summary
We presented MPI optimization methodology in this chapter in its application to the Intel
MPI Library and Intel Trace Analyzer. However, you can easily reuse this procedure with
other tools of your choice.
It is (not so) surprising that the literature on MPI optimization in particular is rather
scarce. This was one of our primary reasons for writing this book. To get the most out of
it, you need to know quite a bit about the MPI programming. There is probably no better
way to get started than by reading the classic Using MPI by Bill Gropp, Ewing Lusk, and
Anthony Skjellum30 and Using MPI-2 by William Gropp, Ewing Lusk, and Rajeev Thakur.31
If you want to learn more about the Intel Xeon Phi platform, you may want to read Intel
Xeon Phi Coprocessor High-Performance Programming by Jim Jeffers and James Reinders
that we mentioned earlier. Ultimately, nothing will replace reading the MPI standard,
asking questions in the respective mailing lists, and getting your hands dirty.
We cannot recommend any specific book on the parallel algorithms because they are
quite dependent on the domain area you are going to explore. Most likely, you know all
the most important publications and periodicals in that area anyway. Just keep an eye on
them; algorithms rule this realm.
169
y
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
References
1. MPI Forum, “MPI Documents,” www.mpi-forum.org/docs/docs.html.
2. H. Bockhorst and M. Lubin, “Performance Analysis of a Poisson Solver Using Intel
VTune Amplifier XE and Intel Trace Analyzer and Collector,” to be published in TBD.
3. Intel Corporation, “Intel MPI Benchmarks,”
http://software.intel.com/en-us/articles/intel-mpi-benchmarks/.
4. Intel Corporation, “Intel(R) Premier Support,”
www.intel.com/software/products/support.
5. D. Akin, “Akin’s Laws of Spacecraft Design,”
http://spacecraft.ssl.umd.edu/old_site/academics/akins_laws.html.
6. A. Petitet, R. C. Whaley, J. Dongarra, and A. Cleary, “HPL - A Portable
Implementation of the High-Performance Linpack Benchmark for
Distributed-Memory Computers,” www.netlib.org/benchmark/hpl/.
7. Intel Corporation, “Intel Math Kernel Library – LINPACK Download,”
http://software.intel.com/en-us/articles/intel-math-kernel-library-
linpack-download.
8. A. Petitet, R. C. Whaley, J. Dongarra, and A. Cleary, “HPL FAQs,”
www.netlib.org/benchmark/hpl/faqs.html.
9. “BLAS (Basic Linear Algebra Subprograms),” www.netlib.org/blas/.
10. Sandia National Laboratory, “HPCG - Home,”
https://software.sandia.gov/hpcg/.
11. “Home of the Mantevo project,” http://mantevo.org/.
12. Intel Corporation, “Configuring Intel Trace Collector,”
https://software.intel.com/de-de/node/508066.
13. Sandia National Laboratory, “LAMMPS Molecular Dynamics Simulator,”
http://lammps.sandia.gov/.
14. Ohio State University, “OSU Micro-Benchmarks,”
http://mvapich.cse.ohio-state.edu/benchmarks/.
15. Intel Corporation, “Intel MPI Library - Documentation,”
https://software.intel.com/en-us/articles/intel-mpi-library-documentation.
16. J. Jeffers and J. Reinders, Intel Xeon Phi Coprocessor High-Performance Programming
(Waltham, MA: Morgan Kaufman Publ. Inc., 2013).
17. “MiniGhost,” www.nersc.gov/users/computational-systems/nersc-8-
system-cori/nersc-8-procurement/trinity-nersc-8-rfp/nersc-8-trinity-
benchmarks/minighost/.
18. Intel Corporation, “Tutorial: MPI Tuner for Intel MPI Library for Linux OS,”
https://software.intel.com/en-us/mpi-tuner-tutorial-lin-5.0-pdf.
170
CHAPTER 5 ■ ADDRESSING APPLICATION BOTTLENECKS: DISTRIBUTED MEMORY
171
CHAPTER 6
Addressing Application
Bottlenecks: Shared Memory
The previous chapters talked about the potential bottlenecks in your application and the
system it runs on. In this chapter, we will have a close look at how the application code
performs on the level of an individual cluster node. It is a fair assumption that there will also
be bottlenecks on this level. Removing these bottlenecks will usually translate directly to
increased performance, in addition to the optimizations discussed in the previous chapters.
In line with our top-down strategy, we will investigate how to improve your
application code on the threading level. On this level, you will find several potential
bottlenecks that can dramatically affect the performance of your application code;
some of them are hardware related, some of them are related to your algorithm. The
bottlenecks we discuss all come down to how the threads of your code interact with the
underlying hardware. From the past chapters you already have an understanding of how
this hardware works and what the important metrics and optimization goals are.
We will start with an introduction that covers how to apply Intel VTune Amplifier
XE and a loop profiler to your application to gain a better understanding of the code’s
execution profile. The next topic is that of detecting sequential execution and load
imbalances. Then, we will investigate how thread synchronization may affect the
performance of the application code.
173
CHAPTER 6 ■ ADDRESSING APPLICATION BOTTLENECKS: SHARED MEMORY
During the optimization work you will focus on the so-called hotspots that
contribute most to the application runtime, because improving their performance will be
most beneficial to overall runtime.
You have already seen a tool called PowerTOP in Chapter 4 that gives insight into
what is currently running on the system. However, it does not show what exactly the
running applications are executing. That is what the Linux tool suite perf is for.1 It
contains several tools to record and show performance data. One useful command is
perf top, which continuously presents the currently active processes and the function
they are currently executing. Figure 6-1 shows how the output of the interactive tool might
look for a run of the HPCG benchmark.2 The first column indicates what percentage of
CPU time the function (listed in column 4 of a line) has consumed since the last update
of the output. The second column shows in which process or shared library image the
function is located. The perf tool also supports the recording of performance data and
analyzing it offline with a command-line interface. Have a look at its documentation for a
more detailed explanation.
Figure 6-1. Output of the perf top commmand with functions active in the HPCG
application
174
CHAPTER 6 ■ ADDRESSING APPLICATION BOTTLENECKS: SHARED MEMORY
the data. VTune Amplifier XE supports both event-based sampling using the processor’s
built-in performance monitoring units (PMUs) and sampling based on instrumentation
of the binary code. In contrast to the Intel Trace Analyzer and Collector (see Chapter 5),
the focus of VTune Amplifier XE is on shared-memory and intra-node analysis. The
performance data is associated with the source code at all times, so you can easily
determine which source line of the application contributed to the performance data.
The most important place to start is with the hotspots analysis to dissect the compute
time of the application and relate that information to the application code. This gives a
good overview of where the application spends the most compute time. The individual
hotspots will be the focus areas of the optimization work to get the biggest bang for the
buck. As a side benefit, the hotspots analysis also provides a first insight into how well the
code executes on the machine. (We revisit this topic in Chapter 7.)
■ Note On most clusters it may not be possible to run the GUI. VTune Amplifier XE also
supports data collection and analysis on remote systems and from the command line.
If Remote (SSH) collection is selected in the project configuration, you can add the hostname
and credentials for a remote system. You can also use the Get Command Line button in
the GUI to get a command line that is ready for cut-and-paste to the cluster console or
job script. After the collection has finished, you can copy the resulting data to your local
machine for analysis within the GUI. For a command-line analysis, you do not need to create
a project. You will see examples of how to use this feature later on in this section. You can
find out more about collecting performance data and analyzing it with the command-line
interface in the VTune Amplifier XE user’s guide.3
175
CHAPTER 6 ■ ADDRESSING APPLICATION BOTTLENECKS: SHARED MEMORY
Running this on our example machine gives us the result shown in Figure 6-2. The
code executed for 383 seconds and consumed about 18,330 seconds of CPU time, out
of which 10,991 seconds (almost 60 percent) is attributed to execution of a function
called ComputeSYMGS_ref. Function ComputeSPMV_ref contributes another 5,538
seconds (30 percent) to the compute time. That makes up about 90 percent of the total
CPU compute time. Thus, these two functions will be of interest when we’re looking for
optimization opportunities.
So, the next step is to dig deeper into these functions to find out more about what
they do and how they do it. We click on one of the hotspots or the Bottom-up button and
VTune will show a screen similar to the one in Figure 6-3. Here all relevant functions
are shown in more detail, together with their relevant execution time, their containing
module (i.e., executable file, shared object, etc.), and the call stack that leads to the
invocation of a hotspot. Of course, we will find our two suspect functions listed first and
second, as in the Summary screen. As we are interested in finding out more about the
hotspot, we change the filter to the Loops and Functions mode to let the tool also show
hot loops. You can enable this mode by changing the Loop Mode filter to Loops and
Functions in the filter area at the bottom of the GUI.
176
CHAPTER 6 ■ ADDRESSING APPLICATION BOTTLENECKS: SHARED MEMORY
Figure 6-3. Hotspots (loops and functions) for the HPCG benchmark
You might be surprised to see that the order of the hotspots now seems to have
changed. The functions ComputeSYMGS_ref and ComputeSPMV_ref are now at the tail of
the ranking, which can be seen by scrolling down to the bottom of the upper pane of the
screen shot in Figure 6-3. The new top hotspots are loops at several locations in these
functions. The hottest loop is at line 67 in the function ComputeSPMV_ref and consumes
13 percent of the total compute time. This is a good candidate for parallelization, isn’t it?
We cannot tell without reading the source code, so we open the source code of the loop
by double-clicking the line noting this loop within the VTune Amplifier XE GUI.
Listing 6-1 shows the pertinent code of this hotspot.
As you can see, the code consists of two nested loops. VTune Amplifier XE identified
the inner loop as the hotspot. Which loop should we select as the target for OpenMP
parallelization? In this case, as in many others, the solution will be to parallelize the outer
loop. But how do we know how many iterations these loops are executing?
177
CHAPTER 6 ■ ADDRESSING APPLICATION BOTTLENECKS: SHARED MEMORY
With these settings, the application will record runtime information for functions
and loops, including trip counts for all loops. There are several caveats to keep in mind
when using this feature, though. First, it only works with single-threaded, single-process
applications. Second, it may add considerable overhead to the runtime of the application.
The penalty depends on the code structure; many fine-grained functions and loops in
the code will add more overhead than fewer large functions and loops. To reduce the
overhead, you may try one or more of the command options listed in Table 6-1.
Flag Effect
-profile-loops=inner Only profile inner loops
-profile-loops=outer Only profile outer loops
-profile-loops-report=1 Report execution of loops, but no trip count
The loop profile for the HPCG example is given in Figure 6-4. When we compare
Figure 6-4 with the hotspot profile shown in Figure 6-3, we can see that the hotspots and
the loop profile do not match. This is no surprise; the loop profile was collected in
single-rank mode—that is, with only one MPI process executing. In addition, a loop with
a small trip count can exceed loops with large numbers of iterations if the loop body
is large and demands a lot of compute time. Nevertheless, the loop profile contains an
accurate itemization of the loops and their trip counts.
178
CHAPTER 6 ■ ADDRESSING APPLICATION BOTTLENECKS: SHARED MEMORY
Figure 6-4. Function and loop profile for the HPCG benchmark
With the loop hotspots and the loop profile, we can now make an informed decision
about which of the two loops in ComputeSPMV_ref to parallelize. The hotspot analysis
told us that the inner loop is the hot loop. However, the loop profile tells us that the loop
in line 67 has been encountered 429 million times with a minimum and maximum trip
count of 1. It is easy to see that any parallelization would have done a very poor job on
this loop. But there is also the highlighted outer loop showing up in the loop profile. It has
been encountered 687 times with minimum and maximum trip count of 17,576 and 1.1
million, respectively. Also, the average trip count of about 625,572 iterations tells us that
this loop will be an interesting candidate for parallelization. Of course, one still needs to
check that there are no loop dependencies that would prevent parallelization. Inspecting
the loop body, we can see that this loop can be executed in parallel. Although it is a good
idea to check for loop-carried dependencies and data dependencies (Chapter 7) instead
of blindly adding OpenMP parallelization pragmas to loops, tools such as Intel Inspector
XE4 or Valgrind5 can be a great help in detecting and resolving issues introduced by
multithreading.
EXERCISE 6-1
Run a hotspot analysis for your application(s) and determine the minimum,
maximum, and average trip counts of its loops. Can you find candidates for
parallelization?
179
CHAPTER 6 ■ ADDRESSING APPLICATION BOTTLENECKS: SHARED MEMORY
$ export OMP_NUM_THREADS=6
$ mpirun -np 8 amplxe-cl -collect hotspots --result-dir miniFE-8x6 -- \
miniFE.x -nx=500
180
CHAPTER 6 ■ ADDRESSING APPLICATION BOTTLENECKS: SHARED MEMORY
Figure 6-5. Hotspot profile of the miniFE application to determine potential load
imbalances
Using this command line to collect performance data, VTune Amplifier XE produced
eight different results databases (miniFE-8x6.0 to miniFE-8x6.7), each of which
contains the performance data for one of the eight MPI ranks. Figure 6-5 only shows the
performance data for the first MPI rank. The other seven MPI ranks expose the same
performance characteristics, and thus we can restrict ourselves to the one MPI rank
in this case. For other applications, it will be required to check all MPI ranks and their
performance data individually to make sure there are no outliers in the runtime profile.
Let us have a look at the timeline view at the bottom of Figure 6-5. The timeline
shows several threads active over time. There are some particular areas of interest.
First, we can observe that only one thread is executing for about 40 seconds before
multithreading kicks in. We can also spot a second sequential part ranging for about
56 seconds in total, from 54 seconds to 110 seconds in the timeline. Zooming in and
filtering the timeline, we can find out that the code is doing a matrix initialization in the
first 40 seconds of its execution. About one-third of the compute time in this part is also
attributed to an MPI_Allreduce operation. A similar issue leads to the sequential part that
begins at 54 seconds of the execution. While this is not a true load imbalance in the code,
because OpenMP is not active in these parts of the application, its exposure is similar to
a load imbalance. From a timeline perspective, a load imbalance will look similar to what
we see in Figure 6-5. In our example, finding a parallelization scheme to also parallelize
the sequential fractions may boost application performance, owing to the amount of time
spend in these parts of the application.
The general approach to solving a load imbalance is to first try to modify the loop
scheduling of the code in question. Typically, OpenMP implementations prefer static
scheduling that assigns equally large numbers of loop iterations to individual worker
threads. While it is a good solution for loops with equal compute time per iteration, any
unbalanced loop will cause problems. OpenMP defines several loop scheduling types
that you can use to resolve the load imbalance. Although switching to fully dynamic
schedules such as dynamic or guided appears to be a good idea, these scheduling
181
CHAPTER 6 ■ ADDRESSING APPLICATION BOTTLENECKS: SHARED MEMORY
schemes tend to increase contention between many OpenMP threads, because a shared
variable that maintains the work distribution. Static scheduling can still be used despite
the load imbalance it introduces if the chunk size is adjusted down so that round-robin
scheduling kicks in. Because each of the threads then receives a sequence of smaller
blocks, there is a good chance that, on average, all the threads will receive compute-
intensive and less compute-intensive loop chunks. At the same time, it ensures that each
thread can compute all iterations it has to process, without synchronizing with the other
threads through a shared variable.
182
CHAPTER 6 ■ ADDRESSING APPLICATION BOTTLENECKS: SHARED MEMORY
Figure 6-6. Speedup graph (lines) and absolute runtime (bars) for the MiniMD
benchmark
$ source /opt/intel/vtune_amplifier_xe/amplxe-vars.sh
$ amplxe-cl -collect advanced-hotspots -r omp -- \
./miniMD_intel --num_threads 24
$ amplxe-cl -collect advanced-hotspots -r mpi -- \
mpirun -np 24 ./miniMD_intel
183
CHAPTER 6 ■ ADDRESSING APPLICATION BOTTLENECKS: SHARED MEMORY
In typical applications these functions show up in the hotspots profile from time to time
and sometimes, as in this case, they are the culprit. To find out, we need to take a closer
look at what the __kmp_test_then_add_real64 function does.
Figure 6-7. Hotspots profiles for the MPI version (top) and OpenMP version (bottom) of
MiniMD
Let’s have a closer look at it by double-clicking its line in the tabular view. This
takes us to the assembly code of the function, because runtime libraries shipped with
Intel Composer XE usually do not ship with full debugging symbols and source code, for
obvious reasons. If you inspect the machine code, you will find that its main operation
consuming a lot of time is a machine instruction lock cmpxchg. This instruction is an
atomic compare-and-exchange operation, which is frequently used to implement an
atomic add operation.
Functions like __kmp_test_then_add_real64 and similar ones that implement
OpenMP locks are hints that the code issues too many fine-grained atomic instructions.
In case of MiniMD, the culprit is an atomic directive that protects the force update
and that causes slowdown compared to the MPI version. It is also responsible for the
limited scalability of the OpenMP version because it quickly becomes a bottleneck for an
increased number of threads.
EXERCISE 6-2
Browse through the MiniMD code and try to find the OpenMP atomic constructs that
cause the overhead in the OpenMP version. Can you find similar synchronization
constructs in your application?
184
CHAPTER 6 ■ ADDRESSING APPLICATION BOTTLENECKS: SHARED MEMORY
Figure 6-8. Comparision of MiniMD with and without OpenMP atomic construct
185
CHAPTER 6 ■ ADDRESSING APPLICATION BOTTLENECKS: SHARED MEMORY
APIs to implement user-space locks. For each lock operation in the code, the analyzer
helps browse through the participating threads (lock owner and waiting threads), the lock
object involved, and the respective source code locations where the lock was acquired
and released.
■ Note The numactl command introduced in Chapter 2 can also change the default
allocation strategy of the Linux kernel. The argument --localalloc enables the standard
Linux allocation strategy. With --preferred you can ask to place physical pages on a
specific NUMA region, whereas --membind enforces placement on NUMA regions. Finally,
the --interleave option interleaves the physical pages on several NUMA regions in a
round-robin fashion. You can find additional details about this in the man page of the
numactl command.
186
CHAPTER 6 ■ ADDRESSING APPLICATION BOTTLENECKS: SHARED MEMORY
The gray line shows the bandwidth we have obtained by forcing memory allocation
to the second NUMA region, while keeping the threads on the first socket:
It is easy to see how much available memory bandwidth we lost by choosing a wrong
placement for data and computation. It is key to tie data and computation together
on the same NUMA region whenever possible. This will greatly improve application
performance. If the application is too complex to improve its NUMA awareness, you can
still investigate if interleaved page allocation or switching off the NUMA mode in the
BIOS improves overall performance. With these settings, the memory allocations are then
distributed across the whole machine and thus all accesses are going equally to local and
remote memory, on average.
If you wish to optimize the application and improve its NUMA awareness, then there
are several ways to accomplish this mission. First, there are ways to bind threads and
processes to individual NUMA regions so that they stay close to their data. We used the
numactl command earlier to do this, but Linux offers several other APIs (for instance,
sched_setaffinity) or tools (for example, taskset) to control process and threads in a
machine-dependent manner. You may also recall the I_MPI_PIN environment variable
and its friends (see Chapter 5) that enable a more convenient way of controlling process
placement for MPI applications. Of course, typical OpenMP implementations also
provide similar environment variables. (We will revisit this topic later in this chapter,
when we look at hybrid MPI/OpenMP applications.)
187
CHAPTER 6 ■ ADDRESSING APPLICATION BOTTLENECKS: SHARED MEMORY
Second, you can exploit the first-touch policy of the operating system in a threaded
application. The key idea here is to use the same parallelization scheme to initialize data
and to make sure that the same parallelization scheme is also used for computation.
Listing 6-2 shows an example of a (very) naïve matrix-vector multiplication code that
uses OpenMP for multithreading. Apart from the compute function, which computes the
result of the matrix-matrix multiplication, the code contains two initialization functions
(init and init_numa_aware). In the init function, the master thread allocates all data
structures and then initializes the data sequentially. With the first-touch policy of the
Linux kernel, all physical pages will therefore reside on the NUMA region that executed
the master thread. The init_numa_aware function still uses the master thread to allocate
the data through malloc. However, the code then runs the initialization in an OpenMP
parallel for loop with the same loop schedule as the accesses in the compute function
happen for the A and c arrays. Because each OpenMP thread now touches the same data
for A and c it is supposed to work on, the physical pages are distributed across the NUMA
regions of the machine and locality is improved.
void init() {
A = (double*) malloc(sizeof(*A) * n * n);
b = (double*) malloc(sizeof(*b) * n);
c = (double*) malloc(sizeof(*c) * n);
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
A[i*n+j] = ((double) rand())/((double) RAND_MAX);
void init_numa_aware() {
A = (double*) malloc(sizeof(*A) * n * n);
b = (double*) malloc(sizeof(*b) * n);
c = (double*) malloc(sizeof(*c) * n);
#pragma omp parallel
{
#pragma omp for
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
A[i*n+j] = ((double) rand())/((double) RAND_MAX);
188
CHAPTER 6 ■ ADDRESSING APPLICATION BOTTLENECKS: SHARED MEMORY
void compute() {
#pragma omp parallel for
for(int i = 0; i < n; i++)
for(int j = 0; j < n; j++)
c[i] += A[i*n+j] * b[j];
}
The array b is a special case in this example. If you consider the compute function,
you will see that b is read equally from all threads. So at first glance it does not seem to
make a real difference if we used a NUMA-aware allocation or just allocate it in a single
NUMA region. Unless the matrix size becomes unreasonably large, it will likely be that
b will fit in the last-level cache of the individual sockets, so that no NUMA effects can
be measured.
Of course, all this only happens if the working size of the application requires
allocation of several physical pages so that they can be distributed across the different
NUMA regions. The data also needs to be large enough so that the caches are not effective
and that out-of-cache data accesses happen. For a perfectly cache-optimized code, the
effect of this optimization may be low or even negligible. If threads frequently access a
large, shared, but read-only data structure (like b) that does not fit the LLC of the sockets,
then distributing it across several NUMA regions will still likely benefit performance.
In this case, distributing the data helps avoid overloading a single NUMA region with
memory accesses from other NUMA regions.
The effect of parallel data allocation in Listing 6-2 can be visualized nicely with the
STREAM Triad benchmark. Figure 6-10 summarizes different thread placements and the
effect of NUMA-aware allocation on memory bandwidth. The compact (gray solid and
dashed line) in the chart indicates that the OpenMP runtime was instructed to first fill a
socket with threads before placing threads on the second socket. “Scatter” (black solid
and dashed line) distributes the threads in round-robin fashion. (We will have a closer
look at these distribution schemes in the next section).
189
CHAPTER 6 ■ ADDRESSING APPLICATION BOTTLENECKS: SHARED MEMORY
Figure 6-10. STREAM Triad bandwidth with NUMA-aware allocation across multiple
NUMA regions
What you can observe from Figure 6-10 is that NUMA awareness always provides best
results, as it fully exploits the capabilities of the memory subsystem. If threads are kept
close to each other (compact), adding the second NUMA region contributes additional
memory bandwidth, which is expected. For the scatter distribution, the memory
bandwidth of the two NUMA regions of the system contributes to the aggregate memory
bandwidth when at least two threads are executing. However, memory bandwidth will be
up to a factor of two less if memory is allocated in only one NUMA region.
Unfortunately, NUMA-aware data allocation is not possible in all cases. One
peculiar example is MPI applications that employ OpenMP threads. In many cases, these
applications use the MPI_THREAD_FUNNELED or MPI_THREAD_SERIALIZED modes in which
only one thread performs the MPI operations. If messages are received into a newly
allocated buffer, then the first-touch policy automatically allocates the backing store of
the buffer on a single NUMA region in which the communicating thread was executing.
If you wish to run OpenMP threads across multiple NUMA regions and still maintain
NUMA awareness, things tend to become complex and require a lot of thought and fine-
tuning. Depending on how long the data will be live in the buffer and how many accesses
the threads will make, it might be beneficial to either make a multithreaded copy of the
buffer so that the accessing threads also perform the first touch, or use the Linux kernel’s
interface for page migration to move the physical pages into the right NUMA domain.
However, these will be costly operations that need to be amortized by enough data
accesses. Plus, implementing the migration strategies adds a lot of boilerplate code to the
190
CHAPTER 6 ■ ADDRESSING APPLICATION BOTTLENECKS: SHARED MEMORY
application. The easiest way of solve this is to use one MPI rank per NUMA region and
restrict OpenMP threading to that region only. In this case, there are no changes required
to the application code, but you will need to properly bind threads and processes to the
NUMA regions and their corresponding cores.
191
CHAPTER 6 ■ ADDRESSING APPLICATION BOTTLENECKS: SHARED MEMORY
Listing 6-3 shows the effect of different values for the KMP_AFFINITY variable on the
thread placement. It shows how 18 threads are mapped to the cores of our two-socket
example machine. For the compact placement, all 18 threads will be assigned to the
first socket. The scatter strategy assigns the threads to the sockets of the machine in a
round-robin fashion; even thread IDs are assigned to the first socket, threads with odd
ID execute on the second socket. We can check this allocation by adding the verbose
modifier to the KMP_AFFINITY environment variable, which requests to print information
about the machine structure and how the threads are assigned to the (logical) cores of
the system (Listing 6-3). To make sense of the different IDs and the underlying machine
structure, you may use the cpuinfo tool introduced in Chapter 5.
Listing 6-3. OpenMP Thread Pinning with Additional Information Printed for Each
OpenMP Thread
$ OMP_NUM_THREADS=18 KMP_AFFINITY=granularity=thread,compact,verbose \
./my_app
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11
info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,
7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,3
3,34,35,36,37,38,39,40,41,42,43,44,45,46,47}
OMP: Info #156: KMP_AFFINITY: 48 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 12 cores/pkg x 2 threads/core (24
total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 24 maps to package 0 core 0 thread 1
[...]
OMP: Info #144: KMP_AFFINITY: Threads may migrate across 1 innermost levels
of machine
OMP: Info #242: KMP_AFFINITY: pid 85939 thread 0 bound to OS proc set {0}
OMP: Info #242: KMP_AFFINITY: pid 85939 thread 1 bound to OS proc set {24}
OMP: Info #242: KMP_AFFINITY: pid 85939 thread 2 bound to OS proc set {1}
OMP: Info #242: KMP_AFFINITY: pid 85939 thread 3 bound to OS proc set {25}
OMP: Info #242: KMP_AFFINITY: pid 85939 thread 4 bound to OS proc set {2}
OMP: Info #242: KMP_AFFINITY: pid 85939 thread 5 bound to OS proc set {26}
OMP: Info #242: KMP_AFFINITY: pid 85939 thread 6 bound to OS proc set {3}
OMP: Info #242: KMP_AFFINITY: pid 85939 thread 7 bound to OS proc set {27}
OMP: Info #242: KMP_AFFINITY: pid 85939 thread 8 bound to OS proc set {4}
OMP: Info #242: KMP_AFFINITY: pid 85939 thread 9 bound to OS proc set {28}
OMP: Info #242: KMP_AFFINITY: pid 85939 thread 10 bound to OS proc set {5}
OMP: Info #242: KMP_AFFINITY: pid 85939 thread 11 bound to OS proc set {29}
OMP: Info #242: KMP_AFFINITY: pid 85939 thread 12 bound to OS proc set {6}
OMP: Info #242: KMP_AFFINITY: pid 85939 thread 13 bound to OS proc set {30}
OMP: Info #242: KMP_AFFINITY: pid 85939 thread 14 bound to OS proc set {7}
OMP: Info #242: KMP_AFFINITY: pid 85939 thread 16 bound to OS proc set {8}
192
CHAPTER 6 ■ ADDRESSING APPLICATION BOTTLENECKS: SHARED MEMORY
OMP: Info #242: KMP_AFFINITY: pid 85939 thread 15 bound to OS proc set {31}
OMP: Info #242: KMP_AFFINITY: pid 85939 thread 17 bound to OS proc set {32}
[...]
$ OMP_NUM_THREADS=18 KMP_AFFINITY=granularity=thread,scatter,verbose \
./my_app
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11
info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,
7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,3
3,34,35,36,37,38,39,40,41,42,43,44,45,46,47}
OMP: Info #156: KMP_AFFINITY: 48 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 12 cores/pkg x 2 threads/core (24
total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 24 maps to package 0 core 0 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 1 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 25 maps to package 0 core 1 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 2 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 26 maps to package 0 core 2 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 3 thread 0
[...]
OMP: Info #242: KMP_AFFINITY: pid 85979 thread 0 bound to OS proc set {0}
OMP: Info #242: KMP_AFFINITY: pid 85979 thread 1 bound to OS proc set {12}
OMP: Info #242: KMP_AFFINITY: pid 85979 thread 2 bound to OS proc set {1}
OMP: Info #242: KMP_AFFINITY: pid 85979 thread 3 bound to OS proc set {13}
OMP: Info #242: KMP_AFFINITY: pid 85979 thread 4 bound to OS proc set {2}
OMP: Info #242: KMP_AFFINITY: pid 85979 thread 5 bound to OS proc set {14}
OMP: Info #242: KMP_AFFINITY: pid 85979 thread 6 bound to OS proc set {3}
OMP: Info #242: KMP_AFFINITY: pid 85979 thread 7 bound to OS proc set {15}
OMP: Info #242: KMP_AFFINITY: pid 85979 thread 8 bound to OS proc set {4}
OMP: Info #242: KMP_AFFINITY: pid 85979 thread 9 bound to OS proc set {16}
OMP: Info #242: KMP_AFFINITY: pid 85979 thread 10 bound to OS proc set {5}
OMP: Info #242: KMP_AFFINITY: pid 85979 thread 11 bound to OS proc set {17}
OMP: Info #242: KMP_AFFINITY: pid 85979 thread 12 bound to OS proc set {6}
OMP: Info #242: KMP_AFFINITY: pid 85979 thread 13 bound to OS proc set {18}
OMP: Info #242: KMP_AFFINITY: pid 85979 thread 14 bound to OS proc set {7}
OMP: Info #242: KMP_AFFINITY: pid 85979 thread 16 bound to OS proc set {8}
OMP: Info #242: KMP_AFFINITY: pid 85979 thread 15 bound to OS proc set {19}
OMP: Info #242: KMP_AFFINITY: pid 85979 thread 17 bound to OS proc set {20}
[...]
If you carefully inspect the printout of Listing 6-3, it appears that the OpenMP
runtime system has assigned the threads in a way that we did not expect in the first place.
The compact policy assigned multiple OpenMP threads to the same physical core
193
CHAPTER 6 ■ ADDRESSING APPLICATION BOTTLENECKS: SHARED MEMORY
(e.g., thread 0 and 1 to cores 0 and 24, respectively), whereas for scatter, it assigned
different physical cores. Due to SMT, each physical core appears as two logical cores that
may execute threads. With compact, we have requested from the OpenMP runtime to fill
one socket first, before utilizing the second socket. The most compact thread placement
is to put thread 0 to logical core 0 and use logical core 24 for thread 1, and so on. Thinking
of a compact placement, this might not be what we have intended to do; you might have
expected something along the line of placing 12 threads on the first socket and deploy the
remaining six threads on the other socket.
The syntax for KMP_AFFINITY provides modifiers to further control its behavior. We
already silently used granularity in Listing 6-3. You can use it to tell the Intel OpenMP
implementation whether an OpenMP thread is to be assigned to a single logical core
(granularity=thread) or to the hardware threads of a physical core (granularity=core).
Once you have played a bit with these two settings, you will see that neither will deploy
the 18 threads of our example to two sockets. The solution is to use compact,1 as the
policy. The effect is shown in Listing 6-4, in which 12 threads have been deployed to the
first socket, and the remaining six threads have been assigned to the second socket. The
documentation of Intel Composer XE6 can give you more information on what compact,1
means and what other affinity settings you can use.
Listing 6-4. Compact KMP_AFFINITY Policy Across Two Sockets of the Example
Machine
$ OMP_NUM_THREADS=18 KMP_AFFINITY=granularity=thread,compact,1,verbose ./
my_app
[...]
OMP: Info #242: KMP_AFFINITY: pid 86271 thread 0 bound to OS proc set {0}
OMP: Info #242: KMP_AFFINITY: pid 86271 thread 1 bound to OS proc set {1}
OMP: Info #242: KMP_AFFINITY: pid 86271 thread 3 bound to OS proc set {3}
OMP: Info #242: KMP_AFFINITY: pid 86271 thread 2 bound to OS proc set {2}
OMP: Info #242: KMP_AFFINITY: pid 86271 thread 4 bound to OS proc set {4}
OMP: Info #242: KMP_AFFINITY: pid 86271 thread 5 bound to OS proc set {5}
OMP: Info #242: KMP_AFFINITY: pid 86271 thread 6 bound to OS proc set {6}
OMP: Info #242: KMP_AFFINITY: pid 86271 thread 7 bound to OS proc set {7}
OMP: Info #242: KMP_AFFINITY: pid 86271 thread 8 bound to OS proc set {8}
OMP: Info #242: KMP_AFFINITY: pid 86271 thread 9 bound to OS proc set {9}
OMP: Info #242: KMP_AFFINITY: pid 86271 thread 10 bound to OS proc set {10}
OMP: Info #242: KMP_AFFINITY: pid 86271 thread 11 bound to OS proc set {11}
OMP: Info #242: KMP_AFFINITY: pid 86271 thread 12 bound to OS proc set {12}
OMP: Info #242: KMP_AFFINITY: pid 86271 thread 13 bound to OS proc set {13}
OMP: Info #242: KMP_AFFINITY: pid 86271 thread 14 bound to OS proc set {14}
OMP: Info #242: KMP_AFFINITY: pid 86271 thread 16 bound to OS proc set {16}
OMP: Info #242: KMP_AFFINITY: pid 86271 thread 15 bound to OS proc set {15}
OMP: Info #242: KMP_AFFINITY: pid 86271 thread 17 bound to OS proc set {17}
With version 4.0 of the OpenMP API specification, OpenMP now defines a common
way to deal with thread placement in OpenMP applications. In OpenMP terms, a place
denotes an entity that is capable of executing an OpenMP thread and is described as an
unordered list of numerical IDs that match the processing elements of the underlying
194
CHAPTER 6 ■ ADDRESSING APPLICATION BOTTLENECKS: SHARED MEMORY
hardware. For Intel processors, these IDs are the core IDs as they appear in the operating
system (e.g., as reported in /proc/cpuinfo or by KMP_AFFINITY=verbose). A place list
contains an ordered list of places and is defined through the OMP_PLACES environment
variable. The place list can also contain abstract names for places, such as threads
(logical cores), cores (physical cores), or sockets (the sockets in the machine).
OpenMP also defines three placement policies with respect to an existing place list:
• master: Assign all threads of a team to the same place as the
master thread of the team.
• close: Assign OpenMP threads to places such that they are close
to their parent thread.
• spread: Sparsely distribute the OpenMP threads in the place list,
dividing the place list into sublists.
In contrast to KMP_AFFINITY, the OpenMP placement policies can be used on a per-
region basis by using the proc_bind clause at a parallel construct in the OpenMP code.
It also supports nested parallelism through a list of policies separated by commas for the
OMP_PROC_BIND variable. For each nesting level, one can specify a particular policy that
becomes active, once a parallel region on that level starts executing. This is especially
useful for applications that either use nested parallelism or that need to modify the thread
placement on a per-region basis.
Listing 6-5 contains a few examples of different thread placements using OMP_PLACES
and OMP_PROC_BIND. The first example has the same effect as the compact placement in
Listing 6-4, whereas the second example assigns the threads in a similar fashion as the
scatter policy of KMP_AFFINITY.
195
CHAPTER 6 ■ ADDRESSING APPLICATION BOTTLENECKS: SHARED MEMORY
OMP: Info #242: OMP_PROC_BIND: pid 86565 thread 15 bound to OS proc set {31}
OMP: Info #242: OMP_PROC_BIND: pid 86565 thread 2 bound to OS proc set {1}
[...]
For more information on how to use KMP_AFFINITY and the OpenMP interface
for threaded applications, see the user’s guide of Intel Composer XE. For more
advanced usage scenarios, the documentation also contains useful information on how
programmers can use special runtime functions that allow for specific control of all
aspects of thread pinning.
EXERCISE 6-3
196
CHAPTER 6 ■ ADDRESSING APPLICATION BOTTLENECKS: SHARED MEMORY
If you configure the application to run only a single MPI rank per node, so that the
remaining cores of the node are used to execute OpenMP threads, you’ll need to place the
threads appropriately to avoid NUMA issues and to make sure that the operating system
keeps the threads where their data has been allocated.
If the application runs with one or more MPI ranks per socket, thread placement will
be less of an issue. If the MPI rank is bound to a certain socket (the default for Intel MPI),
the threads of each MPI process are automatically confined to execute on the same set
of cores (or socket) that are available for their parent process (see Listing 6-6). Since now
the MPI ranks’ threads cannot move away from their executing socket, the NUMA issue is
automatically solved. Data allocation and computation will always be performed on the
same NUMA region. Pinning threads to specific cores might still lead to improvements,
since it effectively avoids cache invalidations of the L1 and L2 caches that may happen
owing to the threads’ wandering around on different cores of the same socket.
In Listing 6-6, we instruct both the Intel MPI Library and the Intel OpenMP runtime
to print their respective process and thread placements for MiniMD on a single node with
two MPI ranks. As you can see, the Intel MPI Library automatically deploys one MPI rank
per socket and restricts execution of the OpenMP threads to the cores of each socket. We
can use this as a starting point and apply what we saw earlier in this section. Adding the
appropriate KMP_AFFINITY settings, we can now make sure that each OpenMP thread is
pinned to the same core during execution (shown in Listing 6-7).
Listing 6-6. Default Process and Thread Placement for an MPI/OpenMP Hybrid
Application
197
CHAPTER 6 ■ ADDRESSING APPLICATION BOTTLENECKS: SHARED MEMORY
[0] OMP: Info #242: KMP_AFFINITY: pid 87135 thread 6 bound to OS proc set
{0,1,2,3,4,5,6,7,8,9,10,11,24,25,26,27,28,29,30,31,32,33,34,35}
[0] OMP: Info #242: KMP_AFFINITY: pid 87135 thread 8 bound to OS proc set
{0,1,2,3,4,5,6,7,8,9,10,11,24,25,26,27,28,29,30,31,32,33,34,35}
[0] OMP: Info #242: KMP_AFFINITY: pid 87135 thread 7 bound to OS proc set
{0,1,2,3,4,5,6,7,8,9,10,11,24,25,26,27,28,29,30,31,32,33,34,35}
[0] OMP: Info #242: KMP_AFFINITY: pid 87135 thread 9 bound to OS proc set
{0,1,2,3,4,5,6,7,8,9,10,11,24,25,26,27,28,29,30,31,32,33,34,35}
[0] OMP: Info #242: KMP_AFFINITY: pid 87135 thread 10 bound to OS proc set
{0,1,2,3,4,5,6,7,8,9,10,11,24,25,26,27,28,29,30,31,32,33,34,35}
[0] OMP: Info #242: KMP_AFFINITY: pid 87135 thread 11 bound to OS proc set
{0,1,2,3,4,5,6,7,8,9,10,11,24,25,26,27,28,29,30,31,32,33,34,35}
[1] OMP: Info #242: KMP_AFFINITY: pid 87136 thread 0 bound to OS proc set
{12,13,14,15,16,17,18,19,20,21,22,23,36,37,38,39,40,41,42,43,44,45,46,47}
[1] OMP: Info #242: KMP_AFFINITY: pid 87136 thread 1 bound to OS proc set
{12,13,14,15,16,17,18,19,20,21,22,23,36,37,38,39,40,41,42,43,44,45,46,47}
[1] OMP: Info #242: KMP_AFFINITY: pid 87136 thread 2 bound to OS proc set
{12,13,14,15,16,17,18,19,20,21,22,23,36,37,38,39,40,41,42,43,44,45,46,47}
[1] OMP: Info #242: KMP_AFFINITY: pid 87136 thread 3 bound to OS proc set
{12,13,14,15,16,17,18,19,20,21,22,23,36,37,38,39,40,41,42,43,44,45,46,47}
[1] OMP: Info #242: KMP_AFFINITY: pid 87136 thread 4 bound to OS proc set
{12,13,14,15,16,17,18,19,20,21,22,23,36,37,38,39,40,41,42,43,44,45,46,47}
[1] OMP: Info #242: KMP_AFFINITY: pid 87136 thread 5 bound to OS proc set
{12,13,14,15,16,17,18,19,20,21,22,23,36,37,38,39,40,41,42,43,44,45,46,47}
[1] OMP: Info #242: KMP_AFFINITY: pid 87136 thread 6 bound to OS proc set
{12,13,14,15,16,17,18,19,20,21,22,23,36,37,38,39,40,41,42,43,44,45,46,47}
[1] OMP: Info #242: KMP_AFFINITY: pid 87136 thread 7 bound to OS proc set
{12,13,14,15,16,17,18,19,20,21,22,23,36,37,38,39,40,41,42,43,44,45,46,47}
[1] OMP: Info #242: KMP_AFFINITY: pid 87136 thread 8 bound to OS proc set
{12,13,14,15,16,17,18,19,20,21,22,23,36,37,38,39,40,41,42,43,44,45,46,47}
[1] OMP: Info #242: KMP_AFFINITY: pid 87136 thread 9 bound to OS proc set
{12,13,14,15,16,17,18,19,20,21,22,23,36,37,38,39,40,41,42,43,44,45,46,47}
[1] OMP: Info #242: KMP_AFFINITY: pid 87136 thread 10 bound to OS proc set
{12,13,14,15,16,17,18,19,20,21,22,23,36,37,38,39,40,41,42,43,44,45,46,47}
[1] OMP: Info #242: KMP_AFFINITY: pid 87136 thread 11 bound to OS proc set
{12,13,14,15,16,17,18,19,20,21,22,23,36,37,38,39,40,41,42,43,44,45,46,47}
[...]
$ I_MPI_DEBUG=4 KMP_AFFINITY=granularity=thread,compact,1,verbose \
mpirun -prepend-rank -np 2
./miniMD_intel --num_threads 12
[0] [0] MPI startup(): Single-threaded optimized library
[0] [0] MPI startup(): shm data transfer mode
[1] [1] MPI startup(): shm data transfer mode
[0] [0] MPI startup(): Rank Pid Node name Pin cpu
198
CHAPTER 6 ■ ADDRESSING APPLICATION BOTTLENECKS: SHARED MEMORY
Summary
This chapter was all about optimizations on the threading level of the application to
achieve better performance on a single node.
If your application is using only MPI to exchange messages on the process level
and you are thinking about multithreading, this chapter showed how you can create a
hotspot and loop profile to get a better understanding of the application behavior. This is
your foundation for making informed decisions about where to apply OpenMP (or other
threading models) to your code to move it to a hybrid MPI/OpenMP solution.
The hotspot profile is the tool for getting to know about optimization and
parallelization candidates. The hotspots are always the optimization candidates that you
will investigate closely and in depth so that you can find bottlenecks in these parts of
your code. We have presented some of the most common application bottlenecks, such
as sequential and load imbalanced parts of code, excessive thread synchronization, and
issues introduced by the NUMA.
199
CHAPTER 6 ■ ADDRESSING APPLICATION BOTTLENECKS: SHARED MEMORY
References
1. “Perf: Linux profiling with performance counters,”
https://perf.wiki.kernel.org/index.php/Main_Page.
2. J. Dongarra and M. A. Heroux, Toward a New Metric for Ranking
High Performance Computing Systems (Albuquerque, NM: Sandia
National Laboratories, 2013).
3. Intel VTune Amplifier XE User’s Guide (Santa Clara, CA: Intel
Corporation, 2014).
4. “Intel® Inspector XE 2015,”
https://software.intel.com/intel-inspector-xe.
5. Valgrind Developers, “Valgrind,” http://valgrind.org/.
6. User and Reference Guide for the Intel C++ Compiler 15.0
(Santa Clara, CA: Intel Corporation, 2014).
200
CHAPTER 7
Addressing Application
Bottlenecks:
Microarchitecture
201
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
48 89 c1 mov rcx,rax
202
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
Pipelined Execution
Pipelines are the computer-science version of an industrial assembly line, and they are
the overarching design principle of a modern CPU core. Thus, all techniques introduced
here need to be seen in light of this concept. In a pipeline, throughput performance is
gained by exploiting parallelism in the execution of a stream of instructions. Pipeline
stages are each executing specialized tasks.
A classic pipeline model used by Tanenbaum3 looks like this:
• Instruction fetch (IF): Loads the instruction indicated by the
instruction pointer. The instruction pointer is a CPU register
containing the memory address of the next instruction to be
executed.
• Instruction decode (ID): The processor parses the instruction and
associates it with the functionality to be executed.
• Load operands (LO): Data from the argument(s) are loaded. In
general this is data thought to be contained in a CPU register, but
it might also be a memory address.
• Execution (EX): The instruction is executed. In the case of an
arithmetic instruction, the actual computation will be done.
• Write-back results (WB): The computed result is committed to the
specified destination, usually a register.
In a non-pipelined approach, each of these steps needs to be completed before
another instruction can be fetched. While one of these stages is active, the other ones is
idle, which renders most of the processor’s capacity unutilized. In a pipeline approach,
however, each pipeline stage is active at the same time, leading to a throughput of one
instruction per clock cycle (assuming that completion of a stage requires one clock cycle).
If you assume that there is no dependence between individual instructions, you can fetch
the third instruction while the second one is being decoded; you can load the operands
for the first one at the same time, and so on. In this way, no pipeline stage is ever idle, as
shown in Figure 7-1.
203
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
In a pipeline, although the latency of the individual instruction (the time during
which the instruction is individually executed) does not change, the overall throughput
increases dramatically.
Let’s estimate the time it takes to execute a number of instructions Ninstructions in Nstages
stages, where we assume each stage takes the same time Tstages to complete:
Tno-pipeline = Ninstructions Nstages Tstage
Tpipeline = Nstages Tstage + (Ninstructions –1) Tstage.
The ideal speedup is
S = Tno-pipeline /Tpipeline = Nstages ,
assuming an infinite number of instructions. This estimation is highly idealized. If, for
some reason, one stage of the pipeline fails to complete in time, the whole pipeline will
come to a halt—that,s what we call a pipeline stall. Without claiming completeness, some
common reasons for pipeline stalls are as follows.
Data Conflicts
Data conflicts arise from the parallel execution of instructions within a pipeline when the
results of one instruction are an input argument to another instruction. Data conflicts
play an important role when we speak about vectorization, and we will come back to
them when we deal with this topic later in the chapter.
204
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
var1=5;
var2=var1+2;
Clearly, the first line needs to complete before the second can
be executed, since otherwise the variable var1 might contain
an arbitrary value.
2. Write after read (WAR) or anti-dependence: This is just the
opposite of RAW: a variable must be written after it has been
read previously. Example in C:
var2=var1+2;
var1=5;
var1=1;
<other instructions>
var1=2;
The WAW case is not a problem in a simple in-order pipeline. However, in the
context of out-of-order and superscalar execution discussed later, this conflict is possible
when the order of instructions can be changed and instructions might even be executed
concurrently.
Control Conflicts
The instruction flow often reaches a point where a decision needs to be made about
which further instructions will be executed. Such points are unmistakably indicated by
explicit conditional constructs (e.g., if-statements) in the source code. Other language
constructs require decisions to be made, as well. A common example is loops where a
counter variable is compared with the upper limit of the loop. If the upper limit is not
reached, the instruction pointer will be set to the beginning of the loop body and the next
iteration is executed. If the upper limit is reached, the execution will continue with the
next instruction after the loop body.
205
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
for(i=0;i<100;i++)
s=s+i;
With the result of the comparison not yet being available, the pipeline cannot reliably
execute the jump and the pipeline will stall until the comparison has written-back its
results. A control conflict can partly be resolved by branch prediction, as discussed later.
Structural Conflicts
A structural conflict appears when more hardware functionality is required than is
available in a section of the instruction flow. If you have only, say, four registers and two
instructions, each uses two registers to copy data; if these two instructions are executing
in the pipeline, an instruction that requires one of the registers will have to wait until the
resources are freed up.
206
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
Superscalar Pipelines
With out-of-order execution in place, there is a straightforward way to increase
performance: instead of executing out-of-order in a single pipeline, you could execute
the micro-instructions in parallel, in two or more independent pipelines, because their
execution is independent from the beginning (see Figure 7-2). The retirement buffer will
then take care of the proper ordering across all pipelines after the execution. The level
of parallelism that can be achieved in this approach is, of course, limited by the inherent
parallelism in the flow of instructions.
SIMD Execution
In practice, we often apply the same instructions to each element of a large dataset.
This gives rise to an additional level of parallelism to be exploited. The potential speedup
is proportional to the number of data elements we can process in parallel. In hardware,
this is implemented by making registers available that are as wide as the number of
elements that we want to treat in parallel, which is a significant hardware investment
limiting the vector length.
Current Intel CPUs support three types of vector extensions: multimedia extensions
(MMX, 64 bit), various versions of Stream SIMD extensions (SSE, 128 bit), and advanced
vector extensions (AVX, 256 bit). Chapter 2 discussed the benefits of SIMD execution in
detail; see especially Figures 2-2 through 2-8 for AVX.
207
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
When a conditional branch appears in the instruction flow, the pipeline stalls at this
position until the condition has been calculated and the next instruction can be fetched,
which can mean a big hit to performance. To alleviate this problem, the processor could
predict the target of the branch and continue feeding the pipeline with instructions from
this point in the code. For this purpose, the processor has a branch target buffer that
stores the last branch target taken from this point in the code, along with information
about the success of the last predictions.
A very simple implementation of a branch prediction would be to predict the last
observed behavior; that is, we predict “branch taken” if we took it the last time and “not
taken” if we didn’t take it. This is easy to store in a single bit associated with the position
of the branch in the code. Figure 7-3 shows a more advanced branch predictor using
two bits. If a branch is taken, the predictor will enter the state 11 “predict taken.” While
the prediction is true, it will stay in this state. If the prediction is once false, it will not
immediately predict “not taken” but, rather, go to state 10, but still “predict taken.” Only if
the prediction is wrong a second time will the state change to “00” and “predict not taken.”
These branch prediction schemes are called dynamic predictions. The branch
predictor compares the current position in the code with information it already has
stored for this particular branch. At the first encounter of a particular branch, this
solution will not work because there is no information available in the branch target
buffer. For this instance, the branch predictor has a default behavior, called the static
prediction. The rules for static branch prediction are as follows:
• A forward conditional branch (an if-statement) is predicted not to
be taken
• A backward conditional branch (a loop) is predicted to be taken
• An unconditional branch (a call to or return from a subroutine) is
predicted to be taken
Branch predictors can have a very high hit rate, but at one point a prediction will fail.
If you consider a loop, the prediction will be wrong when the loop counter has reached
the limit. In this case, the processor speculatively executes the wrong code path and
208
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
the pipeline will be cleared (called a pipeline flush). The computation is then restarted
at the point before the prediction, using the now-known correct branch target. A pipeline
flush can have a serious performance impact. The minimum impact is the time it takes to
refill the pipeline—which is, at best, the number of pipeline stages.
Memory Subsystem
A problem in the last decade of CPU design was the growing divergence in performance
of memory and CPU. While the CPU follows the principle of Moore’s Law and doubles the
number of components (translating directly into performance) each 18 months, memory
performance (that is, the number of bytes delivered per second) grows much slower. To
have the data readily available when needed by the execution units, fast but small and
expensive storage is directly built into the CPU, called a cache. The idea of a cache was
inspired by the temporal principle of locality: data that you have used once you will likely
use again in the near future. The cache memory, then, stores intermediate copies of data
that actually reside in the main memory. Often, more than one cache is present, which is
then called a cache hierarchy.
Three different cache implementations are used:
• Direct-mapped cache: Each memory address can be stored
only in a specific cache line. If a memory address is loaded in a
cache line that is already occupied, the previous content of this
line is evicted. This approach allows for a much leaner logic to
determine a hit and has relatively low access latency. The hit
ratios are lower than with the fully associative cache, but this
technique is cheaper.
• Fully associative cache: The memory address of the data within
the cache is stored alongside. To find where a memory location
has been stored in the cache requires a compare operation across
the memory addresses stored for each line. Fully associative
caches have high hit rates but relatively long access latencies.
Also, they require a larger chip space to incorporate the extensive
hit logic and are therefore more expensive.
• Set associative cache: This is a compromise between the two
aforementioned alternatives. The set associative cache divides the
cache into a number of sets (say eight). A cache line is placed into
a given set, based on its memory address. Searching within a set,
then, is internally fully associative. While cost and chip space stay
reasonable, this technique offers a high hit ratio.
Current product lines of Intel processor cores feature 32 Kbyte instruction and data
Level 1 (L1) caches and a 256 Kbyte unified Level 2 cache (L2), both eight-way set
associative. (Cache latencies and bandwidth have been discussed in Chapter 2.)
Even if we cache data entries for fast access by the execution units, many programs
stream data through the processor, which exceeds the cache capacity. The cache then
becomes useless because you will not find the entry you loaded in the past, as it was
already evicted. The cache also adds latency to the loading of such streaming data into
209
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
the processor. In modern CPU design, this problem is attacked by preloading data in
the caches that is likely to be used next, so that it is readily available. This technique is
called prefetching and is extensively used in the Sandy Bridge architecture, which has four
different hardware prefetchers.6
210
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
211
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
Figure 7-5. Hierarchical top-down analysis method (Source: Intel 64 and IA-32 Architectures
Optimization Reference Manual)
Ideally we would like to see all compute cycles spent in the retired category, although
this doesn’t mean there is no room for improvement. As for the other categories, let’s
discuss some common reasons they will appear:
• Front-end bound: This is caused by misses in the instruction
cache (ICache) or the instruction translation lookaside buffer
(ITLB), owing to a large code, excessive inlining or loop unrolling.
Also, inefficiencies in the decoder, such as length-changing
prefixes,7 can be the reason. (See inlining in later sections
“Dealing with Branching” and “Basic Usage and Optimization.”)
• Back-end memory bound: The cache misses at all levels—irregular
data access, streaming data access, large datasets, cache conflicts,
and unaligned data. (See the later section “Optimizing for
Vectorization.”)
• Back-end core bound: There are long latency instructions (divide),
chains of dependent instructions, and code that is not vectorizing.
(See the later section “Optimizing for Vectorization.”)
• Bad speculation: There is wrong prediction of branches and
resulting pipeline flushes and short loops. (See the later section
“Dealing with Branching.”)
212
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
will create a binary that can execute on all Intel CPUs supporting SSE4.1, but it will still
run the highest performing code path on the Sandy Bridge and Ivy Bridge processors.
The next basic choice is the optimization level. There are four optimization levels,
which can be controlled with the -On switch, where n is between 0 and 3. The -O0 switch
turns off optimization completely. The -O1 optimizes for speed, but doesn’t increase the
code size. The default optimization level is -O2, which optimizes for speed, but increases
code size through unrolling and inlining. The -O3 performs similar optimizations as -O2,
but is more aggressive. When using -O3, you’ll find that increased inlining and loop
unrolling can sometimes lead to slower code because of front-end stalls. It is worth playing
with this switch and measuring performance before deciding on the final level to use.
213
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
Let’s look at the optimization report of the compiler generated with -opt-report5.
First, there is a report header summarizing the setting for IPO, inlining, and the inlined
functions. After this header information, the optimization report for the code starts,
usually with the main routine:
214
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
The function squaregemm that we have defined is inlined; all other functions (not shown
in Listing 7-1) for which no code could be found are marked extern, such as rand() or
printf(). The numbers behind the inlined function summarize the increase of the code size.
In this case, the size of the calling function plus the called function is 68 = 33 + 35, whereas the
size of the inlined function is only 57, owing to further optimizations.
The next interesting point is the optimization report for the squaregemm function:
...
Begin optimization report for: squaregemm(int, double*, double*, double*)
...
Report from: Loop nest, Vector & Auto-parallelization optimizations
[loop, vec, par]
The compiler has changed the order of the loops form i,j,k to i,k,j to provide
better conditions for vectorization.
...
remark #15145: vectorization support: unroll factor set to 4
...
The last line indicates the compiler has unrolled the loop by four iterations.
Checking the assembly output of objdump -d, we indeed find a vectorized version of the
loop that is fourfold unrolled (four AVX vector multiplies and four AVX vector adds):
215
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
So you should now have an idea of what type of information the report creates. We
have deliberately left out quite a number of lines so as to keep this readable. The original
report for this very short program with a single function call is about 200 lines at report
level 5. Very often this is too much detail, as you might be interested in only one function
in a file or in a particular phase of the report. In this case, you can specify a filter—for
instance, for a function:
-opt-report-filter="squaredgemm"
-opt-report-phase=vec
The phase must be one of CG, IPO, LOOP, OFFLOAD, OPENMP, PAR, PGO,
TCOLLECT, VEC, or all, as described earlier.
216
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
This allows for non-destructive operations (none of the sources are altered) and avoids
frequent save operations necessary to preserve the contents of a vector register, as well as
reduces register pressure (the shortage of registers). Examples of AVX functionality for 256-bit
vectors include the following (see illustrations of some vector functions in Figure 7-6):
• Loading and storing of aligned and unaligned data. The
operations may be masked, so that a load ranging into an
unallocated memory range does not cause a fault.
• Broadcasting of a memory element in all elements of a vector.
• Elementary arithmetic operations of addition, subtraction,
multiplication, and division, as well as addsub (alternating
addition/subtraction), (inverse) square root, and reciprocal.
• Comparison, minimum, maximum, and rounding.
• Permutation of elements within a lane and permutation of lanes.
Figure 7-6. Examples of AVX functionality: simple vector addition (top left), in-lane
permutation (top right), broadcasting (bottom left), and mask loading (bottom right)
217
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
We will consider the direct programming of AVX later, in the section “Understanding
AVX: Intrinsic Programming.”
Data Dependences
In regard to pipeline conflicts, we covered data dependences that prevent instructions
from being executed in parallel in a pipelined, out-of-order, or superscalar fashion.
As this pertains to the pipeline, where arbitrary instructions might act on the same data
at different times, it applies even more so to vectors. Here, only a single instruction is
executed on multiple, possibly dependent data elements at exactly the same time. In this
sense, vector dependences are more controllable and more easily solved.
Recall the data conflicts discussed earlier: flow dependence (read after write, or RAW),
anti-dependence (write after read, or WAR), and output dependence (write after write,
or WAW). It is important to realize how dependences affect vectorization. Let’s look at a
simple example. When a variable is written in one iteration and read in a subsequent one,
we have a RAW dependence, as we can see within the loop code:
for(int i=0;i<length-1;i++){
a[i+1]=a[i];
}
After correct execution of the loop, all the elements should be set to the value in a[0]
(see Figure 7-7, left panel). Now, consider a two-element vectorized version of the loop.
At one time, two successive values will be loaded from the array (the parentheses indicate
the vector):
(a[1],a[2])=(a[0],a[1]);
218
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
Figure 7-7. Flow (RAW) data dependence-analysis of a shift-copy loop executed sequentially
and with a two-element vector
The second value is already wrong, according to the original algorithm. In the next
iteration, you get:
(a[3],a[4])=(a[2],a[3])
a[2] has already been changed to a[1] in the previous iteration and the corresponding
values are loaded. Carrying this on, you get, as the final result:
a[0],a[1],a[1],a[3],a[3],a[5] ...
This is obviously wrong according to the original algorithm (see Figure 7-7, right
panel). Clearly, the compiler must prevent this loop from being vectorized. Very often,
however, the compiler assumes an unproven vector dependence, although you will know
better that this will never occur; we will treat this case extensively later.
219
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
EXERCISE 7-1
You can enforce vectorization by placing a pragma simd (to be explained below)
before the loop.
#pragma simd
for(int i=0;i<length-1;i++){
a[i+1]=a[i];
}
Can you confirm the results? For which shifts i+1, i+2,... do you get correct
results? For which do you get wrong results?
Data Aliasing
Another, related reason why code does not vectorize is aliasing. Under aliasing, we
understand the fact that two variables (pointers or references) are associated with the
same memory region. Consider a simple copy function:
In principle, the compiler should be able to vectorize this easily. But wait—can the
compiler be sure that the arrays a and b do not overlap? It cannot. And C/C++ explicitly
allows for this situation! Call the above function from the main function like this:
int main(void){
int length=100;
int copylength=50;
double* a;
double* b;
double* data = (double*) malloc(sizeof(double)*length);
a=&data[1];
b=&data[0];
mycopy(a,b,copylength);
}
220
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
You will get the same situation as with the earlier code showing an explicit vector
dependence. Consequently, the compiler must assume that there is a dependence.
Array Notations
The array notation (AN) introduced with Intel Cilk Plus is an Intel-specific language
extension of C/C++ that allows for direct expression of data-level parallelism (in contrast
to loops, which have the abovementioned problems). AN relieves the compiler of the
dependence and aliasing analysis to a degree and provides an easy way to a correct,
performing code.
AN introduces an array section notation that allows the specification of particular
elements, compact or regularly strided:
The syntax resembles the Fortran syntax, but Fortran programmers beware: the
semantic requires start:length and not start:end!
Examples for the array section notation are:
a[0:10][0:10]=b[10:10][10:10];
a[0:10][0:10]=b[10:10][2][10:10];
The only requirement is that the number of ranks and rank sizes must match. AN
provides reducer intrinsics to exercise all-element reductions. The following expression,
_sec_reduce_add(a[:]);
will return the sum of all elements. Of course, this can also be used with more complex
expression as arguments, so that,
_sec_reduce_add(a[:]*b[:]);
221
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
Let’s look at an example using AN. A problem often encountered in scientific codes is
partial differential equations. Consider a 1D acoustic wave equation,
¶2 ¶2
j ( x ,t ) = c 2 2 j ( x ,t ),
¶t 2
¶x
where x and t are continuous variables. This translates into a second-order finite
difference equation as,
j xt+1 + j xt-1 - 2j xt j t + j xt -1 - 2j xt
= c 2 x +1 ,
(Dt ) 2
(Dx )2
where x and t are now discrete space and time indices with distances Dx and Dt. We want
to know the strength of the field at the time t+1 at position x. Solving the above equation
for the field element j x +1 yields:
t
(Dt )2 t
j xt+1 = c 2 (j x +1 + j xt -1 - 2j xt ) - j xt-1 - 2j xt .
(Dx )2
for(int i=0;i<iterations;i++){
f_next[1:size-2]=prefac*(f_curr[0:size-2]+f_curr[2:size-2]
-2.0*f_curr[1:size-2])-f_prev[1:size-2]+2.0*f_curr[1:size-2];
tmp=f_prev;
f_prev=f_curr;
f_curr=f_next;
f_next=tmp;
}
222
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
Although the compiler might vectorize this simple example even in straight C/C++,
more complex problems—for example, a three-dimensional wave equation of a finite
difference equation solved to a higher order—might not or not fully vectorize. With AN,
the vector code becomes explicit.
Vectorization Directives
Pragmas are an annotation technique you already learned about in the context of
OpenMP. They allow you to hint information to the compiler, for which other means of
expressing it in C/C++ or Fortran are not available.
A pragma is treated by the compiler like a comment or an unknown preprocessor
directive if it doesn’t know it; in the end, it does ignore it. Consequently, the resulting
code maintains its portability for compilers that don’t support a certain feature, but it has
the desired effect if the compiler does understand the meaning of the pragma.
ivdep
The #pragma ivdep tells the compiler that assumed vector dependences in the following
loop body are to be ignored. Note that proven vector dependences are not affected. The
pragma has no further arguments. The #pragma ivdep is available in most compilers,
though its implementation might differ.
vector
The #pragma vector is similar in its effect as #pragma ivdep in the sense that it will ignore
assumed dependences but not proven ones, but it has additional optional clauses:
always: This overrides the heuristics on efficiency, alignment,
and stride.
aligned/unaligned: This tells the compiler to use aligned or
unaligned data movement for memory references.
temporal/nontemporal: This tells the compiler to use streaming
stores in case of nontemporal, or to avoid those in case of
temporal. Streaming stores write directly into memory bypassing
the cache, which saves an ubiquitous read for ownership (RFO)
that is required to modify the data in the cache. The nontemporal
clause can take a comma-separated list of variables that should
be stored nontemporal.
223
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
simd
The #pragma simd is the most powerful of the three vectorization pragmas. So, #pragma
simd tells the compiler to ignore any heuristics and dependence, proven or not; you will
be fully responsible for making sure the result is correct. #pragma simd is a powerful
construct that can be extended by more arguments/subclauses, some similar in spirit to
the OpenMP parallel for pragma discussed earlier. We discuss them briefly here:
• vectorlength(arg1): Tells the compiler to use the specified vector
length. arg<n> must be a power of 2. Ideally, the vector length
is the maximum vector length supported by the underlying
hardware, such as 2 for SSE2 double vectors or 4 for AVX double
vectors. For example:
The following clauses of pragma simd resemble the interface used in the OpenMP
data-sharing clauses and maybe thought of in the same manner—just that a vector lane
would correspond to a thread:
• private(var1[,var2,...]): With this clause the compiler will assume
that the scalar variable var1 can take different values in each loop
iteration (see also the OpenMP private clause). The initial (at
start of the loop) and final (after the loop) values are undefined,
so make sure you set the value in the loop and after the loop
completion. For example:
224
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
225
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
Intrinsics mostly operate on and return vector types. For AVX, those are 256-bit
vectors and the types are as follows:
Eight 32-bit integer elements or four 64-bit integer
elements: __m256i
Eight single precision elements: __m256
Four double precision elements: __m256d
We will focus here on the double-precision types for the sake of brevity; everything
we present applies to single-precision types in a similar fashion. A listing of all intrinsics
can be found in the “Intel Intrinsics Guide.”8
The 256-bit floating-point intrinsic function for AVX starts with a _mm256_, then a
meaningful description of the functionality—say, add—and then two letters, encoding
packed (p) or scalar (s), as well as single (s) or double (d) precision. Packed and scalar in
this context means to execute the operation on all elements of the vector, or only on the
first element (see Figure 7-8).
c=_mm256_add_pd(a,b);
This will add the elements of the vectors a and b and write the results into the
elements of c. The a, b, and c are of type __m256d. See also Figure 7-6.
Intrinsics are only available for C/C++ and can be used in the source code freely.
All AVX intrinsics are listed in the file immintrin.h that you will find in the include
directory of Intel Composer XE 2015. There are hundreds of intrinsics; we will restrict
discussion to the most important ones.
226
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
Sometimes you want to have one value in all of the vector elements. This is called
broadcasting:
• __m256 a _mm256_broadcast_sd(double* memptr): Copies the
double precision value contained in the 64 bit following memptr
into all the four elements of a vector register. No alignment is
required.
227
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
Arithmetic
Now that we can load data into registers, we can start computing something. The four
basic arithmetic instructions are as follows:
• __m256 c = _mm256_add_pd(__m256 a,__m256 b): Adds the
four elements in the registers a and b element wise and puts
the result into c.
• __m256 c = _mm256_sub_pd(__m256 a,__m256 b): Subtracts the
four elements in the register b from a element wise and puts
the result into c.
• __m256 c = _mm256_mul_pd(__m256 a,__m256 b): Multiplies the
four elements in the registers a and b element wise and puts
the result into c.
• __m256 c = _mm256_div_pd(__m256 a,__m256 b): Divides the
four elements in the registers a by b element wise and puts
the result into c.
These four are already sufficient to do some important computation. Very often
multiplication of very small matrices is required—for instance, of 4x4 matrices in a
scenario covering three-dimensional space and time. (We will revisit this example in this
chapter.) There are highly optimized libraries providing the BLAS9 (basic linear algebra
subroutines) functionality, such as matrix-matrix multiplication. Those libraries, such
as Intel MKL, are powerful and feature rich. We might require less functionality. Say,
we don’t need the multiplicative factors in the DGEMM (double general matrix-matrix
subroutine), just straight multiplication of the matrices. In this case, a special matrix-matrix
multiplication like the following would do the trick:
#include <immintrin.h>
void dmm_4_4_4(double* a, double* b, double* c){
int i;
__m256d xa0;
__m256d xa1;
__m256d xa2;
__m256d xa3;
__m256d xb0;
__m256d xb1;
__m256d xb2;
__m256d xb3;
__m256d xc0;
xb0 = _mm256_loadu_pd(&b[0]);
xb1 = _mm256_loadu_pd(&b[4]);
xb2 = _mm256_loadu_pd(&b[8]);
xb3 = _mm256_loadu_pd(&b[12]);
228
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
for(i=0;i<4;i+=1){
xc0 = _mm256_loadu_pd(&c[i*4]);
xa0=_mm256_broadcast_sd(&a[i*4]);
xa1=_mm256_broadcast_sd(&a[i*4+1]);
xa2=_mm256_broadcast_sd(&a[i*4+2]);
xa3=_mm256_broadcast_sd(&a[i*4+3]);
xc0=_mm256_add_pd(_mm256_mul_pd(xa0,xb0),xc0);
xc0=_mm256_add_pd(_mm256_mul_pd(xa1,xb1),xc0);
xc0=_mm256_add_pd(_mm256_mul_pd(xa2,xb2),xc0);
xc0=_mm256_add_pd(_mm256_mul_pd(xa3,xb3),xc0);
_mm256_storeu_pd(&c[i*4],xc0);
}
}
EXERCISE 7-2
Data Rearrangement
Of course, those few intrinsics are not all there are; in few cases do we get data presented
so readily usable, as with a matrix multiplication. More frequently, data needs to be
rearranged in one vector or between different vectors. The intrinsics functions specialized
in data rearrangement often are difficult to configure, as you will see. Still, these intrinsics
have high importance because this is exactly what the compiler has the biggest problems
229
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
with. We provide some examples for configurations that are useful, but we don’t claim
completeness:
• __m256 b = _mm256_permute_pd(__m256d a, int m): Permutes
the elements within each lane according to the bits in m from left
to right. If the bit is 0, take the first element; if the bit is 1, take the
second.
Let’s look at the functionality with an example: Consider a vector containing the
values a0-a3: (a3,a2,a1,a0). Then, _mm256_permute_pd allows you to move the elements
of the vector within each lane (remember—a lane is half the vector); see Table 7-1 for
examples.
230
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
Let’s see what we can do with all this. Consider a cyclic rotation of a vector by one
element (see also Figure 7-9):
(a3,a2,a1,a0) (a0,a3,a2,a1)
Figure 7-9. Right panel: The lane concept of AVX. Left panel: The construction of a cycle
rotate of a double vector with in-lane and cross-lane permutes
231
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
b=_mm256_permute_pd(a,5);
c=_mm256_permute2f128_pd(b,b,1);
3. Blend the vectors b and c, taking the first and third elements
of the second source and the second third elements of the
first source:
d=_mm256_blend_pd(c,b,5);
EXERCISE 7-3
Create a version of the cyclic rotate that shifts by two and three elements.
double value = 5;
double* ref1 = &value;
double* ref2 = &value;
Of course, ref1[50] and ref2[0] are referring to the same memory address:
for(int i=0;i<50;i++){
ref1[i+m]=2*ref2[i];
}
232
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
If we assume m to be known at compile time, we can easily observe what the compiler
is doing. For 0<=m<=50, the above code is vectorizable. For m=51, we obviously have a RAW
dependence and the vectorization fails. If m is not known at compile time, the compiler
will assume both a RAW and WAR dependence.
In practice, you will often encounter exactly these situations, in which the compiler
has to make conservative assumptions to guarantee the correct execution of the program.
In most cases, this is related to a function signature, like:
The compiler has to assume that a and b reference the same memory location. It will
therefore suspect that there might be a dependence. In most cases, we will know that the
assumed vector dependence can never happen and so we must have ways to hint to the
compiler that it should not interfere. The following are various methods that can be used
to allow the compiler to vectorize our code.
• Compiler switches: The switch -no-ansi-alias disables the
use of ANSI aliasing rules and allows the compiler to optimize
more aggressively. Notice that this is the default. The opposite is
-ansi-alias, which will enforce the ANSI aliasing rules. A more
aggressive version is -fno-alias, where no aliasing is assumed
altogether. Both compiler switches are effective for the whole file
that is currently compiled; you want to be careful applying those
switches when more than the function under consideration is
contained in the file.
• The restrict keyword: A more comfortable and precise way to
instruct the compiler to ignore assumed dependences is to use
the restrict keyword defined in the C99 standard. Placed right in
front of the variable, it indicates that this pointer is not referencing
data referenced by other pointers. For example:
233
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
■ Note This section addresses the bad speculation category in the of the top-down
method for pipeline performance.
__builtin_expect
If you observe bad speculation at a particular conditional branch, and you have a good
estimation of what the expected value should be, it is quite easy to let the compiler know
about it explicitly by using the build-in function, __builtin_expect. The syntax is quite
straightforward: instead of writing the condition in the argument of the if-statement, you
write if(__builtin_expect(condition,expectation)), where expectation is either 0 for
false or 1 for true. For example:
if(__builtin_expect(x<0,1)){
somefunction(x);
} else {
someotherfunction(x);
}
Profile-Guided Optimization
If you don’t have a good clue as to which conditions to put first into if-statements, then
profile-guided optimization (PGO) might help. The virtue of this technique is that it gives
you a way to exactly check for such cases such as wrongly predicted conditions and to
correct them without impacting the source code. PGO is a three-step process:
1. Create an instrumented binary with the compiler option
-prof-gen.
2. Run this binary with one or more representative workloads.
This will create profile files containing the desired
information.
3. Compile once more with the compiler option -prof-use.
234
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
unroll/nounroll
The #pragma unroll allows you to control the unroll factor of loops. Unrolling a loop can
speed up the loop execution considerably, because the loop condition doesn’t need to
be checked as often and additional optimizations might become possible in the unrolled
code. Loop unrolling does increase the code size, the pressure on the instruction cache,
the decode unit, and the registers.
The #pragma unroll can take an additional argument indicating the unroll factor.
For example:
#pragma unroll(2)
for(int i=0;i<size;i++){
a[i]=b[i]*c[i]
}
// unrolled loop
for(int i=0;i<size-(size%2);i+=2){
a[i]=b[i]*c[i];
a[i+1]=b[i+1]*c[i+1];
}
// remainder loop - deals with remaining iteration
// for sizes not divisible by 2
for(int i=size-(size%2);size;i++){
a[i]=b[i]*c[i];
}
EXERCISE 7-4
Write a program with a simple loop, such as the above one, for a different loop
length. Compile it with -xAVX -opt-report5. Do you get unrolling? Try placing a
#pragma unroll in front of the loop; can you change the unrolling behavior of the
compiler for this loop? Have a look at the earlier discussion on optimization reports.
235
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
unroll_and_jam/nounroll_and_jam
The #pragma unroll_and_jam does perform a nested loop transformation, where
the outer loop is unrolled and the resulting inner loops are then united. Consider our
squaregemm function used in the earlier optimization reports example. If you put a
#pragma unroll_and_jam in front of the middle loop,
236
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
void myswitchedpolynomial(n,...){
switch(order){
case 4:
polynomial_4(...);
case 8:
polynomial_8(...);
case:16
polynomial_16(...);
default:
polynomial_n(n, ...);
}
}
237
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
int main(void){
int msize=4; // the matrix side length
int msize2=msize*msize; // the number of elements in each matrix
int nmatrices=100000; // how many c-matrices are there
int nab=100; // how many a and b matrix multiplication
// per c-matrix
double** b = (double**) _mm_malloc(sizeof(double*)*nmatrices,32);
double** a = (double**) _mm_malloc(sizeof(double*)*nmatrices,32);
double* c = (double*) _mm_malloc(sizeof(double)*nmatrices*msize2,32);
// allocate matrices
for(int i=0;i<nmatrices*msize2;i++){
c[i]=((double) rand())/((double) RAND_MAX);
}
for(int i=0;i<nmatrices;i++){
b[i] = (double*) _mm_malloc(sizeof(double)*msize2*nab,32);
a[i] = (double*) _mm_malloc(sizeof(double)*msize2*nab,32);
}
238
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
// init matrices
for(int i=0;i<nmatrices;i++){
for(int n=0;n<nab*msize2;n++){
b[i][n]=((double) rand())/((double) RAND_MAX);
a[i][n]=((double) rand())/((double) RAND_MAX);
}
}
We will first use Intel MKL’s DGEMM method to perform the matrix-matrix
multiplication:
DGEMM is too powerful a method, since we are asking for a lot fewer features than
it has to offer (it can also transpose the matrices and multiply with scalar factors, all of
which we don’t need here).
Figure 7-10 shows the output of a basic hotspot analysis done with VTune Amplifier XE.
As expected, we have the majority of time in the DGEMM method, with some fraction coming
from the initialization (functions main and rand).
239
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
for(int i=0;i<nmatrices*msize2;i++){
c[i]=((double) rand())*randnorm;
}
240
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
Similarly, for the initialization of a and b. More severe seems to be the low
floating-point utilization of 0.034. This goes hand in hand with our observation of low
GFLOP rates. In the next section, we will make a first attempt to tackle this.
241
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
Since we are exclusively using 4x4 matrix multiplications, the branch predictor will
not cause pipeline flushes, owing to the additional if statement; still, the method is valid
for all square matrix sizes.
Let’s now create a new binary with the code changes lined out and rerun the general
exploration analysis of VTune. Figure 7-12 shows the output of VTune. What a big leap.
We improved from 17.474s to 6.166s in execution time. We dropped in the CPI, though,
but this reinforces the point that it is important which instructions we retire, not how
many. The FP arithmetic ratio took a big step from 0.034 to 0.213 (see output). This result
is going in the right direction. Now let’s estimate the floating-point performance. A basic
hotspot analysis shows that we spend 4.04s in main; we’ll assume that this is all compute,
for the time being. Following the above considerations, we do 12.80 GFLOP/4.04s = 3.17
GFLOP/s. That’s a lot better, but is it still not enough! If you look at Figure 7-12 again,
you will see that all the FP arithmetic is spend in “FP scalar.” The code doesn’t vectorize
properly. Now, let’s use the compiler directives to help the compiler vectorize the code.
242
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
Figure 7-12. Summary page of general exploration analysis after changing to an explicit c
expression
243
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
} else {
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
m, m, m, 1.0, a, m, b, m, 1.0, c, m);
}
}
Here, we unrolled the inner loop four times and vectorized the outer loop.
Figure 7-13 shows the result of a general exploration analysis. We got even better
results, but—no surprise—the “FP arithmetic” literally vanished because now everything
is executed in vectors. Compared to memory operations, there is comparatively little time
spent in computing. The pressure is now almost fully on the memory subsystem, so our
execution functions perfectly.
EXERCISE 7-5
Try to insert the intrinsics method for a 4x4 matrix multiplication developed earlier
in this chapter into our sample problem. Can you get better than the result achieved
with the compiler only?
244
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
Summary
We presented a brief account of the microarchitecture of modern Intel processors, plus
discussed how to detect microarchitectural issues and how to solve them, ranging from
hinting the compiler via directives to programming brute-force solutions using intrinsics.
As for microarchitectural design, the scope of this chapter is certainly too tight to
look much deeper. Many good textbooks are available, such as the standards by Hennessy
and Patterson13 or Tanenbaum.14 Hager and Wellein15 focus particularly on tuning and
performance aspects. Intel’s software developer manual16 and its optimization reference
manual17 are always good, if extensive, reads.
Innovative usage employing the techniques presented here can be found in the
open-source space, particularly in software that has confined hotspots, such as quantum
chromodynamics, molecular dynamics, or quantum chemistry applications, such as
CP2K18 or Gromacs.19
References
1. S. P. Dandamudi, Introduction to Assembly Language Programming
(Springer, 2005).
2. Intel, “Intel 64 and IA-32 Architectures Software Developer Manuals,”
2014, www.intel.com/products/processor/manuals.
3. A. S. Tanenbaum, Structured Computer Organization, 5th ed.
(Pearson, 2006).
4. Ibid.
5. J. L. Hennessy and D. A. Patterson, Computer Architecture
(Morgan Kaufmann, 2007).
6. “Intel 64 and IA-32 Architectures Software Developer Manual.”
7. Intel, “Intel 64 and IA-32 Architectures Optimization Reference
Manual,” 2014, www.intel.com/products/processor/manuals.
8. “Intel Intrisics Guide,” https://software.intel.com/sites/
landingpage/IntrinsicsGuide.
9. “BLAS (Basic Linear Algebra Subprograms),”
http://www.netlib.org/blas.
10. “Intel Intrinsics Guide”
11. J. Hutter, M. Krack, T. Laino, and J. VandeVondele, “CP2K Open
Source Molecular Dynamics,” www.cp2k.org.
12. “BLAS (Basic Linear Algebra Subprograms)”
13. Hennessy and Patterson, Computer Architecture.
245
CHAPTER 7 ■ ADDRESSING APPLICATION BOTTLENECKS: MICROARCHITECTURE
246
CHAPTER 8
Application Design
Considerations
In Chapters 5 to 7 we reviewed the methods, tools, and techniques for application tuning,
explained by using examples of HPC applications and benchmarks. The whole process
followed the top-down software optimization framework explained in Chapter 3. The
general approach to the tuning process is based on a quantitative analysis of execution
resources required by an application and how these match the capabilities of the platform
the application is run on. The blueprint analysis of platform capabilities and system-level
tuning considerations were provided in Chapter 4, based on several system architecture
metrics discussed in Chapter 2.
In this final chapter we would like to generalize the approach to application
performance analysis, and offer a different and higher level view of application and
system bottlenecks. The higher level view is needed to see potentially new, undiscovered
performance limitations caused by invisible details inside the internal implementations
of software, firmware, and hardware components.
247
CHAPTER 8 ■ APPLICATION DESIGN CONSIDERATIONS
Types of Abstractions
An abstraction is a technique used to separate conceptualized ideas from specific
instances and implementations of those at hand. These conceptualizations are used to
hide the internal complexity of the hardware, allow portability of software, and increase
the productivity of development via better reuse of components. Abstractions that are
implemented in software, middleware, or firmware also allow for fixing hardware bugs
with software that results in a reduced time to market for very complex systems, such
as supercomputers. We believe it is generally good to have the right level of abstraction.
Abstractions today are generally an unavoidable thing: we have to use different kinds
of APIs because an interaction with the raw hardware is (almost) impossible. During
performance optimization work, any performance overhead must be quantified to
judge whether there is need to consider a lower level of abstraction that could gain more
performance and increase efficiency.
Abstractions apply to both control flow and data structures. Control abstraction hides
the order in which the individual statements, instructions, or function calls of a program
are executed. The data abstraction allows us to use high-level types, classes, and complex
structures without the need to know the details about how they are stored in a computer
memory or disk, or are transferred over the network. One can regard the notion of an
object in object-oriented programming as an attempt to combine abstractions of data and
code, and to deal with instances of objects through their specific properties and methods.
Object-oriented programming is sometimes a convenient approach that improves code
modularity, reuses software components, and increases productivity of development and
support of the application.
Some examples of control flow abstractions that a typical developer in
high-performance computing will face include the following:
• Decoding of processor instruction set into microcode. These are
specific for a microarchitecture implementation of different
processors. The details of the mapping between processor
instructions and microcode operations are discussed in Chapter 7.
The mapping is not a simple one-to-one or one-to-many
relation. With technologies like macro fusion,1 the number of
internal micro-operations may end up smaller than the number
of incoming instructions. This abstraction allows processor
designers to preserve a common instruction set architecture (ISA)
across different implementations and to extend the ISA while
preserving backwards compatibility. The decoding of processor
instructions into micro-operations is a pipeline process, and it
usually does not cause performance penalties in HPC codes.
• Virtual machine, involving just-in-time compilation (JIT, widely
used, for example, in Java or in the Microsoft Common Language
Runtime [CLR] virtual machines) or dynamic translation (such
as in scripting or interpreted languages, such as Python or Perl).
Here, compilation is done during execution of a program, rather
than prior to execution. With JIT, the program can be stored in
a higher level compressed byte-code that is usually a portable
representation, and a virtual machine translates it into processor
248
CHAPTER 8 ■ APPLICATION DESIGN CONSIDERATIONS
249
CHAPTER 8 ■ APPLICATION DESIGN CONSIDERATIONS
250
CHAPTER 8 ■ APPLICATION DESIGN CONSIDERATIONS
memory translation are invisible to the application. However, it has its hidden cost, which
may be seen as a performance cost associated with the loading of page tables. Some
measurements (such as one reported by Linus Torvalds)5 provide an estimate of over
1000 processor cycles required for handling a page fault in the Linux operating system on
modern processors.
251
CHAPTER 8 ■ APPLICATION DESIGN CONSIDERATIONS
of the operating system kernel. The active development of several commercial (like
ones by VMWare, Parallels, etc.) and open-source (Xen, KVM, etc.) hypervisors helped
establish hardware virtualization as a base technology for enterprise data center and
cloud computing applications. It promoted the development of such popular directions
these days as software-defined storage (SDS) and software-defined networks (SDN),
and finally brought the concept of the software-defined data center (SDDC) that extends
virtualization concepts such as abstraction, pooling, and automation to all of the data
center’s resources and services to achieve IT as a service.
A complete system virtualization brings certain operational advantages, such as
simplified provisioning (through a higher level of integration of application software with
the operating system environment) to provide a stable software image to applications
(and handling of emulation of newer or obsolete hardware at VMM level) that would
be beneficial in making legacy software work on modern hardware without software
modifications. For enterprise and cloud applications, virtualization offers additional
value, as a hypervisor allows for the implementation of several reliability techniques
(virtual machine migration from one machine to another, system-level memory
snapshotting, etc.) and utilization improvements via consolidation—i.e., putting several
underutilized virtual machines on one physical server).
However, hardware virtualization has not progressed at the same pace within the
HPC user community. Though the main quoted reason for not adopting hardware
virtualization is performance overhead caused by hypervisor, it is probably the most
debatable one. There are studies showing that the overhead is rather small for certain
workloads, and running jobs using a pay-per-use model can be more cost-effective versus
buying and managing your own cluster.10 We tend to believe there are other reasons; for
example, that the values of virtualization recognized by enterprise and cloud application
customers are not compelling for HPC users. Consolidation is almost of no use (though it
is possible to implement it using popular HPC batch job schedulers), and live migration
and snapshotting are not more scalable than checkpointing techniques used in HPC.
However, the cost reduction of virtualized hardware, predominantly hosted by large cloud
providers, in some sense already generates demand exploration of high-performance
computing applications in the hosted cloud services.
This trend will drive a need for optimization of HPC applications (which are tightly
coupled, distributed memory applications) for execution in the hosted virtualized
environments, and we see a great need for the tools and techniques to evolve to efficiently
carry out this job.
252
CHAPTER 8 ■ APPLICATION DESIGN CONSIDERATIONS
253
CHAPTER 8 ■ APPLICATION DESIGN CONSIDERATIONS
254
CHAPTER 8 ■ APPLICATION DESIGN CONSIDERATIONS
best working approach for achieving maintainability and successful evolution of the
applications. Again, the levels of abstractions in nonperformance-critical parts of the
program are of no importance; choose whatever abstraction you find suitable and keep it
as flexible as possible to ensure smooth code evolution.
However, ask yourself a couple of questions about parts of the programs contributing
most to overall runtime:
• What are the predominant data layouts and the data access
patterns inside the critical execution path?
• How is the parallelism exposed in the critical execution path?
Sometimes the use of specialized, highly optimized libraries to implement
time-consuming algorithms in the program will help achieve flexibility and portability,
and will define the answers to these questions. As discussed earlier, software libraries,
such as Intel MKL, will offer you a useful abstraction and will hide the complexity of the
implementation. But let us discuss these questions in greater details, in case you are
working on an algorithm yourself.
Data Layout
The first question above is about data abstractions. Most, if not all, computer
architectures benefit from sequential streaming of data accesses, and the ideal situation
happens when the amount of repeatedly accessed data fits into the processor caches that
are roughly 2.5MiB per core in modern Intel Core-based processors. Such behavior is a
consequence of the double-data rate (DDR) dynamic random access memory (DRAM)
module architecture used by modern computers. If the data access is wrapped into
special containers (as often observed in C++ programs), frequent access to that data can
add overhead from the “envelope” around data bits that may be higher than the actual
time of computing with the values.
The data layout is very important to consider when ensuring efficient use of SIMD
processing, as discussed in Chapter 7. Let’s consider an example where an assemblage
of three values is defined within a single structure and corresponding values from each
set are to be processed simultaneously, where the pointers to that enclosing structure
are passed around as function arguments. This can be, for instance, a collection of three
coordinates of points in space, x, y, and z; and our application has to deal with N of such
points. To store the coordinates in memory we could consider two possible definitions for
structures (using C language notation) presented in Listings 8-1 and 8-2.
255
CHAPTER 8 ■ APPLICATION DESIGN CONSIDERATIONS
#define N 1024
typedef struct _SoA {
double x[N];
double y[N];
double z[N];
} SoA_coordinates;
SoA_coordinates foo;
// access i'th element of array as foo.x[i], foo.y[i], and foo.z[i]
#define N 1024
typedef struct _AoS {
double x;
double y;
double z;
} AoS_coordinates;
AoS_coordinates bar[N];
// access i'th element of array as bar[i].x, bar[i].y, and bar[i].z
The layouts in memory for each of the options are shown in Figure 8-1.
256
CHAPTER 8 ■ APPLICATION DESIGN CONSIDERATIONS
257
CHAPTER 8 ■ APPLICATION DESIGN CONSIDERATIONS
Figure 8-2. Data layouts in memory for the arrays of complex numbers available in BQCD
The BQCD build system provides simple selection of storage layouts and also permit
choosing a different code path for performance-critical sections of the application
dealing with that data. The developers of BQCD invested a great effort in developing
highly optimized instruction code for the several computer architectures on which BQCD
is typically run.
The results obtained by the BQDC developers14 on a server with two Intel Xeon
E5-2695 v2 processors are summarized in Table 8-1 and conclude that the SIMD, or
AoSoA, layout with optimized code path delivers the best performance for hot loops over
vector or standard layouts.
■ Note Often, for memory bandwidth-bound kernels, when the dataset fits into Level 2
cache, the performance of compute kernels can be 10 times higher than when data resides
in main memory.
258
CHAPTER 8 ■ APPLICATION DESIGN CONSIDERATIONS
Figure 8-3. Overview of parallel control and data management patterns (Source: Structured
Parallel Programming: Patterns for Efficient Computation)
Often found among HPC application patterns is the partition pattern. It provides a
simple way to ensure both data locality and isolation: the data is broken down into a set
of nonoverlapping regions, and then multiple instances of the same compute task are
assigned to operate on the partitions. As the partitions do not overlap, the computational
tasks can operate in parallel on their own working set without interfering with each other.
While selecting control flow and the data layouts, the one specific issue to watch for is a
load imbalance. The best application performance will be achieved when all computing
elements are loaded to the maximum, and that computational load is evenly distributed
among the computing elements.
259
CHAPTER 8 ■ APPLICATION DESIGN CONSIDERATIONS
260
CHAPTER 8 ■ APPLICATION DESIGN CONSIDERATIONS
261
CHAPTER 8 ■ APPLICATION DESIGN CONSIDERATIONS
262
CHAPTER 8 ■ APPLICATION DESIGN CONSIDERATIONS
support the standard or specification. And since the standards and open specifications
are supported by multiple vendors of hardware and middleware software, it is much
easier to ensure protection of the investments made in your program development.
Summary
We discussed data and control abstractions used on computer systems today and
across all hardware and software layers. Layered implementations are used to enable
component-level design, increase code portability, and achieve investment protection.
However, increased levels of abstractions add complexity and may impact performance.
Very often the abstractions are unavoidable, as they are hidden inside implementation
of components that are outside of your control. At the same time, the developers can
often choose the coding abstractions used while implementing a program or improving
performance of an existing application.
There is no universal way to write the best and fastest performing application.
Usually the performance is a compromise that involves many points of view. To find
the best balance we suggest analyzing the abstractions involved and then judging
whether the tradeoffs are reasonable and acceptable. We suggested several questions
to be asked in addressing scaling versus performance, flexibility versus specialty,
re-computing versus storing the data in memory or transferring over the network, as
well as understanding the bounds and bottlenecks, and obtaining a total productivity
assessment. Answering these questions will increase your understanding of the program
internals and the ecosystem around it, and may result in new ideas about how to achieve
even higher performance for your application.
References
1. Intel Corporation, “Intel 64 and IA-32 Architectures Optimization Reference
Manual,” www.intel.com/content/www/us/en/architecture-and-technology/
64-ia-32-architectures-optimization-manual.html.
2. C. L. Lawson, J. R. Hanson, D. R. Kincaid, and F. T. Krogh, “Basic Linear Algebra
Subprograms for Fortran Usage,” ACM Transactions on Mathematical Software
(TOMS) 5, no. 3 (1979): 308–23.
3. E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, et al., LAPACK Users’ Guide,
3rd ed. (Philadelphia: Society for Industrial and Applied Mathematics, 1999).
4. M. Frigo and S. Johnson, “FFTW: an adaptive software architecture for the FFT,”
in Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and
Signal Processing, vol. 3 (Seattle: IEEE, 1998).
5. L. Torvalds, “Linus Torvalds Blog,” https://plus.google.com/+LinusTorvalds/
posts/YDKRFDwHwr6.
6. Intel Corporation, “An Introduction to the Intel QuickPath Interconnect, Document
Number: 320412,” January 2009, www.intel.com/content/dam/doc/white-paper/
quick-path-interconnect-introduction-paper.pdf.
263
CHAPTER 8 ■ APPLICATION DESIGN CONSIDERATIONS
264
Index
C
deinition, 248
lexibility and portability
data layout (see Data layout)
express parallelism, 259 Compact policy, 193
full system virtualization, 250 ComputeSYMGS_ref function, 176
instructions and microcode, 248 Control abstraction, 248, 251
D, E
operating system, 250
performance and scalable design, 253
programming languages, 249
raw hardware vs. virtualized Data abstraction, 248
hardware, 251 datatype, 250
routines and modules, 249 virtual memory, 250
total productivity assessment, 262 Data layout
transfer vs. recalculation, 261 AoS deinition, 256
virtual machine, just-in-time BQCD makeiles, 258
compilation, 248 deinition, 255
Advanced Hotspot method, 175 SIMD, 257
Amdahl’s Law, 16 SoA deinition, 256
Application bottlenecks standard, 257
compiler-assisted loop/function vector, 257
proiling, 178 Data organization, 257
F
HPCG benchmark, hotspot of, 175
Intel VTune Ampliier XE, 174
load imbalance detection, 180 Floating point operations per second
NUMA efects (see NUMA efects) (FLOPS), 12
G
perf top command, 174
proiling, 173
sequential code, 180
thread pinning (see hread pinning) Graphics Double Data Rate, version
5(GDDR5), 13
265
■INDEX
I, J, K, L
Message passing interface (MPI)
analysis techniques
compare application traces, 166
Implicit barriers, 254 hardware events, 168
Instruction set architecture (ISA), 248 ITAC charts, 168
Intel Cluster Checker, 57 program correctness, 165
Intel Composer XE benchmarking, 88
optimization lags, 2 cpuinfo utility, 97
optimization report, 6 internode communication, 93
vectorization report, 8 intranode communication, 88
Intelligent platform management load imbalance, 123
interface (IPMI) subsystem, 61 addressing, 124
Intel MPI Library, 1 classiication, 124
optimize process placement, 5 detection, 122
optimize thread placement, 5 optimization, 87
statistics-gathering mechanism, 3 performance investigation
Intel Xeon Phi, 140 HPL benchmark, 103
environment settings, 141 order of magnitude, 102
performance improvement, 145 performance issues, 131
pinning process, 143 addressing, 133
speedup and eiciency, 142 classiication, 132
Interprocedural optimization, 10 mapping (see Mapping)
Intrinsic programming optimization (see Optimization)
arithmetic, 228 tuning (see Tuning)
data rearrangement, 230 physical cores, 100
loading and storing, 227 Pinning process, default, 99
M
scalability
application behavior, 107
balancing process, 114
Mapping behavior analysis, 118
communication fabric strong scalability, 115
fallback mechanism, 135 thread parallelism, 114
IP over IB (IPoIB), 134 weak scalability, 115
multirail capability, 135 workload selection, 111
network provider, 134 Microarchitecture
scalable datagrams, 134 basic linear algebra system
communication paths, 133 (BLAS), 239
improper process layout, 135 branch predictors, 207
266
■INDEX
N
SIMD vectorization, 216
vector pragmas, 223
NUMA efects
irst touch, 186 P, Q
init_numa_aware function, 188 Performance analysis, 11
matrix-vector multiplication bottlenecks
with, 188–190 arrival rate/workload, 17
MPI_THREAD_FUNNELED/ service demand, 17
MPI_THREAD_SERIALIZED concurrency, 13
modes, 190 DDR3 memory, 13
numactl command, 186–187 distributed memory system, 27
STREAM triad benchmark, 187 eiciency metric, 14
O
energy to solution, 12
HPC hardware, 27
architecture diagram, 28
Optimization cluster, 33
acceleration, 161 components, 35
aggressive optimization, 237 co-processors, 32
blocking collectives, 161 core architecture, 28
branches core cache bandwidth, 30
builtin_except, 234 core instruction throughput, 29
inline pragma, 236 core microarchitecture, 29
267
■INDEX
R
topologies, 34–35
latency metric, 11–12
optimization, 11
queuing network modeling, 18 Random access memory (RAM), 42
response time, 12 Remote direct memory access (RDMA), 149
rooline model, 18 Rooline model, 18
S
scalability, 15
strong scaling, 17
weak scaling, 17
shared memory system, 26 SIMD approach, 22
SIMD approach, 21 SIMD layout, 257
slowdown, 15 Standard layout, 257
speedup Structure of arrays (SoA), 256
Amdahl’s Law, 15 System bottlenecks, 55
Gustafson’s observation, 16 characterizing issues
thread I/O utilization, 76
cycles per second (CPS), 20–21 memory bandwidth, 81
instructions per second (CPS), 20 tools performance, 74
scalar, 20 condition issues
throughput metrics, 12, 14 application development, 57
Performance tuning faults, 56
application level Intel Cluster Checker, 57
Basic Linear Algebra System shared resource conlicts, 56
(BLAS), 48 throttling events, 56, 58
distributed memory coniguration
parallelization, 45 alsa driver, 61
magic vectors, 45 BIOS settings, 62
MapReduce program, 47 IPMI subsystem, 62
memory wall, 44 PowerTOP output, 60
Partitioned Global Address Space system software and OS, 59
(PGAS), 47 performance reports, 63
shared memory parallelization, 46 compute subsystem, 64
shared memory (SHMEM), 47 I/O subsytem, 70
thread building blocks (TBB), 47 memory subsystem, 67
T, U
microarchitecture level
branch prediction, 49
out-of-order (OOO) execution, 48
performance monitoring unit hread pinning
(PMU), 49 in hybrid applications, 196
pipelining, 48 OpenMP
superscalarity, 48 cpuinfo tool, 192–193
system level deinition, 195
Basic input-output system KMP_AFFINITY variable, 191, 194
(BIOS), 43 OMP_PLACES and OMP_PROC_
HPC applications, 42 BIND variable, 195–196
268
■INDEX
V, W, X, Y, Z
microarchitecture level, 41, 48
order of magnitude, 39
system level, 41–42
Tuning Vector layout, 257
collective algorithms, 150 Virtual machine monitors
miniGhost trace ile, 156 (VMM), 250
269
Optimizing HPC
Applications with
Intel® Cluster Tools
Alexander Supalov
Andrey Semin
Michael Klemm
Christopher Dahnken
Optimizing HPC Applications with Intel Cluster Tools®
Alexander Supalov, Andrey Semin, Michael Klemm, and Christopher Dahnken
Copyright © 2014 by Apress Media, LLC, all rights reserved
ApressOpen Rights: You have the right to copy, use and distribute this Work in its entirety, electronically
without modiication, for non-commercial purposes only. However, you have the additional right to use
or alter any source code in this Work for any commercial or non-commercial purpose which must be
accompanied by the licenses in (2) and (3) below to distribute the source code for instances of greater than
5 lines of code. Licenses (1), (2) and (3) below and the intervening text must be provided in any use of the
text of the Work and fully describes the license granted herein to the Work.
(1) License for Distribution of the Work: his Work is copyrighted by Apress Media, LLC, all rights reserved.
Use of this Work other than as provided for in this license is prohibited. By exercising any of the rights herein,
you are accepting the terms of this license. You have the non-exclusive right to copy, use and distribute this
English language Work in its entirety, electronically without modiication except for those modiications
necessary for formatting on speciic devices, for all non-commercial purposes, in all media and formats
known now or hereafter. While the advice and information in this Work are believed to be true and accurate
at the date of publication, neither the authors nor the editors nor the publisher can accept any legal
responsibility for any errors or omissions that may be made. he publisher makes no warranty, express or
implied, with respect to the material contained herein.
If your distribution is solely Apress source code or uses Apress source code intact, the following licenses
(2) and (3) must accompany the source code. If your use is an adaptation of the source code provided by
Apress in this Work, then you must use only license (3).
(2) License for Direct Reproduction of Apress Source Code: his source code, from Optimizing HPC
®
Applications with Intel Cluster Tools, ISBN 978-1-4302-6496-5 is copyrighted by Apress Media, LLC,
all rights reserved. Any direct reproduction of this Apress source code is permitted but must contain this
license. he following license must be provided for any use of the source code from this product of greater
than 5 lines wherein the code is adapted or altered from its original Apress form. his Apress code is
presented AS IS and Apress makes no claims to, representations or warrantees as to the function, usability,
accuracy or usefulness of this code.
(3) License for Distribution of Adaptation of Apress Source Code: Portions of the source code
®
provided are used or adapted from Optimizing HPC Applications with Intel Cluster Tools,
ISBN 978-1-4302-6496-5 copyright Apress Media LLC. Any use or reuse of this Apress source code must
contain this License. his Apress code is made available at Apress.com/9781430264965 as is and Apress makes
no claims to, representations or warrantees as to the function, usability, accuracy or usefulness of this code.
ISBN-13 (pbk): 978-1-4302-6496-5
ISBN-13 (electronic): 978-1-4302-6497-2
Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol
with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only
in an editorial fashion and to the beneit of the trademark owner, with no intention of infringement of the
trademark.
he use in this publication of trade names, trademarks, service marks, and similar terms, even if they are
not identiied as such, is not to be taken as an expression of opinion as to whether or not they are subject to
proprietary rights.
While the advice and information in this book are believed to be true and accurate at the date of publication,
neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or
omissions that may be made. he publisher makes no warranty, express or implied, with respect to the
material contained herein.
Publisher: Heinz Weinheimer
Associate Publisher: Jefrey Pepper
Lead Editors: Steve Weiss (Apress); Stuart Douglas (Intel)
Coordinating Editor: Melissa Maldonado
Cover Designer: Anna Ishchenko
Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring
Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail
[email protected], or visit www.springeronline.com.
For information on translations, please e-mail [email protected], or visit www.apress.com.
About ApressOpen
What Is ApressOpen?
• ApressOpen is an open access book program that publishes
high-quality technical and business information.
• ApressOpen eBooks are available for global, free,
noncommercial use.
• ApressOpen eBooks are available in PDF, ePub, and Mobi formats.
• he user-friendly ApressOpen free eBook license is presented on
the copyright page of this book.
iii
To Irina, Vladislav, and Anton, with all my love.
—Alexander Supalov
For my beautiful wife, Nadine, and for my daughters—Eva, Viktoria, and Alice.
I’m so proud of you!
—Andrey Semin
To my family.
—Michael Klemm
Summary ............................................................................................... 10
References ............................................................................................ 10
■ Chapter 2: Overview of Platform Architectures ............................ 11
Performance Metrics and Targets ......................................................... 11
Latency, Throughput, Energy, and Power ................................................................ 11
Peak Performance as the Ultimate Limit ................................................................ 14
Scalability and Maximum Parallel Speedup ........................................................... 15
vii
■ CONTENTS
Summary ............................................................................................... 36
References ............................................................................................ 37
■ Chapter 3: Top-Down Software Optimization ............................... 39
The Three Levels and Their Impact on Performance ............................. 39
System Level .......................................................................................................... 42
Application Level .................................................................................................... 43
Microarchitecture Level .......................................................................................... 48
Summary ............................................................................................... 52
References ............................................................................................ 52
■ Chapter 4: Addressing System Bottlenecks ................................. 55
Classifying System-Level Bottlenecks .................................................. 55
Identifying Issues Related to System Condition ..................................................... 56
Characterizing Problems Caused by System Configuration.................................... 59
viii
■CONTENTS
Summary ............................................................................................... 84
References ............................................................................................ 85
■ Chapter 5: Addressing Application Bottlenecks:
Distributed Memory ...................................................................... 87
Algorithm for Optimizing MPI Performance ........................................... 87
Comprehending the Underlying MPI Performance ................................ 88
Recalling Some Benchmarking Basics ................................................................... 88
Gauging Default Intranode Communication Performance ...................................... 88
Gauging Default Internode Communication Performance ...................................... 93
Discovering Default Process Layout and Pinning Details ....................................... 97
Gauging Physical Core Performance .................................................................... 100
ix
■ CONTENTS
x
■CONTENTS
xi
■ CONTENTS
xii
About the Authors
xiii
■ ABOUT THE AUTHORS
xiv
About the Technical
Reviewers
Heinz Bast has more than 20 years experience in the areas of application tuning,
benchmarking, and developer support. Since joining Intel’s Supercomputer Systems
Division in 1993, Heinz has worked with multiple Intel software enabling teams to
support software developers throughout Europe. Heinz Bast has a broad array of
applications experience, including computer games, enterprise applications, and
high-performance computing environments. Currently Heinz Bast is part of the Intel
Developer Products Division, where he focuses on training and supporting customers
with development tools and benchmarks.
Dr. Heinrich Bockhorst is a Senior HPC Technical Consulting Engineer for high-performance
computing in Europe. He is member of the developer products division (DPD) within
the Software & Services Group. Currently his work is focused on manycore enabling and
high-scaling hybrid programming targeting Top30 accounts. He conducts four to ive
customer trainings on cluster tools per year and is in charge of developing new training
materials for Europe. Heinrich Bockhorst received his doctoral degree in theoretical solid
state physics from Göttingen University, Germany.
Dr. Clay Breshears is currently a Life Science Software Architect for Intel’s Health
Strategy and Solutions group. During the 30 years he has been involved with parallel
computation and programming he has worked in academia (teaching multiprocessing,
multi-core, and multithreaded programming), as a contractor for the U.S. Department of
Defense (programming HPC platforms), and at several jobs at Intel Corporation involved
with parallel computation, training, and programming tools.
Dr. Alejandro Duran has been an Application Engineer for Intel Corporation for the past
two years, with a focus on HPC enabling. Previously, Alex was a senior researcher at the
Barcelona Supercomputing Center in the Programming Models Group. He holds a Ph.D.
from the Polytechnic University of Catalonia, Spain, in computer architecture. He has
been part of the OpenMP Language committee for the past nine years.
xv
Acknowledgments
Many people contributed to this book over a long period of time, so even though we
will try to mention all of them, we may miss someone owing to no other reason than the
fallibility of human memory. In what we hope are only rare cases, we want to apologize
upfront to any who may have been inadvertently missed.
We would like to thank irst and foremost our Intel lead editor Stuart Douglas,
whose sharp eye selected our book proposal among so many others, and thus gave birth
to this project.
he wonderfully helpful and professional staf at Apress made this publication
possible. Our special thanks are due to the lead editor Steve Weiss, coordinating editor
Melissa Maldonado, development editor Corbin Collins, copyeditor Carole Berglie, and
their colleagues: Nyomi Anderson, Patrick Hauke, Anna Ishchenko, Dhaneesh Kumar,
Jefrey Pepper, and many others.
We would like to thank most heartily Dr. Bronis de Supinski, CTO, Lawrence
Computing, LLNL, who graciously agreed to write the foreword for our book, and took
his part in the efort of pressing it through the many clearance steps required by our
respective employers.
Our deepest gratitude goes to our indomitable reviewers: Heinz Bast,
Heinrich Bockhorst, Clay Breshears, Alejandro Duran, and Klaus-Dieter Oertel (all of Intel
Corporation). hey spent uncounted hours in a sea of electronic ink pondering multiple
early chapter drafts and helping us stay on track.
Many examples in the book were discussed with leading HPC application experts
and users. We especially are grateful to Dr. Georg Hager (Regional Computing Center
Erlangen), Hinnerk Stüben (University of Hamburg), and Prof. Dr. Gerhard Wellein
(University of Erlangen) for their availability and willingness to explain the complexity of
their applications and research.
Finally, and by no means lastly, we would like to thank so many colleagues at Intel
and elsewhere whose advice and opinions have been helpful to us, both in direct relation
to this project and as a general guidance in our professional lives. Here are those whom
we can recall, with the names sorted alphabetically in a vain attempt to be fair to all:
Alexey Alexandrov, Pavan Balaji, Michael Brinskiy, Michael Chuvelev, Jim Cownie,
Jim Dinan, Dmitry Dontsov, Dmitry Durnov, Craig Garland, Rich Graham, Bill Gropp,
Evgeny Gvozdev, horsten Hoeler, Jay Hoelinger, Hans-Christian Hoppe, Sergey Krylov,
Oleg Loginov, Mark Lubin, Bill Magro, Larry Meadows, Susan Milbrandt, Scott McMillan,
Wolfgang Petersen, Dave Poulsen, Sergey Sapronov, Gergana Slavova, Sanjiv Shah,
Michael Steyer, Sayantan Sur, Andrew Tananakin, Rajeev hakur, Joe hroop, Xinmin Tian,
Vladmir Truschin, German Voronov, homas Willhalm, Dmitry Yulov, and Marina Zaytseva.
xvii
Foreword
xix
■ FOREWORD
performance and helps the application scientist identify bottlenecks between and within
threads. Several other tools, again including TAU and Paraver, provide similar capabilities.
A particularly useful tool in addition to those already mentioned is HPCToolkit from Rice
University, which ofers many useful synthesized measurements that indicate how well
the node’s capabilities are being used and where performance is being lost.
his book is organized in the way the successful application scientist approaches the
problem of performance optimization. It starts with a brief overview of the performance
optimization process. It then provides immediate assistance in addressing the most
pressing optimization problems at the MPI and OpenMP levels. he following chapters
take the reader on a detailed tour of performance optimization on large-scale systems,
starting with an overview of the best approach for today’s architectures. Next, it surveys
the top-down optimization approach, which starts with identifying and addressing the
most performance-limiting aspects of the application and repeats the process until
suicient performance is achieved. hen, the book discusses how to handle high-
level bottlenecks, including ile I/O, that are common in large-scale applications. he
concluding chapters provide similar coverage of MPI, OpenMP, and SIMD bottlenecks.
At the end, the authors provide general guidelines for application design that are derived
from the top-down approach.
Overall, this text will prove a useful addition to the toolbox of any application
scientist who understands that the goal of signiicant scientiic achievements can be
reached only with highly optimized code.
—Dr. Bronis R. de Supinski, CTO, Livermore Computing, LLNL
xx