Software Optimization For High-Performance Computing
Software Optimization For High-Performance Computing
www.hp.com/go/retailbooks
Prentice Hall books are widely used by corporations and government agencies for training, marketing,
and resale.
The publisher offers discounts on this book when ordered in bulk quantities. For more information,
contact Corporate Sales Department, Phone: 800-382-3419; FAX: 201-236-7141;
E-mail: [email protected]
Or write: Prentice Hall PTR, Corporate Sales Dept., One Lake Street, Upper Saddle River, NJ 07458.
HP and HP-UX are registered trademarks of Hewlett-Packard Company. IBM, IBM Power2
Architecture, RS/6000, System/360, System/370, and System/390 are trademarks or registered
trademarks of International Business Machines Corporation. Intel is a registered trademark of Intel
Corporation. MIPS is a registered trademark of MIPS Technologies, Inc. Other product or company
names mentioned herein are the trademarks or registered trademarks of their respective owners.
ISBN 0-13-017008-9
We use many trademarked terms in this book, so we’d like to recognize the following:
CYDRA is a trademark of Cydrome, Inc. CoSORT is a trademark of Innovative Routines Inter-
national, Inc. Cray, T3E, T3D, T90, and Y-MP are trademarks of Cray Research, Inc. Alpha
AXP, DEC, VAX, are trademarks of Digital Equipment Corp. Convex, HP-UX, PA-RISC, and
MLIB are trademarks of the Hewlett-Packard Company. Intel, Itanium, Pentium, and Pentium
Pro are trademarks of Intel, Inc. AIX, ESSL, IBM, PowerPC are trademarks of International
Business Machines Corp. Linux is a trademark of Linus Torvald. Windows, Windows NT,
Visual Studio, Visual C++, and Microsoft are trademarks of Microsoft Corporation. Maspar is a
trademark of Maspar Corporation. MIPS and MIPS R10000 are trademarks of MIPS Technolo-
gies, Inc. Nastran is a registered trademark of NASA. MSC and MSC.Nastran are trademarks of
the MSC.Software Corporation. NAg is a trademark of the Numerical Algorithms Group, Ltd.
OpenMP is a trademark of the OpenMP Architecture Review Board. Quantify is a trademark of
Rational Software Corporation. IRIX, MIPS, SCSL, and SGI are trademarks of Silicon Graph-
ics, Inc. SPECint95 and SPECfp95 are trademarks of the Standard Performance Evaluation
Council. Solaris, NFS, Network File System, Performance Library, and Sun are trademarks of
Sun Microsystems, Inc. Syncsort is a trademark of Syncsort, Inc. All SPARC trademarks are
trademarks or registered trademarks of SPARC International, Inc. Products bearing SPARC
trademarks are based on an architecture developed by Sun Microsystems, Inc. TMC is a trade-
mark of Thinking Machines Corporation. UNIX is a trademark of X/Open Company, Ltd. Ether-
net is a trademark of Xerox Corp. IMSL is a trademark of Visual Numerics, Inc. FrameMaker is
a trademark of Adobe Systems, Inc.
AIX is used as a short form for AIX operating system. AIX/6000 V3 is used as a short
form of AIX Version 3 for RISC System/6000. UNIX is used as a short form for UNIX operat-
ing system.
All other product names mentioned herein may be trademarks or registered trademarks of
other manufacturers. We respectfully acknowledge any such that have not been included above.
vii
viii
Contents
ix
x Contents
2.2 Types 8
2.3 Pipelining 9
2.4 Instruction Length 9
2.5 Registers 11
2.5.1 General Registers 11
2.5.2 Floating-Point Registers 12
2.5.3 Vector Registers 14
2.5.4 Data Dependence 14
2.5.5 Control Dependence 17
2.6 Functional Units 21
2.7 CISC and RISC Processors 23
2.8 Vector Processors 25
2.9 VLIW 29
2.10 Summary 31
Chapter 3 Data Storage 33
3.1 Introduction 33
3.2 Caches 34
3.2.1 Cache Line 35
3.2.2 Cache Organization 35
3.2.3 Cache Mechanisms 36
3.2.4 Cache Thrashing 36
3.2.5 Caching Instructions Versus Data 40
3.2.6 Multiple Levels of Caches 40
3.3 Virtual Memory Issues 42
3.3.1 Overview 42
3.3.2 The Translation Lookaside Buffer 43
3.4 Memory 45
3.4.1 Basics of Memory Technology 46
3.4.2 Interleaving 46
3.4.3 Hiding Latency 49
3.5 Input/Output Devices 49
3.5.1 Magnetic Storage Device Issues 49
3.5.2 Buffer Cache 50
3.6 I/O Performance Tips for Application Writers 50
3.6.1 The I/O Routines You Use Will Make a Difference 50
3.6.2 Asynchronous I/O 53
3.7 Summary 55
Contents xi
xvii
xviii List of Figures
xxi
xxii List of Tables
Computation. Is this truly the third scientific method? To be a peer to theory and experi-
mentation, any method must be pervasive, productive, long-standing and important. It is the
opinion of many that, in fact, computation is the third branch of scientific study. It is clear that
large numbers of people, large sums of money (in the form of computers), large numbers of
peer-reviewed papers, and large numbers of organizations are involved in computation in essen-
tially all fields of scientific, and non-scientific, endeavor.
It is not possible to design any significant product today without prodigious amounts of
computation. This covers the spectrum from aluminum beverage cans to jet aircraft. If you are
convinced that computation ranks as a valid scientific method, then the follow-on question is:
“Exactly what is it that constitutes computation?” It is my view that it embodies the algorithm,
the code, and the computer architecture. Yes, even the computer architecture.
There was a time when a numerical analyst would write code to implement a algorithm.
The code was myopic in that it was written with only the view of implementing the algorithm.
There was some consideration given to performance, but generally it was more the view of sav-
ing computer memory. Memory was the precious commodity in the early days of computation.
“Tuning” the code for the underlying architecture was not a first-order consideration.
As computer architectures evolved, the coding task had to be enlarged to encompass the
exploitation of the architecture. This was necessary in order to get the performance that is
demanded from application codes. This trend continues today.
The evolution of demands on computation has grown at least exponentially. While a
one-dimensional analysis was the status quo in the 1970’s, today a three-dimensional time-vary-
ing solution, with mixed physics, is the norm. More is demanded of computation in the form of
accuracy, precision, dimensionality, parametric analysis, and time-scale. Even the “Moore’s
Law”-like growth of computing power is inadequate in the face of the growing expectation—no,
xxiii
xxiv Foreword
demand—on computation. Thus the performance of scientific codes today is necessarily depen-
dent upon and intimately tied to the computer architecture, including single-processor perfor-
mance and parallelism.
The theme in the above paragraphs leads us to the point of this foreword and, more impor-
tantly, this book: The numerical analysts of today must know computer architectures. Perfor-
mance gains through the knowledge and exploitation of the architecture are significant and
essential. It is this knowledge that will hopefully be the “take away” for readers of this book.
There are many books on mathematical algorithms and numerical analysis. This book pur-
posely and skillfully avoids these topics. Rather, it provides the bridge between code and useful
modifications to the code to gain performance. To keep the size of the exposition in manageable
proportions, it is not a book on architectures. It is a book on code modifications and the reason
why they are successful on today’s pervasive architecture type: multiprocessor RISC systems.
In this book the reader is exposed at a moderate level to the “art” of programming to gain
the best performance from a computer architecture. Proficiency is gained from repeated expo-
sure to the principles and techniques the authors provide. It can be argued that the best approach
to programming is to consider performance issues as every line of code is written. This is the
computer scientist’s perspective. A physicist, engineer, computational chemist, or mathemati-
cian tends to write code to solve the problem at hand in order to achieve a solution. Code modi-
fications to gain better performance come later. This book will help with either approach.
There are frequently occurring algorithms that have become pervasive throughout much
of numerical analysis, in particular, the BLAS and FFTs. Both of these are covered in sufficient
detail so that the reader will understand the types of optimizations that matter.
In the formative years of scientific computation the lingua franca was Fortran. At this
time C is being used increasingly as the language of choice for scientific applications. While
there are strong reasons for continuing to use Fortran as the programming language, this book is
language-agnostic. The examples provided alternate between Fortran and C.
The authors are well-qualified to write this book. Each has many years of direct experi-
ence in code development. Their experience spans classic vector computers, through clusters
and “MPPs” to today’s scalable architectures. To this day, each of the authors is active in the
area that he writes about in this book.
Readers are advised to read closely and then, in practice, to apply what is described. It is
likely the result will be code that performs much better than the original.
Greg Astfalk
Chief Scientist
Technical Computing Division
Hewlett-Packard Company
Preface
This purpose of this book is to document many of the techniques used by people who
implement applications on modern computers and want their programs to execute as quickly as
possible.
There are four major components that determine the speed of an application: the architec-
ture, the compiler, the source code, and the algorithm. You usually don’t have control over the
architecture you use, but you need to understand it so you’ll know what it is capable of achiev-
ing. You do have control over your source code and how compilers are used on it. This book dis-
cusses how to perform source code modifications and use the compiler to generate better
performing applications. The final and arguably the most important part is the algorithms used.
By replacing the algorithms you have or were given with better performing ones, or even tweak-
ing the existing ones, you can reap huge performance gains and perform problems that had pre-
viously been unachievable.
There are many reasons to want applications to execute quickly. Sometimes it is the only
way to make sure that a program finishes execution in a reasonable amount of time. For exam-
ple, the decision to bid or no-bid an oil lease is often determined by whether a seismic image can
be completed before the bid deadline. A new automotive body design may or may not appear in
next year’s model depending on whether the structural and aerodynamic analysis can be com-
pleted in time. Since developers of applications would like an advantage over their competitors,
speed can sometimes be the differentiator between two similar products. Thus, writing programs
to run quickly can be a good investment.
xxv
xxvi Preface
Theoretical Performance
Actual Performance
Effort
more energy, or time, is expended, the theoretical peak is approached, but never quite achieved.
Before optimizing applications, it is prudent to consider how much time you can, or should,
commit to optimization.
In the past, one of the problems with tuning code was that even with a large investment of
time the optimizations quickly became outdated. For example, there were many applications that
had been optimized for vector computers which subsequently had to be completely reoptimized
for massively parallel computers. This sometimes took many person-years of effort. Since mas-
sively parallel computers never became plentiful, much of this effort had very short-term bene-
fit.
Preface xxvii
In the 1990s, many computer companies either went bankrupt or were purchased by other
companies as the cost of designing and manufacturing computers skyrocketed. As a result, there
are very few computer vendors left today and most of today’s processors have similar character-
istics. For example, they nearly all have high-speed caches. Thus, making sure that code is struc-
tured to run well on cache-based systems ensures that the code runs well across almost all
modern platforms.
The examples in this book are biased in favor of the UNIX operating system and RISC
processors. This is because they are most characteristic of modern high performance computing.
The recent EPIC (IA-64) processors have cache structures identical to those of RISC processors,
so the examples also apply to them.
DO I = 1,N
Y(I) = Y(I) + A * X(I)
ENDDO
takes a scalar A, multiplies it by a vector X of length N and adds it to a vector Y of length N . Lan-
guages such as Fortran 90/95 and C++ are very powerful and allow vector or matrix notation.
For example, if X and Y are two-dimensional arrays and A is a scalar, writing
Y=Y+A*X
means to multiple the array X by A and add the result to the matrix Y. This notation has been
avoided since it can obscure the analysis performed. The notation may also make it more diffi-
cult to compilers to optimize the source code.
There is an entire chapter devoted to language specifics, but pseudo-code and Fortran
examples assume that multidimensional arrays such as Y(200,100) have the data stored in mem-
ory in column-major order. Thus the elements of Y(200,100) are stored as
This is the opposite of C data storage where data is stored in row-major order.
P.3 Notation
When terms are defined, we’ll use italics to set the term apart from other text. Courier font
will be used for all examples. Mathematical terms and equations use italic font. We’ll use lots of
xxviii Preface
prefixes for the magnitude of measurements, so the standard ones are defined in the following
table.
Table P-1 Standard Prefixes.
Prefix Factor Factor
12
tera 10 240
milli 10-3
micro 10-6
nano 10-9
Note that some prefixes are defined using both powers of 10 and powers of two. The exact
arithmetic values are somewhat different. Observe that 106 = 1,000,000 while 210 = 1,048,576.
This can be confusing, but when quantifying memory, cache, or data in general, associate the
prefixes with powers of two. Otherwise, use the more common powers of 10.
Finally, optimizing applications should be fun. It’s really a contest between you and the
computer. Computers sometimes give up performance grudgingly, so understand what the com-
puter is realistically capable of and see that you get it. Enjoy the challenge!
About the Authors
xxix
Acknowledgments
It takes a lot of people to produce a technical book, so thanks to all of the folks who made
this possible. Our sincere appreciation to Susan Wright, editor at HP Press, Jill Pisoni, executive
editor at Prentice-Hall PTR and Justin Somma, editorial assistant PTR, for shepherding this
book from its inception. To Kathleen Caren and our copy editor at Prentice-Hall PTR for their
polishing of the manuscript.
Several people at Hewlett-Packard were instrumental in creating this book. Thanks to
Paco Romero and Joe Green, our book champion and book sponsor respectively. Several of our
HP colleagues were particularly helpful. Thanks to Adam Schwartz for his assistance with our
initial proposal and Brent Henderson for profiling information. We’re grateful to Lee Killough
and Norman Lindsey for last minute proof reading. Thanks also to Raja Daoud for contributing
much of the MPI performance information. We especially appreciate Camille Krug for answer-
ing our numerous FrameMaker questions.
Finally, a special thanks to our technical reviewer. He provided the initial inspiration for
the book and many insightful suggestions along the way.
xxxi
xxxii Acknowledgments
C H A P T E R 1
Introduction
Every day computer centers upgrade systems with faster processors, more processors,
more memory, and improved I/O subsystems, only to discover application performance
improves little, if at all. After some analysis by the system or software vendors, they find that
their application simply wasn’t designed to exploit improvements in computer architecture.
Although the developers had read texts on high performance computing and had learned the
meaning of the associated buzz words, acronyms, benchmarks and abstract concepts, they were
never given the details on how to actually design or modify software that can benefit from com-
puter architecture improvements. This book provides the details necessary to understand and
improve the performance of your applications.
Each year, users see new models of high performance computers that are significantly
faster than last year’s models. The number of high performance computers has exploded over the
last two decades, but the typical application achieves only a small fraction of a computer’s peak
performance. Reasons for this are:
• programs are written without any knowledge of the computers they will run on
• programmers don’t know how to use compilers effectively
• programmers don’t know how to modify code to improve performance
It’s unfortunate how little the average programmer knows about his computer hardware.
You don’t have to be a hardware architect to write fast code, but having a basic understanding of
architectures allows you to be a more proficient programmer. This also enables you to under-
1
2 Chapter 1 • Introduction
stand why certain software optimizations and compiler techniques help performance dramati-
cally. Once armed with these tools, you can apply them to your favorite application to improve
its performance. To address these issues, this book is divided into three parts: hardware over-
view, software techniques, and applications.
• Faster processors
• Larger caches, but cache sizes are not scaling as fast as processor speed
• Larger memory, but memory speed improvements are not scaling as fast as processor
speed
• More chip level parallelism
• More processors
Each of the major components of a computer has an effect on the overall speed of an appli-
cation. To ensure that an application runs well, it is necessary to know which part, or parts, of a
computer system determines its performance.
The processor (Chapter 2) is the most important part of a computer. Some processor fea-
tures, such as pipelining, are found on all high performance processors, while others are charac-
teristic of a specific processor family. Different processor families are defined and discussed. Of
course, the processor is only one part of a computer. For example, if an application is completely
dependent on the rate in which data can be delivered from memory, it may not matter how fast a
processor is.
Data storage (Chapter 3) exists as a hierarchy on computers, with cache being “close” to
the processor in terms of access time. Sometimes the cache is even part of the processor chip.
This is in contrast to the large system memory, which is far away from the processor in terms of
access time. The way data storage is accessed frequently determines the speed of a program.
Most high performance computers contain more than one processor. Designing computers
for parallelism greatly complicates computer design. Chapter 4 discusses the various types of
connections that tie processors and memory together, as well as distributed-memory and
shared-memory paradigms.
Software Techniques — The Tools 3
The final two chapters investigate some common algorithms. Many mechanical design and
analysis applications simulate structures such as automobiles using numerical linear algebra.
This frequently involves solving systems of equations (Chapter 11). Some signal processing
applications generate large amounts of data which are interpreted using convolutions and Fast
Fourier Transforms (Chapter 12). Even if you have limited interest in these algorithms, you
should at least skim the material to see how the techniques of previous chapters can be applied to
improve application performance.
P A R T 1
Hardware Overview —
Your Work Area
5
6 Part 1 • Hardware Overview — Your Work Area
C H A P T E R 2
Computers are like Old Testament gods; lots of rules and no mercy.
Joseph Campbell
2.1 Introduction
Computers are complex machines with many parts that affect application performance.
Undoubtedly the most important component is the processor, since it is responsible for perform-
ing arithmetic and logical calculations. The processor loads data from memory, processes it, and
sends the processed data back to memory. The processor performs these tasks by executing
instructions. The lowest-level instructions which are visible to users are assembly language
instructions. Each processor has an internal clock that determines the speed of operations. The
amount of time to execute any instruction is an integer multiple of this clock period.
Over the last three decades there have been many different types of processors. Computer
architects group processors into different categories according to the instruction set architecture
(ISA) that a processor executes. This is the set of assembly language instructions defined for a
given processor. Many different types of processors may use the same ISA. For example,
Hewlett-Packard’s PA-8000, PA-8200, PA-8500, and PA-8600 processors all execute the same
assembly code and use the PA-RISC 2.0 ISA. The most common method to generate the assem-
bly language instructions is for a user to write her programs in a high-level language such as
Fortran, C or Java and use a compiler to translate these programs into the assembly language
instructions.
Processors also belong to processor families that have certain high-level design features in
common. High performance computing uses several different processor families, so an under-
standing of their design is important. The PA-8x processors mentioned above all belong to the
7
8 Chapter 2 • Processors: The Core of High Performance Computing
RISC processor family. Processor families are a reflection of the technology available when they
were designed. A processor family may be appropriate for many years until advances in hard-
ware, software, or even the cost to fabricate it make it less attractive. They evolve over time and
conform to a survival-of-the-fittest policy. Most of this chapter consists of defining processor
features. Different processor families are created by integrating many of these features.
2.2 Types
High performance processors of the last three decades belong to one of four families:
Table 2-1 shows the different processor families we’ll discuss and representative architec-
tures and processors in each family.
Before describing the characteristics of the processor families, certain processor features
need to be defined and discussed. Some of the features exist in multiple families while others are
unique to a particular family or a specific subset.
2.3 Pipelining
For an instruction to be executed, there are several steps that must be performed. For
example, execution of a single instruction may contain the following stages:
1. Instruction fetch and decode (IF). Bring the instruction from memory into the processor
and interpret it.
2. Read data (RD). Read the data from memory to prepare for execution.
3. Execution (EX). Execute operation.
4. Write-back (WB). Write the results back to where they came from.
Each of the four stages has to perform similar functions since they
Note that we haven’t said how long (i.e., how many clocks) it takes to perform each of the
stages. However, let’s assume each stage takes one cycle. Suppose there is a sequence of n
instructions to perform, that all instructions are independent and we can start one instruction per
cycle. In clock cycle one, an instruction fetch and decode is performed on instruction one. In
clock cycle two, the data necessary for instruction one is read and instruction two is fetched and
decoded. In clock cycle three, operations with instruction three begin and instructions one and
two move to their next stages. This sequence of events is illustrated in Figure 2-1. At cycle four
one instruction can be completed with every subsequent clock cycle. This process is know as
pipelining instructions and is a central concept to high performance computing. Pipelining
makes the processor operate more efficiently and is similar to how a factory assembly line oper-
ates.
Each instruction has a latency associated with it. This is the amount of time it takes before
the result of the operation can be used. Obviously a small latency is desired. The instructions
shown in Figure 2-1 have a four-cycle latency (each stage takes one cycle and there are four
stages). If an instruction starts executing in clock cycle one, its result is available in clock cycle
five.
INSTRUCTION
1 IF RD EX WB
2 IF RD EX WB
3 IF RD EX WB
4 IF RD EX WB
5 IF RD EX WB
6 IF RD EX WB
CYCLE 1 2 3 4 5 6 7 8 9
the information necessary to identify it and to provide exactly what the instruction needs to
accomplish. Simple instructions don’t need as many bits as complicated instructions. For exam-
ple, suppose an instruction tells the processor to initialize a value to zero. The only information
necessary is the name or operation code of the instruction, the quantity zero, and the register or
address that should be modified. Another instruction might load two values, multiply them
together and store the result to a third quantity. The first instruction could be much shorter than
the second instruction since it doesn’t need to convey as much information.
Having a variable number of bits for instructions allows computer designers to load the
minimum number of bits necessary to execute the instruction. However, this makes it difficult to
predict how many instructions are in a stream of instruction bits. This lack of prediction effects
how well instructions can be pipelined.
For example, assume that the processor knows where the next instruction begins in mem-
ory. The processor must load some number of bits (it doesn’t know how many) to determine
what the following instruction is. If the instruction is long, it may have to make multiple loads
from memory to get the complete instruction. Having to make multiple loads for an instruction
is especially bad. Remember that for pipelining to occur as illustrated above, the instruction
fetch stage must take only one cycle. If all instructions have a constant length of n bits, the logic
is much simpler since the processor can make a single load of n bits from memory and know it
has a complete instruction.
So there’s a trade-off in processor design. Is it better to load fewer bits and have compli-
cated logic to find the instruction, or load more bits and know exactly how many instructions it
contains? Because of the performance benefits of pipelining, computer designers today prefer
instructions of constant length.
Registers 11
2.5 Registers
The performance of a processor is closely tied to how data is accessed. The closest (and
smallest) storage areas are the registers that are contained in modern processors. Computers can
be designed without registers, though. Some early computers had every instruction load data
from memory, process it, and store the results back to memory. These are known as mem-
ory-memory architectures. Computers with registers fall into two groups: processors that allow
every instruction to access memory (register-memory architectures) and processors that allow
only load or store instructions to access memory (load-store architectures). All recently designed
processors have load-store architectures. General trends regarding registers are:
• more registers
• wider registers
• more types of registers
common way to interpret integer data is to have the left-most bit be a sign bit and the rest of the
bits determine the magnitude of the number. Let the zero’s bit (the right-most bit) be a0, the
one’s bit be a1, and so on. For a positive number (sign bit equal to zero), the magnitude is
a0 20 + a1 21 + a2 22 + ...
12 Chapter 2 • Processors: The Core of High Performance Computing
There are multiple methods for representing negative numbers. In the one’s-complement
system, the negative of a number is obtained by complementing each bit (i.e., all zeros are
changed to ones and all ones are changed to zeros). Most computers use the two’s-complement
method, which is found by taking the one’s complement and adding one. Thus the numbers 1
and −1 are represented by
Using a two’s-complement format for 32-bit integer data, numbers in the range −231 to
231−1 (−2,147,483,648 to 2,147,483,647) can be represented. 32-bit integers are not large
enough for many applications today, so most high performance processors have general registers
that are 64 bits wide.
When discussing the computer representation of numbers, it is often necessary to discuss
the binary representation of numbers instead of their more familiar decimal representation. This
quickly becomes cumbersome due to the large number of digits in most binary representations.
A more succinct way to represent binary numbers is by using hexadecimal (base 16) notation.
Binary notation is converted to hexadecimal notation by combining 4 binary digits to a single
hexadecimal digit. Hexadecimal digits are 0, 1,..., 9, A, B, C, D, E, F, where F represents the
binary number 1111 (or the decimal number 15.) Representations of two numbers using
two’s-complement forms follow:
Hexadecimal numbers are frequently denoted by prepending them with a 0x, so the above num-
bers may be written as 0x60000002 and 0x9FFFFFFE.
• the fractional part (significand) that contains the number’s significant digits,
• an exponent to indicate the power of two that is multiplied by the significand, and
• a sign bit to indicate whether the number is positive or negative.
Due to the widespread use of floating-point data, most processors have separate registers
for integers (general registers) and floating-point data (floating-point registers). Today most ven-
dors use 64-bit floating-point registers which can contain 32-bit or 64-bit data. There was a time
Registers 13
when vendors used different sizes for the significand, but now virtually all conform to the
ANSI/IEEE 754-1985 standards for 32-bit and 64-bit floating-point data, as shown in
Figure 2-3.
31 23 22 0
32 bits s exp significand
63 52 51
64 bits s exp significand
79 64 63
80 bits s exp significand
For 32-bit floating-point data, the sign bit is zero for positive numbers and one for negative
numbers. The exponent is eight bits long and the significand is 23 bits long. The exponent uses a
bias of 127 and the significand contains an implied one to the left of the binary point. If s is the
sign bit, e is the exponent and m is the significand, then a 32-bit floating-point number corre-
sponds to
For example, to represent 3/4 = 1.5 × 2(126-127), set e = 126 and m = 5 for the binary representa-
tion 001111110100...0. The hexadecimal representation for 3/4 is 3F400000 and the representa-
tion of −3/4 is BF400000. Adding two floating-point values is more complicated than adding
two integers. Let Y and Z be two floating-point numbers:
Y = (1.my) × 2(ey-127)
Z = (1.mz) × 2(ez-127)
Vector operations are well-suited to pipelining since the operations on the individual ele-
ments of the vector are independent. Therefore, some processors have vector registers which
contain many data elements. In the above example, a single vector register, v1, could hold all the
elements of x and another vector register, v2, could contain the elements of y. Vector registers
can be thought of as a collection of general or floating-point registers tightly bound together. So,
in the above example, two vectors of length 50 hold an amount of data that is equivalent to using
100 floating-point registers. One advantage of vector registers is that they allow a large amount
of data to be located close to the processor functional units which manipulate them. Vector reg-
isters are usually a power of two in size with 128 elements per vector register being representa-
tive. There is also a vector length register associated with vector registers that determines how
many elements in the vector are to be used in each operation.
dependent on data from a previous instruction and therefore cannot be moved before the earlier
instruction. So the code,
y = foo(x);
z = y;
has a data dependence since the second expression cannot be moved before the first one.
A control dependence is when an instruction occurs after a conditional branch and there-
fore it is not known whether the instruction will be executed at all. Thus
if (n == 0)
x = y;
has a control dependence. Some dependencies can be eliminated by software techniques dis-
cussed in Chapter 5. Some processors have hardware features that lessen the effect of dependen-
cies. This section discusses features related to data dependence while the next discusses control
dependence.
The second instructions does not depend on the result of the first, but it cannot begin exe-
cution until the first instruction finishes using register 8. If another register were available, then
the processor could use that register in place of register 8 in the second instruction. It could then
move the result of the second instruction into register 8 at its convenience. Then these instruc-
tions could be pipelined.
This use of another dummy register in place of a register specified by the instruction is
generally referred to as register renaming. Some processors have a set of hidden registers used
for just this purpose. These processors dynamically rename a register to one of these hidden reg-
isters so that register dependence is removed and better instruction scheduling is facilitated.
DO I = 1,N
Y(I) = X(I)
ENDDO
16 Chapter 2 • Processors: The Core of High Performance Computing
The naive way to implement this data copy using pseudo-assembly language is
DO I = 1,N
load X(i) into register 1
store register 1 to Y(i)
ENDDO
If the load instruction has a four-cycle latency, then the store cannot start until four cycles
after the load begins. This causes the processor to stall or sit idle for four cycles.
Rotating registers are those which rotate or are renumbered with each iteration of a loop.
For example, if registers rotate, the data in register 1 in the first iteration of a loop would appear
in register 2 in the second iteration.
If registers rotate, then the above loop may be implemented as
With this implementation, the store in the loop can start immediately since it is storing
data that was loaded four iterations before (and hence at least four cycles earlier). Note, how-
ever, a prologue is necessary before the loop body to load the first values of X. There is also an
epilogue after the loop for the last store operations.
store register 4 to x
load y to register 5
It would probably help performance to move the load before the store since the load would
be started earlier. However, if it is known that the addresses of x and y are different, this instruc-
tion movement is legal since the store may modify the value of y.
Registers 17
One way to attack this problem is to define a special type of load, called an advanced load,
and a special address table to keep track of the advanced load addresses. The above sequence
could be converted to appear as follows:
Thus, if the address of y had never been updated by a store, the advanced load was valid
and performed earlier than would have been normally been possible. If the address of y was
modified by the store, the load is repeated and the correct value of y is produced.
i = -1
loop:
s = s + 3.0
i = i + 1
if (i < n), branch to loop
When a branch is taken, the instruction stream jumps from one location to another. The
first time in the instruction sequence where a branch can be recognized is during instruction
decode. Many processors will have the next instruction being fetched before the branch can be
interpreted. For a taken branch, the processor has to undo any work that has been done for that
next instruction and jump to the new location. This causes a gap in the instruction pipeline that
degrades performance. The branch delay slot is the instruction located immediately after a
branch instruction. Some processors always execute the branch delay slot to improve pipelining.
The loop above could then appear as
18 Chapter 2 • Processors: The Core of High Performance Computing
i = -1
loop:
s = s + 3.0
i = i + 1
if (i < n), branch to loop
nop ! branch delay slot, always execute
This instruction shown after the loop is a dummy instruction called a no-operation instruc-
tions or nop. It doesn’t do work, but acts as a place holder so that pipelining can continue until
the loop is exited. Of course, the code would be smaller (and more efficient) if there was real
work in the branch delay slot, so it would be better for the code to appear as
i = -1
loop:
i = i + 1
if (i < n), branch to loop
s = s + 3.0 ! branch delay slot, always execute
Ideally, the instruction stream should contain as few jumps as possible. This makes it eas-
ier for pipelining to continue. Even simple if-tests are a problem, though. For example, the
code
IF (N > 0) then
A = 0.3
ELSE
A = 0.7
ENDIF
is interpreted as
This contains two branches. On architectures which always execute the branch delay slot, the
Registers 19
When the instruction fetch stage takes more than one cycle, or a processor can fetch more than
one instruction at a time, the multiple branches above interfere with pipelining.
y = 2.0
(p1) y = 2.0
If the predicate register p1 is one, i.e., true, y is set to 2.0. However, if p1 is zero or false, the
instruction is interpreted as a nop. Predicate registers allow some control dependencies intro-
duced by branches to be turned into data dependencies. If p1 and p2 are predicate registers, then
the original if-test in the previous section results in the sequence,
This takes only three instructions since the if-else test is a single instruction. Note that one of
the predicated instructions is not needed (but we don’t know which one) and will not update A.
However, since this instruction must be loaded from the instruction stream, it is not completely
free. The cost for this instruction is the same as a nop.
Predicate registers allow architectures with rotating registers to be even more effective.
Recall how rotating registers were used on the loop
DO I = 1,N
Y(I) = X(I)
ENDDO
20 Chapter 2 • Processors: The Core of High Performance Computing
and the prologue and epilogue they generated. Predicate registers allow the prologue and epi-
logue to be moved into the loop body as shown below.
DO I = 1, N+4
if (I <= N) set p1=true; else p1=false;
if (I >= 4) set p2=true; else p2=false;
(p1) load X(I) into register 1
(p2) store register 5 to Y(I-4)
ENDDO
Thus, rotating registers and predicate registers used together reduce the number of instructions
required.
Therefore, a normal load might generate exceptions if the data is loaded and the program may
abort.
There is a special type of load, a speculative load, that allows these situations to be han-
dled efficiently. Any exceptions are held in a special buffer area and the processor checks these
at the location where the original load would have appeared in the code stream. The above code
could be converted to appear as follows:
Thus, if no exceptions had appeared, the speculative load was valid and was performed earlier
than would have normally been possible.
Functional Units 21
Memory and registers contain the data that gets processed, but the parts of the processor
that actually do the work are the functional units. Most processors have functional units for the
following:
• memory operations
• integer arithmetic
• floating-point arithmetic
In load-store architectures, the memory functional units execute instructions that contain
memory addresses and registers. Other functional units execute instructions that operate exclu-
sively on registers. The examples so far have been simplified by assuming that all instructions
spend one cycle in the execution stage. The number of cycles in the execution stage is deter-
mined by the amount of time spent in the corresponding functional unit.
Memory functional units have varying degrees of sophistication. Load instructions refer-
ence an address in memory where a datum is to be loaded from and a register where the datum is
put. As always, an important goal is to ensure efficient pipelining. Simple processors stall until
the loaded data moves from memory into a register. This behavior is known as stall-on-load.
Processors that stall-on-use allow the load and subsequent instructions to be executed until the
processor actually tries to use the register that the load referenced. If the data is not in the regis-
ter, then the processor stalls until it arrives from memory. Some processors support having sev-
eral outstanding loads occurring simultaneously. Processors may also allow advanced and
speculative loads as discussed above.
Most integer instructions have a lower latency than floating-point instructions since inte-
ger operations are easy to implement in hardware. Examples of integer operations are adding
integers, shifting bits, Boolean operations, and bitwise operations such as extracting or deposit-
ing a string of bits. Since computers use a binary representation, multiplication and division by a
power of two is easily accomplished by shifting bits. For example, multiplying a positive value
by eight is accomplished by shifting the binary representation to the left by three bits. Multipli-
cation and division by arbitrary integer values is much more difficult and time-consuming.
These are sometimes accomplished by converting the integer data to floating-point data, per-
forming floating-point operations and converting the result back to an integer.
Floating-point data contains an exponent and significand, so adding two floating-point
numbers involves much more logic and hence more complex circuitry. Actual latencies vary
among processors, but floating-point additions may take three times longer than integer addi-
22 Chapter 2 • Processors: The Core of High Performance Computing
tions, while floating-point divides may take several times longer than floating-point additions.
Table 2-2 shows representative latencies on one modern architecture.
Integer addition 1
Floating-point addition 3
Floating-point multiplication 3
Many processors have single instructions that perform multiple operations using a single
functional unit. It is usually advantageous to use these since they result in getting more work
done in a unit of time. Below are multiple operation instructions implemented in some contem-
porary processors:
• Shift and Mask—shift an integer a specified number of bits and perform a Boolean AND
operation. These instructions can be used for integer arithmetic (especially multiplication
by powers of two) involved in common tasks such as address calculation and modular
arithmetic.
• Floating-point Fused Multiply and Add (fma)—multiply two floating-point registers and
add the result to another register. This instruction is also known as Floating-point Multiply
and Accumulate. This instruction brings up several interesting issues. This instruction is
implemented a couple of different ways. The most efficient processors have one or more
functional units that can perform an fma. The fma instruction is a big advantage on these
processors since two floating-point operations are produced by the functional unit. Less
efficient processors have separate functional units for multiply and addition and a fma
instruction is split into two parts that uses both functional units. There’s no advantage to
using an fma on these processors. If an fma instruction is executed by a single functional
unit, the result may be different than if the multiply and addition are done separately. This
is because the intermediate multiply result can be more accurate since it doesn’t need to be
stored. This has caused problems because some users believe any answer that differs from
what they’ve obtained in the past must be a wrong answer.
• Some processors have instructions that take 64-bit integer registers and operate on the four
16-bit components contained in the register. An add of this type performs four 16-bit addi-
CISC and RISC Processors 23
tions with a single instruction. These instructions are used in multimedia applications.
These types of instructions are known as Single Instruction Multiple Data (SIMD) instruc-
tions and are a form of instruction level parallelism.
• Applications which use 32-bit floating-point numbers can use another type of SIMD
instruction that operates on the two 32-bit halves of a 64-bit floating-point register. An
fma of this type performs four 32-bit floating-point operations with a single instruction.
Now that we’ve defined lots of processor features, we’ll put them together to build the pro-
cessor families we discussed earlier.
small number of assembly language instruction types is that it makes it easier for all instructions
to have the same instruction length.
Most RISC designs also execute one or more branch delay slots to aid pipelining. Reduc-
ing the number of memory accesses is also important, so they also contain a large number of
general and floating-point registers. Therefore, some features that encourage pipelining and
hence help define RISC processors are
• no microcode
• relatively few instructions
• only load and store instructions access memory
• a common instruction word length
• execution of branch delay slots
• more registers than CISC processors
The first RISC processors functioned very much as shown in Figure 2-1. In each clock
cycle, after the pipeline is full, a new instruction is dispatched and an instruction completes exe-
cution. So ideally there is one CPI.
It should be noted that on RISC processors there are still some instructions that cannot be
pipelined. One example is the floating-point divide instruction. As mentioned above, division is
much more difficult to perform than addition or multiplication. When divide is implemented in
hardware, it takes so many clocks to execute that it cannot be pipelined. Fortunately, the vast
majority of instructions can be pipelined on RISC processors.
In the quest to increase performance on RISC processors, many new design features
appeared. On the pipelined processor shown in Figure 2-1, several instructions are executed in
parallel in each cycle. However, at any given point in time each instruction being executed is in a
different stage. One way to increase performance is to have multiple instances of each stage, so
that more than one instruction is at the same stage at the same time. This is known as Instruction
Level Parallelism (ILP) and is the idea behind superscalar RISC processors. A superscalar RISC
processor can issue or dispatch more than one instruction per clock cycle. For example, a
two-way superscalar processor is one that can dispatch two instructions every clock period.
Figure 2-4 shows a four-way superscalar RISC pipeline.
Another way to increase performance is out-of-order execution. Sometimes a processor
will attempt to execute an instruction only to find out that the instruction cannot be executed due
to a dependency between it and previous instructions. Out-of-order execution processors attempt
to locate and dispatch only instructions that are ready for execution regardless of their order in
the instruction sequence.
As one would expect, the logic to optimally perform this analysis is very complicated.
Also, out-of-order execution introduces a randomness into the instruction execution ordering
that is not present in simpler RISC processors. This can make it more difficult for optimization
experts or compiler writers, to optimize code, because some of the determinism may vanish at
Vector Processors 25
INSTRUCTION
1 IF RD EX WB
2 IF RD EX WB
3 IF RD EX WB
4 IF RD EX WB
5 IF RD EX WB
6 IF RD EX WB
7 IF RD EX WB
8 IF RD EX WB
CYCLE 1 2 3 4 5 6
dispatch time. Another feature that improves instruction scheduling is register renaming, so
many RISC processors also support this functionality.
When RISC processors first appeared, their clock periods were slower than CISC proces-
sors. By the early 1990s, RISC processor clock periods had improved so much that they were
faster than their CISC counterparts and RISC superseded CISC in most high performance com-
puters. Examples of RISC processor ISA’s include:
What has happened to CISC processors since the RISC revolution? Intel’s Pentium Pro
architecture is a recent CISC processor. While it is difficult to overcome all of CISC’s design
constraints, the Intel processor has very good performance because the microcode has been
designed to be very RISC-like. The Pentium Pro also has many more registers than previous
generations of the processor family and uses register renaming. So CISC processors are copying
the best attributes of RISC processors.
premier designer of vector computers was the late Seymour Cray. Working at Cray Research
Corporation and Cray Computer Corporation, Cray’s goal was to build the fastest computer in
existence. For many years he did just that. Cray’s computers were so much faster than other
computers that they were called supercomputers. In the 1980s, one definition of supercomputer
was the latest computer Seymour Cray designed. These supercomputers were all vector comput-
ers. Like RISC computers, the high performance of vector computers is obtained by pipelining
instructions. Some argue that Cray’s computers were early RISC computers since the instruction
set was simple, instructions were all the same size, and they did not use microcode.
Vector computers use a single instruction to repeat an operation on many pieces of data
(i.e., a vector) so that they execute far fewer instructions than scalar (CISC or RISC) processors.
There is always a hardware imposed maximum vector length associated with a vector processor.
This is the number of data items in the longest vector register. For vector operations in excess of
this length, the compiler generates a loop where each iteration of the loop executes vector
instructions using the maximum vector length. This is referred to as strip-mining the loop. On
vector processors, the branch that occurs at the end of a loop is much less important than it is on
a scalar processor, since the branch is executed far fewer times.
Suppose that two vectors of data, x and y, each have 512 elements and these vectors are to
be added and stored to another vector z as shown below:
For this example, assume that vector registers can hold 128 elements. Table 2-3 shows a
comparison between the instructions that scalar and vector processors produce.
Processors using scalar or vector instructions may both have a goal of one operation per
clock period. At a fixed clock period, the two approaches may achieve similar performance
results on ideal code. Not all code can be vectorized, though. Codes that are not vectorizable are
called scalar codes. In general, if codes are vectorizable, they are also capable of being pipe-
lined on a RISC processor. However, there are many scalar codes that can also be pipelined on a
RISC processor.
An additional level of pipelining available on most vector processors is chaining. This is
when the result of a vector instruction is fed into, or chained with, another vector instruction
without having to wait for the first instruction to complete execution. A common form of chain-
ing is when a vector multiply instruction chains with a vector addition. This is analogous to the
fma instruction discussed earlier.
Just as there are superscalar RISC processors that dispatch multiple instructions per clock
period, there are also vector architectures that allow multiple independent vector instructions to
be initiated and executed in parallel. Figure 2-5 shows a vector pipeline with a vector length of
128 that executes two independent vector load instructions chained to a vector addition instruc-
tion, which is chained to a vector store.
INSTRUCTION
IF RD EX Z(1) EX Z(128) WB
STORE Z(1-128)
The selling price of a processor is proportional to the volume produced. The number of
vector processors has always been very small compared to the number of other types of proces-
sors, so they cost much more. However, in the 1970s and 1980s their performance was so supe-
rior to that of other processors that companies were willing to pay exorbitant prices for vector
computers. This was due to their much faster clock periods, the inherent pipelining of vectoriza-
tion, and advanced memory systems.
When Seymour Cray introduced the Cray 1 in 1976, its performance was over 10 times
faster than the fastest CISC processor. Throughout the 1980s and 1990s, RISC processor tech-
nology matured and the performance of these relatively inexpensive processors improved more
rapidly than that of vector processors.
28 Chapter 2 • Processors: The Core of High Performance Computing
The LINPACK benchmarks are used to measure the performance of linear algebra soft-
ware (see Chapter 9). The LINPACK 1000x1000 benchmark solves a system of 1000 equations
for 1000 unknowns. One of the creators of the LINPACK software and benchmark, Jack Don-
garra, maintains a database of benchmark results at http://www.netlib.org/bench-
mark/performance.ps. Hardware vendors are encouraged to send results to Dongarra. The
database contains entries for hundreds of computers. Figure 2-6 shows single processor LIN-
PACK 1000x1000 performance comparing Cray vector processors and Hewlett-Packard RISC
processors. The data points represent the number of million floating-point operations per second
(Mflop/s) in the year of introduction of the respective processors and show how the processor
families have improved over time.
This is a very good benchmark to show the benefits of vector processors and a poor bench-
mark for CISC processors. On the other hand, RISC processor performance is rapidly approach-
ing vector processor performance. There are many other benchmarks where the speed of RISC
processors exceeds the speed of vector processors. In fact, for today’s general-purpose comput-
ing, RISC processors are faster than vector processors. Since the cost of RISC processors is
much less than the cost of vector processors, the number of vector computers continues to
shrink.
VLIW 29
2.9 VLIW
Most RISC instructions are 32-bits in length. Using long instruction words that contain
multiple tightly bound instructions is another way to increase the amount of parallelism and
hence performance. This is the rationale for long instruction word (LIW) and very long instruc-
tion word (VLIW) processors. The difference between LIW and VLIW processors is somewhat
arbitrary, so the two terms will be used interchangeably. The performance of processors is lim-
ited by their clock periods, which are ultimately limited by the laws of physics. That is, electric-
ity travels at a significant fraction of the speed of light, which is a constant. If clock speed is held
constant, the way to gain performance is to increase the amount of work done in a clock cycle.
RISC superseded CISC in high performance computing because RISC processors could execute
more instructions in a clock cycle than a CISC processor could. As RISC processors matured,
they were improved and made more complicated by designing them to dispatch multiple instruc-
tions in a clock cycle. This works fine for some applications, but it is sometimes difficult for the
hardware to determine which instructions can be executed in parallel. LIW processors are
explicitly designed for instruction parallelism. With LIW processors, software determines which
instructions can be performed in parallel, bundles this information and the instructions and
passes them to the hardware. As processor clock speed increases become more difficult to
obtain, LIW architectures facilitate increased performance without increasing the clock period.
Early LIW designs in the 1980s such as the Multiflow Trace computers had little commer-
cial success. This was partly due to long compile times. A characteristic of the Multiflow com-
puters was a variable word length. One type of Multiflow processor had seven instructions
packed in a long instruction word, while another one had 14 instructions in a long instruction
word. A great deal of research went into producing compilers to generate instructions that were
optimal for each word length. The resulting complexity resulted in compilers that took a long
time to generate even very simple executables. Another financially unsuccessful computer was
Cydrome’s Cydra 5. It was more sophisticated than the Trace computer and included rotating
and predicate registers.
Intel and Hewlett-Packard have announced Explicit Parallel Instruction Computing
(EPIC) processors. These represent an update to the LIW concept. Two features that make EPIC
attractive are that processors have a common instruction word length and compiler technology
and processor speed has advanced to allow quick compilation time. The EPIC instruction word
or bundle is 128 bits long and consists of three 41-bit instruction slots, as shown in Figure 2-7.
The five-bit template field defines the mapping of instructions contained in the bundle to func-
30 Chapter 2 • Processors: The Core of High Performance Computing
tional units and whether the instructions in the bundle can be executed in parallel with the next
instruction. Since the bundle length is constant, objects that are created on one processor will be
compatible with future processors that have the same bundle length. Some features that help
define EPIC processors are
INSTRUCTION
RD A EX A WB A
1 IF RD B EX B WB B
RD C EX C WB 1C
RD A EX A WB A
IF RD B EX B WB B
2
RD C EX C WB C
RD A EX A WB A
IF RD B EX B WB B
3
RD C EX C WB C
CYCLE 1 2 3 4 5 6
The first ISA in the EPIC processor family is Intel Architecture - 64 (IA-64) [2]. It con-
tains a large number of general and floating-point registers (128 of each). Itanium is the first
IA-64 processor and it can dispatch two long word instructions in a clock cycle to achieve addi-
tional parallelism. To demonstrate improvements in instruction level parallelism, Figure 2-9
compares representative number of instructions per clock cycle for some of the processors we’ve
discussed.
Summary 31
2.10 Summary
Trends in processor design in the last twenty years include increased use of pipelining and
increased instruction level parallelism. As a result, there is increased reliance on software to
generate efficient instruction scheduling. These trends have resulted in processor evolution from
CISC through vector and RISC architectures to LIW. While it’s impossible to predict all the
characteristics of processors ten or twenty years from now, the trends will likely continue, so
processors of the near future will contain many of the features described in this chapter.
Processor design is crucial for high performance, but other components of a computer sys-
tem such as memory systems can cripple even the best designed processor. The next chapters
will discuss some of the other system components.
References:
Data Storage
Some day, on the corporate balance sheet, there will be an entry which reads, “Information”;
for in most cases, the information is more valuable than the hardware which processes it.
Grace Murray Hopper
Many applications perform relatively simple operations on vast amounts of data. In such
cases, the performance of a computer’s data storage devices impact overall application perfor-
mance more than processor performance. Data storage devices include, but are not limited to,
processor registers, caches, main memory, disk (hard, compact disk, etc.) and magnetic tape. In
this chapter we will discuss performance aspects of memory systems and caches and how appli-
cation developers can avoid common performance problems. This will be followed by a brief
overview of disk file system performance issues.
3.1 Introduction
Suppose for a moment that you are a carpenter. You have a tool belt, a lightweight tool
box, a tool chest permanently attached to your vehicle, and a shop that contains more tools and
larger machinery. For any particular carpentry job you put a different set of tools into your tool
belt, tool box, and tool chest to reduce the number of trips you have to make up and down the
ladder to the tool box, walking back and forth to the tool chest and driving to and from the shop.
The combination of tools in your tool belt varies with the job, simply because it isn’t practical to
carry everything on your belt. The same applies to the tool box, chest, and even the shop. The
things you need most often are kept in closer proximity to you.
Computer architectures have adopted an analogous strategy of keeping data close to the
processor. Moreover, the distance, measured in processor clocks, to storage devices increases as
33
34 Chapter 3 • Data Storage
their capacity increases. The processor’s set of registers are, of course, the closest storage
devices. The next closest storage devices are referred to as caches and usually vary in size from
a few hundred bytes to several megabytes (MB). Caches are usually made with static random
access memory (SRAM) chips. Beyond caches lies the main memory system. Most computer
main memory systems are built from dynamic random access memory (DRAM) chips. Some
memory systems are built with SRAMs (e.g., the Cray T90), rendering them faster than DRAM,
but expensive. At the next level of storage hierarchy is the magnetic disk. Magnetic disks are
truly the workhorses of data storage, playing important roles in virtual memory and file systems.
As storage devices become larger, they typically are farther away from the processor and the
path to them becomes narrower and sometimes more complicated. The typical memory hierar-
chy and its basic components are illustrated in Figure 3-1.
Memory Bus
CPU L1 Cache L2 Cache
I/O Bus
Main Memory
Magnetic Disk
3.2 Caches
Registers are the closest user accessible data storage areas to the functional units. Typi-
cally, CISC and RISC processors have fewer than 100 registers and hence these don’t contain
much data. Memory is very far from the processor in terms of processor clocks, so intermediate
Caches 35
storage areas are desired. Cache is a type of memory located between the processor and main
memory.
The largest data caches on high performance processors are usually a few megabytes in
size. Cache may be on-chip (physically part of the processor chip) or off-chip.
This requires loading X(I), loading Y(I), adding X(I) and Y(I), and storing Y(I) for each
iteration (value of I).
Suppose we are executing this loop on a machine with a single, direct mapped, 1 MB
cache and that a cache line is 32 bytes in length. If X and Y are arrays of eight-byte data (e.g.,
data declared as REAL*8 in Fortran) then X and Y could be allocated as illustrated in Figure 3-2.
That is, X and Y are a multiple of the cache size apart in memory. On the first iteration elements
X(1) through X(4) are loaded into cache and X(1) is loaded into a register. Note that opera-
tions in Figure 3-2 that force cache lines to be moved between the cache and memory are high-
lighted in bold. Then Y(1) through Y(4) are loaded into cache from memory and Y(1) is
loaded into a register. Note that the cache line containing Y(1) displaces the cache line contain-
Caches 37
MEMORY
ADDRESS CONTENTS
CACHE LINES
Load X(2)
0x011A0 Y(1) Y(2) Y(3) Y(4)
0x011A0
X(1) X(2) X(3) X(4)
Load Y(2)
0x011A0 Y(1) Y(2) Y(3) Y(4) Add X(2) to Y(2)
Store Y(2)
ing X(1) through X(4). Completing the iteration, X(1) and Y(1) are added and Y(1) is stored.
On the second iteration, X(2) must be loaded. This requires that elements X(1) through X(4)
be again loaded into cache. But this displaces the cache line containing Y(1) and hence forces
that line to be stored to memory (because Y(1) has been modified) before the line containing
X(1) through X(4) can be loaded into the cache. Each load of X and each load of Y requires
that the data be moved to and from memory. In this case, the cache is nearly useless. This pro-
cess of repeatedly displacing and loading cache lines is referred to as cache thrashing.
If the cache happened to be a two-way set associative cache, then most, if not all, of this
thrashing would be eliminated. This is because the cache line for X would likely reside in one set
of the cache and the cache line for Y in the other. Both round-robin and least-recently-used
replacement strategies would address this situation very well. A random replacement approach
might take a few iterations before getting it right. That is, since the set number to be replaced is
generated randomly, the result could easily be that the first set is replaced multiple times before
selecting the second set (or vice-versa).
The previous example was derived from an example where the arrays X and Y were
declared as follows:
Note that 7340032 in hexadecimal is 0x700000, which is seven MB, an integral multiple of the
cache size.
If the first array had been “padded” so that it was not a multiple of the cache size apart,
then this cache thrashing could have been avoided. Suppose X and Y were declared as follows:
Then many memory transactions will be eliminated. Note that the padding is for four elements.
Since each element is eight bytes, this is effectively padding the array by one cache line (32
bytes = 4 × 8 bytes). With the arrays declared as above, we then have a sequence of operations as
outlined in Figure 3-3. Not only does the second iteration not require any memory transactions,
neither does iteration 3 or 4! At the beginning of iteration five, the cache line containing Y(1)
through Y(4) will have to be stored to memory, so we will have roughly three memory transac-
tions for every four iterations. Compare this to the previous example where we had three mem-
ory transactions for every single iteration.
Caches 39
MEMORY
ADDRESS CONTENTS
0x400011A0 X(1) X(2) X(3) X(4)
CACHE LINES
ADDRESS CONTENTS OPERATION
Load X(1)
0x011A0 X(1) X(2) X(3) X(4)
0x011C0
7340032 7.21
7340036 3.52
in this that one sometimes wonders if caches should be spelled c-a-s-h! Given the advantages of
cache to today’s processor performance, it is easy to see how an intermediate level of cache
might be beneficial and economically feasible. Consider the Compaq AlphaServer DS20
machine which has a 500 MHz Alpha 21264 processor with a 64 KB instruction cache and 64
KB data cache on chip and an intermediate (or secondary) cache that is four MB in size. The pri-
mary benefit of such caches is, of course, to reduce the average time per access to memory.
Suppose we have a workload which we have studied and we are able to produce a rough
estimate of how often memory accesses occur as a function of the data storage available.
Table 3-2 is a representative listing of this information.
0 - 4 KB 72% 72%
4 - 16 KB 83% 11%
16 - 64KB 87% 4%
64 - 256 KB 90% 3%
256 KB - 1 MB 92% 2%
1 - 4 MB 95% 3%
4 - 16 MB 99% 4%
16 - 64 MB 100% 1%
42 Chapter 3 • Data Storage
Now consider three different cache architectures with the same processor and memory
system. Suppose they have cache structure and attributes as shown in Table 3-3. Note that only
Architecture 3 has two caches.
Table 3-3 Data Storage Sizes and Associated Access Times for
Hypothetical Architectures.
L1 Cache L1 Access L2 Cache L2 Access
Architecture Size Time (Cycles) Size Time (Cycles)
1 1 MB 3 None N/A
2 4 MB 5 None N/A
3 16 KB 1 4 MB 6
If we assume it is 100 clocks from the processor to main memory for all three architec-
tures, then, using the data in Table 3-2 and Table 3-3, we can produce the expected number of
clock cycles for an average data access for all three architectures.
For architecture 1, 92% of the accesses are in one MB of data storage or less. So:
Expected latency for architecture 1 = 92% × 3 cycles + 8% × 100 cycles = 10.76 cycles.
Now, architecture 2 has a four MB cache and so 95% of the accesses are estimated to
occur within it. Thus:
Expected latency for architecture 2 = 95% × 5 cycles + 5% × 100 cycles = 9.75 cycles.
So, all else being equal, architecture 1 is likely to be faster for this application than archi-
tecture 2 even though it has a smaller cache! The remaining cache architecture has the follow-
ing:
Expected latency for architecture 3 = 83% × 1 cycle + 12% × 6 cycles + 5% × 100 cycles
= 6.55 cycles
What an improvement! Even though the primary cache is very small and the secondary
cache is the same size and slower than that of architecture 2, it is much faster than either of the
other cache architectures.
3.3.1 Overview
Most of today’s computers are virtual memory machines. Such machines translate logical
memory addresses in a user’s program to physical memory addresses. There are several advan-
tages that virtual memory machines have over those machines that do not have virtual memory.
Virtual Memory Issues 43
They allow programs that have logical address space requirements larger than physical memory
to execute, albeit slowly. Virtual memory machines are capable of executing multiple processes
which, when combined, occupy several times more memory than is actually in the machine.
Even though a process’ address space appears to be sequential on a virtual memory, it is
likely not to be in the physical address space. This is because virtual memory systems break up a
process’ memory into blocks, usually referred to as pages. Page sizes are typically four KB or
larger in size and do not always have to be uniform, i.e., a process may have several pages that
are four KB in size and yet several more pages which are one MB or larger. By breaking virtual
memory into pages, the operating system can place some of the pages on magnetic disk. In this
way, the operating system can keep only those pages currently being accessed by the process in
physical memory, with the remainder residing on disk. When memory is accessed in a page that
is not currently in physical memory, the operating system copies the page in question from disk
and places it in physical memory. This may cause another page to be displaced from physical
memory, which forces the operating system to place it on disk. The location on magnetic disk
that is used for virtual memory pages is called swap space since it is used to swap pages in and
out of physical memory.
The operating system needs a map to translate virtual memory addresses to physical mem-
ory addresses. This is done by mapping pages in virtual memory to those in physical memory. A
common way to accomplish this is with a lookup table, generally referred to as a page table.
Processes typically have multiple page tables, including page tables for text, data areas, etc.
where every seventh element in a list is accessed, skipping over six elements at a time. Multidi-
mensional arrays have inherently long strides, depending on how they are accessed. For exam-
ple, an array declared as follows,
REAL*8 X(1000,2000)
has a stride of 1000 between columns, meaning that x(N,1) and x(N,2) are 1000 elements
apart, regardless of the value of N. Consider the following sequence of code:
Assume that both X and Y are declared as above. Then, regardless of the values of M or N, the
execution of this sequence of code results in both X and Y being accessed with strides of 1000.
Note that this means that X(I,1) and X(I,2) are 8000 bytes apart. Many computers use a page
size of only four KB. If this code were executed on such a machine, then each iteration of the
inner loop forces two different TLB entries (one for X() and one for Y()).
To illustrate the severity of TLB misses and the benefit of being able to use larger page
sizes, the following sequence of code was executed on a HP N-4000 computer:
This is clearly a deliberate attempt to cause TLB misses, because the array x is a double
precision array and hence each element is eight bytes in size. The value of stride is set to 516,
which translates to a byte stride of 4128; the value of jmax is fixed at 256 while imax is varied
from 256 to 2048. The result is a test which accessed from 0.5 MB to 4 MB of data. The HP
N-4000 has a 1 MB, four-way set associative cache which, when combined with a stride of 516,
keeps the cache misses to a minimum. The results are astounding, as illustrated in Table 3-4.
Note the time per access for the 0.5 MB problem size drops sharply after the page size is
increased from 4 KB to 16 KB. This implies that a TLB miss takes roughly 160 ns since the data
Memory 45
(0.5 MB) resides entirely in cache. More importantly, it’s interesting that TLB misses can cause
your code to run over 30 times slower!
3.4 Memory
To the casual observer, memory systems appear to be the most elementary part of the com-
puter. It seems to be simply a large number of identical computer chips which store data. This is,
of course, far from reality.
The memory system or main memory, as it is typically described, is the crossroads of data
movement in the computer. Data moving to and from caches moves through the memory system.
Input and output devices such as magnetic disk, network devices, and magnetic tape all have the
main memory system as the target of their output and the source of their input.
Computer hardware vendors have, by increasing address space capabilities, themselves
caused memory capacity demands to increase. During the 70’s almost all computer vendors
moved from 16-bit addresses to 32-bit. As we move into the next millennium, most computers
will have 64-bit address spaces. As a result, memory systems that used to be measured in kilo-
bytes (KB) are now measured in gigabytes (GB).
With the economies of scale in processors, multiprocessor computers have evolved from
simply having two processors in the 80’s to configurations with literally thousands of proces-
sors. At the same time, memory systems are expected to be capable of “feeding” all these
data-consuming processors.
46 Chapter 3 • Data Storage
So, while complexity in processor design has increased at a dramatic rate, memory sys-
tems have been driven to keep up with them—albeit unsuccessfully. Added to this pressure is the
tremendous increase in multiprocessor systems that require not just faster memory systems, but
larger capacity systems that allow overall performance to scale with the number of processors.
3.4.2 Interleaving
Simple memory system organizations using DRAM (which is most likely since it is less
expensive than SRAM) result in each memory transaction’s requiring the sum of access time
plus cycle time. One way to improve this is to construct the memory system so that it consists of
multiple banks of memory organized so that sequential words of memory are located in different
banks. Addresses can be sent to multiple banks simultaneously and multiple words can then be
retrieved simultaneously. This will improve performance substantially. Having multiple, inde-
pendent memory banks benefits single processor performance as well as multiprocessor perfor-
mance because, with enough banks, different processors can be accessing different sets of banks
simultaneously. This practice of having multiple memory banks with sequential words distrib-
uted across them in a round-robin fashion is referred to as interleaving. Interleaving reduces the
effective cycle time by enabling multiple memory requests to be performed simultaneously.
The benefits of interleaving can be defeated when the memory access pattern is such that
the same banks are accessed repeatedly. Let us assume that we have a memory system with 16
banks and that the computer uses cache lines which contain 4 words, with each word being 8
bytes in size. If the following loop is executed on this computer, then the same set of banks is
being accessed repeatedly (since stride = 64 words = 16 * 4 words):
Memory 47
double *x;
...
stride = 64;
sum = 0.0;
for( j = 0; j < jmax; j++ )
{
for( i = 0; i < imax; i++ )
sum += x[i*stride + j];
}
As a result, each successive memory access has to wait until the previous one completes
(the sum of the access time and cycle time). This causes the processor (and hence the user’s pro-
gram) to stall on each memory access. This predicament is referred to as a bank stall or bank
contention.
Let’s revisit the sequence of code used in the previous TLB discussion:
First, fix imax at 16384 and jmax at 512 so that the problem size is 64 MB in size. In
Figure 3-4, we show the average access time for various hardware platforms using several values
of stride. Two sets of data are charted for the HP N-4000, one using 4 KB pages and another
using 1 MB pages. Note the tremendous difference in performance for the N-4000 caused by
TLB misses, as illustrated by the divergence in the graphs after a stride of 16. This data was gen-
erated using 8 KB pages on the SUN UE3500 and 16 KB pages on the SGI Origin 2000. Note
that, even with 1 MB pages, the N-4000’s performance decreases after a stride of 8, indicating
that memory bank contention is hindering performance.
Page size does not just benefit applications that use a lot of data, it also benefits those that
have a large text (i.e., instructions) segment. Electronic design simulations, relational database
engines, and operating systems are all examples of applications whose performance is sensitive
to text size.
How does one alter the page size for an executable? HP-UX provides a mechanism to
change the attributes of an executable file. This can be accomplished by a system utility, chatr.
For the tests discussed here, the executable was modified to request four KB data pages with the
following command:
Figure 3-4 Memory access time as a function of word stride for various computers.
Similarly,
chained from a single I/O slot (that is, a single card/controller). Frequently I/O cards have a peak
bandwidth inherent in their design which ultimately limits realizable I/O performance. To illus-
trate the problem, consider an I/O card that is capable of only 40 MB/sec. Building a logical
device using a single card with the 8 disks mentioned above limits performance to only half of
the disks’ aggregate capabilities. However, if two cards are used with the 8 disks (4 disks on
each card), then the logical device is capable of up to 80 MB/sec of I/O bandwidth.
Most memory systems of server or mainframe class computers today are capable of deliv-
ering over 400 MB/sec of memory bandwidth per processor. In order to construct a logical
device with magnetic disks capable of providing data at half this rate, one would need 20 of the
disks discussed above. The capacity of these disks, using 10 GB disks, is a whopping 200
GB—for just one processor! Such is the dilemma of system configuration with regard to mag-
netic disk storage: High performance magnetic disk will go hand-in-hand with a tremendous
amount of storage, perhaps far more than is necessary.
sandwich. The time and economics involved in travelling to the supermarket for individual items
are not practical.
In an effort to duplicate this efficiency of getting several things with each trip to the more
remote storage devices, buffered I/O mechanisms were developed. The most common is that
found in the C programming language with the fread() and fwrite() subroutines.
This mechanism, described as buffered binary I/O to a stream file, works roughly as fol-
lows. A file is opened with the fopen() subroutine call instead of the open() system call. In
the process of opening this file, fopen() also initializes data structures to maintain a buffer,
allocated in the user’s address space (as opposed to the kernel’s address space). This buffer var-
ies in size and the user can actually manipulate how this buffer is allocated and what its size will
be. Subsequent to the fopen() call, transfers to and from the file are accomplished with calls to
the subroutines fwrite() and fread(), respectively. These subroutines first check to see if the
data being requested is already in the buffer. If it is, then the data is simply copied from the
buffer into the destination specified by the procedure call. Thus, no system call is made and no
transfer of data is made to or from the kernel’s address space (i.e., buffer cache). If the data is not
already in the buffer, then the subroutines make the necessary system call to move blocks of data
into and out of the buffer from the file.
Some implementations of these buffered binary I/O routines use relatively small buffers.
As a result, these routines can be slower than system calls for large transfers. One way to
improve this situation is to make use of the setvbuf() routine. This routine allows the user to
specify another buffer that can be much larger in size. For example, to change the internal buffer
used by buffered I/O, one might use the following C code:
This sequence assigns the 64 KB array work_buffer to the file pointer fp to be used for
its buffering. The performance benefit of doing this will be shown below.
Using language-specific I/O routines such as the Fortran I/O statements (e.g., open(),
read(), write(), etc.) can be extremely detrimental to performance. Implementations vary,
but in many cases these routines are built on top of the buffered binary I/O mechanisms
(fopen(), fread(), fwrite(), etc.) discussed above. Fortran provides for some additional
functionality in these I/O routines, which leads to yet more overhead for doing the actual I/O
transfers. Fortran read performance is typically worse than either using the system call interface
or the buffered binary I/O mechanisms.
To illustrate this, a comparison of reading a file using the read() system call, fread(),
fread() with a 64 KB transfer size, and Fortran I/O was made on a HP N-4000 server using
HP-UX 11.0 operating system. The Fortran I/O was performed with an implied DO loop using
the following calls:
52 Chapter 3 • Data Storage
Note that the default buffer size for HP-UX’s buffered binary I/O is eight KB. The time to
make transfers of various sizes is given in Table 3-5. The times shown are the milliseconds per
transfer for using various I/O interfaces. A 100 MB file was read sequentially starting from the
beginning of the file to produce these times. The file system buffer cache was configured to be
roughly 500 MB in size so that the file would fit entirely in the buffer cache. So, the times reflect
transfers from the buffer cache and not from magnetic disk.
From the table, one can draw many conclusions. Note that the buffered binary I/O mecha-
nism is very efficient for small (less than four KB) transfers. For transfers of eight KB or more,
the binary I/O interface (fread) benefits from a larger I/O buffer; in this case, 64 KB was used.
However, note that using a system call to perform the I/O is substantially faster than the other
methods for transfers of eight KB or more. Under no circumstances is the Fortran I/O faster than
any of the other interfaces.
One word of warning with regard to the use of large (more than 64 KB) transfer sizes and
system calls. Some file systems are capable of being configured so that transfers exceeding a
certain size will bypass the file system buffer cache. This is referred to as direct I/O or buffer
cache bypass. The intent is to keep extremely large files from occupying the entire buffer cache
and/or to reduce the load on the memory system by eliminating the additional copy from buffer
I/O Performance Tips for Application Writers 53
cache memory to user memory. The downside to this approach is that the operating system can-
not perform read-ahead. Hence, the user will not benefit from memory-to-memory copy speeds.
There’s actually another lesson in Table 3-5. If we divide the transfer sizes by the time, we
can get bandwidth performance. Table 3-6 contains a variation of Table 3-5 modified to reflect
bandwidth rather than the amount of time per transfer. Note that the read system call delivers
roughly twice the performance of other common interfaces. Thus if an application can do its
own buffering, that is, read in 64 KB of data at a time, then the benefits can be tremendous.
The poor performance of Fortran I/O is now clear. Note that the read system call using
large transfers delivers the best bandwidth. So, if your application can do its own buffering and
use system calls to transfer data between your buffer and files, then that could well be the best
performance option.
asynchronous read in Figure 3-5 below. Note the use of the aio_suspend() routine to cause
the calling program to suspend processing until the read completes.
aio_read()
call creates new
flow of control
OS initiates
I/O transfer
Calling program
continues
processing
File
Time
aio_suspend()
call suspends
execution
Data transfer
completes
Processing
resumes
Figure 3-5 Asynchronous I/O enables processing to continue without waiting for file
transfers to complete.
Asynchronous I/O provides the benefits of system call performance with parallelism.
Details of the POSIX asynchronous I/O facility will be discussed in Chapter 8, and it should be
considered for use in I/O intensive applications. There are some downsides to use of asynchro-
nous I/O functions. Some implementations of asynchronous I/O actually create an additional
Summary 55
flow of control in the operating system and then destroy it upon the function’s completion. This
can add a tremendous amount of time to that required to actually transfer the data. SGI’s IRIX
operating system provides the aio_sgi_init() function to alleviate this problem. The user
can identify the number of asynchronous operations to be used by the program, allowing the
operating system to more efficiently service those operations.
3.7 Summary
This chapter has provided many tips to the application writer on how to avoid performance
problems related to storage devices ranging from caches to file systems. The important issues
are:
• Avoid cache thrashing and memory bank contention by dimensioning multidimensional
arrays so that the dimensions are not powers of two.
• Eliminate TLB misses and memory bank contention by accessing arrays in unit stride.
• Reduce TLB misses by using large pages on applications with large memory (data or text)
usage.
• Avoid Fortran I/O interfaces.
• Do your own buffering for I/O and use system calls to transfer large blocks of data to and
from files.
References:
The following publications are excellent resources for the parallel programming and other
topics discussed in this section:
An Overview of Parallel
Processing
If one ox could not do the job they did not try to grow a bigger ox, but used two oxen.
Grace Murray Hopper
4.1 Introduction
In a sense, we all perform parallel processing every day. Vision is accomplished through
the use of one eye, but two eyes give additional peripheral vision as well as depth perception.
Washing dishes is much faster when using two hands instead of one (especially in the absence of
a dishwasher!). Getting work done faster by performing multiple tasks simultaneously is the
driving force behind parallel processing.
In this chapter, the basic concepts of parallel processing are highlighted from both a soft-
ware and hardware perspective. Details of efficient algorithm approaches and software imple-
mentation are discussed in Chapter 8.
Many years ago, Michael Flynn proposed a simple, common model of categorizing all
computers that continues to be useful and applicable. The categories are determined by the
instruction stream and the data stream that the computer can process at any given instant. All
computers can be placed in one of these four categories:
• Single Instruction, Single Data (SISD)—This is the most common computer in the mar-
ket-place today, the single processor system with one instruction stream and one data
stream. Most PCs fall into this category.
• Multiple Instruction, Single Data (MISD)—The same data point is processed by multiple
processors. No system of this type has ever been made commercially available.
• Single Instruction, Multiple Data (SIMD)—Multiple data streams are processed by multi-
ple processors, each of which is executing the same (single) stream of instructions. There
57
58 Chapter 4 • An Overview of Parallel Processing
SISD and MIMD machines are the primary focus of this book. They also happen to
account for about 99% of the computers sold today.
There’s another term that sounds like it is another category, but it isn’t. The acronym is
SPMD and it stands for Single Program Multiple Data. This is a description of one approach to
parallel programming. In SPMD there is one program and multiple processors execute the same
program on multiple data sets. This is discussed again in Chapter 8.
From an operating system perspective, there are two important means of accomplishing
parallel processing: multiple processes and multiple threads.
When one executes a program on a computer, the operating system creates an entity called
a process which has a set of resources associated with it. These resources include, but are not
limited to, data structures containing information about the process, a virtual address space
which contains the program’s text (instructions) and data, and at least one thread. This begs the
question, “What is a thread?”
A thread is an independent flow of control within a process, composed of a context (which
includes a register set) and a sequence of instructions to execute. By independent flow of con-
trol, we mean an execution path through the program.
There are different levels of parallelism in computer systems today. As discussed in
Chapter 2, LIW and superscalar RISC processors achieve parallelism at the instruction level.
However, in the context of this book, we use the term parallel processing to describe the use of
more than one thread of execution executing in a single program. Note that this allows more than
one process to accomplish parallel processing. This leads to our being able to categorize parallel
processing into three general categories:
Parallel Models 59
Since parallelism can be achieved with multiple processes, why bother with thread-paral-
lelism? There are at least two potential reasons: conservation of system resources and faster exe-
cution.
Threads share access to process data, open files, and other process attributes. For the fol-
lowing discussion, define a job as a set of tasks to be executed. Sharing data and text can dramat-
ically reduce the resource requirements for a particular job. Contrast this to a collection of
processes which will often duplicate the text and data areas in memory required for the job.
POSIX 1003.1c is the portion of the overall POSIX standard covering threads. Included are the
functions and Application Programming Interfaces (APIs) that support multiple flows of control
within a process. Threads created and manipulated via this standard are generally referred to as
pthreads. Previous to the establishment of pthreads, thread APIs were hardware vendor-specific,
which made portability of thread-parallel applications an oxymoron. This, combined with the
complexity of rewriting applications to use (and benefit from!) explicit thread control, resulted
in very few thread-parallel applications.
#include <pthread.h>
#include <stdio.h>
main()
{
pthread_t tid[4];
int i, n, retval;
int iarg[4];
Parallel Models 61
Now consider the same effective result—four threads call foobar()—using an OpenMP
compiler directive.
#include <stdio.h>
main()
{
pthread_t tid[4];
int i, n, retval;
int iarg[4];
exit(0);
}
With regard to programming complexity, the examples speak for themselves, don’t they?
62 Chapter 4 • An Overview of Parallel Processing
Both explicit thread-parallel implementations (e.g., pthreads) and directive based parallel-
ism (e.g., OpenMP) benefit from what is loosely referred to as “shared-memory.” In both mod-
els, the threads can access the same virtual memory locations allocated before the threads are
created. That is, thread 0 and 1 can both access x in the example below. Moreover, if thread 0
modifies the value of x, then thread 1 can subsequently retrieve this new value of x. Computer
hardware manages the complexity of keeping such shared-memory values current. This feat is
generally referred to as coherency and will be discussed later in Section 4.3.
4.2.5 Message-passing
The fork/exec model does not imply the existence of shared-memory. Quite the contrary!
Processes can communicate through I/O interfaces such as the read() and write() system
calls. This communication can occur through a typical file or via sockets.
Communication via a file is easily done between processes which share a file system. This
can be achieved on multiple systems via a shared file system such as NFS. Typically, communi-
cation is accomplished by creating a file lock (commonly a separate file with the suffix .lck) to
establish exclusive access to the communication file.
Sockets are usually a more efficient means of communication between processes since
they remove a lot of the overhead inherent in performing operations on the file system.
Both of these common variations, file system and sockets, rely on the process sending the
data to be communicated to the file or socket. This data can be described as a message, that is,
the sending process is passing a message to a receiving process. Hence the name for this model:
message-passing.
There have been many different implementations of message-passing libraries. PAR-
MACS (for parallel macros) and PVM (Parallel Virtual Machine) are two early examples that
were successful. In an attempt to bring about a standard message-passing API, the Mes-
sage-passing Interface (MPI) Forum put together a specification which was published in May
1994. See http://www.mpi-forum.org/ for more details. MPI soon eclipsed PVM and
Hardware Infrastructures for Parallelism 63
some of the advantages of PVM, such as dynamic process creation, are slowly being adopted by
the MPI standard. While it was intended primarily for distributed memory machines, it has the
advantage that it can be used for parallel applications on shared-memory machines as well! MPI
is intended for process parallelism, not thread-parallelism. This actually worked to MPI’s benefit
in its adoption by parallel software developers.
It is worth noting that the first specification for MPI occurred in 1994, a full four years
before POSIX defined a thread-parallel API standard. The combination of early definition with
its more general purpose utility resulted in more highly parallel applications being implemented
with MPI than any other parallel programming model at the turn of the millennium.
4.3.1 Clusters
A cluster is an interconnected collection of stand-alone computers that are used as a single
computing resource. One extreme, yet common, example of a cluster is simply a set of several
workstations that are placed in a room and interconnected by a low bandwidth connection like
Ethernet. Since the workstations are just sitting on the floor somewhere with no special cabinet
or rack, such a “system” is often referred to as a carpet cluster.
We’ll loosely define a computing node as a collection of processors that share the lowest
memory latency. By its very definition, a cluster’s node is a single stand-alone computer. One of
the advantages of a cluster is that each node is usually well-balanced in terms of processor,
memory system, and I/O capabilities (because each node is a computer). Another of its advan-
tages is cost; it usually consists of individual off-the-shelf workstations. Interconnect technology
can be purchased off-the-shelf as well in the form of Ethernet, FDDI, etc. There are also propri-
etary interconnect technologies that offer higher performance but also have the inevitably higher
price tags. Clusters are also very scalable since you can continue to add nodes to your parallel
system by simply adding another workstation. The principal limiting factor is, of course, the
capacity and performance of the interconnect.
The capacity and performance of interconnects are two disadvantages of clusters. Access
to data that resides on the same node on which the application is running will be fast (as fast as
the workstation, anyway). Data that exists on other nodes is another matter. Since the other node
is a whole computer by itself, the data will likely have to be transferred via an I/O system call as
64 Chapter 4 • An Overview of Parallel Processing
discussed above in message-passing. That means that the data must travel across the wire (and
protocol) from the remote node to the node that needs the data. This can be very slow, anywhere
from 1 to 3 orders of magnitude slower than accessing local main memory.
Note that there is the issue of address space for an application. Clusters have multiple,
independent address spaces, one for each node.
There’s also the problem of system management. Without special cluster management
software, it is very difficult to manage the system. Software must be installed on each individual
node which can be a very time-consuming and expensive process (e.g., you may need a software
license for each node!).
There’s also the issue of the system’s giving the impression of being a single system rather
than a bunch of computers. Any user would like to log onto a system and find his data (e.g., files)
as he left it when he was last working on the system. Without sophisticated cluster system soft-
ware, this is not the case with clusters. There are extremes in this experience. On the one hand,
the user may log on to the same workstation in the cluster on every occasion. This may actually
be a good thing; at least the data will look like it did last time he worked on the system. How-
ever, if every user is placed on the same workstation, they are likely to see poor performance
while they all contend for that single workstation’s processing resources to address their basic
login requirements. The other extreme is that users may actually log on to a different worksta-
tion within the cluster every time. In this scenario, one’s environment may look different every
time, causing immense confusion for the cluster neophyte.
To summarize, the advantages of clusters include truly scalable systems at a relatively
inexpensive cost. The disadvantages of clusters include system administration difficulties, lack
of a single system image, and poor interconnect performance.
single processor system! We’ll include the requirement that an SMP must also be capable of
having all of its processors execute in kernel mode.
SMPs do provide a single address space for applications. This can make application devel-
opment much easier than it would be on a system with multiple, independent address spaces
such as clusters.
An SMP will have multiple processors but it doesn’t really have multiple I/O systems or
multiple memory systems. Since SMPs have equal or uniform access to memory, they are Uni-
form Memory Access (UMA) machines. UMA carries the implication that all processors can
access all of memory with the same latency. In any computer (not just parallel computers), the
various resources must have interfaces to each other. For example, the processor must be able to
communicate with the memory system.
BUS
MEMORY
C
A MEMORY BOARD 0
PROCESSOR 0 C
H
E
MEMORY BOARD 1
C
A
PROCESSOR 1 C
H
E I /O SUBSYSTEM
Figure 4-1 Computers with bus interconnect have few paths between peripherals.
system connected by a (fast) bus collectively share an interconnect path. This allows processors
to have the opportunity of realizing low latency, high performance local memory access while
keeping the number of crossbar paths relatively low.
Before we go any further, it’s probably a good time to dig a little deeper into shared-mem-
ory functionality and put interconnects on hold for the moment. If a data item is referenced by a
particular processor on a multiprocessor system, the data is copied into that processor’s cache
and is updated there if the processor modifies the data. If another processor references the data
while a copy is still in the first processor’s cache, a mechanism is needed to ensure that the sec-
ond processor does not use the data from memory which is now out of date. The state that is
achieved when both processors always use the latest value for the data is called cache coherency.
Not all shared-memory machines are cache-coherent. Those that are not typically rely on
software to achieve coherence at any given time. This can be achieved with two mechanisms.
The first mechanism, generally referred to as a cache line flush, immediately forces a cache line
from the processor’s cache to main memory. The second mechanism simply marks a particular
cache line as being invalid but does not write it back to memory. This is called a cache line purge
and basically forces the processor to explicitly load a “fresh” copy of the data from memory the
next time it is accessed. An example of how these mechanisms work is illustrated in Figure 4-3.
Hardware Infrastructures for Parallelism 67
CROSSBAR
MEMORY
C
A
PROCESSOR 0 C MEMORY BOARD 0
H
E
C
PROCESSOR 1 A MEMORY BOARD 1
C
H
E
Figure 4-2 Computers with crossbar interconnect usually enjoy multiple, independent
paths between peripherals.
Recall that a cache line typically contains multiple data elements. Let’s suppose that x and
y reside in the same cache line as shown in Figure 4-3. Assume we are processing on processor
0 and change the value of x. Later on, processor zero needs to get a “fresh” copy of y and we
perform a cache line purge. Then finally we realize that it’s a good time to send the new value of
x back to memory so that all the other processors can see what we did to it. So, a cache line flush
is performed. But, hey! That cache line was purged, so we lost the value of x, and since the
cache line was marked invalid by the earlier purge, who knows what was actually flushed back
to memory? Worse yet, and this is the point we were striving to make, we may have no clue that
x and y were in the same cache line. This is an example of false cache line sharing which we’ll
discuss in more detail in Chapter 8.
As the previous discussion shows, performing cache coherence in software is awkward,
since the application has to do purges and flushes all the time. However, performing coherence
in software means that the hardware doesn’t have to do it. As a result, such machines are likely
to have hardware that is much less complicated and, hence, is likely to be less expensive. The
Cray T3D is a good example of a shared-memory machine that doesn’t perform cache coherence
with hardware.
So, rather than continuously purging and flushing caches ourselves, we may be interested
in a shared-memory machine which takes care of the cache coherency for us (i.e., in hardware).
Roughly speaking, there are two means of accomplishing coherency: bus snooping and directory
based coherency.
Let’s refer to buses that connect caches to main memory as memory buses. Coherency via
bus snooping is accomplished by having every memory bus continuously communicate what’s
68 Chapter 4 • An Overview of Parallel Processing
x y
0 0
LOAD x
x y LOAD x
x y
0 0
0 0
STORE x = 1 STORE y = 3
x y x y
1 0 0 3
PURGE y FLUSH y
x y
(INVALIDATES CACHE LINE)
x y 0 3
- -
x y
0 3
LOAD y
x y
0 3
Figure 4-3 Example of false cache line sharing complicated by coherency controlled
with software.
happening to it. Coherency is achieved because all the other memory buses are listening, and
when another memory bus asks for a particular cache line, they immediately look in their caches
to see if they have it. The memory bus whose cache had the cache line will send it to the request-
ing bus and send a message to the memory system indicating that the request has been resolved.
Hardware Infrastructures for Parallelism 69
Roughly speaking, this works because moving data from one cache to another is a lot faster than
pulling the cache line from main memory. Note that this approach requires that all the other
memory buses be listening to all the other memory buses, that is, it requires all the memory
buses to be snooping.
One important thing to note about bus snooping is that it requires a lot of bus traffic (the
snooping). Clearly, adding more processors (and their caches) adds that much more snoopy traf-
fic. As a result, the overall bus bandwidth of the machine can be overwhelmed with just a few
processors. This is one reason why cache snooping is typically used on SMPs with fewer than
twenty processors.
Now, let’s go back to the basic issue of directory-based cache coherency. If another pro-
cessor has the data you want, then you need to get it from its cache rather than main memory.
This means that you need to check all of the other caches before going to memory to get the
data. We want this to happen quickly, too! One way to do this is to have hardware support to
maintain a set of tables which shows which cache has what data. There are, of course, a lot of
large tables, which means that its size certainly approaches that of a metropolitan phone book or
directory! This allows a processor to request its data from the directory, and it will find it for you
whether it be in another cache or in memory. This device should have a lot of data paths because
data really needs to be transferred directly from one cache to another as well as between caches
and main memory. This sounds familiar, doesn’t it? That’s because this directory mechanism is a
lot like an intelligent crossbar. It’s intelligent because it has to locate data, not just move it.
The advantage of this approach, generally referred to as directory-based coherency, is that
it is scalable. It should be fast because of all the direct paths between caches. However, we dis-
cussed why a crossbar-based system can actually degrade single processor performance above,
and this certainly applies to smart crossbars as well. The advantage of using directories for
coherency instead of bus snooping is scalability. Having a directory eliminates the need for all
the memory buses to communicate their every move to the entire system. Furthermore, every
memory bus doesn’t have to be snooping either. This reduces the amount of traffic across the
memory buses and, when combined with the multiple data paths in a crossbar, makes for a lot
more bandwidth available to move real data. This allows directory-based SMPs to scale much
better than those which use bus snooping for coherency.
The disadvantage of directory-based coherency is that it is expensive. As mentioned previ-
ously, crossbars are expensive. So, adding directory-based coherency to the crossbar only makes
it that much more expensive.
Memory Access (NUMA) architecture. NUMA is not tied to crossbar technology; any intercon-
nect could be used in the discussion above and still result in different memory access times.
NUMA machines which have hardware-based cache coherency are simply referred to as
ccNUMA machines.
NODE 0
BUS
C
MEMORY
A
PROCESSOR 0 C
H
E
I /O
C
A
PROCESSOR 1 C
H
SWITCH
E
INTERCONNECT
NODE 1
BUS
C
SWITCH
A
PROCESSOR 2 C
H
E
I /O
C
A
PROCESSOR 3 C
H
E MEMORY
protocol) to implement highly parallel ccNUMA machines are Sequent’s NUMA-Q and
Hewlett-Packard’s Scalable Computing Architecture (SCA).
Now compare the previous scenario to a more sophisticated operating system’s approach.
It will place all the memory that a thread would need to access as close to it as possible. One
approach is to allocate all the memory on one node and execute all four threads on that node.
Another is to allocate the memory that each thread will access on the node it will execute on.
This latter scenario is non-trivial because the operating system will need some hints as to just
what memory the thread plans to access. Moreover, it’s not likely to be that simple because there
will be data that is shared between threads
There are other issues besides memory latency that are important. Suppose you have an
application that is two-way parallel and is very memory-intensive. Assume that the nodes in
Figure 4-4 have a memory system that can deliver only enough bandwidth to sustain one of the
two threads (or processes). Then the application is likely to execute much faster if exactly one
thread executes on each node so they won’t be competing for a single memory system.
Another example is best illustrated with a message-passing application. Assume that you
have an application which doesn’t exhibit a balanced message-passing load. That is, suppose
that all the processes send about 1 MB of data in messages to each other. However, processes 0
and 2 send an additional 500 MB of data to each other. If the operating system schedules the
processes so that processes 0 and 1 execute on node 0 and processes 2 and 3 execute on node 1,
then 500 MB of messages will be passed across the interconnect between processes 0 and 2.
Compare this to a scenario in which processes 0 and 2 are scheduled on node 0 and processes 1
and 3 are scheduled on node 1. In this second scenario, the 500 MB of messaging activity is
done within the node. Therefore, it doesn’t incur the additional latency and reduced bandwidth
that will result from passing through the two switches and the interconnect.
There is yet another aspect of the previous example that underscores the importance of
proper placement. Many hardware vendors have implemented their message-passing APIs in a
way that exploits the advantages of shared-memory machines. Let’s assume that the machine in
Figure 4-4 is actually a cluster of two SMPs with a fast interconnect (but the computers have
multiple, independent address spaces).
Since processors within SMPs share memory, there is a short cut that message-passing can
take. Generally speaking, messages are passed via a third “holding” buffer, as shown in
Figure 4-5. So, there are actually two copy operations involved in passing a message, one from
the sending process to the intermediate buffer, and a second from the intermediate buffer to the
receiving process. An advantage of cache-coherent shared-memory machines is that the holding
buffer can be eliminated, as illustrated in Figure 4-6. In this case, the message can be copied
directly from the sending process’ address space to that of the receiving process. The result is
that the operation takes only half the time it normally would have required! This type of data
transfer is referred to as process-to-process-bcopy. For those vendors that support pro-
cess-to-process-bcopy, it is typically accomplished without any intervention by the user, as it is
built into the message-passing library.
Now back to our message-passing application example. If the processes are scheduled so
that processes 0 and 2 are on the same node, then the 500 MB of messages can be passed
Control of Your Own Locality 73
PROCESS 0 PROCESS 1
ADDRESS ADDRESS
SPACE SPACE
MPI BUFFER
Figure 4-5 Typical MPI message-passing. Process 0 sending data to process 1 actually
results in two transfers.
PROCESS 0 PROCESS 1
ADDRESS MPI send + MPI receive ADDRESS
SPACE SPACE
directly from one process to the other, causing only 500 MB to actually be copied instead of
almost a gigabyte (2 × 500 MB) of data being copied to and from the intermediate buffer.
It’s worth noting that some operating systems act naively. That is, a process’ memory may
be allocated on a node where no thread is executing. The result is that every memory access is
across the interconnect. Memory references that go across the ccNUMA interconnect are also
referred to as a remote memory access.
So, by defining some attributes of processes or threads a priori, a user can take advantage
of knowledge of the machine’s configuration and provide information to the operating system so
that it can efficiently allocate resources. The term topology is used to describe the layout of pro-
cessors or nodes in a parallel architecture. It is especially useful in NUMA configurations and
clusters of SMPs so that users can determine the number of processors in each node as well as
how many nodes are in the overall system. Topology is also used to describe the placement of
74 Chapter 4 • An Overview of Parallel Processing
• node # Specify which node to start the first thread of a process on.
• localityForce child processes to be created on same node as parent.
• maxth # Specify maximum number of threads to be created per node.
• min Specify minimum number of total threads to create.
• max Specify maximum number of total threads that can be created.
• private Specify type of memory that a process’ data area should be allocated in.
• stacktype Specify type of memory that a process’ stack area should be allocated in.
• specific Specify type of memory that a process’ thread-specific memory should be
allocated in.
• spin Create all threads (up to maximum) so that they are spin-waiting.
Another useful utility, dplace, is available on many SGI systems. It uses memory, pro-
cess, and thread placement specifications from a file (created by the user). It allows the user to:
Recall the parallel application discussed above in which two processes passed a bulk of
the total messages (500 MB between thread 0 and thread 2). Then an appropriate dplace file for
this application is as follows:
Control of Your Own Locality 75
This dplace file specifies that there are two memory systems and a total of 4 threads to be
used by the application. It also specifies the distribution of threads to be 1 per memory system
(which is a node in the context of this chapter). The effect is a round-robin type scheduling
approach, that is, thread 0 is scheduled on node 0, thread 1 on node 1, thread 2 on node 0, and
finally thread 3 on node 1. This is exactly the scheduling that we desire for best performance!
Note that we could have achieved the other scheduling scenario (processes 0 and 1 on
node 0 with processes 2 and 3 on node 1) by changing the block specification line to:
This same utility can be used to handle the two way parallel application discussed above
that is memory intensive. The following accomplishes the desired result:
memories 2
threads 2
distribute threads block 1
or
Well, neither of these interfaces is anything close to a standard. Moreover, just as with
threads before POSIX got its act together, every computer system vendor will have its own util-
ity with a different set of options or file semantics. But even if there was a standard, it would
require Independent Software Vendors (ISVs) to modify their applications to use these com-
mand line interfaces. Why is that? Because almost all parallel applications that are commercially
available today are launched by a “wrapper” application. Thus, the user could launch the appli-
cation with a perfectly good topology specification, but it would apply only to the wrapper appli-
cation and have no effect on the parallel executable.
It should be noted that most of the functionality outlined above could be achieved through
a library of procedures provided by both vendors. But, again, these procedures are very
76 Chapter 4 • An Overview of Parallel Processing
non-standard and would require application developers to extensively modify their parallel
application in very non-portable ways to make use of these procedures.
There are other approaches for the user to provide information to the operating system
about their parallel applications. Among them are environmental controls, application registries,
and application resource tools.
Application resource tools or managers fall into roughly two categories: ones that require
a command line interface and those that are system level tools. The former has been discussed
above. The latter is a powerful means of configuring virtual machines within a single system.
Such virtual machines have many aliases, including subcomplex and protection domains.
SPP-UX’s subcomplex manager and HP-UX’s Process Resource Manager (PRM) are examples
of such tools. Generally speaking, this is a means of redefining the system for multiple users
rather than the user providing information about her application to the operating system.
Application registries seem like a good place for parallel attributes to reside. However, a
registry defines these attributes for all users of that application on that system. So, all users
would have the same set of attributes applied to them unless there were multiple application reg-
istries, say, one for each user. In any case, the application provider would have to provide these
registries for each operating system. Detailed utilities and instructions on how to modify these
registries on a per-session basis are needed as well. Today there are no commercially available
parallel applications that make thread, process, or topology controls available through registries.
One of the easiest ways to provide such information is through the user’s environment.
Environmental variables are used to control parallelism in most operating systems and parallel
APIs. One example is MPI on HP-UX. Applications built with HP-UX’s MPI provide the
MPI_TOPOLOGY environment variable. It controls the application’s placement of processes by
enabling the user to specify the virtual machine (subcomplex), initial node, and process topol-
ogy to use in the execution of the application. For example, an MPI_TOPOLOGY value of
rdbms/3:4,0,4,4 specifies that the application is to be run as follows:
Thus, processes 0, 1, 2, and 3 will execute on node 3, processes 4, 5, 6, and 7 will run on
node 0, then processes 8, 9, 10, and 11 will run on node 2.
Many vendors provide environment variables to control parallelism. While this is a very
convenient means of controlling parallelism, it has some drawbacks:
Summary 77
• The current and/or default setting is not easily identifiable by the user unless she has set it
in her login environment. For example, HP-UX uses the MP_NUMBER_OF_THREADS
environment variable to control the number of threads to be used by parallel applications
that were implemented with compiler directives. But its default value, or even its exist-
ence, is nontrivial to identify.
• There’s no protection against misspellings until after the fact. For example, in the
MPI_TOPOLOGY environment variable, what if the subcomplex name was actually rdb?
The user won’t find a problem until the application begins to execute and then it may not
be obvious what the problem is. Worse yet, what if the user sets the environment variable
MPNUMBEROFTHREADS (forgetting the underscores)? He certainly won’t get the
number of threads he was asking for.
• Applications can redefine environment variables via system calls such as putenv().
Thus a user may set the maximum number of threads he wants to use to be one number
while the application, upon execution, promptly changes that environment variable to
another value.
One alternative to environment variables is the concept of limits. In general, this enables
the user to set or get limitations on the system resources available to the user. Historically, this
has included parameters indicating the maximum size of a process’ stack, text, and data areas, as
well as specifications for things like the maximum cpu time that a process can consume or the
maximum number of file descriptors a process can have open.
The limits interface, or one like it, has also been applied to parallelism. Such use, however,
is usually limited (pardon the pun) to the maximum number of threads. It does avoid the prob-
lems that environment variables have because it requires an interface to modify the parameters.
It also specifies the current settings for all parameters and cannot be reset through a procedure
call.
4.5 Summary
This chapter has given an overview of the computer hardware and system software issues
that apply to parallel processing. Many of the topics are informative only. For example, there’s
not a whole lot of benefit in knowing that a machine’s cache coherency is done by bus snooping
or use of directories.
There is one important message though: know your environment. To make the most of the
system you are executing a parallel application on, you should know the topology of your sys-
tem and what the desired topology is for your applications threads and/or processes. This can
78 Chapter 4 • An Overview of Parallel Processing
make a huge difference in how fast your application runs, regardless of how well you optimized
it.
The parallel controls discussed here are not covered in detail. Details of their functionality
are beyond the scope of this book. However, the reader should know that such controls exist.
That is, the reader now knows enough to ask the hardware vendor whether such tools exist (and
if they don’t, why don’t they?).
References:
The following publications are excellent resources for the parallel systems and other top-
ics discussed in this section.
1. Dagum, L. and Menon, R. OpenMP: An Industry-Standard API for Shared Memory Pro-
gramming, IEEE Computational Science and Engineering, Vol. 5, No. 1, January/March
1998. http://www.openmp.org/.
2. Message-Passing Interface Forum. MPI: A Message-Passing Interface Standard, Univer-
sity of Tennessee, 1994.
3. Norton, S. J.; DiPasquale, M. D. Thread Time: The MultiThreaded Programming Guide,
Prentice-Hall Professional Technical Reference, 1996.
4. Hennessy, J. L.; Patterson, D. A. Computer Architecture, A Quantitative Approach, Mor-
gan Kaufmann, 1996. ISBN 1-55860-329-8.
5. Pfister, G. F. In Search of Clusters, Prentice Hall PTR, 1998. ISBN 0-13-899709-8.
6. Kwang, H. Advanced Computer Architecture, McGraw-Hill, 1993. ISBN 0-07-031622-8.
P A R T 2
Software Techniques —
The Tools
79
80 Part 2 • Software Techniques — The Tools
C H A P T E R 5
optimization
The three most important factors in selling are location, location, location.
Realtor’s creed
81
82 Chapter 5 • How the Compiler Can Help and Hinder Performance
Compiler Linker
Assembler
Assembly Code
system or computer can be used on a different one. The pinnacle of compatibility is binary com-
patibility which is when executables created for one version of an operating system or computer
can be used on a different one.
Of course, users don’t have to write high-level code. They can write assembly code
directly. However, assembly code takes longer to write, is harder to debug, and is less portable,
so very few people write programs in assembly code anymore. Still, some programmers enjoy
writing assembly code since it allows them to control the instruction sequence exactly and max-
imize processor performance. For relatively simple routines, code by a proficient assembly code
programmer will always outperform a program written in a high-level language. One goal for
any compiler is to convert a routine written in a high-level language to the most efficient assem-
bly code possible.
Producing efficient assembly code isn’t easy. The simplest thing for the compiler to do is a
one-to-many translation. Each line of high-level code is converted to one or more lines of assem-
bly code. This usually results in poorly performing assembly code since there are probably many
more instructions than necessary. Compilers must apply sophisticated (and time-consuming)
mathematical techniques to generate good assembly code.
Compilers are given a file or files from which to create object code. In each file there may
be multiple procedures. In Fortran 90, procedures are called subroutines, functions or modules,
while in C and C++ they are called functions or the unique main function. The compiler ana-
lyzes these by breaking them into language-independent pseudo-code called intermediate lan-
guage code. This code is never seen by the user and is deleted once the compiler has completed
the compile. This intermediate language is analyzed by breaking it into smaller pieces named
basic blocks. Assembly code is then generated from the intermediate code.
Compiler Terminology 83
i = 1;
j = 3;
go to location_1;
i = 1;
go to location_1;
j = 3;
isn’t, since it has a branch in its interior. One of the most important constructs to optimize are
loops. Although a loop from Fortran or C is converted to multiple basic blocks, the meat of the
loop may be only a single basic block. Consider the following routine:
INTEGER*8 IX(N)
DO I = 1,N
IX(I) = 0
ENDDO
Pseudo-Code Comments
Basic Block 1:
load addr ix(1) into r1
r2 = n load n into r2
r3 = 0 load 0 into r3
if (r2 <= r3) exit if (n <= 0) exit
Basic Block 2:
r4 = 1 for loop counter
r5 = 8 increment for address of ix in bytes
r6 = 0 load 0 into r6
Basic Block 3:
loop:
store r6 at r1 store 0 to the address of ix(i)
r1 = r1 + r5 address for next ix(i)
r3 = r3 + r4 increment loop counter
if (r3 <= r2) go to loop if (i <= n) loop
84 Chapter 5 • How the Compiler Can Help and Hinder Performance
OpenMP directives:
C$OMP PARALLEL PRIVATE(J) SHARED(N)
C$OMP DO
DO J = 1, M
CALL INIT(A(1,J),N)
ENDDO
C$OMP END PARALLEL
Hewlett-Packard directives:
C$DIR LOOP_PARALLEL
DO J = 1, M
CALL INIT(A(1,J),N)
ENDDO
SGI directives:
C$DOACROSS LOCAL(J), SHARED(A,N)
DO J = 1, M
CALL INIT(A(1,J),N)
ENDDO
In the C language, compiler directives are usually implemented through #pragma control
lines. Thus a line of the form
#pragma directive-name
level dependency. In this situations, you can expect to get wrong answers some of the time, so be
careful.
5.4 Metrics
In order to discuss the theoretical benefits of optimizations, some basic metrics must be
defined. Most routines are constrained by the ability to move data from cache or memory to the
processor.
5.4.1 In-Cache
When data is in-cache and the code is performing floating-point calculations, the goal of
optimization is usually to increase the number of floating-point operations per memory opera-
tion. We use the term F to M ratio, or F:M, to quantify this relationship. So the loop
DO I = 1,M
Z(I) = X(I) + Y(I)
ENDDO
has an F:M ratio of 1:3 since there is one floating-point operation for three memory operations.
5.4.2 Out-of-Cache
Quantifying the amount of data moved is very important when the data is out-of-cache.
One useful metric is the number of floating-point operations per byte of data. In the above exam-
ple, if the arrays used 64-bit floating-point data, then there is one floating-point operation per 24
bytes of data. If the code could use 32-bit data, then the performance of out-of-cache problems
improves by a factor of two, since only half as much data is needed per floating-point operation
or one floating-point operation per 12 bytes of data.
So, if the data is being loaded and stored from memory most of the time, the floating-point
operations themselves don’t matter much. All the time is spent performing the memory opera-
tions. The amount of data movement from memory to cache can be cumbersome to discuss. The
size of the cache line varies from one processor to another and the data type also varies from one
application to another. We’re usually dealing with a vector containing some number of points,
say, n, in any discussion of data movement.
Recall that when data is needed by the processor and it is not in cache, a cache miss
occurs. When a cache miss occurs, data moves from memory to cache. Generally the processor
modifies the data and puts the result back in cache. At some point later in time, a write back
occurs and the data moves from cache back to memory. So for both a cache miss and a write
back, data moves between memory and cache. To help quantify the data movement, we’ll define
a new term memory_transfer to be
Using this definition allows routines to be compared independently of data type and the
processor cache line size. When n loads occur and the data is not in cache, cache misses occur
and data must be loaded from memory. This causes one memory_transfer. When n stores occur
and the data is already in cache, one memory_transfer occurs for the write backs. When n stores
occur and the data is not in cache, there are two memory_transfers, one caused by the loads and
one from the write backs.
floating-point instructions
instruction latencies
integer instructions
branch instructions
memory bandwidth
Reduce number of
Reduce number of
Reduce number of
Reduce number of
Reduce effect of
Reduce effect of
Technique
Register allocation ✓
C/C++ register data type ✓
C/C++ asm macro ✓ ✓
Uniqueness of memory ✓ ✓
addresses
Common subexpression ✓ ✓ ✓
elimination
Strength reductions ✓ ✓
Increase fma ✓
Fill branch delay slots ✓
Compiler Optimizations 89
...
register double b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12;
...
for (j = 0; j < n; j += 4)
{
...
for (i = 0; i < m; i++)
{
c1[i] += b1 * a[i+j*lda] + b4 * a[i+(j+1)*lda]
+ b7 * a[i+(j+2)*lda] + b10 * a[i+(j+3)*lda];
c2[i] += b2 * a[i+j*lda] + b5 * a[i+(j+1)*lda]
+ b8 * a[i+(j+2)*lda] + b11 * a[i+(j+3)*lda];
c3[i] += b3 * a[i+j*lda] + b6 * a[i+(j+1)*lda]
+ b9 * a[i+(j+2)*lda] + b12 * a[i+(j+3)*lda];
}
}
even if a compiler supports asm, it may not support all assembly code instructions. Usage also
varies among hardware vendors. The following example shows how an integer multiplication by
two using a shift left and add (SHLADD) instruction may be accomplished using an asm macro:
#include <machine/reg.h>
#include <machine/inline.h>
void scale(n, ix, iy)
int n, *ix, *iy;
{
int i;
register int reg1;
for (i = 0; i < n; i++)
{
/* want to perform iy[i] = (ix[i] << 1); */
reg1 = ix[i];
_asm("SHLADD",reg1,1,GR0,reg1);
iy[i] = reg1;
}
}
DO I = 1,N
Y(I) = X(I) + Z
ENDDO
Fortran assumes that Z and all of the values of X and Y have unique memory locations.
Thus Z and several values of X and Y could be loaded before the first location for Y is stored to.
The C language is more complicated. If variables are defined locally, the compiler knows they
don’t overlap, but globally defined variables may be aliased. So if x, y, and z are defined glo-
bally, the compiler must assume that they could overlap in the following loop:
Compiler Optimizations 91
To see how important this can be, suppose y[2] and z are aliased and x[1] = 1,
x[2] = 2, x[3]= 3, z = 4. In C the loop would proceed as
In Fortran, it would be perfectly legal to load Z outside the loop and load two values of X
before any additions and stores, such as
load Z = 4
load X(1) = 1, load X(2) = 2, load X(3) = 3
add Z + X(1), add Z + X(2), add Z + X(3)
store Y(1) = 5, store Y(2) = 6, store Y(3) = 7
So different answers are obtained. Aliasing greatly restricts compiler optimizations. Fre-
quently, programmers know that none of the arrays overlap in memory, but, as discussed above,
the C language assumes that some aliasing occurs. Therefore C compilers usually have a com-
piler option such as +Onoparmsoverlap that causes the compiler to assume none of the
addresses overlap. This allows the compiler to make more aggressive optimizations. This topic
will be discussed more in Chapter 7.
i = 0;
if (i != 0) deadcode(i);
a = 1 + 2
b = a + 3
92 Chapter 5 • How the Compiler Can Help and Hinder Performance
can be replaced by
a = 3
b = 6
a = b + (c + d)
f = e + (c + d)
the expression c + d is used by both a and f. Thus c + d can be calculated first and this result
used in the calculations for both a and f.
integer values to floating-point values, perform the division, and convert the result back to inte-
ger. Since the IEEE 64-bit floating-point mantissa is 52 bits wide, the answers are guaranteed to
be exact.
Replace floating-point multiplication with floating-point additions On some
poorly performing processors, floating-point multiplication operations are more expensive than
additions, so some compilers convert floating-point multiplications by small constants to addi-
tions. For example,
y = 2 * x
can be replaced by
y = x + x
a = y / x
b = z / x
both y and z are divided by x. Some compilers have an optimization that replaces the expres-
sions with the equivalent of
c = 1 / x
a = y * c
b = z * c
x = y ** 3
94 Chapter 5 • How the Compiler Can Help and Hinder Performance
can be replaced by
x = y * y * y
(a * b) + c
-(a * b) + c
These instructions bring up interesting issues, since the result may be more accurate than
using two separate instructions. When two instructions are used, the result of the multiplication
is stored to a register before being loaded for the addition. This intermediate result is limited by
the precision of the operations being performed (64-bits, for example). Some rounding may
occur when forming the intermediate result. When a single instruction is used, the intermediate
result is not limited by the precision of the multiplication, so it may use more bits and hence be
more accurate. Compilers may make use of these instructions, but usually require a compiler
option to enable them.
Codes can sometimes be altered to reduce the number of floating-point instructions by
maximizing the number of these compound instructions. For example, let a, b and c be complex
numbers written in terms of their real and imaginary components as a = (ar, ai), b = (br,
bi), c = (cr, ci). The sequence c = c + a * b is usually performed as
multiplication f1 = ar*br
multiplication f2 = ar*bi
fnma f3 = -ai*bi + f1
fma f4 = ai*br + f2
addition f5 = cr + f3
addition f6 = ci + f4
Compiler Optimizations 95
which uses two multiplications, a fma, a fnma and two additions. By altering the order of the
instructions to
fma f1 = ar*br + cr
fma f2 = ar*bi + ci
fnma f3 = -ai*bi + f1
fma f4 = ai*br + f2
As with many optimizations, this reordering of instructions may produce slightly different
results than the original code.
floating-point instructions
instruction latencies
integer instructions
branch instructions
memory bandwidth
Reduce number of
Reduce number of
Reduce number of
Reduce number of
Reduce effect of
Reduce effect of
Technique
Induction variable ✓
optimization
Prefetching ✓
Test promotion in loops ✓ ✓ ✓
Loop peeling ✓
Fusion ✓ ✓
Fission ✓
Copying ✓
Block and copy ✓
Unrolling ✓ ✓ ✓
Software pipelining ✓
Loop invariant code motion ✓
Array padding ✓
Optimizing reductions ✓
Compiler Optimizations 97
for (i = 0; i < n; i += 2)
ia[i] = i * k + m;
the variable i is known as the induction variable and n is the loop stop variable. When values in
the loop are a linear function of the induction variable (a multiple of the induction variable
added to a constant), the code can be simplified by replacing the expressions with a counter and
replacing the multiplication by an addition. Thus this is also a strength reduction. The above
code can be replaced by
ic = m
for (i = 0; i < n; i += 2)
{
ia[i] = ic;
ic = ic + k;
}
5.5.2.2 Prefetching
Chapter 3 discussed using prefetch operations to decrease the effective memory latency.
When prefetches are under software control, they are implemented by special prefetch instruc-
tions that must be inserted into the instruction stream. Prefetch instructions can be inserted any-
where in the instruction stream, but in practice they usually occur in loop structures, since the
need for prefetching in a loop is obvious.
The compiler may require a compiler option or a compiler directive/pragma to enable the
inserting of prefetch instructions. One valid area for concern is what happens when the compiler
prefetches off the end of an array. For example, in
DO I = 1,N
X(I) = 0
ENDDO
suppose the prefetch instruction requests the data four elements in advance of X(I). For X(N),
the hardware will attempt to prefetch X(N+4). What if this element doesn’t exist? A regular load
of location X(N+4) can cause a program to abort if the address of X(N+4) is not valid. Fortu-
nately, processors are designed to ignore prefetch instructions to illegal addresses. In fact, pro-
cessors also ignore prefetch instructions to locations that have not been mapped to virtual
memory, i.e., that would cause page faults. So, it’s important to ensure that prefetch instructions
are to locations that have been mapped to virtual memory to ensure you get a benefit from insert-
ing them. Also, there doesn’t always need to be a prefetch instruction for every original memory
instruction. One prefetch per cache line is sufficient to preload the data. The following three
examples show how the accesses of the X array affect the generation of prefetch instructions:
98 Chapter 5 • How the Compiler Can Help and Hinder Performance
REAL*8 X(N)
...
DO I = 1,N
X(I) = 0
ENDDO
This requires storing the value zero to the addresses X(1) through X(N). Using compiler
prefetches requires inserting at least one prefetch per cache line of X. Therefore, every 16 stores
requires one prefetch instruction. Adding a prefetch instruction greatly improves out-of-cache
performance, but if the data for X is already in-cache, performance may take longer by a factor
of 17/16. The smaller the cache line size, the larger this effect.
REAL*8 X(N)
...
DO I = 1,N,16
X(I) = 0
ENDDO
In this example, only one point in each cache line is modified owing to the stride of 16 on
I; therefore, a compiler should insert one prefetch instruction for each store of X. If data was
already in-cache, the amount of time to execute this loop will take twice as long as the code
without prefetch instructions. Lessons:
1. Unit stride code is best for prefetching (and nearly everything else related to perfor-
mance).
2. Prefetching data helps performance when the data is out-of-cache, but can hurt perfor-
mance when the data is already in-cache. For large problems, the data is frequently
out-of-cache, so using data prefetch instructions is probably the right thing to do.
The elements of the array X are said to be gathered and the elements of Z are said to be
scattered in the above code. Let’s examine the gather in more detail and consider data prefetch
instructions. The operations are
load IA(I)
load X(IA(I))
Suppose the prefetch distance is the constant M. To prefetch for X(IA(I)) requires two
levels of prefetching. First, IA(I) must be prefetched. Next, a future value of IA(I) must be
loaded into a hardware register; this value is used to calculate the address of the future
X(IA(I)) value, and then the final prefetch for X(IA(I)) is performed. The naive way to
insert prefetches for the gather would be to include
To clarify this, suppose the existence of a compiler directive C$DIR PREFETCH EXPRES-
SION where EXPRESSION is the explicit element to be prefetched. The code could appear as
DO I = 1,M
C$DIR PREFETCH IA(I+2*M)
C$DIR PREFETCH X(IA(I+M)
Y(I) = X(IA(I))
ENDDO
Note that the second PREFETCH directive forces IA(I+M) to be loaded into an address
register. This is wrong! If IA(I+M) is not defined, then the loading from this location may cause
the code to abort.
To ensure nothing illegal happens requires making two loops. The first loop includes the
prefetches and the load to X(IA(I+M)) and the second cleanup loop omits them. The code
could look like
NEND = MAX(0,N-M)
DO I = 1,NEND
C$DIR PREFETCH IA(I+2*M)
C$DIR PREFETCH X(IA(I+M)
Y(I) = X(IA(I))
ENDDO
DO I = NEND+1,N
Y(I) = X(IA(I))
ENDDO
100 Chapter 5 • How the Compiler Can Help and Hinder Performance
Note that if a processor supports prefetch instructions, and the compiler does not support a
prefetch directive, one might still be able to support this functionality in C using the asm macro.
For example, the C for loop
can be modified to insert gather prefetches using the HP C compiler and asm. On the HP
PA-RISC processors, prefetch instructions are implemented by loading an address to general
register 0. The above C code can be modified to insert prefetch instructions as follows:
#if defined(PREFETCH)
#include <machine/reg.h>
#define PREFETCH_MACRO(x) { register void *hpux_dp_target; \
hpux_dp_target = (void*)&(x); \
_asm(“LDW”, 0, 0, hpux_dp_target, R0); }
#else
#define PREFETCH_MACRO(x)
#endif
The above testpref() function provides the same functionality as the preceding Fortran
code. All prefetch operations are in the first loop, while the second loop performs the cleanup
operations. If the routine is compiled with -DPREFETCH, then the object code for the first loop
contains prefetch instructions. Otherwise, these instructions are not generated.
Compiler Optimizations 101
DO I = 1,N
IF (A .GT. 0) THEN
X(I) = X(I) + 1
ELSE
X(I) = 0.0
ENDIF
ENDDO
Exchanging the if and do constructs results in much better code since the if test is eval-
uated only once instead of every time through the loop.
IF (A .GT. 0) THEN
DO I = 1,N
X(I) = X(I) + 1
ENDDO
ELSE
DO I = 1,N
X(I) = 0.0
ENDDO
ENDIF
Few compilers perform this optimization since it’s fairly complex. If your application uses
this construct, you probably should modify your source code instead of depending on the com-
piler to do it for you.
DO I = 1,N
IF (I .EQ. 1) THEN
X(I) = 0
ELSEIF (I .EQ. N) THEN
X(I) = N
ELSE
X(I) = X(I) + Y(I)
ENDDO
102 Chapter 5 • How the Compiler Can Help and Hinder Performance
The above example can be rewritten to eliminate the if tests by peeling off the edge val-
ues as
X(1) = 0
DO I = 2,N-1
X(I) = X(I) + Y(I)
ENDDO
X(N) = N
and the only uses of the temp array appear in the above two loops. These loops can be fused as
This eliminates all references to temp. This optimization is especially important when n is large
and there are many cache misses.
This is another difficult optimization for compilers to perform. They must look across
multiple loops and check that there’s not too much register pressure before performing the
fusion. For example, if fusing loops causes the compiler to have to spill and restore data from
memory, the fusion may be detrimental to performance.
and suppose x[i] and x[i+m] map to the same cache location in a direct mapped cache, e.g., m
is a large power of two. Since cache thrashing will occur between x[i] and x[i+m], the loop
should be split as
What if the cache replacement scheme is associative? Will the above fission still help?
Maybe. Suppose the cache is two-way associative and a random cache line replacement scheme
is used. Suppose x[0] is loaded into one side of the cache. x[m] has a 50% chance of being
loaded into the other side of the cache. Even if it is, it probably won’t take very long before some
x[i+m] displaces x[i] in the cache due to random replacement. Of course, higher associativity
and more sophisticated cache replacement strategies help get around the pathological situations,
but there are many chances for poor performance when cache in involved.
5.5.2.7 Copying
The copying optimization can be thought of as performing loop fission using dynamically
allocated memory. Suppose that a vector addition
DO I = 1,N
Y(I) = Y(I) + X(I)
ENDDO
is to be performed and that X and Y are larger than the cache size. Further suppose the cache is
direct mapped and X(1), Y(1) map to the same location in cache. Performance is improved by
copying X to a “safe” location and then copying from this safe vector to Y. Let NC be the size of
the cache in bytes. To implement this approach dynamically allocate an array XTEMP whose size
is the same as the size of X and Y plus 2*NC. The starting location from XTEMP should not map to
a location in the cache near X and Y. In the example below, the addresses from XTEMP are half the
104 Chapter 5 • How the Compiler Can Help and Hinder Performance
The original code missed cache for each load of X and twice for each store of Y. The mod-
ified code misses on a cache line basis: once for X, twice for Y, and three times for XTEMP. This
removes pathologically bad cache misses.
What if the cache is associative? The discussion is similar to the one from the previous
section. Associative caches help performance, but there is still the strong possibility that too
many cache misses will occur.
to/from can be chosen to be half a cache size away from X and Y. Let NC be the size of the cache
in bytes. The code now appears as
The XTEMP array accesses half of the data cache at a time in the inner loop, which is a
slight improvement over having XTEMP be the same size as X and Y. If the size of the inner loop
is further reduced, the XTEMP array would not need to be written to memory as often and the
number of data misses would be reduced. The optimal size for XTEMP is processor dependent,
but choosing a value of around one tenth the cache size is sufficient to make the XTEMP accesses
minimal, as shown in Table 5-3.
large XTEMP 1 per cache line 2 per cache line 3 per cache line
access XTEMP using 1 per cache line 2 per cache line 2 per cache line
half the cache
DO I = 1,N
Y(I) = X(I)
ENDDO
Without unrolling, for each iteration, the loop index must be incremented and checked to
see if the loop should terminate. If there are more iterations to perform, a branch is taken to the
top of the loop nest. In general, branches are expensive and interfere with pipelining, so they
should be avoided. Most compilers have a compiler optimization that takes the original code and
make two code sections from it: the new unrolled loop and any cleanup code necessary to finish
processing. For example, if the loop is unrolled by four, it appears as
NEND = 4*(N/4)
DO I = 1,N,4
Y(I) = X(I)
Y(I+1) = X(I+1)
Y(I+2) = X(I+2)
Y(I+3) = X(I+3)
ENDDO
DO I = NEND+1,N
Y(I) = X(I)
ENDDO
For large N, most of the time is spent in the unrolled loop, which has one branch for every
four iterations of the original loop. The compiler can also rearrange more instructions in the new
loop to reduce the effect of instruction latencies.
Suppose a processor has the characteristics shown in Table 5-4.
Table 5-5 shows the execution of the original loop (the branch will be ignored) and the
execution of the unrolled loop where the instructions have been reordered to reduce instruction
latencies.
Table 5-5 Clock Cycles in an Unrolled Loop.
Clock cycle Modified Clock cycle
Original order number order number
The number of clock cycles has been reduced by over half. Note, however, there is a gap in
the processing pipeline. The first store still can’t start until clock seven. It would be better if it
could start execution at clock five.
Loading all the values of X before the values of Y reduces the possibility of cache thrash-
ing. Another benefit of unrolling is that the amount of unrolling can decrease the number of soft-
ware prefetch instructions. Ideally, the amount of unrolling should be such that only one
prefetch instruction is inserted per cache line. For example, the compiler on a processor with a
64-bytes cache line that is accessing eight-byte unit stride data should unroll a loop eight-way to
minimize the number of prefetch instructions. Some compilers have an option that allows users
to specify the unrolling depth. Thus users can also mix their own hand unrolled code with the
compiler’s default unrolling depth to increase overall unrolling. This can also further reduce the
possibility of cache line thrashing discussed above. However, excessive unrolling will cause data
to be spilled from registers to memory, so too much unrolling will hurt performance. It can also
have the undesirable effect of making the size of the object so large than more instruction cache
misses occur.
108 Chapter 5 • How the Compiler Can Help and Hinder Performance
Iteration 1 Prologue
Part a
Part b
Iteration 1 Iteration 1
Iteration 2
Part a
Part b
Iteration 2
Iteration 3
Part a
Part b
Iteration 2 Iteration 3
Iteration 4
Part a
Part b
Iteration 4
Iteration 5
Cleanup Part a
Part b Epilogue
may be moved above the loop body to comprise a prologue to the loop. Parts 1b and 2a can then
Compiler Optimizations 109
be merged to form the first iteration of the new loop. The creation of new iterations continues
until parts 4b and 5a are been merged to create the new iteration four. Finally, an epilogue to the
loop body is created from part 5b. This approach is shown in Figure 5-2 and is contrasted with
two-way unrolling.
Note that unrolling and software pipelining are independent concepts, although in practice
they are usually combined. There are multiple reasons for this. First, there must be enough work
to hide all instruction latencies. We’ve discussed that unrolling by itself is frequently not suffi-
cient to hide all latencies. Very simple loops that use software pipelining on RISC processors
may not contain enough work to hide all instruction latencies either. On both superscalar RISC
and VLIW processors there should also be enough work for good instruction level parallelism.
This also leads to merging the two techniques. Finally, when prefetch instructions are used, there
needs to be only one prefetch instruction for every cache line. Unrolling helps minimize the
number of these instructions.
The example from the previous section showed how unrolling was used to reduce the
number of clock cycles from 28 to 10. Suppose the compiler unrolls the loop so that four itera-
tions are done at a time and the loop is reordered for software pipelining as follows:
NEND = 4*((N-2)/4)
IF (N .GT. 2) THEN
NEND = 0
load X(1)
load X(2)
J = -3
DO I = 1,NEND,4
J = I
load X(I+2)
store Y(I)
load X(I+3)
store Y(I+1)
load X(I+4)
store Y(I+2)
load X(I+5)
store Y(I+3)
ENDDO
store Y(J+4)
store Y(J+5)
NEND = J+5
ENDIF
DO I = NEND+1,NEND
load X(I)
store Y(I)
ENDDO
110 Chapter 5 • How the Compiler Can Help and Hinder Performance
Load X(I+2) 1 1
Store Y(I) 3 2
Load X(I+3) 2 3
Store Y(I+1) 4 4
Load X(I+4) 3 5
Store Y(I+2) 1 6
Load X(I+5) 4 7
Store Y(I+3) 2 8
While there must be a prologue of code to set up the registers and an epilogue to finish
execution, the software pipelined code takes only eight cycles and the instruction latency is com-
pletely hidden.
Software pipelining is an optimization that is impossible to duplicate with high level code
since the multiple assembly language instructions that a single line of high level language cre-
ates are moved around extensively. Due to the complexities involved, compilers implement soft-
ware pipelining only at high optimization levels. However, a judicious choice of unrolling can
help the compiler generate better software pipelined code. For example, if your compiler per-
forms software pipelining, you can hand unroll a loop with varying amounts of unrolling to see
which has the best performance. If you write efficient assembly code for loops, you will use
unrolling and software pipelining.
DO I = 1,N
X(I) = X(I) * Y
ENDDO
The naive way to interpret this is to load Y and X(I) inside the loop. This is clearly ineffi-
cient since Y is loop invariant.Therefore, the load of Y may be performed before the loop as fol-
lows:
DO I = 1,N
S = S + X(I)
ENDDO
The load for S may be hoisted from the loop and the store to S may be sunk. Hoisting and
sinking S removes a load and a store from the inner loop. Compilers usually perform simple
hoist/sink operations.
A slight complication of the previous loop is
DO I = 1,N
Y(J) = Y(J) + X(I)
ENDDO
Y(J) should be hoisted and sunk from the loop. However, Y is a function of the index J and
some compilers have difficulty making this optimization. Changing the loop to
S = Y(J)
DO I = 1,N
S = S + X(I)
ENDDO
Y(J) = S
usually helps the compiler hoist and sink the reference. Some compilers return information
describing the hoist and sink optimizations performed.
REAL X(8,N)
DO J = 1,8
DO I = 1,N
X(J,I) = 0.0
CALL SUB1(X)
...
ENDDO
ENDDO
and suppose that the loops cannot be exchanged due to other work in the inner loop. Further sup-
pose that the memory system has eight banks and that the data is out-of-cache. The code might
be improved by increasing the leading dimension of X to nine, as discussed in Chapter 3. This
requires more memory, but may perform better.
creates a sum reduction and a product reduction. The first reduction above is a sum reduction
since the scalar sumx is repeatedly added to, while the second is a product reduction, since mul-
tiplication is used. Compilers usually use a single register for the reduction variables. However,
this constrains the rate that can be achieved to that of the floating-point add instruction latency. If
the latency is four clock cycles, then one iteration of the above loop cannot take less than four
cycles per iteration. If the sum reduction in the loop above is rewritten as
then the instruction latency is hidden. This executes much faster than the original, but may pro-
duce different results since the order of operations is different.
Some applications are very concerned about numerical accuracy even when only one sum
reduction is used, so they perform sum reductions at a higher precision than the rest of the appli-
cation. They convert each loaded value to a higher precision floating-point value and perform the
addition at this new precision. At the conclusion of the loop, the reduction value is converted
back to the original precision. This leads to more accuracy at the expense of reduced perfor-
mance.
The next set of optimizations operates on nested loops. These can be especially tricky (and
time-consuming) for a compiler to get right. Most of these optimizations operate on matrices
and there are many chances for a compiler to make an optimization that slows down perfor-
mance. The flip side is that many of these optimizations can significantly speed up performance.
Due to the inconsistent nature of these optimizations, most compilers do not perform them by
default. Users must usually turn on special compiler options to obtain them. Due to the possibil-
ity of slowing down the code, users should time code more carefully than usual to ensure that the
optimizations help performance. The examples shown are all written in Fortran, which stores
instruction latencies
integer instructions
branch instructions
memory bandwidth
Reduce number of
Reduce number of
Reduce number of
Reduce number of
Reduce effect of
Reduce effect of
Technique
Loop interchange ✓
Outer loop unrolling ✓ ✓
Unroll and jam ✓
Blocking ✓
Block and copy ✓
114 Chapter 5 • How the Compiler Can Help and Hinder Performance
multidimensional arrays in column major order. Corresponding C examples would reverse the
order of the indices. See Chapter 7 for more details on programming language issues.
DO I = 1,N
DO J = 1,N
X(I,J) = 0.0
ENDDO
ENDDO
Interchanging the loops helps performance since it makes the array references unit stride.
For large N, this helps performance due to cache line reuse (fewer cache lines misses), and vir-
tual memory page reuse (fewer TLB misses).
DO J = 1,N
DO I = 1,N
A(I,J) = A(I,J) + X(I) * Y(J)
ENDDO
ENDDO
Suppose the compiler hoists the reference to Y(J) outside the inner loop. If the outer loop
is unrolled by two, the resulting code becomes
DO J = 1,N,2
DO I = 1,N
A(I,J) = A(I,J) + X(I) * Y(J)
A(I,J+1) = A(I,J+1) + X(I) * Y(J+1)
ENDDO
ENDDO
Note that this requires half as many loads of X, since each load of X is used twice.
DO K = 1,N
DO J = 1,N
DO I = 1,N
C(I,K) = C(I,K) + A(I,J) * B(J,K)
ENDDO
ENDDO
ENDDO
Step 1: Unroll. Suppose the two outermost loops are unrolled by two. The loops then
appear as
DO K = 1,N,2
DO J = 1,N,2
DO I = 1,N
C(I,K) = C(I,K) + A(I,J) * B(J,K)
ENDDO
DO I = 1,N
C(I,K) = C(I,K) + A(I,J+1) * B(J+1,K)
ENDDO
ENDDO
DO J = 1,N,2
DO I = 1,N
C(I,K+1) = C(I,K+1) + A(I,J) * B(J,K+1)
ENDDO
DO I = 1,N
C(I,K+1) = C(I,K+1) + A(I,J+1) * B(J+1,K+1)
ENDDO
ENDDO
ENDDO
Step 2: Jam. All four DO I loops can be jammed together into a single DO I loop as fol-
lows.
DO K = 1,N,2
DO J = 1,N,2
DO I = 1,N
C(I,K) = C(I,K) + A(I,J)*B(J,K) + A(I,J+1)*B(J+1,K)
C(I,K+1) = C(I,K+1) + A(I,J)*B(J,K+1) + A(I,J+1)*B(J+1,K+1)
ENDDO
ENDDO
ENDDO
The benefit of unroll and jam is apparent by comparing the ratio of memory to float-
ing-point operations. Assume that the references to B are hoisted outside the innermost loop. By
unrolling the outer two loops by two, the floating-point operation to memory operation ratio is
increased from 2:3 to 4:3, thereby halving the number of memory operations. Some processors
116 Chapter 5 • How the Compiler Can Help and Hinder Performance
require a ratio of at least 2:1 for peak performance, so unroll and jam is an extremely important
tool. Of course, you don’t have to unroll both outer loops by only two. By increasing the unroll
factors to three or four, even better memory reductions are obtained. The unroll factor on each
loop need not to be the same either.
A limit to the amount of unrolling is the number of hardware registers available. For
example, in the above loop nest, the number of values of the B array to hoist into registers is the
product of the unrolling factors. Once the number of registers required exceeds the number of
hardware registers, some B values must be reloaded in the inner loop. This defeats the whole
point of the unroll and jam optimization. Thus, as the amount of unroll and jam is increased,
there will be a point of maximal performance, after which performance will decrease. Table 5-8
shows the number of operations for various unrolling factors.
Table 5-8 Unroll and Jam Results.
Outer by
middle Floating-
unrolling point
factors Loads Stores operations F:M ratio
1x1 2 1 2 0.67
2x2 4 2 8 1.33
3x3 6 3 18 2.00
3x4 7 3 24 2.40
4x4 8 4 32 2.67
5.5.3.4 Blocking
Blocking is an important optimization for decreasing the number of cache misses in nested
loops. The most common blocking takes the inner loop, splits it into two loops, and then
exchanges the newly created non-innermost loop with one of the original outer loops. In the
code
REAL*8 A(N,N)
DO J = 1,N
DO I = 1,N
Y(I) = Y(I) + A(I,J)
ENDDO
ENDDO
Compiler Optimizations 117
suppose the length of the inner loop is very long and that the amount of data in the Y array is
many times the size of the data cache. Each time Y is referenced, the data must be reloaded from
memory. Rewrite the code as follows:
NBLOCK = 1000
DO IOUTER = 1,N,BLOCK
DO J = 1,N
DO I = IOUTER,MIN(N,N+NBLOCK-1)
Y(I) = Y(I) + A(I,J)
ENDDO
ENDDO
ENDDO
NBLOCK
Y A
Now there are many uses of the Y array in cache before a column of A displaces it. The
value of NBLOCK is a function of the cache size and page size and should be carefully chosen. If
NBLOCK is too small, the prefetches on the A array are ineffectual, since the prefetched data will
probably have been displaced from cache by the time the actual loads of A occur. If NBLOCK is
too large, the values of Y don’t get reused. For caches that are on the order of a megabyte, a
blocking factor of approximately one thousand is usually sufficient to get the benefits of
prefetching on A and reuse of Y.
118 Chapter 5 • How the Compiler Can Help and Hinder Performance
SUBROUTINE MATMUL(N,LD,A,B,C)
REAL*8 A(LD,*), B(LD,*), C(LD,*)
DO L = 1,N
DO J = 1,N
DO I = 1,N
C(I,J) = C(I,J) + A(I,L) * B(L,J)
ENDDO
ENDDO
ENDDO
END
Suppose the goal is to operate on blocks of data whose collective size is less than the
cache size. Since the code uses eight-byte data and three arrays (A, B and C) are used, this block-
ing factor, NBLOCK, should be chosen such that
and the code blocked as following (the cleanup steps will be ignored):
DO I = 1,N,NBLOCK
DO J = 1,N,NBLOCK
DO L = 1,N,NBLOCK
CALL MATMUL(NBLOCK,LD,A(I,L),B(L,J),C(I,J))
ENDDO
ENDDO
ENDDO
Now the individual matrix-matrix multiplications are performed on a region that is less
than the size of the cache. There can still be a large number of cache misses in the individual
matrix-matrix multiplications since some of the pieces of A, B and C may map to the same loca-
tion in cache. It is sometimes beneficial to copy the individual pieces of data. This is treated in
more detail in Chapter 10. Suppose we define a routine, MATCOPY, to copy the individual subma-
trices. The blocked and copied code follows and this concept is illustrated in Figure 5-4:
Compiler Optimizations 119
REAL*8 TEMP(NBLOCK,NBLOCK,3)
...
DO I = 1,N,NBLOCK
DO J = 1,N,NBLOCK
CALL MATCOPY(NBLOCK,C(I,J),N,TEMP(1,1,3),NBLOCK)
DO L = 1,N,NBLOCK
CALL MATCOPY(NBLOCK,A(I,L),N,TEMP(1,1,1),NBLOCK)
CALL MATCOPY(NBLOCK,B(L,J),N,TEMP(1,1,2),NBLOCK)
CALL MATMUL(NBLOCK,NBLOCK,TEMP(1,1,1),
$ TEMP(1,1,2),TEMP(1,1,3))
ENDDO
CALL MATCOPY(NBLOCK,TEMP(1,1,3),NBLOCK,C(I,J),N)
ENDDO
ENDDO
...
SUBROUTINE MATCOPY(N,A,LDA,B,LDB)
IMPLICIT NONE
INTEGER*4 N, LDA, LDB
INTEGER*4 I, J
REAL*8 A(LDA,*), B(LDB,*)
DO J = 1,N
DO I = 1,N
B(I,J) = A(I,J)
ENDDO
ENDDO
END
A B C
floating-point instructions
instruction latencies
integer instructions
branch instructions
memory bandwidth
Reduce number of
Reduce number of
Reduce number of
Reduce number of
Reduce effect of
Reduce effect of
Technique
Inlining ✓ ✓ ✓ ✓ ✓ ✓
Cloning ✓ ✓ ✓ ✓ ✓ ✓
5.6.1 Inlining
Many codes are written to be modular and may call routines that are only a few lines long,
as in
...
j = 1;
for (i = 0; i < n; i++)
j = inlineit(j);
...
int inlineit(int j)
{
j *= 2;
return(j);
}
The cost of making the function call is much higher than the cost of doing the work in the
routine. Most compilers allow the user to inline routines. The only drawback is that this can
cause the size of the object code to increase. Profilers (discussed in Chapter 6) can help to point
out situations where inlining would be beneficial. If a routine is called many times and each call
does very little work, then it is a good candidate for inlining. In the example above, the code can
be rewritten as
Interprocedural Optimization 121
...
j = 1;
for (i = 0; i < n; i++)
j *= 2;
...
Another example is
S = DDOT( 10, X, 1, Y, 1 )
Inlining, constant propagation, and dead code elimination cause the pertinent code to be
condensed and the useless code removed so that the final code appears as
S = 0.0
DO I = 0,9
S = S + X(I) * Y(I)
ENDDO
5.6.2 Cloning
Cloning is closely related to inlining. Inlining takes a routine and incorporates it into a
subroutine that calls it. But what if the routine is too large to profitably inline? Cloning takes the
routine, makes a clone (copy) of it, performs interprovincial analysis on the cloned routine for
each call to it, and optimizes the logic in the cloned routine. Thus there may be many clones of a
routine, each generating slightly different object code.
Suppose a routine makes multiple calls to the MATMUL routine from an earlier section as
CALL MATMUL(2,LD,A,B,C)
CALL MATMUL(1000,LD,A,B,C)
CALL MATMUL(40,LD,A,B,C)
CALL MATMUL(N,LD,A,B,C)
122 Chapter 5 • How the Compiler Can Help and Hinder Performance
The first call to MATMUL might inline the code completely. The second call could make a
clone of MATMUL that contains only the code for the large case of N=1000. The third call could
make a completely different clone of MATMUL, which is optimized for the N=40 case. Finally, the
last call could call the original MATMUL routine.
instruction latencies
integer instructions
branch instructions
memory bandwidth
Reduce number of
Reduce number of
Reduce number of
Reduce number of
Reduce effect of
Reduce effect of
Technique
Most of the time when an algorithm is changed, the order stays the same but the number of
operations decreases. An example of this is the Fast Fourier Transform (FFT) discussed in
Chapter 12, where a radix-4 algorithm which has 4.25n log2 n floating-point operations is shown
Summary 123
to be superior to a radix-2 algorithm which uses 5n log2 n operations. For this algorithm change,
the number of operations drops by 15%.
1 + 2 + 3 + 4 + ... + n
it becomes apparent that the solution is (n/2) (n+1) which has O(1) since it contains only three
operations. Thus an O(n) algorithm has been replaced by an O(1) algorithm.
Two famous algorithms that we’ll discuss in detail are FFTs (Chapter 12) and Strassen’s
matrix-matrix multiplication (Chapter 10). FFT algorithms take the O(n2) 1-d Discrete Fourier
Transform (DFT) algorithm and implement it as an O(n log n) algorithm. Strassen’s multiplica-
tion takes the O(n3) standard matrix-matrix multiplication and implements it as an algorithm
whose order is log2 7 or approximately an O(n2.8) algorithm.
If you can lower the order of computations, the performance gains can be nearly unbeliev-
able. In the case of an FFT, the DFT algorithm takes 8n2 operations while the FFT algorithm
takes 5n log2 n operations. For a problem of size 1024, the number of operations is reduced from
about eight million to 50 thousand operations for a reduction factor of 160x!
5.8 Summary
This chapter covered a lot of very important material. You must use a compiler to generate
your application and if you care about performance (which you do or you wouldn’t be reading
this), you need to be able to use the compiler proficiently. More importantly, we’ve discussed
lots of source code modifications that allow you to create applications whose performance far
exceeds that of unoptimized code.
References:
1. Parallel Programming Guide for HP-UX Systems, K-Class and V-Class Servers, 2nd ed.
Hewlett-Packard, document number B3909-90003, March 2000.
2. Dowd, K.; Severance, C.R. High Performance Computing, 2nd ed. Sebastopol, CA:
O’Reilly & Associates, Inc., 1998, ISBN 1-56592-312-X.
124 Chapter 5 • How the Compiler Can Help and Hinder Performance
3. Origin2000 and Onyx2 Performance Tuning and Optimization Guide. Silicon Graphics
Incorporated, document number 007-3430-002, 1998.
4. Daydé, M. J.; Duff, I. S. The RISC BLAS: A Blocked Implementation of Level 3 BLAS for
RISC Processors. ACM Transactions on Mathematical Software, Vol. 25, 316-340, 1999.
C H A P T E R 6
6.1 Introduction
Timers and profilers are tools used to measure performance and determine if optimizations
help or hinder performance. They allow users to determine bottlenecks in applications to direct
where optimization efforts should be expended. Once the bottlenecks are determined, they can
be attacked with the optimization techniques discussed in this book. This chapter also discusses
how to predict the performance of small kernels to determine if the measured results are accept-
able.
A timer is a function, subroutine, or program that can be used to return the amount of time
spent in a section of code. As in the carpenter’s motto, it’s important to make multiple measure-
ments to ensure that results are consistent. A profiler is a tool that automatically inserts timer
calls into applications. By using a profiler on an application, information is generated that sum-
marizes timings about subroutines, functions, or even loops that were used.
Timer and profilers operate at various levels of granularity. A stop watch is a good exam-
ple of a coarse timer. It can be started when a program begins and stopped when it finishes. If the
program takes several minutes to run, this timer can be used to determine whether various com-
piler options improve performance. Some timing programs are like the stop watch in that they
time only the entire program. Of course, if the program takes a tiny fraction of a second, or if a
user is trying to tune part of the program that takes a small percentage of the total run time, this
type of timer isn’t too useful. Therefore, users can insert timers around the code they’re inter-
ested in, or instruct a profiler to return this information. However, if timers are called too often,
125
126 Chapter 6 • Predicting and Measuring Performance
the amount of time spent executing the timers can substantially increase the run time of an appli-
cation.
6.2 Timers
Before evaluating the effect of any optimization, a computer user must have a baseline
measurement against which to compare all subsequent results. Computer systems have several
ways to measure performance, but at the level of measuring only a few lines of code, a user
needs a timer to measure performance.
Timers are usually called in one of two ways:
t0 = timer();
...
< code segment being timed >
...
t1 = timer();
time = t1 - t0;
or
zero = 0.0;
t0 = timer(&zero);
...
< code segment being timed >
...
t1 = timer(&t0);
In the first case, the timer can be something as basic as returning the time of day and the
result of the two calls must be subtracted to return the total time. The second timer passes an
argument which is subtracted inside the timer function. We will build timers that are accessed
assuming the calling sequence shown in the second case.
Whenever we use a timer, we’d like it to have the following properties:
• be highly accurate
• have a low access overhead
• reset (roll over) infrequently
Many timers measure multiples of processor clock ticks and so they are integer quantities.
Ideally, a timer should be accurate to the clock cycle, since this is the finest granularity possible
on a computer. This is not possible to obtain on most systems. Most timers are accurate only to
milliseconds, although users should use, if available, a timer that is accurate to at least microsec-
onds.
Timers 127
Ideally, a timer should take no time to call; otherwise, it could distort the timing measure-
ments. This is not possible since accessing a timer requires a call to a system routine to access a
clock. The amount of time to access different system clocks varies. In Section 6.2.5, there is a
routine that measures the accuracy of timers. If this code shows a highly accurate timer, then, by
definition, this timer must have a low overhead since its accuracy could not be measured other-
wise.
Many timers are like a car’s odometer. At some point they will reset to zero, or roll over.
This can also render some timers useless.
• user time
• system time
• CPU time (sum of user time and system time)
• elapsed time or wall clock time
The operating system is involved any time a program is run. It must arbitrate system
resources such as I/O and swap processes to ensure that everyone gets access to the processors
on a system. So when a job is executed, some amount of time is spent performing the work of a
program (the user time) and some amount is spent in the operating system supporting the execu-
128 Chapter 6 • Predicting and Measuring Performance
tion of the job (the system time). Most runs will have the bulk of their time spent in user time and
very little in system time. CPU time is defined to be the sum of the user time and the system
time. Many timers keep track of the user time and system time separately, while other timers just
report the CPU time. Some timers return separate amounts for the parent process and its child
processes. So the most general of these timers return the four quantities:
A very useful timer is the wall-clock time or the wall time. This time is frequently called
elapsed time, although some documents confusingly call CPU time the elapsed time. Wall time
is like using a stop watch. The stop watch is started when execution starts and stopped when exe-
cution ends. What could be simpler?
Most work done by a computer shows up as user time or system time. However, when a
system is waiting for a remote device access, the time waiting is not counted against the user
time or the system time. It will, of course, be a component of the wall time. On a system execut-
ing a large number of programs, the wall time for a particular program may be large, but the
CPU time is small since each job competes against all other jobs for computing resources.
Parallel processing jobs represent another challenge to timers. The total time to execute is
the most important time, since there are multiple processors, each accruing CPU time. So for
parallel processing, you may not care what the CPU time is. However, you’ll want to know how
well your job is using multiple processors. The parallel efficiency of a job is found by measuring
the wall time using one processor, measuring the wall time with n processors, and calculating
their ratio. If this number is close to n, then you have a high degree of parallelism. Note that you
must run these jobs stand-alone to ensure that the results are meaningful.
The following examples illustrate the differences between the timing quantities:
Example 1: Single processor system. The only job running on the system is your job.
It executes the following code which contains an infinite loop
PROGRAM MAIN
N = 0
I = 0
DO WHILE (N .EQ. 0)
I = I + 1
IF (I .EQ. 1000) I = 0
ENDDO
END
Timers 129
Suppose you terminate the job after it runs several minutes. This job has minimal operat-
ing system requirements, so the amount of system time is small. The wall time result should be
nearly the same as the CPU time result since you are not contending against other users.
Example 2: Same system and code as shown in Example 1. You and four of your col-
leagues start running your codes at the same time. You terminate your copy of the job after sev-
eral minutes. The system time in this example is higher than in Example 1 since the operating
system must distribute processor time between the five users. The system time is still small com-
pared to the CPU and wall time. The CPU time is probably about 1/5 the wall time since the pro-
cessor gets shared among the users.
Example 3: You have a two processor system all to yourself. You’ve written a pro-
gram that consumes both processors all the time and requires few system calls. The CPU time
should be about double the wall time because the program does not require much work from the
O/S to distribute work among processors.
#include <fcntl.h>
#include <unistd.h>
#define FILENAME "/tmp/$$temp.test"
#define MIN(x,y) ((x)>(y)?(y):(x))
foo(n)
int n;
{
int i, j, k;
static int fildes[20];
static char buffer[20][64];
static int call_flag= 0;
k = 0;
for( i = 0; i < n; i += 20 )
{
for( j = 0; j < MIN(n-i,20); j++ )
sprintf(buffer[j],"%s.%d",FILENAME,i+j);
for( j = 0; j < MIN(n-i,20); j++ )
fildes[j] = open( buffer[j], O_CREAT|O_RDWR, 0777 );
for( j = 0; j < MIN(n-i,20); j++ )
{
unlink( buffer[j] );
k++;
}
130 Chapter 6 • Predicting and Measuring Performance
return(0);
}
This routine performs lots of system calls and therefore spends nearly all of its time in sys-
tem time and very little in user time. A timer that measures system time separately from user
time is valuable here since CPU time won’t distinguish between the two.
At this point it should be obvious that the authors have a strong bias in favor of bench-
marking stand-alone systems and using wall time.
6.2.3.1 timex
Executing timex a.out causes the elapsed time (called real below), user and system
(sys) time to be written as follows for a sample executable:
real 0.14
user 0.12
sys 0.02
6.2.3.2 time
Executing time a.out is more useful than using timex. It returns the user time, the sys-
tem time, the wall time, and the percent of a processor used. Output such as the following is rep-
resentative:
This command may also return information on the memory space usage.
Timers 131
where t1 contains the amount of time in seconds spent doing the work. The timers below return
CPU time or wall time. Timers that return CPU time can be easily modified to return user time
or system time.
132 Chapter 6 • Predicting and Measuring Performance
6.2.4.2 clock
clock() is supported by the C language and returns the amount of CPU time measured in
multiples of some fraction of a second. Check the include file <time.h> to see the usage on
your system. The time includes the times of the child processes that have terminated. To get the
number of seconds requires dividing by the system defined constant CLOCKS_PER_SECOND. If
clock() uses 32-bit integers and CLOCKS_PER_SECOND is 106 or larger, this timer is not recom-
mended since the roll over time is 1.2 hours or less.
/* Name: cputime.c
Description: Return CPU time = user time + system time */
# include <time.h>
double cputime(t0)
double *t0;
{
double time;
static long clock_ret;
static long base_sec = 0;
clock_ret = clock();
if ( base_sec == 0 )
base_sec = clock_ret;
time = (double) (clock_ret - base_sec) * recip - *t0;
return(time);
}
6.2.4.3 times
times() returns separate values for user time and system time. It also allows users to
check the user and system time for child processes. It requires dividing by the system defined
constant CLK_TCK to obtain seconds. Since the user and system time are returned separately, it
allows the user to gain more insight into the program than clock(). If times() uses 32-bit
integers and CLK_TCK is 106 or larger, this timer is not recommended since the roll over time is
1.2 hours or less.
Timers 133
/* Name: cputime.c
Description: Return CPU time = user time + system time */
# include <time.h>
# include <sys/times.h>
double cputime(t0)
double *t0;
{
double time;
static double recip;
struct tms buffer;
static long base_sec = 0;
(void) times(&buffer);
if ( base_sec == 0 )
{
recip = 1.0 / (double) CLK_TCK;
base_sec = buffer.tms_utime + buffer.tms_stime;
}
time = ((double)(buffer.tms_utime + buffer.tms_stime -
base_sec)) * recip - *t0;
return(time);
}
6.2.4.4 getrusage
getrusage() returns the number of seconds and microseconds for user time and the
number of seconds and microseconds for system time. It also returns parent and child process
times in separate structures. The microsecond values must be multiplied by 10-6 and summed
with the seconds values to obtain the time. Since the user and system time are returned sepa-
rately, it allows the user to gain more insight into the program than clock(). This can be a very
useful timer.
134 Chapter 6 • Predicting and Measuring Performance
/* Name: cputime.c
Description: Return CPU time = user time + system time */
# include <sys/resource.h>
double cputime(t0)
double *t0;
{
double time, mic, mega;
int who;
static long base_sec = 0;
static long base_usec = 0;
struct rusage buffer;
who = RUSAGE_SELF;
mega = 1.0e-6;
because the operating system can spend a lot of time determining which threads have child pro-
cesses and which ones don’t. Moreover, the data structures providing such information are often
accessible by only a single thread at a time. The end result may be a timer that takes longer to
execute than the rest of the program!
6.2.4.7 gettimeofday
gettimeofday() is a most useful timer. It allows a wall clock timer to be built around it.
gettimeofday() returns four different values, two of which are important for timing. These
two items are the number of seconds since Jan. 1, 1970, and the number of microseconds since
the beginning of the last second. gettimeofday() can be used to build the general purpose
walltime routine shown below.
walltime using getttimeofday
/* Name: walltime.c
Description: Return walltime
# include <sys/time.h>
double walltime(t0)
double *t0;
{
double mic, time;
double mega = 0.000001;
struct timeval tp;
struct timezone tzp;
static long base_sec = 0;
static long base_usec = 0;
(void) gettimeofday(&tp,&tzp);
if (base_sec == 0) {
base_sec = tp.tv_sec;
base_usec = tp.tv_usec;
}
6.2.4.8 SYSTEM_CLOCK
SYSTEM_CLOCK is supported by Fortran 90 and returns three values: the current value of a
system clock, the number of clocks per second, and the maximum system clock value. It allows
a wall clock timer to be built around it as shown below.
walltime using SYSTEM_CLOCK
#include <stdio.h>
main()
{
t1 = 0.0;
j = 0;
Timers 137
foo(n)
int n;
{
int i, j;
i = 0;
for (j = 0; j < n; j++)
i++;
return (i);
}
If only one iteration is needed, the program prints an upper bound for the resolution of the
timer. Otherwise, it prints the timer resolution. Be sure to run the routine a few times to ensure
that results are consistent. On one RISC system, the following output was produced:
Using clock():
Using times():
Using getrusage():
So the tested timer with the highest resolution for CPU time on this system is
getrusage().
ZERO = 0.0D0
T2 = 100000.
DO J= 1,5
TO = TIMER(ZERO)
CALL CODE_TO_TIME
T1 = TIMER(TO)
T2 = MIN(T2,T1)
ENDDO
T2 = T2 / N
PRINT *,'THE MINIMUM TIME IS',T2
The problem with spin loops is that sometimes sophisticated compilers can optimize them
out of the code, thereby making the measurement useless. If the code to be timed is a subroutine
call and interprocedural optimization is not enabled, then the code to be timed will not be
removed by the compiler.
6.3 Profilers
Manually inserting calls to a timer is practical only if you’re working with a small piece of
code and know that the code is important for the performance of your application. Often you’re
given a large application to optimize without any idea where most of the time is spent. So you’d
like a tool that automatically inserts timing calls, and you’d like to be able to use it on large
applications to determine the critical areas to optimize. This is what profilers do. They insert
calls into applications to generate timings about subroutine, functions, or even loops. The ideal
profiler collects information without altering the performance of the code, but this is relatively
rare. Different profilers introduce varying degrees of intrusion into the timing of a program’s
execution.
There are many different types of profilers used on RISC systems. Some profilers take the
standard executable and use the operating system to extract timing information as the job is exe-
cuted. Others require relinking the source code with special profiled libraries. Some require all
Profilers 139
source code to be recompiled and relinked to extract profile information. Compilers that require
recompiling can be frustrating since sometimes you’re given only object routines for the applica-
tion you wish to profile. (Begging for source is an important skill to cultivate!) The granularity
of profilers also varies a lot. Some profilers can perform only routine-level profiling, while oth-
ers can profile down to the loop level.
When using a profiler, optimizing code becomes an iterative process:
Ideally, a few routines will dominate the profile with most of the time spent in a few key
loops. The most difficult programs to optimize are ones that have lots of routines that each take a
very small percentage of the time. Applications like this are said to have a flat profile, since a
histogram showing the time spent in these routines is flat. These application are difficult to opti-
mize since many routines have to be examined to improve performance significantly. Using pro-
filers in conjunction with independent timers is a powerful technique. The profiler can narrow
the field of routines to optimize. Timers allow these routines to be finely tuned.
The goals for profiler timers are the same as for independent timers. For example, the tim-
ing routines used by processors must be highly accurate. The user is at the mercy of the creators
of the profiler for this, so be aware of this dependency.
sons why it works so well. First, hardware designers realized the importance of profilers and
designed special profiling registers into the processors. Second, vector computers can produce
very meaningful profile information with relatively small amounts of data.
Suppose a vector computer processes a loop that consists of adding two vectors of length
n. So the vectors need to be loaded, added together, and their result stored. The profiler needs to
keep very little information for each vector operation: only the type of operation and its vector
length. The profiler also needs to know the number of clock ticks to perform the calculations. If
n is very large, the time to collect this information is small compared to the number of clock
cycles it takes to actually do the calculations. The processing of the vector units can also occur
simultaneously with the updating of the profile registers. Contrast this to a RISC processor
where any collection of data for profiling has to compete with the normal processing and may
interfere with pipelining. So profilers on vector computers have a natural advantage over other
types of processors.
Most computer vendors have their own unique profilers, but there are some common ones.
The Message Passing Interface (MPI), which will be discussed in Chapter 8, is a widely used
standard for parallel programming using message passing. XMPI is an event-based, trace-based
profiler for programs that use MPI. It is invoked by linking with a special MPI library that has
been created for profiling.
The most common UNIX profilers for single processor execution are prof and gprof. At
least one of these is included on most UNIX computers. These are the most common type of
profiler and require code to be relinked and executed to create an output file with timing results
in it. An additional routine is then called to collect these results into meaningful output. prof is
a reductionist profiler that may be event-based or sampling-based. gprof is an event-based pro-
filer and returns more information than prof.
6.3.3 gprof
To use gprof requires creating an executable that contains timing instrumentation. This
can be done by relinking, but may require recompiling source. Check your Fortran and C com-
piler documentation to find the exact compiler options.
The LINPACK 100x100 benchmark is a good code to study for profiling. It is available
from http://www.netlib.org/. This code solves a system of equations of size 100x100.
The source code was compiled to use gprof (the -pg option below) on a CISC processor using
the GNU version of gprof by the following:
gprof produces a profile that shows the amount of time spent in each routine. For this
system, gprof determined that the subroutine DAXPY was the most time-consuming routine.
(This system appends an underscore to the routine names which will be ignored in this section.)
Columns to note are self seconds and the number of calls. The self seconds column
shows the amount of time in a routine minus the amount of time in child routines, while the
number of calls is important for inlining. If the number of calls is large and the amount of time
spent in each call (us/call) is small, a routine should be inlined. (Note that us/call is the
number of microseconds per call while ms/call is the number of milliseconds per call.) An
extract of the gprof output is
gprof also produces a call graph that shows which routines call other routines. The part
of the call graph relating to DAXPY is
-----------------------------------------------
Each entry in this table consists of several lines. The line with the
index number at the left hand margin lists the current function.
The lines above it list the functions that called this function,
and the lines below it list the functions this one called.
This line lists:
index A unique number given to each element of the table.
Index numbers are sorted numerically.
The index number is printed next to every function name so
it is easier to look up where the function in the table.
% time This is the percentage of the 'total' time that was spent
in this function and its children. Note that due to
different viewpoints, functions excluded by options, etc,
these numbers will NOT add up to 100%.
By studying the output, it is apparent that DAXPY is called by both DGEFA and DGESL.
DAXPY was called 133874 times, with most of the calls (128700) from DGEFA. One question is
ask is whether DAXPY is running as fast as it should. Also note the large number of calls to
DAXPY and the small amount of time spent in each call. This makes DAXPY a good candidate for
inlining.
Profilers 143
6.3.4 CXperf
The advantage of gprof is that it is available on many different hardware platforms. It
doesn’t provide all of the profile information we’d like, though. The ideal profiler would provide
many additional features, including
The Hewlett-Packard profiler, CXperf, is an event-based profiler with both trace and
reductionist components. It provides most of the above information (everything except the
Mflop/s values). Advanced profilers like CXperf are dependent on hardware to provide support
that allow them to obtain some of the information. Some HP hardware provides more profiling
information than others, so CXperf may provide different types of information depending on the
type of hardware used. CXperf is available for the HP C (cc), C++ (aCC), and Fortran 90 (f90)
compilers. It also supports profiling routines that use MPI and compiler based parallelism.
Users move through four steps using CXperf:
• compilation
• instrumentation
• execution
• analysis
6.3.4.1 Compilation
This consists of compiling source with +pa or +pal to instrument the object code for
CXperf. +pa is used for routine level analysis (like gprof), while +pal instruments for the finer
loop level analysis. We’ll use +pal in the following examples. Since you may not have access to
all the source, CXperf contains a utility, cxoi, that allows users to take an HP-UX object file
and modify it to include routine level instrumentation for use by CXperf.
Compiling and linking the previous LINPACK code with +O2 +pal produces an efficient
executable that is ready for analysis by CXperf.
6.3.4.2 Instrumentation
Typing cxperf a.out starts the profiler and produces the GUI shown in Figure 6-1.
Users can select the routines for analysis and whether to profile at the loop level. Various metrics
144 Chapter 6 • Predicting and Measuring Performance
such as wall time and data cache misses can also be selected. One approach is to make multiple
profiles. The first produces a routine-level profile so the user can select the few most important
routines for more analysis. Then a second profile may be created using loop-level profiling on
these routines. Selecting “Next” on the GUI places the user on an execution page where the rou-
tine may be initiated by selecting a start button.
6.3.4.3 Analysis
After the routine has been executed, the GUI shown in Figure 6-2 appears. It contains a
histogram of the timing results showing the amount of time spent in each routine excluding chil-
dren. On this processor, the routines DGEFA and DAXPY take most of the time. Users can reprofile
the executable to see the individual loops in the routines that take the most time. There are many
other views of the data, including a three-dimensional view for threaded, parallel results. CXperf
also has the ability to combine the output from separate processes generated by MPI applica-
Profilers 145
tions into a single file. This profiler can display the call graph as shown in Figure 6-3. Note the
thickness of the arcs in the graphical call graph. These show the importance of the various pro-
cessing paths. So, for example, while DAXPY is called by both DGEFA and DGESL, the call from
DGEFA is much more important as shown by the thickness of the arc connecting DGEFA and
DAXPY. A summary report for the whole application or for individual regions can also be created.
6.3.5 Quantify
Another profiler is the Quantify product produced by Rational software
(http://www.rational.com/). The main advantage of this product is that it is available for
both HP and Sun computers. The current versions are Quantify 4.5.1 for Sun and Quantify 5.0.1
for HP. It is available for the C and C++ programming languages, but not Fortran. It is similar to
CXperf, but, as would be expected, it is inferior in some areas and superior in others. Like
146 Chapter 6 • Predicting and Measuring Performance
CXperf, it is an event-based profiler with a GUI. After it is installed on a system, Quantify can
be accessed by replacing “cc” with “quantify cc”. Like CXperf, Quantify needs to use
objects that have been modified for profiling in order to generate the maximum amount of infor-
mation. If you link only with Quantify, as in
then Quantify creates new objects that are instrumented for profiling. When a.out is executed,
it displays the normal output, some additional Quantify information, and opens a GUI for profil-
ing.
Profilers 147
followed by the normal output from the executable. Then an epilogue gives more information.
It should be noted that Quantify thinks the MHz of this computer is 200 MHz, although it
is actually 440 MHz. When the code is executed, the GUI shown in Figure 6-4 appears:
Users can then request analyses similar to those of CXperf. The Function List shows the
amount of time spent in the most time-consuming routines.
One nice feature of Quantify is that it automatically includes results for system library
routines such as fabs(). Users can also select more information on individual functions.
Figure 6-5 shows the Function List and the Function Details for the function daxpy. Although
Quantify can report results in seconds and millions of instructions per second, it doesn’t collect
separate data on cache or TLB misses.
148 Chapter 6 • Predicting and Measuring Performance
Figure 6-5 Quantify function list and function details for daxpy.
Profilers 149
Quantify also contains a well-designed call graph that allows users to see the relationship
between functions, as shown in Figure 6-6.
So which profiler is best? It depends on what you want from a profiler. Table 6-1 contrasts
CXperf and Quantify.
Many metrics, including cache A few basic metrics CXperf collects more
and TLB results information
Can combine performance data Quantify combines multi- CXperf has better support
for each process in an MPI threaded results in a single for parallelism
application file unless the user requests
multiple files
150 Chapter 6 • Predicting and Measuring Performance
SUBROUTINE DAXPY(N,DA,DX,INCX,DY,INCY)
C
C CONSTANT TIMES A VECTOR PLUS A VECTOR.
C JACK DONGARRA, LINPACK, 3/11/78.
C
DOUBLE PRECISION DX(1),DY(1),DA
INTEGER I,INCX,INCY,IX,IY,M,MP1,N
C
IF(N.LE.0)RETURN
IF (DA .EQ. 0.0D0) RETURN
IF(INCX.EQ.1.AND.INCY.EQ.1)GO TO 20
C
C CODE FOR UNEQUAL INCREMENTS OR EQUAL INCREMENTS
C NOT EQUAL TO 1
C
IX = 1
IY = 1
IF(INCX.LT.0)IX = (-N+1)*INCX + 1
IF(INCY.LT.0)IY = (-N+1)*INCY + 1
DO 10 I = 1,N
DY(IY) = DY(IY) + DA*DX(IX)
IX = IX + INCX
IY = IY + INCY
10 CONTINUE
RETURN
C
C CODE FOR BOTH INCREMENTS EQUAL TO 1
C
20 CONTINUE
DO 30 I = 1,N
DY(I) = DY(I) + DA*DX(I)
30 CONTINUE
RETURN
END
It’s helpful to determine where the time is spent in a subroutine. The call graph from
gprof showed that the only routines that call DAXPY are DGEFA and DGESL, with most of the
time spent in the call from DGEFA. The calling sequence for DAXPY in DGEFA is
CALL DAXPY(N-K,T,A(K+1,K),1,A(K+1,J),1)
Profilers 151
so INCX=INCY=1, and therefore the DO 30 loop is the loop to analyze. By analyzing the source
code or using a loop level profiler, it can be determined that the size of N is between 1 and 99.
Given the high percentage of time spent in DAXPY, it is worth analyzing in detail. A good tech-
nique is to create a timing driver just for DAXPY.
PROGRAM DAXPYT
C TIMER FOR DAXPY ROUTINE
IMPLICIT NONE
INTEGER*4 NSTART, NEND, NINCR, N, NSPIN, I, J, INCX, INCY
REAL*8 T0, T1, TIME, WALLTIME, RATE, AMOPS, AMB, SIZE, ZERO
PARAMETER (AMOPS = 1000000., AMB = 1048576.)
C NSTART, NEND, NINCR ARE THE BEGINNING, END, AND INCREMENT
C FOR THE SIZES OF DAXPY TO TIME
PARAMETER (NSTART=10000, NEND=800000, NINCR=10000)
C NSPIN IS THE SPIN LOOP VALUE
PARAMETER (NSPIN=100)
C MAKE X TWICE THE SIZE OF NEND TO ENCOMPASS BOTH VECTORS IN
C DAXPY CALL AND ENSURE THAT X AND Y ARE CONTIGUOUS
REAL*8 A,X(2*NEND)
C
C INCX=INCY=1 TESTS
INCX = 1
INCY = INCX
WRITE (UNIT=6,FMT='(6X,A)') ' DAXPY RESULTS'
WRITE (UNIT=6,FMT='(6X,A,4X,A,5X,A,6X,A)')
*'N','SIZE','TIME','RATE'
WRITE (UNIT=6,FMT='(11X,A,4X,A,3X,A)')
*'(MB)','(SEC)','(MFLOP/S)'
DO N = NSTART, NEND, NINCR
C INITIALIZE THE DATA
A = 1.0D0
DO I = 1,2*N
X(I) = 1.0D0
ENDDO
C SET TIME TO A LARGE NUMBER OF SECONDS
TIME = 1000000.
DO J = 1,5
ZERO = 0.0D0
T0 = WALLTIME(ZERO)
C SPIN LOOP TO OVERCOME TIMER RESOLUTION
152 Chapter 6 • Predicting and Measuring Performance
DO I = 1,NSPIN
CALL DAXPY(N, A, X, INCX, X(N+1), INCY)
ENDDO
T1 = WALLTIME(T0)
TIME = MIN(TIME,T1)
ENDDO
TIME = TIME / NSPIN
RATE = (2 * N / TIME) / AMOPS
SIZE = (2 * 8 * N) / AMB
WRITE (UNIT=6,FMT='(I8,F8.2,F10.7,F10.4)') N,SIZE,TIME,RATE
ENDDO
END
The code was run on different processors. By looking at the output, the size of the proces-
sor’s cache is usually apparent. This is because once the problem size exceeds two times the
cache size, the performance should be steady state, since it is likely that two points map to every
location in the cache. This is easiest to observe in a direct mapped cache. Associative caches are
more complex. If the cache is n-way associative, then degradations may start occurring when the
problem size exceeds the size of one of the n-way portions. This is usually more pronounced in
caches with a random or round-robin replacement scheme than ones that employ strategies such
as least recently used (LRU), as discussed in Chapter 3.
DAXPY RESULTS
N SIZE TIME RATE
(MB) (SEC) (MFLOP/S)
10000 0.15 0.0000456 438.8852
20000 0.31 0.0000916 436.7766
30000 0.46 0.0001370 437.8284
40000 0.61 0.0001830 437.0629
50000 0.76 0.0002286 437.3879
60000 0.92 0.0002750 436.3636
70000 1.07 0.0004187 334.3763
80000 1.22 0.0006969 229.5816
90000 1.37 0.0010175 176.9111
100000 1.53 0.0013442 148.7885
110000 1.68 0.0016554 132.8944
Profilers 153
This system has consistent performance before one MB and after two MB. The processor
contains a one MB, four-way associative cache with a round-robin cache line replacement
scheme, so we’d expect performance to be good when the problem size is less than one MB.
After two MB, each cache line has to be loaded from memory and performance is low. Between
these sizes, performance is complicated and depends on how the cache lines are replaced.
DAXPY RESULTS
N SIZE TIME RATE
(MB) (SEC) (MFLOP/S)
40000 0.61 0.0007735 103.4313
80000 1.22 0.0015611 102.4898
120000 1.83 0.0021688 110.6608
160000 2.44 0.0029216 109.5287
200000 3.05 0.0038347 104.3095
240000 3.66 0.0046173 103.9566
280000 4.27 0.0065335 85.7117
320000 4.88 0.0091919 69.6266
360000 5.49 0.0124616 57.7776
400000 6.10 0.0155816 51.3425
440000 6.71 0.0194310 45.2884
480000 7.32 0.0226135 42.4525
520000 7.93 0.0246022 42.2727
560000 8.54 0.0266907 41.9621
600000 9.16 0.0281908 42.5671
64000 9.77 0.0304189 42.0791
This processor has consistent performance for problems less than four MB and for prob-
lems greater than seven MB in size. This is as expected, since this processor contains a four MB,
two-way associative cache that uses an LRU replacement scheme.
154 Chapter 6 • Predicting and Measuring Performance
Theoretical
peak Measured
(Mflop/s) (Mflop/s) Percent
R10000: Out-of-cache 58 50 86
Since the actual performance is over 75% of the theoretical performance in all cases, the
compilers are doing a good job of code generation.
6.5 Summary
This chapter developed the timing tools necessary to determine the effectiveness of the
optimization techniques defined in the previous chapter and showed how to predict the perfor-
mance of simple kernels on high performance processors. By combining all of these tools, pro-
cessor optimization can be analyzed in great detail. If your measured performance is over 80%
of the theoretical peak, you should be pretty happy and move on to the next most important part
of your application. If you understand the performance of more than 80% of your application,
it’s time to declare victory!
References:
Is High Performance
Computing Language
Dependent?
7.1 Introduction
If you were going to paint a house then you’d be more likely to get a paintbrush instead of
a hammer. More generally, there are very few general-purpose tools. This was also the case with
early generations of programming languages. Originally, Fortran was designed for scientific,
number-crunching applications (hence the name FORmula TRANslation). The common busi-
ness-oriented language (COBOL) was designed specifically for accounting applications, and the
C programming language grew from system programming needs. There are many other exam-
ples, but you get the idea. Given the environments from which these languages derived, each had
some underlying assumptions built into the language. Originally, this made the compiler tech-
nology for each of them simple (compared to today’s descendants). Over the years, languages
have grown to include functionality found in other programming languages.
There are many examples of how each language has picked up features from others. For-
tran 90 provides features such as pointers, dynamic memory allocation, and select/case con-
structs, all of which were previously in other languages, such as C. The same is true of C, with
modern compilers supplying loop optimizations and compiler directives for parallelism origi-
nally designed for Fortran.
So, are we at a point where any programming language provides the same performance for
any applications? Hardly. Is there one language that gives better performance for all programs?
Again, hardly. Many of today’s large software applications are implemented in two, three, and
sometimes four programming languages. For example, a seismic imaging application is likely to
159
160 Chapter 7 • Is High Performance Computing Language Dependent?
implement its graphics and user interface with C++, memory management and I/O in C, and
number crunching signal processing routines in Fortran (and even assembler!). Why? Because
each of these languages is well-suited to the application area mentioned. This same type of
application is just as likely to be implemented entirely in a single programming language such as
C. While some performance may be lost in doing so, it removes the reliance on multilingual pro-
grammers and exposure to portability issues due to using multiple languages. So, while one may
not write a graphical user interface in Fortran or a simple linear algebra subroutine in C, there
are many things that one can do to get good performance using any given language.
f90 -S myprog.f
as myprog.s
f90 -c myprog.f
does. This is accomplished in an analogous way with C compilers as well. To illustrate this, let
us consider the following C code, contained in a file called addabs.c:
#include <math.h>
double addabs( double x, double y)
{
double temp;
temp= fabs(x) + fabs(y);
return( temp );
}
.LEVEL 2.0N
.SPACE$TEXT$,SORT=8
.SUBSPA$CODE$,QUAD=0,ALIGN=4,ACCESS=0x2c,CODE_ONLY,SORT=24
addabs
.PROC
.CALLINFO
CALLER,FRAME=16,ENTRY_FR=%fr13,SAVE_RP,ARGS_SAVED,ORDERING_AWARE
.ENTRY
STW %r2,-20(%r30) ;offset 0x0
FSTD,MA %fr12,8(%r30) ;offset 0x4
FSTD,MA %fr13,8(%r30) ;offset 0x8
LDO 48(%r30),%r30 ;offset 0xc
.CALL ARGW0=FR,ARGW1=FU,RTNVAL=FU ;fpin=105;fpout=104;
B,L fabs,%r2 ;offset 0x10
FCPY,DBL %fr7,%fr12 ;offset 0x14
FCPY,DBL %fr4,%fr13 ;offset 0x18
.CALL ARGW0=FR,ARGW1=FU,RTNVAL=FU ;fpin=105;fpout=104;
B,L fabs,%r2 ;offset 0x1c
FCPY,DBL %fr12,%fr5 ;offset 0x20
FADD,DBL %fr13,%fr4,%fr4 ;offset 0x24
LDW -84(%r30),%r2 ;offset 0x28
LDO -48(%r30),%r30 ;offset 0x2c
FLDD,MB -8(%r30),%fr13 ;offset 0x30
BVE (%r2) ;offset 0x34
.EXIT
FLDD,MB -8(%r30),%fr12 ;offset 0x38
.PROCEND;fpin=105,107;fpout=104;
.SPACE$TEXT$
.SUBSPA$CODE$
.SPACE$PRIVATE$,SORT=16
.SUBSPA$DATA$,QUAD=1,ALIGN=64,ACCESS=0x1f,SORT=16
.SPACE$TEXT$
.SUBSPA$CODE$
.EXPORT add-
abs,ENTRY,PRIV_LEV=3,ARGW0=FR,ARGW1=FU,ARGW2=FR,ARGW3=FU
.IMPORT fabs,CODE
.END
is a number from 0 to 31. Instructions that begin with F are typically floating point instructions,
e.g., an instruction that starts with FLDD is a floating point load instruction which loads data into
a given register. Stores are performed with FSTD instructions, copies from one floating point reg-
ister to another are performed with FCPY, adds are performed with FADD, etc. As discussed in
Chapter 2, most RISC processors have what is called a branch delay slot in the execution
sequence, i.e., the instruction following the branch instruction is executed before the branch is
actually taken.
PROGRAM MAIN
REAL X(10)
CALL FOO( X, X, 9 )
END
SUBROUTINE FOO( A, B, N )
REAL A(*), B(*)
INTEGER N
...
In the subroutine FOO(), the arrays A and B are most certainly aliased!
The Fortran 90 standard defines pointers. Previous to this, pointers were not defined in
Fortran standards. However, Cray Research provided a convenient implementation of pointers in
Fortran (this implementation is generally referred to as “Cray pointers”). Generally speaking, if
Pointers and Aliasing 163
you need to use pointers in your application, then use C, C++, or hope that your Fortran com-
piler supports “Cray” pointers (which is usually the case).
So, what’s the big deal with aliasing? Consider the following sequence of C statements:
void copy( a, b, n )
double *a, *b;
int n;
{
int i;
for( i = 0; i < n; i++)
b[i] = a[i];
return;
}
Since the compiler expects aliasing, the pseudo code generated for the loop in copy()
will look something like
a[0] = 7
a[1] = -3
a[2] = 44
a[3] = 8
a[4] = 1000
164 Chapter 7 • Is High Performance Computing Language Dependent?
before the call to copy() and afterward, due to the aliasing of a with b, we have
a[0] = 7
a[1] = 7
a[2] = 7
a[3] = 7
a[4] = 7
Of course, there’s a “cleanup” loop to copy the remaining elements if n is not a multiple of
four which follows the above pseudo-code.
In this case, the array a, after the copy, looks like this:
a[0] = 7
a[1] = 7
a[2] = -3
a[3] = 44
a[4] = 8
This is a lot different from what one might expect! So, if you plan to allow aliasing in your
application, then you should avoid using Fortran subroutines for starters. Does this mean you
should use only Fortran subroutines if you don’t allow aliasing of array or pointer arguments and
want optimal performance? Roughly speaking, the answer is no. But you’re going to have to use
a special flag or two to get the C compiler to generate optimal code.
Pointers and Aliasing 165
Compiler designers have taken a serious look at C aliasing issues and most, if not all, pro-
vide compile flags that tell the compiler that you “promise none of the pointer or array argu-
ments are aliased.” That is, they do not “overlap” in terms of the memory they address.
Flags to accomplish this do not, unfortunately, have a common syntax. For example,:
HP: +Onoparmsoverlap
SGI: -OPT:alias=restrict
Sun: -xrestrict
There are a lot of subtle issues with aliasing, so let’s consider several variants on the fol-
lowing loop:
The following analyses will all be for the Hewlett-Packard C compiler unless specified
otherwise. When compiled with cc +O2, the loop in fooreg() is unrolled, but scheduling is
poor, as is demonstrated by the resulting sequence of instructions (generated by cc +O2):
$D0
FLDD -8(%r24),%fr7 ; Load b[i]
FLDD -8(%r25),%fr8 ; Load a[i]
FLDD -8(%r23),%fr9 ; Load c[i]
FLDD 0(%r23),%fr6 ; Load c[i+1]
FLDD 8(%r23),%fr5 ; Load c[i+2]
FLDD 16(%r23),%fr4 ; Load c[i+3]
FMPYFADD,DBL %fr7,%fr8,%fr9,%fr24 ; fr24= a[i]*b[i] + c[i]
FSTD %fr24,-8(%r23) ; Store fr24 at c[i]
FLDD 0(%r24),%fr10 ; Load b[i+1] (only after store of c[i])
FLDD 0(%r25),%fr11 ; Load a[i+1] (only after store of c[i])
FMPYFADD,DBL %fr10,%fr11,%fr6,%fr26 ; fr26= a[i+1]*b[i+1]+c[i+1]
FSTD %fr26,0(%r23) ; Store fr26 at c[i+1]
166 Chapter 7 • Is High Performance Computing Language Dependent?
The important thing to note is how the loads of a and b cannot be done until c from the
previous iteration is calculated and stored.
Is this architecture dependent? Absolutely not. The same routine compiled on an SGI Ori-
gin 2000 machine results in the following assembly, which corresponds to the loop having been
unrolled by a factor of two, but the scheduling suffers because the compiler is constrained on
what it can do. Note that comments are preceded by “#” rather than “;” on this architecture.
.BB8.foreg:
ldc1 $f3,0($4) # load a[i]
ldc1 $f0,0($5) # load b[i]
ldc1 $f2,0($6) # load c[i]
madd.d $f2,$f2,$f3,$f0 # f2= c[i]+a[i]*b[i]
sdc1 $f2,0($6) # store f2 into c[i]
ldc1 $f0,8($4) # load a[i+1]
ldc1 $f1,8($5) # load b[i+1]
ldc1 $f3,8($6) # load c[i+1]
madd.d $f3,$f3,$f0,$f1 # f3= c[i+1]*a[i+1]*b[i+1]
addiu $6,$6,16 # increment address register for c
addiu $4,$4,16 # increment address register for a
addiu $5,$5,16 # increment address register for b
bne $6,$9,.BB8.foreg # if more work to do goto .BB8.foreg
sdc1 $f3,-8($6) # store f3 into c[i+1]
Let’s now consider the instructions generated with the HP C compiler, using
cc +O2 +Onoparmsoverlap:
Pointers and Aliasing 167
$D0
FLDD -8(%r24),%fr9 ; Load b[i]
FLDD -8(%r25),%fr10 ; Load a[i]
FLDD -8(%r23),%fr11 ; Load c[i]
FLDD 0(%r24),%fr6 ; Load b[i+1]
FLDD 0(%r25),%fr7 ; Load a[i+2]
FLDD 0(%r23),%fr8 ; Load c[i+1]
FLDD 8(%r24),%fr4 ; Load b[i+2]
FLDD 8(%r25),%fr5 ; Load a[i+2]
FMPYFADD,DBL %fr9,%fr10,%fr11,%fr24 ; fr24= a[i]*b[i]+c[i]
FSTD %fr24,-8(%r23) ; Store c[i]= fr24
FLDD 8(%r23),%fr22 ; Load c[i+2]
FMPYFADD,DBL %fr6,%fr7,%fr8,%fr25 ; fr24= a[i+2]*b[i+1]+c[i+1]
FSTD %fr25,0(%r23) ; Store c[i+1]
FLDD 16(%r24),%fr23 ; Load c[i+3]
LDO 32(%r24),%r24 ; Increment address register for c
FMPYFADD,DBL %fr4,%fr5,%fr22,%fr28 ; fr28= a[i+2]*b[i+2]+c[i+2]
FSTD %fr28,8(%r23) ; Store c[i+2]
FLDD 16(%r25),%fr26 ; Load a[i+3]
LDO 32(%r25),%r25 ; Increment address register for a
FLDD 16(%r23),%fr27 ; Load b[i+3]
FMPYFADD,DBL %fr23,%fr26,%fr27,%fr9 ; fr9= a[i+3]*b[i+3]+c[i+3]
FSTD %fr9,16(%r23) ; Store c[i+3]
ADDIB,> -4,%r31,$D0 ; If more work to do goto $D0
LDO 32(%r23),%r23 ; Increment address register for b
Observe how the load instructions are grouped at the beginning of the loop so that as many
operands as possible are retrieved before the calculations are performed. In this way, the effect
of instruction latency is decreased so that fewer FMPYFADD instructions are delayed.
Aliasing issues don’t apply only to array arguments used in the body of the loop. The com-
piler is also on the alert for any of the loop control variables to be aliased as well. To illustrate
this point, consider the following subroutine and recall that external or global variables can also
be the “victims” of aliasing:
Now, not only do we have an issue with a, b, and c overlapping, but they may alias with
globn as well—which could have a serious impact on what the answer is. This is true even
168 Chapter 7 • Is High Performance Computing Language Dependent?
though globn and the arrays are of different types! As we would expect, the compiler generates
some pretty conservative code:
$D1
FLDD,S %r31(%r25),%fr4 ; Load a[i]
FLDD,S %r31(%r26),%fr5 ; Load b[i]
LDO 1(%r31),%r31 ; i++
FLDD 0(%r23),%fr6 ; Load c[i]
FMPYFADD,DBL %fr4,%fr5,%fr6,%fr7 ; fr7= a[i]*b[i]+c[i]
FSTD %fr7,0(%r23) ; Store fr7 at c[i]
LDW RR’globn-$global$(%r1),%r29 ; Load globn
CMPB,>,N %r29,%r31,$D1 ; If ( i < globn ) goto $D1
SHLADD,L %r31,3,%r24,%r23 ; Increment address register for c
This is going to be brutally slow. Not only is the loop not unrolled, disabling efficient
scheduling, but the value of the loop stop variable, globn, is loaded in every iteration!
The good news is that by compiling with +Onoparmsoverlap, the loop is unrolled and
globn is not loaded inside the loop, so the generated instructions are analogous to that discussed
above.
This simply multiplies the second and third arguments and stores the result in the first argument.
A word of caution when building one’s own macros to perform such calculations—be on
the lookout for floating point overflow/underflow issues. A good example is complex division. A
straightforward approach to dividing a complex number n by a complex number d is as follows:
Subroutine or Function Call Overhead 169
This algorithm is actually used by many Fortran compilers to perform complex division
(and fortunately it is hidden from the user). This algorithm enables complex division to be
accomplished for a greater range of values (without causing floating point overflow) than the
simpler form presented earlier.
Programming languages have different approaches to how they call functions or subrou-
tines. In an effort to make things a little more concise, let’s expand the term function to describe
routines which return values (traditionally called functions) as well as those that do not return
values (subroutines).
Some languages pass arguments by value, while others pass them by address. C and C++
use a pass by value convention. For example, suppose the function foo() is defined as follows:
Then the function foo will actually receive the value of all three arguments. Typically, this is
accomplished through registers. Many RISC architectures use certain registers for arguments to
function calls. When the function foo is executed on PA-RISC machines, the value of n will be
in general register 26, the value of x (which is actually an address!) will be in general register
25, and the value of y will be in floating point register 7.
With a pass by address language such as Fortran, addresses are passed for each argument.
To illustrate this, consider the function FOO defined in Fortran:
FUNCTION FOO( N, X, Y )
REAL*8 FOO
INTEGER N
REAL*8 X(*), Y
In this case, general register 26 contains the address where the value of N resides. General regis-
ter 25 contains the address of the first element of the array X, and general register 24 contains the
address of the memory location containing the value of Y. Thus, in Fortran, the value of N must
be loaded from the address in general register 26, and the value of Y will have to be loaded from
the address in general register 24. It’s interesting to note that one doesn’t have to load the
address X from the address in general register 25; it’s already there! A slight inconsistency, but
we’ll let it slide.
The subject of this is book is performance, so one should be asking why we’re going down
this path. Generally speaking, the pass by value convention is more efficient. The reason is
Subroutine or Function Call Overhead 171
pretty simple: It eliminates the storing and loading of arguments. Consider the simple example
of integration using the rectangle method, as follows:
t = 0.0;
for( i = 1; i < n; i++ )
t += (x[i] - x[i-1]) * myfunc( (x[i] + x[i-1]) * 0.5 );
return( t );
}
So, we’re integrating the function x2 over n points, x[0] through x[n-1]. These same
two functions can be analogously defined in Fortran:
FUNCTION MYFUNC( X )
REAL*8 MYFUNC, X
MYFUNC = X * X
RETURN
END
FUNCTION RECTANGLE_METHOD( X, N )
EXTERNAL MYFUNC
REAL*8 RECTANGLE_METHOD, MYFUNC
REAL*8 T, X(*)
INTEGER I, N
T = 0.0
DO I=2, N
T = T + ( X(I) - X(I-1) ) * MYFUNC( (X(I) + X(I-1)) * 0.5 )
END DO
RECTANGLE_METHOD = T
RETURN
END
172 Chapter 7 • Is High Performance Computing Language Dependent?
Now, let’s take a look at the code generated by each and compare the number of instruc-
tions. For the C example, the subroutine myfunc() is simply
The function consists of just two instructions! Note again that the instruction following a branch
(the delay slot) is executed.
The sequence of instructions from the C code for the loop in rectangle_method() is as
follows:
$00000006
FLDD 8(%r3),%fr6 ; Load x[i]
FLDD 0(%r3),%fr7 ; Load x[i-1]
LDO 8(%r3),%r3 ; i++
FADD,DBL %fr6,%fr7,%fr4 ; x[i] + x[i-1]
FSUB,DBL %fr6,%fr7,%fr14 ; x[i] - x[i-1]
B,L myfunc,%r2 ; call myfunc
FMPY,DBL %fr4,%fr13,%fr5 ; arg (fr5) = 0.5*(x[i] +x[i-1])
ADDIB,< 1,%r4,$00000006 ; goto $00000006 if more to do
FMPYFADD,DBL%fr14,%fr4,%fr12,%fr12; accumulate rectangle area
FLDD 0(%r26),%fr5
BVE (%r2)
FMPY,DBL %fr5,%fr5,%fr4
So, Fortran has to load the argument from the address in register 26 and then calculate the return
value. Not only is there an additional instruction, there is also the instruction latency impact.
That is, the multiply cannot execute until the load completes.
Standard Library Routines 173
$0000000C
FLDD 8(%r3),%fr5 ; Load x(i)
FLDD 0(%r3),%fr6 ; Load x(i-1)
LDO -96(%r30),%r26 ; Put address for argument in r26
LDO 8(%r3),%r3 ; i = i + 1
FADD,DBL %fr5,%fr6,%fr4 ; x(i) + x(i-1)
FSUB,DBL %fr5,%fr6,%fr14 ; x(i) - x(i-1)
FMPY,DBL %fr4,%fr13,%fr7 ; 0.5 * (x(i) + x(i-1) )
B,L myfunc,%r2 ; call myfunc
FSTD %fr7,-96(%r30) ; Store function argument (fr7)
ADDIB,<= 1,%r4,$0000000C ; goto $0000000C if more to do
FMPYFADD,DBL %fr14,%fr4,%fr12,%fr12; accumulate rectangle area
There are 11 instructions in this version of the loop. Both are caused by the store to memory
required for passing the argument to MYFUNC. That is, after the value is calculated, it must then
be stored to the address contained in general register 26.
This example has shown three potential performance problems that the pass by address
convention can cause:
1. Additional instructions are usually required for function calls. The caller must store the
arguments to memory and then the callee must load them.
2. For simple functions, the additional memory operations incur instruction latencies that
delay operations being performed on the arguments.
3. Additional loads and stores cause additional memory access, risking cache misses and
thrashing.
#include <math.h>
...
float x, y;
...
x = fabs( y );
If you compile with cc -O, then the following sequence of instructions can be expected:
Note that fr5 is used to pass the first (and only, in this case) floating point register to the func-
tion call. Recall that the instruction immediately following the subroutine call branch (B,L) is
executed in the delay slot and hence is actually performed before the first instruction in the func-
tion fabs(). So, the first instruction executed after the return from fabs() is the
FCNV,DBL,SGL instruction (the last one in the sequence). At first glance, this looks pretty good,
just three instructions. But keep in mind that we just want the absolute value of the number and
Standard Library Routines 175
it took three instructions just to call the subroutine fabs(). In addition to those are the instruc-
tions in the fabs() subroutine.
So, what’s a poor user to do? It seems like there ought to be a single instruction to perform
this operation; after all, it amounts to just clearing the sign bit. In fact, there is a single
instruction that does exactly that on most architectures. On SPARC processors, it’s FABSS()
and on PA-RISC, it’s FABS(). But, how does one get the compiler to generate the single
instruction instead of the subroutine call? Luckily, most compilers have a handy flag that
optimizes common subroutine calls such as fabs(). For Hewlett-Packard’s ANSI C compiler,
the flag is +Olibcalls (which roughly translates to optimize library calls). So, if we were to
compile the sequence of C code above with cc -O +Olibcalls, then we’ll get the following:
Note that we got very lucky here. Not only did the compiler remove the function call to
fabs(), it also eliminated the conversion to and from double. This is not always true, as we
shall see with sqrt() later.
Here’s a short list of flags that accomplish this with different vendors’ compilers (and usu-
ally these are available only with the “premium” compilers):
Sun: +xlibmil
HP: +Olibcalls
Note that SGI C compilers inline absolute value and square root instructions by default!
Before you go off and start using these flags on your application, there are two important
points that should be made. First, it is imperative that you include the appropriate include file for
the subroutine you’re trying to optimize (you should do this in any case). Note that math.h is
included in the example above because that’s where fabs() is prototyped. The second thing
you must keep in mind is that the calling code must not expect to access the errno variable after
the function’s return. This variable, defined in errno.h, is set to a nonzero code value that more
specifically identifies the particular error condition that was encountered. In many cases, the
function call is replaced by an instruction or sequence of instructions, eliminating the opportu-
nity to access error codes that might otherwise be set by a function call. So, use the flags above
to improve the performance of selected library routines only when you are not performing error
checking for these routines.
Another common routine that behaves similar to fabs() is sqrt(). This function is diffi-
cult to generalize because some architectures (including IA-64) implement this in software
176 Chapter 7 • Is High Performance Computing Language Dependent?
using multiple instructions. Yet other architectures implement sqrt() in hardware and hence as
a single instruction. Suppose we have the following C code:
#include <math.h>
...
float x, y;
...
x = sqrt( y );
Well, the call to sqrt() was replaced by the single FSQRT instruction. Unfortunately, we
didn’t get as lucky with the conversions as we did with fabs(). In this case, we have to be a lit-
tle more careful about data type. More precisely, if we replace sqrt() with sqrtf() (note the
f suffix), then we’ll get just a single instruction:
Note that the SGL part of the instruction causes the square root to be calculated in 32-bit
precision, rather than 64-bit (double), and it removes the conversions, just as we’d hoped.
For a moment, let’s go back to the absolute value topic. In particular, the integer version of
absolute value, abs(), sometimes gets overlooked by compiler optimization. That is, abs()
will sometimes result in a function call even with the handy flags mentioned above. Moreover,
be mindful that abs() is usually prototyped in stdlib.h, not math.h. If you find that you
need abs() to be done a lot faster than the system’s function call, then consider redefining with
a macro such as the following:
For example, on Hewlett-Packard machines, the above macro will cause a call to abs() to be
replace by the following sequence of instructions:
It’s not as clean as it might be, but it replaces a function call with two instructions.
As it turns out, Fortran has a distinct advantage over C here. If one uses the “general pur-
pose” version of the intrinsic (either abs() or sqrt()), then the +Olibcalls flag or their ana-
Standard Library Routines 177
logs is not needed. That is, the intrinsic calls are replaced with the appropriate instruction(s). It is
interesting to note that, for integer absolute value, Fortran uses the same two instructions shown
above!
xreal = x->re;
ximag = x->im;
yreal = y->re;
yimag = y->im;
treal = xreal * yreal - ximag * yimag;
timag = xreal * yimag + ximag * yreal;
t->re = treal;
t->im = timag;
return;
}
Before we proceed any further, it needs to be stressed that complex multiplication is usu-
ally inlined by Fortran compilers and, if the intrinsic mentioned earlier in this section is used, it
will be inlined by C compilers. We’re using this subroutine solely as an example to demonstrate
the efficiency of vector intrinsics. The instructions generated for accomplishing this are roughly
as follows, with the approximate clock cycle that the operation will begin on with a PA-RISC
PA-8500 processor (annotated in comments):
So, there’s a lot of stalling going on here. It’s easy to imagine that complex multiplication would
be used inside a loop. If that loop were unrolled by a factor of two, then the subroutine cmult()
would be called twice in each iteration of the unrolled loop. After examining the sequence of
instructions above, it seems like there’s some opportunity to get some additional instructions
executed during some of the delays. So, let’s consider using a routine to do two complex multi-
plications in one call:
xreal0 = x0->re;
ximag0 = x0->im;
yreal0 = y0->re;
yimag0 = y0->im;
xreal1 = x1->re;
ximag1 = x1->im;
yreal1 = y1->re;
yimag1 = y1->im;
treal0 = xreal0 * yreal0 - ximag0 * yimag0;
timag0 = xreal0 * yimag0 + ximag0 * yreal0;
treal1 = xreal1 * yreal1 - ximag1 * yimag1;
timag1 = xreal1 * yimag1 + ximag1 * yreal1;
t0->re = treal0;
t0->im = timag0;
t1->re = treal1;
t1->im = timag1;
return;
}
The resulting assembly code is as follows (again with the approximate clock cycle when the
instruction begins execution):
Just looking at the comments, it seems like there’s a lot more waiting going on than before.
However, recall that it took 10 clocks to perform a single multiply before and here we get 2 mul-
tiplications done in 15 clocks. So, that’s an improvement of 25%. In addition to this perfor-
mance, note that we’ll be doing half as many function calls.
Other, more complicated, operations such as log, sine, cosine, exponential, etc., also bene-
fit from performing multiple operations with a single call. This usually happens with the basic
-O optimization flag with Fortran. To be safe, it is recommended to always compile and link
with the flag(s) that enable optimized library calls (e.g., +Olibcalls). With C and C++, one
should always use this flag and disable aliasing so that the compiler knows it can perform multi-
ple operations at a time on arrays (e.g., +Onoparmsoverlap on HP-UX). Highly tuned, some-
times vector, versions of the math library functions sin, cos, tan, atan2, log, pow, asin,
acos, atan, exp, and log10 are usually available with most compilers.
7.5.3 Sorting
One of the more common activities in many applications is sorting. So, it is no wonder that
the routine qsort() is available on most systems today. This routine is a general purpose
implementation of the quicksort algorithm with occasional improvements, depending on the sys-
tem.
The subject of sorting algorithms has taken up volumes of literature already and there are
several excellent references on the subject. This book will not be one of those. Basically, this
section is intended to point out a couple of performance problems with using the qsort()
library routine and how to get around them.
Let’s suppose that you have an array of 64-bit floating point numbers that you want to sort
in ascending order. Then you might use the following to accomplish this:
Note that the compare routine returns a zero for elements which are identical. This is pref-
erable to a “simpler” compare routine such as:
This version may cause some elements to be swapped unnecessarily because two equivalent ele-
ments may result in a return value of -1, implying they are not equal and should be swapped!
It took 442 milliseconds on a single processor HP N-4000 machine to sort a randomly dis-
tributed array with 100,000 elements using the method outlined above. Moreover, sorting 1 mil-
lion elements in this manner required about 4.52 seconds. An interesting side note is that this is
almost linear performance, not bad for sorting which is widely recognized as being O(n log n)!
A serious issue with using the standard qsort() routine is that it performs a subroutine
call for every comparison that is made.
Note that each column is stored contiguously. This method of array storage is referred to as col-
umn-major order, or simply column-major.
a(1,1)
a(2,1)
a(1,2)
a(2,2)
a(1,3)
a(2,3)
a(1,4)
a(2,4)
C and C++ are just the opposite of Fortran in how multi-dimensional arrays are stored.
Consider the following definition of an array with 2 rows and 4 columns in C (or C++):
double a[2][4];
In these languages, the values are stored contiguously in memory, as shown in Figure 7-2.
Odds and Ends 183
Observe that the storage is such that each row is stored contiguously in memory, just the oppo-
site of Fortran. This method of array storage is described as row-major.
a[1][1]
a[1][2]
a[1][3]
a[1][4]
a[2][1]
a[2][2]
a[2][3]
a[2][4]
The reason that array storage is a performance issue follows from the goal of always trying
to access memory with unit stride. In order to accomplish this, nested loops will often be
reversed, depending on the language you are working with. For example, suppose we want to
copy one two-dimensional array to another where the dimensions of the arrays are both 100 ×
100. In Fortran, this is most efficiently done as follows:
DO J= 1, M
DO I = 1, N
X(I,J) = Y(I,J)
END DO
END DO
This causes the data in both X and Y to be accessed with unit stride in memory. Using the same
loop structure in C, with the two-dimensional arrays x and y, results in the following:
But this will access memory with a stride of 100 (the leading dimension of x and y), destroying
locality. The better way to write the loop, in C, is to switch the loops:
184 Chapter 7 • Is High Performance Computing Language Dependent?
The result is that memory is accessed with unit stride, dramatically improving locality.
There are a couple of other differences between C and Fortran that could be performance
problems.
General routines for operating on multi-dimensional arrays are easier to write in Fortran.
That is, it’s very difficult in C or C++ to duplicate the following Fortran subroutine:
SUBROUTINE FOO( X, Y, M, N)
INTEGER M, N, X(N,*), Y(N,*)
DO J = 1, M
DO I = 1, N
X(I,J) = Y(I,J)
END DO
END DO
You can write a similar routine in C if you know that the trailing dimension is actually
100:
return;
}
Odds and Ends 185
But, alas, this is not always the case. So, what one has to do is write the general purpose
routine as follows:
return;
}
There is usually no performance impact from this, but sometimes the compiler optimization gets
tangled up with the integer multiplication used in calculating the indexes.
compiled with cc -64. As is usually the case, be sure to include the appropriate include files
for non-typical data types such as size_t. The appropriate include file in this case is
stdlib.h.
short 2 2 2 2 2
int 4 4 4 4 4
long 4 4 8 4 8
size_t 4 4 8 4 8
void * 4 4 8 4 8
Now, why is this important you ask? Any time you can use less storage for your applica-
tion, you should do so. For example, suppose you are frequently using a table of integers that has
250,000 entries. If you can determine that the maximum absolute value of these integers is less
than 32,768 (which is 216 ), then defining that table as a short rather than an int will reduce
the amount of storage required from around 1 MB to 500 KB and hence make it a lot more likely
to reside entirely in cache.
In Chapter 2, we discussed the IEEE representation for floating point numbers. The small-
est normalized value in this format is represented by a number which has an exponent value of
one and a mantissa of zero. Smaller numbers can be represented by using denormalized values.
Denormalized floating-point numbers are those with an exponent value of zero and a nonzero
mantissa. For example, with 32-bit floating-point numbers, the smallest normalized magnitude
value is 1.17549435 × 10-38 , which is 1.0 × 2-126. Numbers with a smaller magnitude can be
Odds and Ends 187
represented with denormalized floating-point numbers. For example, 5.87747175 × 10-39 is rep-
resented by 0.5 × 2-127, which has a mantissa of 0x400000 (i.e., a 1 in its leading bit) and an
exponent value of zero in its IEEE 32-bit floating-point representation.
An underflow condition may occur when a floating-point operation attempts to produce a
result that is smaller in magnitude than the smallest normalized value. On many systems, the
occurrence of a denormalized operand or result, either at an intermediate stage of a computation
or at the end, can generate a processor interrupt. This is done to allow the user to trap float-
ing-point underflow conditions. Generating exceptions for such operations can reduce the speed
of an application significantly. Hewlett-Packard systems with PA-RISC processors are an exam-
ple of those systems that can exhibit poor performance when they encounter underflow condi-
tions.
To illustrate the performance impact of operations involving denormalized numbers, sup-
pose we perform a vector scaling operation on an array x and store the result in another vector y.
void vscale( x, y, a, n )
float *x, *y, a;
int n;
{
int i;
for( i = 0; i < n; i++ )
y[i] = x[i] * a;
return
}
Suppose, for simplicity, that every value of x is the smallest normalized 32-bit float-
ing-point number, 1.17549435 × 10-38. Moreover, assume that the scalar a has a value of 0.1 and
that n is 10,000,000. The above routine, with these assumptions, executed on an HP N-4000
machine in 88.6 seconds. This amounts to less than one-eighth MFLOP/s.
There are several solutions to this problem, including:
The first solution basically indicates that every element of the array y should be set to
zero. This could be done by inserting a test to check if the value is less than the smallest normal-
ized value and, if so, then set the value to zero. If the programmer isn’t clear on where he might
encounter denormalized values, then this isn’t a practical solution since every operation would
need to be tested. The second approach can be applied more often, but it may require a detailed
understanding of the application. The third solution can sometimes be achieved without modify-
188 Chapter 7 • Is High Performance Computing Language Dependent?
ing the application. For example, one could compile the application with
cc -Dfloat=double, which would effectively change all 32-bit floating-point data to 64-bit
floating-point data. The HP Fortran has similar functionality through use of the Fortran 90
option +autodbl. Converting the single-precision storage to double-precision allowed the pro-
gram to execute in 1.69 seconds—over 50 times faster!
The downside of converting data storage in this way is that it requires more memory to
execute the application, sometimes causing the working set to no longer fit in cache. In order to
avoid this problem, many vendors provide the capability to simply assign zero to denormalized
results without raising an exception. In most situations this has no impact on the application’s
accuracy. After all, numbers less than 1.17549435 × 10-38 in magnitude are very close to zero.
To enable this flush-to-zero capability on HP systems, link with the +FPD option (or -Wl,+FPD,
if you are using the compiler to perform the link operation). After doing this, the original prob-
lem executed in 1.20 seconds, an improvement of 30% over the double-precision solution.
7.7 Summary
Depending on what your application does, the choice of language can make an impact on
your application’s performance. However, this impact can be reduced to a minimum by using the
right syntax and compiler flags. In some cases, standard library routines can be much faster, but
only if the options are used correctly. On the flip side, some can be slower than what you will
achieve with macros or using your own subroutines!
Generally speaking, care should be taken, when programming in C and C++, to use the
correct include files whenever using standard library routines of any kind (math, string, or other
operations). In particular, the two include files that you should get in the habit of using are
stdlib.h and math.h.
References:
The following publications are excellent resources for language-specific issues and other
topics discussed in this chapter:
1. Anderson, P. L.; Anderson, G. C. Advanced C, Tips and Techniques, Hayden, 1988. ISBN
0-672-48417-X.
2. Hewlett Packard, HP C/HP-UX Programmer’s Guide, Hewlett Packard Co., 1998. Part
Number 92434-90013.
3. Hewlett Packard, HP C/HP-UX Reference Manual, Hewlett Packard Co., 1998. Part Num-
ber 92453-90087.
4. Hewlett Packard, HP Fortran 90 Programmer’s Guide, Hewlett Packard Co., 1998. Part
Number B3909-90002.
Summary 189
Parallel Processing — An
Algorithmic Approach
Some people are still unaware that reality contains unparalleled beauties.
Berenice Abbott
At some point you may find that there just isn’t any more performance that can be
squeezed out of a single processor. The next logical progression in improving performance is to
use multiple processors to do the task. This can be very simple or very difficult, depending on
the work and the computer you are using. The task that you’re doing may be as simple as adding
two arrays together or as complex as processing car rental requests from all over the country.
The former problem is easily broken into pieces of work that can be performed on multiple pro-
cessors. The latter is a very complex problem, one that challenges all of today’s database soft-
ware vendors.
8.1 Introduction
This chapter is not a survey of parallel algorithms. Quite the contrary, it is intended to pro-
vide the reader with multiple parallel mechanisms and examples of how they can (and cannot, in
some cases) be used to enable a task to be processed in parallel. Many of the basic concepts dis-
cussed in this chapter were discussed in Chapter 4, including processes, threads, etc.
To accomplish the goal of this chapter, a relatively simple problem will be examined and
multiple parallel programming techniques will be applied to it. In this way, many of the concepts
and techniques will be demonstrated through examples. We will also illustrate several pitfalls of
parallel processing.
191
192 Chapter 8 • Parallel Processing — An Algorithmic Approach
Consider a file which contains hundreds, perhaps thousands, of records which are delim-
ited with a specific string of characters. The problem is that every individual record, except for
its delimiter, is to be sorted. So, for example, suppose the input file is as follows:
DELIM:hjfjhrajnfnDELIM:qwdqsaxzfsdgfdophpjargjkjgDELIM:adqwxbncmb
Note that with the string DELIM: as the delimiter, there are three records in the file. We want to
produce a file which has the characters within each record sorted. So, the resulting output file
would be:
DELIM:affhhjjjnnrDELIM:aadddffggghjjjkoppqqrsswxzDELIM:abbcdmnqw
Unfortunately, the file doesn’t contain any information about the maximum record size or the
total number of records. Moreover, records can have arbitrary length.
Throughout the following case study, we shall use a file with the same form shown above.
In all cases, we will use the same input file and the same computer. The input file is just over 107
MB in size and the machine we use is a Hewlett-Packard N-4000 computer. The actual sort will
be an implementation of the quicksort algorithm, and record delimiters will be found using a
variation of the Rabin-Karp search algorithm loosely based on the implementation by
Sedgewick [1]. Unless otherwise noted, all I/O will be performed using UNIX system calls. The
details of the sorting, searching, and I/O will not be examined unless they pertain to the parallel
programming approach being discussed.
Let’s look at how the baseline algorithm performs. There is some initial I/O required to
read the data set into memory, then the records are sorted and the results written to an output file.
Using /bin/time to produce the overall time on our N-4000 computer gives the following:
real 1:01.1
user 59.3
sys 1.7
So, it takes about a minute to do the task. Let’s see how we can improve this with parallel
processing.
address space. The execv() system call, and all its variations, loads a program from an ordi-
nary executable file into the current process, replacing the current program. Consider the follow-
ing code segment:
...
Note that the fork() system call creates two processes executing the same program with
one difference. The value of pid in the parent process will be the process identification number,
commonly referred to as its PID, of the newly created child process. The value of pid in this
new child process will be zero. Hence, the path taken in the switch() statement following the
fork will be different for the two processes. Note that the child process executes an execv()
system call which, as it appears above, will execute the program ./hello and pass it the argu-
ment list in argv. The for loop creates ncpus-1 child processes in this way. It’s worth noting
that the perror(), exit(), and break statements will not execute unless the execv() call
fails.
Once the parent process has executed the first for loop, it then executes a second for loop
which waits on the children to exit. The return value from wait is actually the child’s process
194 Chapter 8 • Parallel Processing — An Algorithmic Approach
identification number. When the parent executes the wait() system call, it will not return until
a child process has completed, i.e., exited the machine’s execution queue.
If ncpus has a value of three, then Figure 8-1 illustrates the sequence of events. The first
for loop accomplishes what is roughly referred to as a spawn in parallel processing parlance.
That is, the parent spawns multiple threads of execution. The second loop achieves synchroniza-
tion of the multiple threads commonly referred to as a join.
(pid: 0)
pid = fork()
(pid: 0)
pid = fork()
wait() exit()
wait() exit()
Let’s get back to our problem of sorting records. It’s clear that we’re performing the same
operation (sorting) on every single record. So, we’ll create multiple processes using fork() as
outlined above and have each process read records from the file. In order to distribute the work
among N processes, we’ll have each process sort records in a round-robin fashion. For example,
the first process will sort records 1, N+1, 2N+1, etc.; the second process will sort records 2, N+2,
Process Parallelism 195
2N+2, etc.; and so on. Doing this with the fork/exec model outlined above gives the times in
Table 8-1.
The results are a bit discouraging. We achieved a speed up of only 1.7x on four processors.
Notice that the system cpu time has almost tripled and the total user cpu time has more than dou-
bled. Since each process has to search through the entire file, it may be causing a lot of file con-
tention. That is, since each process is accessing the same file, there is probably a severe
bottleneck in getting the data read from the file..
The performance may benefit from breaking the file into multiple input files and process-
ing them simultaneously yet separately. To do this, we’ll have the parent process read the input
file and distribute the records evenly into N separate files, where N is the number of processors to
be used. Note that this requires the parent to search through the entire input file to identify the
total number of records before it can begin equally distributing them. Since the quicksort algo-
rithm is O( n log n ) on average, we expect that this search and distribution won’t result in a lot
of additional processing time. Once this is done, multiple children will be created with fork(),
each of which will read in its own file and sort the records in it.
The resulting times, shown in Table 8-2, are also disappointing. We certainly reduced the
overall system and user cpu time when compared to the “single input file” approach previously
196 Chapter 8 • Parallel Processing — An Algorithmic Approach
described. However, the job ran much slower; the elapsed time is twice as long as before. Look-
ing at the breakdown of execution time in more detail, we find that it took an increasing amount
of elapsed time to create the input files while the actual sorting didn’t improve all that much.
Looking at the separate input file sizes we find that, for 4 processes, the input file sizes are
dramatically different. The four file sizes are roughly 42 KB, 368 KB, 51 MB, and 56 MB.
Hence two processes were doing over 99% of the work! Note that the number of records was
distributed evenly among the processes, but the sizes of these records varies dramatically. So,
two of the processes sorted some huge records while the other two sorted very small records.
PROCESS 1 PROCESS 2
record 1
record 4
record 2
record 3
record 5
elapsed time
record 6
exploit this to create a lock which prevents two or more processes from accessing the position
file at the same time. This is accomplished by using the following code segment (with
pseudo-code):
PROCESS 1 PROCESS 2
record 1 record 2
record 3 record 4
The locking mechanism is used only to get the record and update the position file with the
position of the next record. The processing, i.e., sorting of the record is done outside of the lock.
Otherwise we’d be defeating the whole purpose of the locking mechanism and processing would
be done one record at a time (inside the lock).
The code between the acquisition of the lock and the release of the lock can be executed
only by a single process. Such a segment of code is often referred to as a critical section or crit-
ical region. Note the while statement which iterates, or spins, until the open() call returns suc-
cessfully. This mechanism is called a spin lock.
After re-architecting the algorithm to use a file locking as outlined above, the records were
once again processed. The resulting times are given in Table 8-4. These are not very encourag-
ing, as they are slower than any approach so far.
The overall parallel scaling is not too bad as we’re getting just over 2x speedup with four
processors. Moreover, the user cpu time is not increasing dramatically from 1 to 4 processors.
However, the system time is growing exponentially. This is not a surprise really, because we’re
likely to be doing a tremendous number of system calls in our locking mechanism (open,
read, write, lseek, close, unlink). So, our scaling is improving, but we’re spending
Process Parallelism 199
too much time in the operating system. It’s probably time to take a dramatically different
approach.
BSD UNIX implemented a simpler approach using just a couple of system calls:
• mmap() Establishes a mapping between the process’ address space and a file
using the file descriptor returned by open().
• munmap() Releases shared-memory mapping established by mmap().
The BSD mmap() maps the contents of a given file into virtual memory at a given address.
This enables processes to share memory and have it initialized to the contents of a given file
without having to open it, read it, or allocate memory to read the contents into.
200 Chapter 8 • Parallel Processing — An Algorithmic Approach
The increase in user cpu time for two CPUs is dramatic and is likely being caused by false
cache line sharing. But let’s not dwell on this. We now know that the static scheduling approach,
even with shared-memory, is not efficient, so let’s revisit the dynamic scheduling approach.
• semget() Create a semaphore and return the identifier associated with it.
• semctl() Interface for performing a variety of semaphore control operations.
• semop() Used to atomically perform an array of semaphore operations.
One of the downfalls of System V semaphores is that they are anything but user friendly.
Moreover, this interface was not available in the BSD UNIX. As a result, many early parallel
programmers created their own semaphores or locks.
At the lowest level, mutually exclusive objects usually rely on synchronization primitives
that are implemented in hardware. An atomic exchange interchanges a value in a register for a
value in memory. Most systems today have special instructions that implement atomic
exchanges. PA-RISC systems feature the “load and clear word” instruction, ldcw. Its purpose is
to simply read a value from memory and set the value to be zero. More precisely, it reads the
contents at a certain address and simultaneously zeros the contents of that address.
202 Chapter 8 • Parallel Processing — An Algorithmic Approach
To better illustrate how this can be used to implement a spin lock, consider the following
subroutine in PA-RISC assembly code:
__ldcws32
.proc
.callinfo caller
.entry
ldcws0(0,%arg0),%ret0
nop
bv %r0(%rp)
.exit
nop
.procend
.end
This function returns the contents of the integer at address ip and zeros that same location.
Thus, we could implement a spin lock mechanism with the following routines:
int acquire_lock(ip)
int * volatile ip;
{
while (1)
{
while ( (*ip) == 0 );
if ( __ldcws32(ip) != 0)
break;
}
return 0;
}
int release_lock(ip)
int *ip;
{
*ip = 1;
return 0;
}
simply sets the value of a memory location, the lock, to one. Thus, the code sequence for a criti-
cal section would be achieved as follows:
acquire_lock( mylock );
/* critical section begins. */
...
/* end critical section. */
release_lock( mylock );
acquire_lock
.PROC
.CALLINFO
CALLER,FRAME=16,ENTRY_GR=%r3,SAVE_RP,ARGS_SAVED,ORDERING_AWARE
.ENTRY
STW %r2,-20(%r30) ;
STW,MA%r3,64(%r30) ;
COPY %r26,%r3 ;
LDW 0(%r3),%r31 ;Load contents of ip
$D0
CMPIB,<>,N0,%r31,loop2 ;If *ip != 0 goto loop2
loop1
B loop1 ;else infinite loop
NOP
loop2
.CALL ARGW0=GR,RTNVAL=GR ;in=26;out=28;
B,L __ldcws32,%r2 ;
COPY %r3,%r26 ;
CMPIB,=,N0,%r28,$D0 ;If( __ldcws32(ip) == 0 ) goto $D0
LDW 0(%r3),%r31 ;
LDW -84(%r30),%r2 ;
COPY %r0,%r28 ;
BVE (%r2) ;
.EXIT
LDW,MB-64(%r30),%r3 ;
.PROCEND ;in=26;out=28;
What’s happened is that the optimizer sees no reason to reload *ip in the spin loop:
while( (*ip) == 0 );
From a compiler’s perspective, this is fair. As a result, the value is loaded once and if it is zero,
then we have an infinite loop.
204 Chapter 8 • Parallel Processing — An Algorithmic Approach
The volatile qualifier is a nice addition to the C programming language to alleviate just
this sort of problem. It communicates to the compiler, and hence the optimizer, that this variable
may change for no obvious reason, so it needs to be reloaded every time it is accessed. If ip is
declared with the volatile qualifier, then the infinite loop above is transformed into the fol-
lowing:
loop1
LDW,O0(%r3),%r31; Load *ip
CMPIB,=,N0,%r31,$D1; If( *ip == 0 ) goto loop1
nop
Clearly, the volatile qualifier is very handy for parallel programming. The alternative is
to basically compile any routine that accesses shared-memory with no optimization!
Back to the record sorting problem, we’re going to allocate some shared-memory to be
used as a lock. It will be initialized to one so that the first process can acquire the lock. Note that
this shared-memory is different from that to be used for the actual input file, so there will need to
be separate calls to mmap(). The following call is used to allocate a memory segment to be used
to hold the input data file:
Note that we’re using the MAP_ANONYMOUS feature of mmap() here so there’s no file descriptor
needed (hence the -1 for that argument).
Now we need some additional shared-memory to contain things like the mutex and the
offset to the next record to be sorted (analogous to the position file’s contents used in the previ-
ous example). This is accomplished through another call to mmap(); this time we demonstrate a
call using a file descriptor:
we execute the program and get the performance outlined in Table 8-6. This is indeed an
improvement! The scaling from 1 to 2 processors is very good, but we’re not seeing much bene-
fit using four processors instead of two. There’s also a large increase in user cpu time at four pro-
cessors. This is likely to be caused by false cache line sharing. Even so, the elapsed time is the
best time yet and is over two times faster than the original sequential algorithm.
#include <pthread.h>
...
for( i = 1; i < ncpus; i++ )
{
retval = pthread_create( tid+i, (pthread_attr_t *) NULL,
(void *(*)())psort_buffer,
(void *) (pattern) );
if( retval > 0 ) perror(“pthread_create”);
}
psort_buffer( pattern );
So, the parent thread creates additional threads, each of which execute the subroutine
psort_buffer(). Once the additional threads are created, the parent thread also executes
psort_buffer(). These routines can be identical because when threads other than the parent
encounter the return statement in psort_buffer() it causes an implicit pthread_exit()
call to be made. The latter simply terminates the thread. Use it explicitly with caution as it will
terminate the main thread as well!
In defining pthreads, POSIX provided a very simple mutex interface. This interface will be
used to illustrate various parallel programming practices through the remainder of this chapter.
The basic routines provided for pthread mutexes are:
To implement the critical section needed to accurately communicate the location of the
next record to each thread, we use the following to replace the acquire_lock() and
release_lock() subroutines discussed above:
pthread_mutex_lock( &mutex );
{
/* Critical section begins. */
...
/* Critical section ends. */
}
pthread_mutex_unlock( &mutex );
Thread Parallelism 207
Implementing the dynamic scheduling with pthreads analogously to how it was done using
processes and shared-memory allocated with mmap() resulted in the times shown in Table 8-7.
Note that the elapsed time improved only slightly, but the user cpu time improved dramatically.
As mentioned above, this is due the additional load placed on the virtual memory system by pro-
cesses sharing memory allocated with mmap.
This result was also achieved without having to use a special interface to allocate memory
(such as mmap()), nor did it require writing the acquire_lock(), release_lock(), and
ldcw32() routines.
#include <pthread.h>
typedef struct filter_struct
{
double *buffer, *outbuffer, *filter;
int startrow, stoprow, nrows, ncols, filter_size,
fd, outfd, nrecords, my_tid;
} filter_t;
...
filter_t *farray;
...
208 Chapter 8 • Parallel Processing — An Algorithmic Approach
buffer = fptr->buffer;
outbuffer = fptr->outbuffer;
startrow = fptr->startrow;
stoprow = fptr->stoprow;
nrows = fptr->nrows;
ncols = fptr->ncols;
filter = fptr->filter;
frows = fptr->filter_size;
fcols = fptr->filter_size;
Thread Parallelism 209
return;
}
One further restriction on this hypothetical problem is that we need to filter the records in
the order that they are stored, so our goal is to filter each record using multiple threads. The first
approach is that, for a given record, we’ll create multiple threads, each of which will filter a par-
ticular segment of the record. In order to do this, we define farray such that it is a bona-fide
array of filter_t data. The startrow and stoprow components of each element of farray
are defined such that the output buffer is divided into an equal number of rows for each thread.
Once this is done, the main loop can be rewritten to use pthreads as follows:
As it turns out, there are a large number of records: 4000 in all. Moreover, the buffers
aren’t very large, 63 × 63, and the filter is even smaller, 11 × 11. This makes for a lot of thread
spawning and joining with not a lot of work to actually perform. The algorithm above was exe-
cuted three times on 1, 2, 4, 8 and 12 processors of an HP V-Class computer. The elapsed, user
Thread Parallelism 211
cpu, and system cpu times for each of the runs are shown in Table 8-8. Given the small amount
of work to be done filtering each record, it is not a surprise to see that the scaling starts to flatten
out after four threads. Beginning with four threads, things begin to get interesting. That is, there
is some variation in the elapsed times. This happens because threads can be spawned on any
given processor. So they may initially compete for a processor until the scheduler steps in and
does some basic load balancing.
Moreover, the system cpu time begins to grow to the extent that it is a substantial amount
of the overall cpu time, up to 11% using 12 threads. This is not a surprise because, using 12 pro-
cessors, there will be 22 calls to pthread_create() and pthread_join() for each record,
totalling a whopping 88,000 threads that have been spawned and joined. We’re fortunate that the
performance is this good! The HP-UX operating system, like most others, maintains a pool of
threads that are waiting to be assigned to any process. Thus, the thread is not actually created
212 Chapter 8 • Parallel Processing — An Algorithmic Approach
with pthread_create(); it is assigned. Even so, there should be a more efficient way to exe-
cute this workload with less thread overhead.
8.3.3 Barriers
Things would be much nicer if the threads were already created and there was a virtual
gate where they waited while each record was being read into memory. Such a mechanism is
referred to as a barrier. More formally, a barrier is a synchronization mechanism that causes a
certain number of threads to wait at a specified point in an application. Once that number of
threads has arrived at the barrier, they all resume execution. Barriers are typically used to guar-
antee that all threads have completed a certain task before proceeding to the next task.
A barrier can be implemented with two functions: one initializes the barrier information
and another basically designates at what point the threads are to block until the specified number
arrive. The pthreads standard does not implement barriers, but it does provide the necessary
tools to implement them, oddly enough. A typical barrier can be implemented as follows:
#include <stdlib.h>
#include <pthread.h>
barrier_list[barrier_count] = malloc(sizeof(barrier_t));
if ( pthread_mutex_init(
&barrier_list[ barrier_count ]->mutex, NULL) != 0)
perror(“barrierinit_, Unable to allocate mutex”);
if ( pthread_cond_init(
&barrier_list[barrier_count]->condition, NULL) != 0 )
perror(“barrierinit_, Unable to allocate condition”);
barrier_count++;
return;
}
return;
}
With these routines, we’ve introduced another pthread interface, condition variables. Con-
dition variables are very useful synchronization mechanisms. They basically cause a thread to
wait at a given point until an event occurs. So, when a thread encounters a condition variable and
the event has not occurred, then the thread will wait. Subsequently, another thread will cause the
event to occur and cause the condition variable to change state and thus wake up one or more
threads that are waiting on the condition variable and enable them to resume processing.
214 Chapter 8 • Parallel Processing — An Algorithmic Approach
Having defined the barrier mechanism, let’s implement our algorithm with it and see if the
performance improves. To do so, we must first initialize the barrier and start all the threads by
replacing the main loop above with the following sequence:
/* Initialize barrier 0. */
barrierinit( ncpus, 0 );
/* Spawn additional threads. */
for( i = 1; i < ncpus; i++ )
{
retval = pthread_create( tid+i, (pthread_attr_t *) NULL,
(void *(*)())filter_record_par, (void *) (farray+i) );
if( retval != 0 ) perror(“pthread_create”);
}
filter_record_par( farray );
Note first that the threads are not created for each record. Secondly, the function called by
each thread is different. The function filter_record_par() is basically the main loop with a
few important changes.
nrecords = fptr->nrecords;
fptr->buffer = outbufarray[recno];
fptr->outbuffer = inbufarray[recno];
fptr->filter = filter2;
filter_record( fptr );
}
return;
}
Note that there is not a second barrier after the second filter is applied. This is possible
because the input and output buffers are unique to each record. So, any given thread can start the
first filter of record i+1 as soon as it has finished with the second filter of record i. The barrier
between filters is necessary because a given thread may be ready to start the second filter before
some other thread has finished with the first. Without a barrier, the output of the first filter may
not be complete before some thread tries to use it for input to the second filter.
#include <sys/mpctl.h>
...
extern int cpuid[16];
...
/* Get number of first processor in the system. */
cpuid[0] = mpctl( MPC_GETFIRSTSPU, NULL, NULL );
farray[0].my_tid = 0;
for( i = 1; i < ncpus; i++ )
{
/* Get number of the next processor in the system. */
cpuid[i] = mpctl( MPC_GETNEXTSPU, cpuid[i-1], NULL );
if( cpuid[i] < 0 )
{
perror(“mpctl(MPC_GETNEXT)”);
exit(-1);
}
/* Initialize the thread index in farray. */
farray[i].my_tid = i;
}
Then, in the filter_record_par() routine, before the loop over each record, we use
mpctl to move the threads to unique processors by using the information in the global array
cpuid. This is accomplished with the following code:
been obtained through use of the advisory version of the call, which is achieved by using
MPC_SETLWP in place of MPC_SETLWP_FORCE.
After implementing our program with barriers and the affinity mechanisms described
above, we then execute on the same machine in the same fashion as before. The results of these
runs are shown in Table 8-10. So, indeed, the times do not vary as much as before. Better yet,
they are uniformly as good as or better than any of the previous attempts at executing this task in
parallel. Finally, note that the system cpu times are roughly 40% less than they were without the
additional affinity effort.
Parallelism and I/O 219
farray->buffer = inbuffer;
farray->outbuffer = outbuffer;
while( bytes_read > 0 )
{
filter_record( farray );
This loop was executed on an HP N-4000 computer using a file system that is capable of
I/O performance of at least 80 MB/sec. Processing 50,000 records, each of which was a two-
dimensional array of double precision values, 63 × 63 in size, translates to reading and writing
around 1.5 GB of data (each way). With filter_record() defined as before, this resulted in
36.7 seconds reading, 41.0 seconds filtering, and 16.2 seconds writing data. Thus, the filtering
accounted for less than half of the total time and I/O for the rest! So, even if we had an infinite
number of processors that we could use to filter the data, reducing that portion of the time to
zero, then the best scaling we could achieve would be just under 2x. That’s not very good scaling
by any measure. This is an important part of understanding parallelism—scalability will only be
as good as the part of your application that will execute in parallel.
A good question at this point is, “Is there any way to perform I/O in parallel?” The answer
is, “Of course!”
220 Chapter 8 • Parallel Processing — An Algorithmic Approach
This API is actually defined as part of real-time extensions to POSIX, and as a result is
contained in the real-time library on some systems. For example, one must link with -lrt on
HP-UX systems.
This API can be used in our example to achieve better performance. To do this, we’ll actu-
ally need two input buffers and have to modify our logic slightly to toggle between these two
buffers for every iteration of the loop. That is, in any given iteration we’ll want to read data into
one buffer while we filter the other one and write the result to the output file. To do this, we’ll
need the aio_error() function to identify when the asynchronous read has completed and
then check the status with aio_return(). The following code achieves the desired result:
#include <aio.h>
...
struct aiocb aioblk;
...
i = 0;
bytes_read = read( fd, inbuffer[i], nbytes );
Note that the API provides for either a polling or notification model in identifying when an
asynchronous I/O operation completes. In the above example, the aio_error() function is
used to poll the status of the aio_read(). Also, it should be noted that the aio_read() func-
tion basically provides the functionality of both lseek() and read(), since the offset in the file
is one of the parameters of the asynchronous I/O control structure, aioblk.
Implementing our test program, outlined above, with this approach reduced the effective
read time from 36.7 seconds to 14.5 seconds! But ideally, we’d like to see the read time reduced
to 0, that is, have it entirely overlapped by the filtering and writing tasks. As it turns out, the calls
to aio_read() and, to a lesser degree, aio_error() and aio_return(), take more time
than one might think. This is because a virtual thread must be obtained from the operating sys-
tem before the aio_read() can return. A successful return from aio_read() or
aio_write() indicates that there are sufficient system resources to execute the asynchronous
222 Chapter 8 • Parallel Processing — An Algorithmic Approach
I/O operation. In some implementations, a thread is actually created and, as we have seen previ-
ously, this is not something that can be done in zero time. In any event, we did improve the per-
formance, reducing the effective read time to less than half of what it was originally.
The pwrite() system call is related to the write() system call in just the same way.
Using these thread safe I/O interfaces, users can achieve their own asynchronous I/O. With
regard to our previous filtering example, there are at least two approaches. We can use either a
model similar to that of aio_read() as discussed above, or each thread can read its own piece
of the input array.
Parallelism and I/O 223
Thread 0 Thread 1
Figure 8-4 Unpredictable I/O with multiple threads using traditional system calls.
Let’s consider the latter approach first. But before we do, an important feature of the filter-
ing process needs to be examined in some detail. Consider the diagram in Figure 8-5. If the i-th
row of the input buffer is the first row to be processed by thread 1 then, with a 5 × 5 filter, it
requires data from the last two rows read in by thread 0 and the first three rows read in by
thread 1. The resulting element in row i, column j of the output buffer, however, resides in the
first row of thread 1’s portion of the output buffer.
This is analogous to that illustrated in the two-filter example described in Section 8.3. That
is, create threads outside the main loop as before, with each thread assigned a set of rows. Each
thread then reads its piece of the input buffer into memory, applies the filter, and finally writes its
rows of the output buffer to disk. Note that we’ll need two barriers to get this to work correctly.
The first barrier will occur after reading the data into memory. This is necessary since, when fil-
tering the first and last few of its rows, each thread uses data from rows that are read into the
input buffer by other threads (see Figure 8-5). So, the entire input buffer has to be in place before
any given thread can begin filtering. Since the output buffer is not shared, each thread can write
224 Chapter 8 • Parallel Processing — An Algorithmic Approach
Col. j
Col. j
inbuffer data
from thread 0
Row i Row i
inbuffer data
from thread 1
its own piece of the output buffer to disk once it has finished filtering. However, the thread must
wait before reading its piece of the next input buffer, because doing so could overwrite informa-
tion that is being used by other threads still filtering the previous record. As a result, we’ll have
to have a barrier after the output buffer is written. This sequence is illustrated in Figure 8-6. One
drawback to this approach is that we now have multiple, simultaneous system calls in the read
and write portions of the loop. That is, suppose we have 4 threads running. Owing to the barrier
after the write, they’ll all be performing pread() calls at virtually the same time, all accessing
the same input file. Writing to the output file may be just as bad because all 4 threads may finish
filtering at roughly the same time, and hence there could well be 4 simultaneous calls to
pwrite(), all of which are accessing the same output file. Implementing this scheme is left as
an exercise for the reader, but don’t be surprised if the performance doesn’t improve much. We’ll
revisit this approach in the ccNUMA section of this chapter.
Given the potential performance problems when each thread performs I/O for its portion
of the input and output buffers, let us revisit the asynchronous I/O approach in the previous sec-
tion. Rather than use aio_read() and aio_write(), we can create additional threads to per-
form the I/O. To accomplish this, we create two additional threads, one to read data into an input
buffer and another to be writing data from the thread-private output buffer. In particular, we need
two separate input buffers just as we did with the aio_read() example above. Each of these
input buffers will have a pair of condition variables associated with them (see the earlier section
on barriers). One of the condition variables will indicate whether the buffer is ready to be filtered
or not. The other will be used to identify whether data can be placed into the input buffer, i.e.,
whether the “filtering” threads are finished with it. The former condition variable will be set by
Memory Allocation, ccNUMA, and Performance 225
inbuffer outbuffer
barrier barrier
Figure 8-6 Performing I/O in parallel with pread and pwrite system calls.
the thread doing the reading and the latter will be set by the threads doing the filtering. This
approach has the advantage of far fewer system calls being performed as well as the attraction of
having all the I/O operations being done while other threads are performing computational tasks.
The details of this approach are left as an exercise for the reader.
#include <errno.h>
#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>
#include <spp_prog_model.h>
pthread_mutex_lock( &mutex );
i = pthread_self();
tb[0] = ’a’ + i - 1;
fprintf(stderr,”Enter<%d>\n”,pthread_self());
fprintf(stderr,”b(x%x)=%c, “,b,b[0]);
fprintf(stderr,”tb(x%x)=%c, “,tb,tb[0]);
fprintf(stderr,”i(x%x)=%d, “,&i,i);
fprintf(stderr,”x(x%x)=%d\n”,&x,x);
x = pthread_self();
b[0] = tb[0];
pthread_mutex_unlock( &mutex );
barrier(0);
Memory Allocation, ccNUMA, and Performance 227
pthread_mutex_lock( &mutex );
fprintf(stderr,”Exit <%d>\n”,pthread_self());
fprintf(stderr,”b(x%x)=%c, “,b,b[0]);
fprintf(stderr,”tb(x%x)=%c, “,tb,tb[0]);
fprintf(stderr,”i(x%x)=%d, “,&i,i);
fprintf(stderr,”x(x%x)=%d\n”,&x,x);
pthread_mutex_unlock( &mutex );
return;
}
...
nbytes = 1024;
/* Initialize mutex. */
if( pthread_mutex_init( &mutex, NULL ) != 0 )
{ perror(“pthread_mutex_init”); exit(-1); }
/* Initialize barrier. */
barrierinit( ncpus, 0 );
x = -1;
228 Chapter 8 • Parallel Processing — An Algorithmic Approach
/* Spawn threads. */
for( i = 1; i < ncpus; i++ )
{
buffer[i] = buffer[0]+1;
retval = pthread_create( tid+i, (pthread_attr_t *) NULL,
(void *(*)())thread_sub,
(void *) (buffer) );
if( retval > 0 ) perror(“pthread_create”);
}
thread_sub( buffer );
exit(0);
}
The qualifier thread_private is unique to HP’s programming model and is not part of
the pthread API definition. This qualifier enables the programmer to allocate thread-specific
storage even before the threads themselves are created. So, each thread will have its own private
copy of x (as we’ll demonstrate shortly). Consider the data structures i and tb (the pointer)
defined locally by thread_sub(). These are allocated on the stack, but the question is, what
stack? When a thread is created, it has its own stack, referred to as a thread stack. So, memory
allocated on the stack in a subroutine called by a thread is actually placed on that thread’s stack.
This feature is a veritable Pandora’s box. Consider for a moment how stacks and heaps work.
They are placed at opposite ends of the address space and “grow” toward one another. Threads
share a process’ address space and, as a result, the multiple thread stacks are preallocated and
have a finite limit to their size. On HP systems, the default size for thread stack is 64 KB.
If we execute the above code with ncpus set to four on an HP N-4000 computer, then the
following output results. The address and value for b[0], tb[0], i, and x are printed for each
thread. Note that the thread id, obtained from pthread_self(), is printed in < >.
Memory Allocation, ccNUMA, and Performance 229
Enter<2>
b(x40022060)=A, tb(x40042010)=b, i(x68fb50e0)=2, x(x40022640)=0
Enter<1>
b(x40022060)=b, tb(x400229b0)=a, i(x68ff0a60)=1, x(x400041d0)=-1
Enter<3>
b(x40022060)=a, tb(x40062010)=c, i(x68fa40e0)=3, x(x400227c0)=0
Enter<4>
b(x40022060)=c, tb(x40082010)=d, i(x68f930e0)=4, x(x40022940)=0
Exit <4>
b(x40022060)=d, tb(x40082010)=d, i(x68f930e0)=4, x(x40022940)=4
Exit <3>
b(x40022060)=d, tb(x40062010)=c, i(x68fa40e0)=3, x(x400227c0)=3
Exit <2>
b(x40022060)=d, tb(x40042010)=b, i(x68fb50e0)=2, x(x40022640)=2
Exit <1>
b(x40022060)=d, tb(x400229b0)=a, i(x68ff0a60)=1, x(x400041d0)=1
Good-bye world!, x= 1
There is a lot of interesting behavior demonstrated by this program. First, note that the
buffer allocated in the main program and passed into thread_sub() as b is indeed shared by
all the threads. The virtual address shown for b is the same for all four threads and, as the output
beginning with “Exit” illustrates, the information stored there is indeed shared. We can see this
because the value of b[0] is set by the previous thread as it passes through the first critical sec-
tion in thread_sub(), and that value is read by the next thread passing through the critical sec-
tion. For example, when thread 2 enters the critical section, it finds the character A in b[0].
Before leaving the critical section, it stores the character b into b[0]. Subsequently, when thread
1 enters the critical section, it finds the character b in b[0]. The last thread through the critical
section (thread 4) stores the character d into b[0]. Subsequently, all the other threads read d
from b[0], because b is shared memory.
The other memory allocation done in the main program is that for x. It’s interesting to note
that the process’ main thread (thread id of 1) initialized x to -1, and it is the only thread that
shows a value of -1 upon entering thread_sub(). All the other threads (children) have x set to
zero.
The memory allocated in thread_sub() (i, tb, and the data tb points to), all have
unique addresses and, hence, unique values. Both i and tb are part of the thread’s stack. But the
memory allocated through malloc() within the threads, while being thread-specific data, is all
placed on the parent process’ heap. This is easily discerned by examining the address of b and
comparing it to the addresses for tb, in each thread. As a result, if any given thread addresses
memory outside its range for tb, then data corruption of other threads’ memory will occur.
Finally, we should point out that the thread id numbers returned from pthread_self()
happened to be consecutive and started with one. Generally speaking, you cannot assume that
the thread id numbers of threads that are consecutively created will be consecutive and start with
one.
230 Chapter 8 • Parallel Processing — An Algorithmic Approach
The size of a thread’s stack is the source of a large number of thread-parallel application
problems. On HP-UX, if the subroutines called by a thread allocate more than 64 KB of local
variables, then data corruption can (and usually will) occur. Many systems add an additional
guard page to the end of each thread’s stack. Accessing memory in this guard page will cause an
error (typically a “bus error”). But modifying memory beyond this page usually results in the
modification of data in another thread’s stack! Moreover, unless data in the guard page is
accessed, the user will likely not see an obvious error, just data corruption.
Unfortunately, a thread’s stack size can’t be changed external to the application (e.g., with
an environment variable). The pthreads API does provide an interface to change the thread’s
attributes, such as stack size. The following code excerpt demonstrates how these interfaces can
be used to increase a thread’s stack size to 128 KB:
pthread_attr_t pattr;
...
/* Initialize the thread attribute object. */
retval = pthread_attr_init( &pattr );
if( retval != 0 ) { perror(“pthread_attr_init”); exit(-1); }
• round-robin placement,
• first-fault placement,
• fixed placement, and
• page migration
For this discussion, let us define a node as a source of physical memory with a number of
processors associated with it. The association is that these processors can access this source of
physical memory as fast as any other processor. For example, two SMPs that share memory via
some interconnect would represent a system with two nodes (the SMPs). A node defined in this
way is sometimes referred to as a Memory Locality Domain (MLD).
Round-robin page placement describes the practice of allocating memory pages in a
round-robin fashion on each of the nodes in a system. When memory is allocated on the node
associated with the first thread that accesses the page, this is referred to as first-fault, or first
touch, placement. When memory is allocated from a fixed node, e.g., the initial thread of a pro-
cess, this is called fixed placement. Finally, some systems support page migration. This is the
ability to move pages from one node to another based on the number of memory accesses, i.e., if
a thread executing on a particular node is accessing a page far more than others, then the system
will move the page onto that node. The first three placement methods are static. The program
will use that placement method throughout the applications execution. Page migration is a very
dynamic method of memory allocation, making it very different from the others.
Not all of today’s ccNUMA systems support these placement mechanisms. For example,
HP-UX currently provides only first-fault placement. The default on most systems is first-fault.
SGI’s IRIX supports all four of the placement policies described above. These placement poli-
cies can be realized through the use of environment variables. Since first-fault is fairly common,
threads should initialize and allocate memory that they will most often be accessing.
Let’s consider the filtering example discussed in the section on parallel I/O. Given what
we’ve discussed in this section, we can re-implement this algorithm so that one of the barriers is
removed and it executes efficiently on ccNUMA systems with a first-fault page placement pol-
icy. The input and output buffers can be allocated (but not initialized) by the parent thread just as
before. Then the algorithm would proceed as follows:
So, by having each thread read in all the data necessary to produce its portion of the out-
put, the first barrier shown in Figure 8-6 has been eliminated. Note that the first I/O performed
occurs in Step three, after all threads have initialized their portions of the input and output buff-
ers. There will be some redundancy here since each thread will read in all the data it needs to
perform the filtering operation and, hence, the few rows above and below the filter’s portion of
the input array (See Figure 8-5) will be read in two threads.
The barrier in Step three insures that all initialization has occurred as planned before the
I/O operations initialize the input buffer in a different fashion. For example, suppose we have a
total of 2 threads, a 5 × 5 filter, and each record has 200 rows. Then, without the barrier in Step
three, the second thread may read in the rows 98 through 200 (note two additional rows needed
for its filtering step) before the second thread finished initializing rows 99 and 100. In this case,
those two rows would be first faulted by the second thread and hence placed on the node that
thread is executing on.
Given the situation we just discussed, we will have some data sharing between threads.
This has been the case all along; we’ve just not looked into the problem in detail. One improve-
ment on the algorithm presented above would be to have each thread allocate its own subset of
rows sufficient to perform its processing. That is, the parent process will not even allocate an
input buffer. Instead, each thread will allocate (if the record is very big, use malloc() to avoid
exceeding thread stack size limit) an input buffer just large enough to contain the rows it needs
for the filtering process. Thus, for the example in the previous paragraph, both threads would
allocate their own input buffers, each of which would be large enough to contain 102 rows.
More than just loops can be made to run in parallel. If you have different subroutines that
you would like to run, then the parallel sections directive can be used. That is, suppose that you
want two threads to execute the subroutine foo() and one thread to execute bar(). Then the
following sequence of code will achieve just that:
Other scheduling strategies provided are static and runtime. Static scheduling will cause
iterations of the loop to be divided into chunks of a size specified by an optional parameter,
chunk_size. These chunks are then assigned to the threads in a round-robin fashion. When no
chunk_size is specified, the total number of iterations is divided into chunks that are approxi-
mately equal in size, with one chunk assigned to each thread. Runtime scheduling indicates that
the parallel loop or parallel section is to be executed with a scheduling policy that is determined
at runtime, based on the user’s environment.
and joining threads is nontrivial. In this case, automatic parallelization is a disaster. Loops that
took only a few microseconds are now taking milliseconds because of the parallel overhead.
Therefore, the 10% of the time spent outside the main loop could easily increase dramatically
(as in an order of magnitude), negating any benefit the main loop may get from parallelism.
This situation is actually the rule rather than the exception. Astfalk [4] demonstrated that
most applications have short loop lengths and usually do not benefit from automatic paralleliza-
tion. This problem can be alleviated somewhat by runtime selection of which loops to execute in
parallel. Unfortunately, this requires a test before every loop to determine if the loop length is
long enough to justify parallel execution. This additional overhead may prove to be prohibitive
as well.
Most programmers know where most of the time is spent in their application. Hence, it is
easy for them to identify just where the parallelism would give the most benefit. If this could be
the only place that parallelism is implemented, then the application is far more likely to show
benefit from parallel processing.
Some compilers enable the programmer to do just this sort of thing. That is, the program-
mer inserts a compiler directive indicating what loop or loops are to be executed in parallel and
then uses a special compiler switch to instruct the compiler to instrument only those areas for
parallelism and no others. To demonstrate these concepts, consider the following subroutines
that perform matrix multiplication.
In the file matmul_dir.f we have the following.
SUBROUTINE MATMUL_DIR( A, B, C, M, K, N )
DO I1 = 1, M
DO I2 = 1, N
C(I1,I2) = 0.0
END DO
END DO
DO I2 = 1, N
!$OMP PARALLEL DO
DO I3 = 1, K
DO I1 = 1, M
C(I1,I2) = C(I1,I2) + A(I1,I3) * B(I3,I2)
END DO
END DO
END DO
RETURN
END
236 Chapter 8 • Parallel Processing — An Algorithmic Approach
In another separate file, matmul_auto.f, we have the following subroutine which also
performs a simple matrix multiplication. Note that there are no directives in this file.
SUBROUTINE MATMUL_AUTO( A, B, C, M, K, N )
DO I1 = 1, M
DO I2 = 1, N
C(I1,I2) = 0.0
END DO
END DO
DO I2 = 1, N
DO I3 = 1, K
DO I1 = 1, M
C(I1,I2) = C(I1,I2) + A(I1,I3) * B(I3,I2)
END DO
END DO
END DO
RETURN
END
These two files are compiled with Hewlett-Packard’s Fortran 90 compiler with two
slightly different sets of options as follows:
Note that the file with directives is compiled with the option +Onoautopar. This option
instructs the compiler to parallelize only loops which have parallel directives associated with
them. In the second case, the compiler automatically parallelizes every loop that it deems fit. As
a result, both the initialization loop sequence and the matrix multiplication sequence are made to
execute in parallel. We have experimented with various systems and found that the directive, as
placed in matmul_dir.f, produces the best results. Using three square matrices of dimension
500 × 500, the two routines exhibited startlingly different performance, as shown in Table 8-11.
Comparing the fully automatic parallelism performance of matmul_auto() to that of the direc-
The Message Passing Interface (MPI) 237
tive-based parallelism, one can see the value of using directives and having the compiler paral-
lelize only those code segments which have directives associated with them.
One side note on the performance summarized in Table 8-11. The total size of the data in
these problems is approaching six MB. Moving from 1 to 2 processors reduces the data accessed
by each processor to roughly three MB. As a result, the data begins to fit efficiently into the N
Class’s one MB, four-way set associative cache. The performance improves dramatically, well
over three times faster on two processors. Such behavior is almost always due to the overall
problem’s being cut up into small enough pieces so that each thread’s data fits into a single pro-
cessor’s cache. Scaling better than the number of processors is generally referred to as superlin-
ear scaling.
main()
{
initialize();
while ( ! done)
{
compute();
data_exchange();
}
gather_results();
print_report();
cleanup();
}
In the exchange phase, a process typically sends data from the same set of output buffers
at every iteration. Likewise, it receives data into the same set of input buffers.
There are several ways to program this exchange. In general, the list of neighbors for a
given process may not form a simple regular graph. In these cases, the most effective approach is
to use MPI’s persistent requests, along with the MPI_startall() and MPI_waitall() rou-
tines. For a two-dimensional simulation example, this could be done as follows:
Summary 239
MPI_request req[8];
MPI_status stat[8];
Creating and destroying the eight requests outside of the main computation loop avoids
this overhead in each iteration. Using MPI_startall() to initiate the exchange allows MPI to
trigger all eight transfers more efficiently than if they were done via eight separate
MPI_isend() and MPI_irecv() calls. Likewise, using MPI_waitall() to block until all
eight transfers complete allows MPI to optimize its actions and reduce the number of redundant
checks it would have to make if this was programmed using individual MPI_wait() calls, or
worse, individual MPI_test() calls. Creating the four send requests before the receive requests
lets MPI accelerate the sending of the data. This helps reduce the number of unnecessary status
checks that MPI makes on behalf of the receive requests.
This method of programming the data exchange is quite flexible and can be very easily
adapted to less regular communication patterns.
8.8 Summary
In this chapter, we have presented the basic premise of parallel programming. SPMD pro-
gramming concepts were illustrated in some detail along with some of the pitfalls involved in
getting good scaling without shared-memory constructs.
Two important differences in parallel process scheduling were illustrated: static work dis-
tribution and dynamic scheduling. Quite often, static scheduling enjoys the advantage of cache
240 Chapter 8 • Parallel Processing — An Algorithmic Approach
re-use from previous iterations, while dynamic scheduling enjoys efficient load balancing of the
threads (or processes, as the case may be).
Locks and barriers are important mechanisms in parallel processing, but they should be
used sparingly. They can often cause unforeseen problems, as was demonstrated in the loss of
affinity when using barriers for the filtering example in Section 8.3. Unfortunately, there is not
yet a portable means of increasing (or decreasing) a thread’s affinity for a processor. But many
systems provide tools to accomplish this, as was demonstrated with HP-UX’s mpctl() inter-
face.
Compiler directives can certainly make parallel programming easier. They can also
remove the details from the application programmer, making it difficult to enjoy the robustness
of the pthreads specification. The use of directives is almost orthogonal (but not quite) to the
approach taken with explicit use of pthreads and/or processes with MPI.
Parallel I/O is, and will continue to be, an important part of parallel programming. As pro-
cessor performance continues to outpace the performance of data storage access (memory, disk,
etc.), we will see more and more emphasis placed on transferring data asynchronously. The
recent inclusion of I/O interfaces into the MPI standard is a good example of this.
Parallel programming can be very exciting. Achieving linear (sometimes super linear)
scaling on an application when executing on multiple processors is a very rewarding experience.
The constructs discussed in this chapter will have hopefully whetted your appetite for parallel
programming. There are many excellent resources for the parallel programmer; several are given
as references at the end of this chapter.
References:
The following publications are excellent resources for the parallel programming and other
topics discussed in this section.
7. Gropp, W; Lusk, E.; Skjellum, A. Using MPI: Portable Parallel Programming with the
Message-Passing Interface, MIT Press, 1994. ISBN 0-262-57104-8.
8. Pacheco, P. Parallel Programming with MPI, Morgan Kaufmann, 1996. ISBN
0-262-69184-1.
9. Hewlett-Packard Company. HP MPI User’s Guide, 1999. Part Number B6011-96010.
10. Hewlett-Packard Company. Exemplar Programming Guide, 1997. Part Number
B6056-96002.
11. Dowd, K. High Performance Computing, O’Reilly, 1993. ISBN 1-56592-312-X.
12. Norton, S. J.; DiPasquale, M. D. Thread Time: The MultiThreaded Programming Guide,
Prentice-Hall Professional Technical Reference, 1996. ISBN 0-13-190067-6.
13. Curry, D. A. Using C on the UNIX System, O’Reilly and Associates, 1991. ISBN
0-937175-23-4.
14. Silicon Graphics, Incorporated. Origin2000 and Onyx2 Performance Tuning and Optimi-
zation Guide, 1998.
15. Hewlett-Packard Company. Parallel Programming Guide for HP-UX Systems, 2000. Part
Number B3909-90003.
242 Chapter 8 • Parallel Processing — An Algorithmic Approach
P A R T 3
Applications —
Using the Tools
243
244 Part 3 • Applications — Using the Tools
C H A P T E R 9
High Performance
Libraries
9.1 Introduction
Consider again the scenario where you are a carpenter. Suppose you go to a job and find
that you need a hammer, saw, sawhorses, and nails. Then you’ll go back to your smelter, fire it
up, add a sufficient amount of iron ore, and create a few kilograms of pure iron. Once you’ve
done that, you’ll take this iron to your forge and start beating out a hammer head, saw blade and
nails. While these pieces are cooling off, you’ll cut down an oak tree (or possibly hickory) and
take a good size log or two to your saw mill. There you will cut a couple of small pieces for the
hammer and saw handles and several long and narrow pieces, say 2 × 4’s, for the sawhorses.
After producing a suitable hammer handle and saw handle in your wood shop the hammer and
saw can be assembled and subsequently the sawhorses can be built. Now, at last, you are ready
to start the carpentry job.
Is this typical behavior for a carpenter? Of course not! No carpenter in his right mind will
recreate his tools for every job. Similarly, no software developer in his right mind will recreate
subprograms for every job. This is, of course, why there are so many software libraries around,
including standard system libraries such as libc, as well as task specific third party libraries like
Syncsort and CoSORT, which provide high performance libraries for sorting data.
This chapter will discuss some of the commonly used high performance libraries that are
available today. The focus will be on mathematical libraries, but others will be discussed as well.
245
246 Chapter 9 • High Performance Libraries
9.2.1 BLAS
The BLAS (Basic Linear Algebra Subprograms) are routines for performing basic vector
and matrix operations. They are generally referred to as being divided up into three groups or
levels. Level 1 BLAS are for vector operations, Level 2 BLAS are for matrix-vector operations,
and Level 3 BLAS do matrix-matrix operations. The BLAS were begun in 1979 by Lawson
et al [1], and continued to grow through the 1980s with the final additions, Level 3, being made
in 1990 by Dongarra et al [2].
While the BLAS have been used for many years with much success, it never became a
bona-fide standard. However the BLAS, are a strong and well-accepted, ad hoc standard. But it
has some limitations in functionality. It does not provide some of the more typical routines used
in many calculations, such as
DO I=1, N
Z(I) = ALPHA * X(I) + Y(I)
END DO
Most new software development is done in languages other than Fortran. For example, far
more development is being done in C and C++ than in Fortran. Given the differences in subrou-
tine calling convention (Fortran passes arguments by address while C and C++ pass them by
value), it is not straightforward to call BLAS routines from C or C++.
The combination of these issues led a group of folks, referred to as the BLAS Technical
Forum, to develop a standard for the BLAS. The standard was nearly complete in 1999 and pro-
vides Fortran 77, Fortran 95, and C interfaces to all subprograms. It is largely a superset of the
original (legacy) BLAS. Given that the standard is just emerging at the time of this writing, it is
unclear how well-accepted it will be. The C interfaces are an improvement, but the argument
lists are sometimes cumbersome and contain some Fortran-centric features that may or may not
be useful with languages other than Fortran.
9.2.2 LINPACK
The linear algebra package LINPACK (LINear algebra PACKage) is a collection of For-
tran subroutines for use in solving and analyzing linear equations and linear least-squares prob-
lems. It was designed for computers in use during the 1970s and early 1980s. LINPACK
provides subroutines which solve linear systems whose matrices are dense, banded, symmetric,
symmetric positive definite, or triangular. It is based on the Level 1 routines in the legacy BLAS.
Linear Algebra Libraries and APIs 247
9.2.3 EISPACK
EISPACK (EIgenvalue Software PACKage) by Smith [3], Garbow [4] and others is a col-
lection of subroutines that compute eigenvalues and eigenvectors for several types of matrices.
It, too, was developed in the late 1970s for computers of that era.
9.2.4 LAPACK
Arguably the most successful public domain library for linear algebra is LAPACK (Linear
Algebra PACKage) by Anderson et al [5]. The original project was to enable the EISPACK and
LINPACK libraries to execute efficiently on shared-memory vector and parallel processors.
LAPACK improved the basic algorithms by redesigning them so that memory access patterns
were more efficient. Moreover, the subroutines in LAPACK are written so that, wherever possi-
ble, calls to the Level 2 and Level 3 routines of the legacy BLAS subprograms are made. As will
be discussed in Chapters 10 and 11, the Level 2 and Level 3 BLAS routines are much more effi-
cient for today’s computer architectures. This has also proven to be a huge benefit to users, as
almost all computer vendors today provide highly tuned legacy BLAS libraries. So, LAPACK
can be built with these faster libraries and provide dramatically better performance than building
the entire package from scratch. Users should use LAPACK rather than LINPACK or EISPACK
as it is superior to both in functionality as well as performance.
9.2.5 ScaLAPACK
The ScaLAPACK (Scalable Linear Algebra PACKage) features a subset of the LAPACK
library which has been architected to execute in parallel. It is designed primarily for distributed-
memory parallel computers and uses its own communication package BLACS (Basic Linear
Algebra Communication Subprograms). BLACS can be built with MPI or PVM. ScaLAPACK,
LAPACK, and BLAS can all be obtained from the Netlib Repository at the University of Ten-
nessee—Knoxville and Oak Ridge National Laboratory. The URL for this repository is http:/
/netlib.org/.
9.2.6 PLAPACK
Another parallel linear algebra package is PLAPACK (Parallel Linear Algebra PACKage)
developed by van de Geijn [6] and Alpatov et al [7]. PLAPACK is an infrastructure for coding
linear algebra algorithms at a high level of abstraction. By taking this approach to paralleliza-
tion, more sophisticated algorithms can be implemented. As has been demonstrated on
Cholesky, LU and QR factorization solvers, PLAPACK allows high levels of performance to be
realized. See http://www.cs.utexas.edu/users/plapack/ for more information on
PLAPACK.
248 Chapter 9 • High Performance Libraries
9.3.1 FFTPACK
One of the most popular signal processing libraries is FFTPACK. This is a package of For-
tran subprograms for the Fast Fourier Transform (FFT) of periodic and other symmetric
sequences. This package, developed by Swarztrauber[8], is available through the netlib reposi-
tory (URL provided above) and provides complex, real, sine, cosine, and quarter-wave trans-
forms. Version 4 of this package was made available in 1985.
9.3.2 VSIPL
9.4.1 PHiPAC
One of the first attempts at providing a library of linear algebra routines which is automat-
ically tuned for a particular architecture was provided by Bilmes et al [9]. This package provides
Portable High-Performance, ANSIC C (PHiPAC) linear algebra subprograms. Building subpro-
grams, such as matrix multiplication, with PHiPAC first requires a "training" session. This
requires the user to run a sequence of programs on the machine for which the subprogram is
being built that identify an optimal set of parameters for that subprogram and machine.This
training can take a while. For example, on a dedicated Hewlett-Packard N-4000 machine, it took
28 hours to build the general matrix multiplication routine dgemm(). This is a nontrivial amount
of dedicated time on a 1.7 Gflop/s processor. This training needs to be done only once, but it
should be done on a dedicated system; otherwise, the parameters it identifies may not be opti-
mal. Version 1.0 of PHiPAC produces a good version of dgemm() on the Hewlett-Packard N-
4000. It provides performance that is far better than the public domain Fortran reference BLAS
dgemm() routine when compiled with HP’s Fortran 90 compiler. It does seem to have trouble
with problems that do not fit into cache, as is illustrated in Figures 9-1 and 9-2.
Self-Tuning Libraries 249
9.4.2 ATLAS
Another automatically tuned package is the ATLAS (Automatically Tuned Linear Algebra
Software) library. Version 1.0 of ATLAS was made available by Whaley et al [10] through the
netlib repository. ATLAS puts most, if not all, of its system specific information into the design
of a single subroutine. Other, higher level, routines are then able to reuse the system specific rou-
tine efficiently. The end result is that the complexity of the tuning is dramatically reduced
because only a single routine needs to be generated for a particular platform.
As was the case with PHiPAC, ATLAS produced a dgemm() that was much better than
that obtained with the public domain BLAS reference dgemm() routine compiled with HP’s
Fortran 90 compiler on a HP N-4000 machine. The results of dgemm() performance on various
sizes of matrices with PHiPAC, ATLAS, and HP’s MLIB, version 7.0, is presented in Figure 9-1.
Figure 9-2 gives dgemm() performance when the matrices being multiplied are transposed. Note
250 Chapter 9 • High Performance Libraries
that ATLAS 1.0 automatically generates all the legacy BLAS. PHiPAC 1.0, on the other hand,
has scripts to generate the dgemm() subroutine, but not very many other routines within BLAS.
9.4.3 FFTW
The self-tuning approach to mathematical libraries used in PHiPAC and ATLAS is not iso-
lated to linear algebra. The Fastest Fourier Transform in the West (FFTW) is a portable C pack-
age for computing multidimensional, complex, discrete Fourier transforms (DFT). The authors
of FFTW, Frigo and Johnson [11], take a slightly different approach to performance tuning than
PHiPAC and ATLAS do. That is, FFTW doesn’t require a "training" session to define a set of
parameters for the particular architecture on which it is to be built and executed. Instead, FFTW
performs the optimization at runtime through an explicit call to a setup program, referred to as
the "planner." The planner uses a dynamic programming algorithm to identify a near optimal set
of parameters that are subsequently used by the "executor." So, calculating an FFT for a given
problem size requires two subroutine calls, one to the planner and another to the executor. Note,
however, that a single plan can be reused for other problems, provided they are the same size.
Self-Tuning Libraries 251
The performance of FFTW is quite good. At the time of its announcement, September
1997, it was faster than any other public domain FFT package available and faster than some
computer vendor libraries. A performance comparison, using FFTW’s timers, of FFTW 2.1.3
with FFTPACK and HP’s MLIB 7.0 is given in Figure 9-3.
9.6 Summary
In summary, there are a number of software library packages available free of charge, as
well as commercially, both of which provide high performance for commonly used algorithms.
Lower level routines, such as those in the BLAS, can be found that perform at or near the best
possible performance on a given platform. Whenever other packages are built with these subpro-
grams, the user can typically expect good performance. All of the packages outlined above will
usually perform as well as, or better than, those you develop yourself and allow you to focus on
application specific development rather than common subroutine development.
Summary 253
References:
The following publications are excellent resources for the libraries discussed in this sec-
tion:
1. C. L. Lawson, C.L.; Hanson, R.J.; Kincaid, D.; Krogh, F.T. Basic Linear Algebra Subpro-
grams for FORTRAN Usage, ACM Trans. Math. Soft., Vol. 5, pp. 308-323, 1979.
2. Dongarra, J.J.; Du Croz, J.; Duff, I.S.; Hammarling, S. Algorithm 679: A Set of Level 3
Basic Linear Algebra Subprograms, ACM Trans. Math. Soft., Vol. 16, pp. 18-28, 1990.
3. Smith, B.T.; Boyle, J.M.; Dongarra, J.J.; Garbow, B.S.; Ikebe, Y.; Klema, B.C.; Moler,
C.B. Matrix Eigensystem Routines—EISPACK Guide, Volume 6 of Lecture Notes in Com-
puter Science. Springer-Verlag, 1976. ISBN 0387075461.
4. Garbow, B.S.; Boyle, J.M.; Dongarra, J.J.; Moler, C.B. Matrix Eigensystem Routines—
EISPACK Guide Extension, Volume 51 of Lecture Notes in Computer Science. Springer-
Verlag, 1977.
5. Anderson, E.; Bai, Z.; Bischof, C.; Blackford, S.; Demmel, J.; Dongarra, J.; Du Croz, J.;
Greenbaum, A.; Hammarling, S.; McKenney, A.; Sorensen, D. LAPACK Users’ Guide.
SIAM, 1999. ISBN 0-89871-447-8.
6. van de Geijn, R.A. Using PLAPACK. MIT Press, 1997. ISBN 0-262-72026-4.
7. Alpatov, P; Baker, G.; Edwards, C.; Gunnels, J.; Morrow, G.; Overfelt, J.; van de Geijn,
R.A.; Wu, Y.J. PLAPACK: Parallel Linear Algebra Package, Proceedings of the SIAM
Parallel Processing Conference, 1997.
8. Swarztrauber, P.N. Vectorizing the FFTs, Parallel Computations (G. Rodrigue, ed.). Aca-
demic Press, 1982.
9. Bilmes, J.; Asanovic, K.; Demmel, J.; Lam, D.; Chin, C.W. PHiPAC: A Portable, High-
Performance, ANSI C Coding Methodology and Its Application to Matrix Multiply,
LAPACK Working Note 111, University of Tennessee, 1996.
10. Whaley, R. C.; Dongarra, J. J. Automatically Tuned Linear Algebra Software (ATLAS),
Technical Report, University of Tennessee. http:/netlib.org/atlas
11. Frigo, M.; Johnson, S.G. FFTW: An Adaptive Software Architecure for the FFT, ICASSP
Proceedings, Vol. 3, p. 1381, 1998.
12. Allan, R. J.; Hu, Y. F.; Lockey, P. A Survey of Parallel Numerical Analysis Software, Tech-
nical Report 99-01, CLRC Daresbury Laboratory, 1999.
13. Hewlett-Packard. HP MLIB User’s Guide (VECLIB and LAPACK), Hewlett-Packard Part
Number B6061-96010, 1999.
254 Chapter 9 • High Performance Libraries
C H A P T E R 1 0
• Reuse of data
• Reuse of code
Vendors of hardware and software provide libraries containing commonly used routines.
Writing code to use routines contained in these libraries also improves performance, since com-
puter hardware and software vendors put a lot of effort into ensuring that these standard routines
perform well. You can think of these routines as the nails and glue that hold together more com-
plicated structures. In the linear algebra arena, there is a rich set of building blocks that can be
used to construct high performance mathematical algorithms. These are the Basic Linear Alge-
255
256 Chapter 10 • Mathematical Kernels: The Building Blocks of High Performance
bra Subprograms (BLAS). This chapter also discusses how some of these routines may be struc-
tured for parallel execution.
10.2 BLAS
The BLAS have a long history compared to most software. Over 25 years ago, the first
BLAS routines were defined. They were written in Fortran and contained routines for four For-
tran data types: REAL*4, REAL*8, COMPLEX*8, and COMPLEX*16. Most of the examples
below use REAL*8 routines since these are the most widely used BLAS routines. The first
BLAS routines became known as the Level 1 BLAS. Level 1 just refers to the routines being
composed of a single loop. Later, Level 2 (two loops) and Level 3 (three loops) BLAS routines
were defined. Recently, the BLAS routines have been generalized for other languages and data
types by the BLAS Technical Forum to create the BLAS Standard as discussed in Chapter 9.
The earlier BLAS definitions are now referred to as the Legacy BLAS. Table 10-1 shows the
history of the BLAS.
When the Level 1 BLAS routines were originally defined, they had good performance on
vector computers, but poor performance on other types of computers since they had very little
reuse of data. At the time, all high performance computers were vector computers, so it didn’t
matter if performance on slower CISC processors was poor. No one was too concerned because
it was assumed that future high performance computers would be vector computers. Then clock
periods of CISC processors started dropping toward the domain of vector computers, RISC pro-
cessors appeared, and vector computers remained very expensive, so people started getting
interested in making non-vector processors run well.
BLAS 257
Having the Level 1 BLAS building blocks proved quite useful from a software design
standpoint. The LINear algebra PACKage (LINPACK) was built on top of the Level 1 BLAS.
This library contains routines to solve dense systems of equations. If a hardware vendor could
ensure that a few key Level 1 BLAS routines performed well on her computers, then all the LIN-
PACK software would run well. However, the Level 1 BLAS routines ran at rates approaching
the processor’s theoretical peak only on very expensive vector computers.
So why did the Level 1 BLAS run well on vector computers and not anywhere else? Vec-
tor computers have memory systems that can deliver one or more elements per clock cycle.
Other processors depend on cache to achieve good performance, but this works well only if the
data in cache can be reused. Level 1 BLAS routines have very little reuse of data. They usually
operate on two vectors at a time. To reuse data in cache requires routines based on matrices of
data. This was the rationale for the creation of the Level 2 and Level 3 BLAS.
The Level 2 BLAS contains things like matrix-vector multiplication, while the Level 3
BLAS contain matrix-matrix multiplication. If the Level 3 BLAS routines are correctly opti-
mized, processors with caches can achieve very good performance on these kernels. When algo-
rithms are built around the Level 3 BLAS, performance is good across all computer platforms.
The recent BLAS Standard generalizes the original Fortran routines to support Fortran 77,
Fortran 95, and C. The Fortran 95 routine names are the most generic since the language allows
users to call generic routines which call routines for the appropriate data type.
The following example compares the original BLAS and the BLAS Standard. One of the
simplest Level 1 BLAS routines is the vector copy routine DCOPY. It appears as
Note that when INCX or INCY are negative, the accesses of X or Y start at the end of the
array and move through backwards through memory. The functionality of DCOPY is also con-
tained in the BLAS Standard, but the name has been changed for the support of other languages.
For example in Fortran 77, the routine is named F_DCOPY and is called as
while in C the routine is named c_dcopy and has the calling sequence
void c_dcopy (int n, const ARRAY x, int incx, ARRAY y, int incy);
In Fortran 95 the name of the routine is COPY . The offset and stride are not needed in the
Fortran 95 routine. To perform the same functionality as INCX = -1, one can pass the expression
X(1:1+(N-1)*INCX) to COPY. The calling sequence for COPY is
multiplication, but others don’t. Division operations are very slow. If someone asks you to per-
formance a floating-point division, or even worse, an integer division, just say no! Obviously
there are occasions when you must perform these operations, but anything you can do to avoid
them is worthwhile. Basic operations with complex arithmetic map to one or more of the corre-
sponding floating-point operations and will also be discussed.
Scalar operations in order of increasing difficulty and inefficiency are
The next sections discuss how to reduce some of these to more efficient operations.
unsigned int i, j;
/* multiplication by 2040 */
j = (i << 11) - (i << 3);
unsigned int i, j;
/* division by 8 */
j = i >> 3;
the result back to integer. 32-bit integers may be converted to 64-bit IEEE floating-point num-
bers and division achieved as follows:
There are also IEEE formats for double-extended (64-bit mantissa) and quad-precision
(113-bit mantissa) data. If floating-point division is more efficient in either of these modes than
64-bit integer division, then 64-bit integers may be converted to these floating-point precisions,
the division performed and the result converted back to integer.
Y=2*X
can be replaced by
Y=X+X
Scalar Optimization 261
It’s a good idea to check compiler-generated code to ensure the compiler is generating efficient
code.
Note that if fma instructions are available, two fma and two multiplication instructions may be
used.
In the past, floating-point multiplication was sometimes more expensive to perform than
floating-point addition. Therefore, the following algorithm was used to replace the four multipli-
cations and two additions by three multiplications and five additions. This technique is not prof-
itable for scalars on today’s computers since floating-point multiplication and addition usually
take the same amount of time to perform. Efficient fma instructions also make the original code
perform faster than the reduced multiplication code. This idea is important later in this chapter,
so it’s worth remembering. Let X = (xr , xi), Y = (yr , yi) be two complex numbers, and s1, s2, s3
be real numbers. The complex multiplication may be performed as follows:
s1 = xr(yr − yi)
s2 = yr(xr + xi)
s3 = yi(xr − xi)
(xr , x i) (yr , yi) = (s1 + s3 , s2 − s1)
(xr , x i) / s = (xr / s , xi / s)
262 Chapter 10 • Mathematical Kernels: The Building Blocks of High Performance
If you’re using complex numbers extensively, it’s worth checking the compiler-generated
assembly code to verify that the compiler is doing a good job of code generation.
(xr , x i) / (yr , yi) = [(xr , x i) (yr , −yi)] / [(yr , yi) (yr , −yi)]
= (xryr + xiyi , x ryi − xiyr) / (yryr + yiyi)
This perform two real divisions, but, as discussed in Chapter 7, it isn’t numerically safe. If
the components of Y are very large, calculating the denominator may lead to an overflow. So
compiler writers don’t implement this technique except as an unsafe optimization. If you know
the data is not near the extreme value, this is a valid and efficient approach, though. What is a
safe way to divide two complex numbers? One way is to ensure that none of the intermediate
values overflows. This can be accomplished by finding the absolute value of each Y component
and normalizing by this as the following code shows:
This is not quite as efficient as the earlier code since it requires three floating-point divisions.
This is how Fortran compiler writers implement complex division. Thus, if you are a Fortran
programmer and know your data is not very large, you may want to implement the fast (but
unsafe) code. If you are a C programmer, you can implement the one that is most appropriate for
your data.
data. This optimization is useful only if the cost of two REAL*8 divisions is less than the cost of
three REAL*4 divisions.
COMPLEX*8 X, Y, Z
REAL*8 XR, XI, YR, YI, ZR, ZI, D
XR = REAL(X)
XI = AIMAG(X)
YR = REAL(Y)
YI = AIMAG(Y)
D = YR*YR + YI*YI
ZR = (XR*YR + XI*YI)/D
ZI = (XI*YR - XR*YI)/D
Z = CMPLX(ZR,ZI)
The bzero() and memset() functions may be optimized to use the eight-byte integer registers
found on modern computers. The hardware in many RISC computers assumes that eight-byte
264 Chapter 10 • Mathematical Kernels: The Building Blocks of High Performance
load and store operations occur on eight-byte boundaries. If this is not the case, executables may
abort on a non-aligned memory operation. Thus, a prologue is executed until the data is
eight-byte aligned. An epilogue may also be needed to cleanup the last few steps.
if( n < 16 )
{
for( i = 0; i < n ; i++ )
x[i] = (char)value;
}
else
{
/* Move to 8 byte boundary */
n1= 8 - (((unsigned int)x) & 0x07);
n1= 0x07 & n1;
for( i = 0; i < n1; i++ )
x[i] = (char)value;
/* x+n1 is aligned on 8 byte boundary */
uvalue = (unsigned int) value;
value8 = (long long int) uvalue;
value8 = value8 << 8 | value8;
value8 = value8 << 16 | value8;
value8 = value8 << 32 | value8;
x8ptr = (long long int *)(x + n1);
/* n2 = remaining # of bytes / 8 */
n2 = (n - n1) >> 3;
for( i = 0 ; i < n2; i++ )
x8ptr[i] = value8;
/* cleanup, start at n1 + n2*8 from start of x */
for( i = n1 + (n2 << 3); i < n; i++ )
x[i] = (char)value;
}
return;
}
Vector Operations 265
These may also be optimized to use eight-byte registers, as in the memset() routine above.
Another copy routine is the Level 1 BLAS routine DCOPY . The unit stride version performs
There’s very little that can be done to optimize this routine other than writing it in the most
efficient assembly code possible.
In the BLAS Standard, this has been generalized to the routine DAXPBY (ALPHA times X
plus BETA times Y). The Fortran 77 version of this, F_DAXPBY, is
There is not much that can be done to improve compiler generated code for DAXPY or
F_DAXPBY. In the DAXPY routine, the scalar A can be checked to see if it is zero and the routine
266 Chapter 10 • Mathematical Kernels: The Building Blocks of High Performance
exited. F_DAXPBY can also implement special case code when ALPHA or BETA equals zero or
one. The scalars should be hoisted outside the loops and the loops unrolled to reduce the effect
of latencies. For large N, the code should also include prefetch instructions. Most of these opti-
mizations should be implemented by the compiler. Note that applications written to use DAXPY
may run less efficiently if the routine is replaced by a call to F_DAXPBY. This is because
DAXPY requires only one multiplication and addition (or one fma instruction), while F_DAXPBY
requires two multiplications and one addition (one fma and one multiplication).
10.4.3.2 ZAXPY
A COMPLEX*16 vector accumulation Level 1 BLAS routine is ZAXPY. The unit stride
version appears as follows:
This requires one complex multiplication and one complex addition. These map to four
real multiplications and four real additions, as demonstrated below.
On computers with fma instructions, this require two multiplications, two fma’s, and two
additions. However, if the order of operations is changed to
YR = ( YR + AR * XR ) - AI * XI
YI = ( YI + AR * XI ) + AI * XR
DDOT = 0.0D0
DO I = 1,N
DDOT = DDOT + X(I) * Y(I)
ENDDO
In the BLAS Standard for Fortran 77, the function DDOT is replaced by the subroutine
F_DDOT , which performs the following:
DDOT = 0.0D0
DO I = 1,N
DDOT = DDOT + X(I) * Y(I)
ENDDO
R = BETA * R + ALPHA * DDOT
The scalar DDOT should be hoisted and sunk from the DO loop. One potential problem
with DDOT is that the same floating-point register is usually used for the reduction. This limits
the speed of DDOT to that of the processor’s multiply-add floating-point latency. Multiple sum
reductions may be used, but this changes the order of operations and can produce slightly differ-
ent answers than the original code. The number of sum reductions used should be equal to the
floating-point latency multiplied by the number of multiply-add pairs that can be executed per
cycle, divided by the number of load instructions that can be executed per clock cycle. So, if an
fma instruction with a four cycle latency and two loads and two fused multiply-adds can be exe-
cuted per clock cycle, then the four-way unrolling below is sufficient to achieve good perfor-
mance.
DDOT1 = 0.0D0
DDOT2 = 0.0D0
DDOT3 = 0.0D0
DDOT4 = 0.0D0
NEND = 4*(N/4)
DO I = 1,NEND,4
DDOT1 = DDOT1 + X(I) * Y(I)
DDOT2 = DDOT2 + X(I+1) * Y(I+1)
DDOT3 = DDOT3 + X(I+2) * Y(I+2)
DDOT4 = DDOT4 + X(I+3) * Y(I+3)
ENDDO
DDOT = DDOT1 + DDOT2 + DDOT3 + DDOT4
DO I = NEND+1,N
DDOT = DDOT + X(I) * Y(I)
ENDDO
R = BETA * R + ALPHA * DDOT
268 Chapter 10 • Mathematical Kernels: The Building Blocks of High Performance
The following table compares the Level 1 BLAS routines we’ve discussed.
DCOPY yi = xi 2 0 0.00
DAXPY yi = yi + a x i 3 2 0.67
F_DAXPBY yi = β yi + α xi 3 3 1.00
ZAXPY yi = yi + a x i 6 8 1.25
In general, DDOT and ZAXPY are more efficient than DAXPY since they contains a higher
ratio of floating-point operations per memory operations than DAXPY.
10.4.5 Parallelism
The routines in this section are embarrassingly parallel. That is, they may be run on multi-
ple processors with little or no communication necessary between the processors. Before run-
ning code in parallel, you need to determine if it is profitable to do so. If you know that the entire
data fits in the data cache, then it’s not worth running in parallel since the act of dividing the data
across processors will take more time that just performing the calculations on a single processor.
So, let’s assume that you want to perform a DAXPY and N is several million in size. If the com-
puter has M processors, then each processor can be given a range of size N/M to process.
Shared-memory parallelism is easily achieved using OpenMP directives.
Using the compile line -WGkeep creates an output file so you can verify that the code was
split up as described above. You’ll see something similar to the following, where MPPID is the
processor id and MPPNPR is the number of processors:
Vector zero, copy, and other scalar vector operations can all be parallelized in a similar
fashion.
Dot products are more complicated to run in parallel since each point contributes to the
reduction. Parallelism can be obtained by splitting the work up in chunks of size N/M, but each
processor must maintain its own partial sum. At the conclusion of processing, each of the partial
sums is sent to a single processor which sums them to obtain the final solution. This can be
achieved using the attribute REDUCTION of the DO OpenMP directive, or the CRITICAL direc-
tive. The original dot product code can be modified for parallelism as follows:
This technique should be applied to a dot product that has first been unrolled to hide the
instruction latency as discussed above.
10.5.1.1 F_DGE_COPY
The Fortran 77 routine F_DGE_COPY copies a matrix A (or its transpose) to the matrix B.
When the data is not transposed, the routine performs
This has unit stride accesses and performs well; however, the transpose case
DO J = 1, N
DO I = 1, M
B(I,J) = A(J,I)
ENDDO
ENDDO
is not so straightforward. Figure 10-1 shows the array access patterns for a matrix transpose.
A B
In this form, A has non-unit stride access and B has unit stride access. If the order of the
loops is switched, then A has unit stride accesses and B has non-unit stride accesses. Either way,
one of the accesses will not be optimal. This type of code was discussed in Chapter 5 in the sec-
tion on blocking.
By unrolling the outer loop, multiple columns and rows may be accessed and data reuse
will increase. Suppose the cache line size is 64 bytes and data is eight bytes long. Thus, each
cache line contains eight elements, so an unrolling factor of eight would allow an entire cache
line to be accessed at a time.
Suppose the length of the inner loop is very long and that the amount of data in the A and
B arrays is many times the size of the data cache. Each time A is referenced, the data must be
reloaded from cache. By blocking the code as shown in the following source and in Figure 10-2,
there are many uses of the array A in cache before a column of B displaces it.
BLOCK = 1000
DO IOUTER = 1,N,BLOCK
DO J = 1,N
DO I = IOUTER,MIN(N,N+BLOCK-1)
B(I,J) = A(J,I)
ENDDO
ENDDO
ENDDO
A B
The value of BLOCK is a function of the cache size and should be carefully chosen. If
BLOCK is too small, the prefetches on the both arrays are ineffective, since the prefetched data
will probably have been displaced from cache by the time the actual loads and stores occur. If
BLOCK is too large, the values of A don’t get reused. For caches that are on the order of a mega-
byte, a blocking factor of around 1000 is usually sufficient to get the benefits of prefetching and
cache line reuse.
272 Chapter 10 • Mathematical Kernels: The Building Blocks of High Performance
DO JOUT = 1,N,BLOCK
DO IOUT = 1,M,BLOCK
DO J = JOUT,MIN(JOUT+BLOCK-1,N)
DO I = IOUT,MIN(IOUT+BLOCK-1,M)
B(J,I) = A(I,J)
ENDDO
ENDDO
ENDDO
ENDDO
A Exchange B
10.5.1.2 F_DGE_TRANS
If A exceeds the cache size, transposing A with the result overwriting the original A, may
have 1.5 times better performance than transposing A to a different matrix B. Recall the defini-
tion of memory_transfer from Chapter 5:
where n is the number of data points. An out-of-place transpose of A to B results in three sets of
memory_transfers using this definition: one memory_transfer to load A and two to store to B.
The two memory_transfers for B occur because the processor must also load data from B before
it can store to B. If A is transposed over itself, each point of A is loaded once and stored once,
Matrix Operations 273
which requires two sets of memory_transfers. Table 10-3 compares the two approaches.
There are significant problems to achieving these rates, however. To keep from using a
work array in F_DGE_TRANS, values must be swapped across the main diagonal. There are a
couple of ways to do this. Figure 10-4 shows one technique. It is similar to the F_DGE_COPY
Exchange
code when separate input and output arrays are used. One of the array accesses will be unit
stride, but the other will have stride equal to the leading dimension of one of the matrices. Code
to perform this appears as
DO J = 1, N-1
DO I = J+1, N
TEMP = A(J,I)
A(J,I) = A(I,J)
A(I,J) = TEMP
ENDDO
ENDDO
274 Chapter 10 • Mathematical Kernels: The Building Blocks of High Performance
As in F_DGE_COPY, the outer loop may be unrolled to increase cache line reuse. How-
ever, the code can’t be blocked as easily as F_DGE_COPY can since the data is being overwrit-
ten. Also, what if the leading dimension of the matrix of A is a large power of two?
Unfortunately, this situation does occur, most notably for Fast Fourier Transforms whose
lengths are large powers of two in size. There are a couple of problems with the non-unit stride
memory accesses. Recall that the number of memory banks is a power of two in size. Caches are
usually a power of two in size. As the data is accessed across the matrix, multiple points of a row
(using Fortran column major order) may map to the same location in the cache and cause cache
thrashing. Also, all points of a row may map to the same memory bank. This causes another per-
formance problem since there is not enough time to refresh the memory banks between data
accesses. It doesn’t get any worse than this on a processor with cache!
One way to lessen this effect is to transpose blocks of data. This is similar to Figure 10-3
with the added complication that blocks that contain a main diagonal element have to be handled
specially to ensure correct result.
Another way to decrease bank conflicts is to swap diagonals, as shown in the following
code and Figure 10-5.
DO J = 1, N-1
DO I = J+1, N
TEMP = A(I-J,I)
A(I-J,I) = A(I,I-J)
A(I,I-J) = TEMP
ENDDO
ENDDO
Exchange
By traversing two off-diagonals that are the same distance from the main diagonal, the
cache thrashing and memory bank conflicts are avoided when the leading dimension is a power
of two in size. This results from the memory stride being a power of two plus one. As before, the
outer loop should be unrolled to increase cache line reuse.
The obvious problem with this approach (and previous ones) is that relatively few entries
from a virtual memory page are used before another page is required. This can cause a large
number of TLB misses. By increasing the virtual memory page size to ensure that the entire
data, or a larger part of it, is mapped into virtual memory, the number of TLB misses can be
decreased.
10.5.1.3 Parallelism
There are a couple of ways to parallelize a matrix copy. One way is to split the outer loop
into chunks of size N divided by the number of processors. This has the advantage of having
each processor work on large contiguous chunks of code. This is called a static allocation of
work. Another way is to have each outer loop iteration be given to a separate processor until all
outer loop iterations are completed. This is the type of scheduling used to service customers
waiting in a queue. It may lead to a better load balance since any imbalance of work is compen-
sated for by the other processors. This is a dynamic allocation of work. The disadvantage of this
approach is that there are more data transfers, with each one moving a smaller amount of data
than the static allocation approach. Also, the operating system must determine who gets the next
work with each iteration of the outer loop. These approaches are shown in the code extracts
below, using OpenMP directives.
C STATIC ALLOCATION
C$OMP PARALLEL SHARED (A,B,N,M) PRIVATE(I,J)
C$OMP DO SCHEDULE(STATIC)
DO J = 1,N
DO I = 1,M
B(I,J) = A(I,J)
ENDDO
ENDDO
C$OMP END DO
C$OMP END PARALLEL
C DYNAMIC ALLOCATION
C$OMP PARALLEL SHARED (A,B,N,M) PRIVATE(I,J)
C$OMP DO SCHEDULE(DYNAMIC)
DO J = 1,N
DO I = 1,M
B(I,J) = A(I,J)
ENDDO
ENDDO
C$OMP END DO
C$OMP END PARALLEL
276 Chapter 10 • Mathematical Kernels: The Building Blocks of High Performance
Which approach is best? As always, it depends on the data. For very large data sets, a
dynamic allocation may be the better choice, while smaller data may be best served by static
allocation.
Parallelization of a transpose is more difficult than parallelization of a regular copy. This
is because, as discussed earlier, for large data one of the arrays is guaranteed to have poor cache
line reuse in the original transpose code. So a dynamic allocation of work such as shown above
for a matrix copy, would have multiple processors trying to access the same cache line, resulting
in cache thrashing and very poor performance.
The most popular way to perform large transposes uses large blocks of data. This is simi-
lar to the blocking shown in Figure 10-3, but the blocks are chosen to be large enough so that
each processor is kept busy. Then each block may use a finer level of blocking to ensure good
cache line reuse.
The most common matrix-vector routines are the Level 2 BLAS routines DGER and
DGEMV. The examples will be slightly simplified version of the BLAS routines. They will
assume that vector accesses are unit stride and the scalar multipliers are set to 1.0, since this is
the common case.
10.5.2.1 DGER
DGER stands for Double precision GEneral Rank-one update. The code appears as
DO J = 1,N
DO I = 1,M
A(I,J) = A(I,J) + X(I) * Y(J)
ENDDO
ENDDO
The word “update” refers to the matrix A being modified or updated. Rank-one means that
A is updated by a single vector at a time. For example, a rank-two update means that A gets
updated by two vectors at a time.
The first thing to determine when optimizing multiple loops is which loop should be the
innermost loop. It should be the one that maximizes unit stride accesses. Then an attempt should
be made to minimize the number of memory operations.
For DGER, the I loop should be the innermost loop since this results in the references on A
and X being unit stride. The value of Y may also be hoisted outside the innermost loop. The
outer loop can then be unrolled to reduce the number of memory operations. Using four-way
unrolling and ignoring the required cleanup code for non-multiples of four produces
Matrix Operations 277
DO J = 1,N,4
Y0 = Y(J)
Y1 = Y(J+1)
Y2 = Y(J+2)
Y3 = Y(J+3)
DO I = 1,M
A(I,J) = A(I,J) + X(I) * Y0
A(I,J+1) = A(I,J+1) + X(I) * Y1
A(I,J+2) = A(I,J+2) + X(I) * Y2
A(I,J+3) = A(I,J+3) + X(I) * Y3
ENDDO
ENDDO
Therefore, four columns of A are updated with one column of X. This increases the F:M
ratio from 2:3 to 8:9. Table 10-4 shows the number of operations for each approach.
Out-of-Cache There’s an additional problem for very large matrices. Suppose the size
of X and each column of A exceeds the size of the cache. Each time a column of A is accessed,
each cache line misses the cache. There’s nothing you can do about that. However, each time X
is accessed, it also misses the cache. As we demonstrated in Chapter 5, data can be blocked so
that the references to X are usually in cache. The code above may be rewritten as
DO IOUT = 1, M, MBLOCK
DO J = 1,N,4
Y0 = Y(J)
Y1 = Y(J+1)
Y2 = Y(J+2)
Y3 = Y(J+3)
DO I = IOUT, MIN(IOUT+MBLOCK,M)
A(I,J) = A(I,J) + X(I) * Y0
A(I,J+1) = A(I,J+1) + X(I) * Y1
A(I,J+2) = A(I,J+2) + X(I) * Y2
A(I,J+3) = A(I,J+3) + X(I) * Y3
ENDDO
ENDDO
ENDDO
where the value of MBLOCK is chosen to keep X cache resident most of the time, but long
enough that prefetching of A and X is effective. This is shown in Figure 10-6.
278 Chapter 10 • Mathematical Kernels: The Building Blocks of High Performance
MBLOCK
X A
10.5.2.2 DGEMV
DGEMV stands for Double GEneral Matrix Vector multiplication. DGEMV allows the
matrix to be accessed in normal form or transposed. These cases will be considered separately.
DGEMV — A Not Transposed The code for the normal matrix-vector multiplication
appears as follows:
DO J = 1,N
DO I = 1,M
Y(I) = Y(I) + A(I,J) * X(J)
ENDDO
ENDDO
It is crucial to have the accesses on A be unit stride, so the I loop should be the innermost
loop. As in the case of DGER, the outer loop can be unrolled to increase data reuse as follows:
DO J = 1,N,4
DO I = 1,M
Y(I) = Y(I) + A(I,J) * X(J) + A(I,J+1) * X(J+1)
$ + A(I,J+2) * X(J+2) + A(I,J+3) * X(J+3)
ENDDO
ENDDO
Thus, four columns of A update one column of Y. Note, however, there is a single term
Y(I) that all intermediate results sum to. Using a single reduction means that this code may be
limited by the floating-point instruction latency, since four terms sum to the Y(I) term. To
remove this constraint, the inner loop may be unrolled. Four-way inner loop unrolling produces
Matrix Operations 279
DO J = 1,N,4
DO I = 1,M,4
Y(I) = Y(I) + A(I,J) * X(J) + A(I,J+1) * X(J+1)
$ + A(I,J+2) * X(J+2) + A(I,J+3) * X(J+3)
Y(I+1) = Y(I+1) + A(I+1,J) * X(J) + A(I+1,J+1) * X(J+1)
$ + A(I+1,J+2) * X(J+2) + A(I+1,J+3) * X(J+3)
Y(I+2) = Y(I+2) + A(I+2,J) * X(J) + A(I+2,J+1) * X(J+1)
$ + A(I+2,J+2) * X(J+2) + A(+2,J+3) * X(J+3)
Y(I+3) = Y(I+3) + A(I+3,J) * X(J) + A(I+3,J+1) * X(J+1)
$ + A(I+3,J+2) * X(J+2) + A(I+3,J+3) * X(J+3)
ENDDO
ENDDO
The compiler may still have difficulty interleaving the separate reductions. Some compil-
ers attempt to perform the Y(I) calculations before starting the Y(I+1) calculations. This renders
the inner loop unrolling useless.Therefore, the calculation of the individual Y values can be
explicitly interleaved so that only the first multiply-addition component of each Y vector is cal-
culated. Following that, the second multiply-addition component can be calculated. This process
continues until the calculations of the inner loop are complete. This restructuring is demon-
strated below.
DO J = 1,N,4
DO I = 1,M,4
Y0 = Y(I) + A(I,J) * X(J)
Y1 = Y(I+1) + A(I+1,J) * X(J)
Y2 = Y(I+2) + A(I+2,J) * X(J)
Y3 = Y(I+3) + A(I+3,J) * X(J)
Y0 = Y0 + A(I,J+1) * X(J+1)
...
Y(I) = Y0 + A(I,J+3) * X(J+3)
Y(I+1) = Y1 + A(I+1,J+3) * X(J+3)
Y(I+2) = Y2 + A(I+2,J+3) * X(J+3)
Y(I+3) = Y3 + A(I+3,J+3) * X(J+3)
ENDDO
ENDDO
DO I = 1,M
DO J = 1,N
Y(J) = Y(J) + A(I,J) * X(I)
ENDDO
ENDDO
280 Chapter 10 • Mathematical Kernels: The Building Blocks of High Performance
The I loop should be the innermost loop to ensure that the accesses on A are unit stride.
Outer loop unrolling also helps performance. Unrolling the outer loops by four produces the fol-
lowing code:
DO J = 1,N,4
DO I = 1,M
Y(J) = Y(J) + A(I,J) * X(I)
Y(J+1) = Y(J+1) + A(I,J+1) * X(I)
Y(J+2) = Y(J+2) + A(I,J+2) * X(I)
Y(J+3) = Y(J+3) + A(I,J+3) * X(I)
ENDDO
ENDDO
As in the case of DGER, calling DGEMV with A transposed and large requires the source
code to be blocked to reduce cache misses for the vector accessed in the inner loop. Table 10-4
compares the theoretical performance of the original DGER and the two DGEMV codes with the
modified source.
4 9 8 0.89
4 6 8 1.33
4 5 8 1.40
10.5.2.3 Parallelism
DGEMV with A transposed and DGER, can be parallelized with static or dynamic alloca-
tion of outer loop iterations, as discussed in the matrix copy section. When A is transposed, the
code needs a reduction attribute on the elements of Y to ensure correctness of answers.
Matrix Operations 281
DO I = 1,M
DO J = 1,N
DO L = 1,K
C(I,J) = C(I,J) + A(I,L) * B(L,J)
ENDDO
ENDDO
ENDDO
The I loop should be the innermost loop since that causes the access patterns for A and C
to be unit stride. Unroll and jam techniques can then be used on the outer two loops. This is
called the hoist B matrix-matrix multiplication version of the code since the references to B may
be hoisted outside of the inner loop. For example, if the outer two loops are unrolled by two and
jammed together, the code appears as follows. Note that two columns of A update two columns
of C.
DO J = 1,N,2
DO L = 1,K,2
B11 = B(L,J)
B21 = B(L+1,J)
B12 = B(L,J+1)
B22 = B(L+1,J+1)
DO I = 1,M
C(I,J) = C(I,J) + A(I,L) * B11 + A(I,L+1) * B21
C(I,J+1) = C(I,J+1) + A(I,L) * B12 + A(I,L+1) * B22
ENDDO
ENDDO
ENDDO
This increases the floating-point to memory operation ratio. For most processors, we want
the floating-point to memory ratio to be greater than 2.0. As the amount of unrolling and jam-
282 Chapter 10 • Mathematical Kernels: The Building Blocks of High Performance
ming increases, the number of floating-point registers required also increases. For example, a
4 × 4 unrolling requires 16 floating-point registers for B and another four registers for C. So the
number of floating-point registers available sets an upper limit on the amount of unrolling. The
F:M ratio should be at least two for most processors, so this sets a lower limit for the amount of
unroll and jam. Another constraint is determined by the instruction latency. The larger the
latency, the more registers are required to hide it. This may require a certain amount of inner
loop unrolling to increase the number of streams of instruction, which also increases the mini-
mum number of registers required. Table 10-5 shows the effects of unroll and jam for various
unrolling factors.
Table 10-5 DGEMM Data Reuse for a Hoist B Approach.
Floating-
Memory point
J unroll factor, L unroll factor operations operations
(columns of C, B) (columns of A) per iteration per iteration F:M ratio
1 1 3 2 0.67
2 2 6 8 1.33
3 3 9 18 2.00
3 4 10 24 2.40
4 4 12 32 2.67
By studying the general formula at the bottom of the table, it is apparent that nb must be
greater than or equal to two and kb must be greater than or equal to three to obtain F:M ratios
greater than or equal to 2.0. (This is one reason why DGEMV performance cannot equal DGEMM
performance since DGEMV has only one column to update.) Figure 10-7 shows the hoist B
approach blocking.
Example: PA-8500 processor. This has 28 user accessible floating-point registers, an
instruction latency of three cycles, and the ability to do two fma instructions per cycle. Thus,
there must be at least six independent streams of instruction. As the table above shows, a 4 × 4
outer loop unrolling would result in a F:M value greater than 2.0. However, in order to have at
least six streams of instruction, the inner loop must be unrolled by at least two. Therefore, 16
Matrix Operations 283
N
nb
kb
K hoist B
M A C
registers must be used for B, and eight registers must be used for C. This leaves only four regis-
ters for all of A, which is not nearly enough, due to the instruction latency.
A 3 × 4 outer loop unrolling also exceeds an F:M ratio of 2.0 and requires at least two-way
inner loop unrolling, but uses only 12 registers for B and six for C. This leave 10 registers for A
and temporary values and is much more feasible for this processor.
Middle and Outer Loops Since the I loop should be the innermost lost, the next
question is whether the J or L loop should be the outermost loop. This choice influences the
number of cache misses when some of the data is out of cache. If J is chosen to be the outermost
loop, then L is the middle loop and all of A is accessed for some small number of columns of C.
If L is the outer loop and J is the middle loop, then all of C is accessed for some small number of
columns of A.
Suppose a 3 × 4 unrolling has been chosen. There are three columns of C and four columns
of A that are being accessed in the inner loop. Each column of C must be loaded and stored,
while each column of A needs only to be loaded. Suppose further than some cache misses occur
and that the chance that A will not be in cache is the same as the chance that C will not be in
cache. Since C must be loaded and stored, it will incur twice as many misses as A. Therefore, it
is better to have J as the outermost loop (hold C fixed) and sweep through all of A for each value
of J.
284 Chapter 10 • Mathematical Kernels: The Building Blocks of High Performance
DO J = 1,N
DO L = 1,K
DO I = 1,M
C(I,J) = C(I,J) + A(I,L) * B(J,L)
ENDDO
ENDDO
ENDDO
Since the B values can be hoisted outside the inner loop, the optimizations that were made
for the previous case can be applied here.
DO J = 1,N
DO L = 1,K
DO I = 1,M
C(I,J) = C(I,J) + A(L,I) * B(L,J)
ENDDO
ENDDO
ENDDO
Having the I loop as the inner loop is a poor choice since the accesses on A will not be unit
stride. However, if the L loop is the inner loop, the accesses on A and B are unit stride, and the C
references may be hoisted and sunk from the inner loop.
DO J = 1,N
DO I = 1,M
DO L = 1,K
C(I,J) = C(I,J) + A(L,I) * B(L,J)
ENDDO
ENDDO
ENDDO
This is called the hoist/sink C matrix-matrix multiplication approach and is superior to the
hoist B approach since there are no stores in the inner loop. Performing a 2 × 2 unroll and jam
produces the following code. Note that two columns of A and two columns of B update four sca-
lar values.
Matrix Operations 285
DO J = 1,N,2
DO I = 1,M,2
C11 = 0.0
C21 = 0.0
C12 = 0.0
C22 = 0.0
DO L = 1,K
C11 = C11 + A(L,I) * B(L,J)
C21 = C21 + A(L,I+1) * B(L,J)
C12 = C21 + A(L,I) * B(L,J+1)
C22 = C22 + A(L,I+1) * B(L,J+1)
ENDDO
C(I,J) = C(I,J) + C11
C(I+1,J) = C(I+1,J) + C21
C(I,J+1) = C(I,J+1) + C12
C(I+1,J+1) = C(I+1,J+1) + C22
ENDDO
ENDDO
Table 10-6 shows the effects of unroll and jam for various unrolling factors.
Floating-
J unroll factor Memory point
(columns of C, I unroll factor operations operations
B) (columns of A) per iteration per iteration F:M ratio
1 1 2 2 1.00
2 2 4 8 2.00
3 3 6 18 3.00
The hoist/sink C approach is superior to the hoist B approach in terms of the floating-point
operation to memory operation ratio. It takes less unrolling for unroll and jam to achieve peak
rates with the hoist/sink C approach due to the absence of store operations in the inner loop.
Figure 10-8 shows the hoist/sink C approach blocking.
286 Chapter 10 • Mathematical Kernels: The Building Blocks of High Performance
N
nb
mb
M hoist/
sink C
K AT B
Middle and Outer Loops The order of the two outer loops should also be considered.
Is it better to access all of A with each iteration of the outer loop or all of B? Since each of these
arrays are only loaded (i.e., read), it is not apparent which is better at first. However, the
accesses on C should also be considered. Note that the references to the C in the middle loop are
unit stride when the I loop is the middle loop. This is especially important when the inner loop is
short, so the I loop should be the inner loop.
DO J = 1,N
DO L = 1,K
DO I = 1,M
C(I,J) = C(I,J) + A(L,I) * B(J,L)
ENDDO
ENDDO
ENDDO
By now you should know what to do first. Make all array references unit stride. But in this
case, it can’t be done. Choose J, L, or I as the inner loop and at least one of the arrays will have
non-unit stride accesses. What to do?
We’ve seen that the hoist/sink C solution is superior to the hoist B in terms of data reuse,
and since neither approach can have all accesses unit stride for this case, choosing a hoist/sink C
Matrix Operations 287
approach is a reasonable solution for the case when both A and B are transposed. Thus, L should
be chosen to be the inner loop, the accesses on A are unit stride and the accesses on B are
non-unit stride. Choosing I as the middle loop makes the C accesses unit stride in the middle
loop. Applying a 2 × 2 unroll and jam produces the following:
DO J = 1,N,2
DO I = 1,M,2
C11 = 0.0
C21 = 0.0
C12 = 0.0
C22 = 0.0
DO L = 1,K
C11 = C11 + A(L,I) * B(J,L)
C21 = C21 + A(L,I+1) * B(J,L)
C12 = C21 + A(L,I) * B(J+1,L)
C22 = C22 + A(L,I+1) * B(J+1,L)
ENDDO
C(I,J) = C(I,J) + C11
C(I+1,J) = C(I+1,J) + C21
C(I,J+1) = C(I,J+1) + C12
C(I+1,J+1) = C(I+1,J+1) + C22
ENDDO
ENDDO
DO L = 1,K
DO I = 1,M
DO J = 1,N
C(I,J) = C(I,J) + A(I,L) * B(L,J)
ENDDO
ENDDO
ENDDO
Note that the access patterns for C are always non-unit stride. This makes the approach
unattractive for DGEMM . If both matrices C and B were transposed, it might be worth consider-
ing. However, if CT = CT + ABT, then C = C + B TA, and then the hoist/sink C approach would be
better. Therefore, the hoist A approach doesn’t help DGEMM performance.
288 Chapter 10 • Mathematical Kernels: The Building Blocks of High Performance
4 ~100
BLOCK
~100,000
C A
decomposed into multiple matrix-matrix multiplications, each of which fit into the data cache.
For each block of C, this consists of
The data motion is shown in the following code and illustrated in Figure 10-10.
...
REAL*8 TEMP(BLOCK,BLOCK,3)
DO I = 1,M,BLOCK
DO J = 1,N,BLOCK
CALL F_DGE_COPY(BLAS_NO_TRANS, BLOCK, BLOCK, C(I,J),
$ LDC, TEMP(1,1,3), BLOCK)
DO L = 1,K,BLOCK
CALL F_DGE_COPY(BLAS_NO_TRANS, BLOCK, BLOCK, A(I,L),
$ LDA, TEMP(1,1,1), BLOCK)
CALL F_DGE_COPY(BLAS_NO_TRANS, BLOCK, BLOCK, B(L,J),
$ LDB, TEMP(1,1,2), BLOCK)
IF (L .EQ. 1) THEN
CALL F_DGEMM(BLAS_NO_TRANS, BLAS_NO_TRANS, BLOCK,
$ BLOCK, BLOCK, ALPHA, TEMP(1,1,1), BLOCK,
$ TEMP(1,1,2), BLOCK, BETA, TEMP(1,1,3), BLOCK)
ELSE
CALL F_DGEMM(BLAS_NO_TRANS, BLAS_NO_TRANS, BLOCK,
$ BLOCK, BLOCK, ALPHA, TEMP(1,1,1), BLOCK,
$ TEMP(1,1,2), BLOCK, 1.0D0, TEMP(1,1,3), BLOCK)
ENDIF
ENDDO
CALL F_DGE_COPY(BLAS_NO_TRANS, BLOCK, BLOCK, TEMP(1,1,3),
$ BLOCK, C(I,J), LDC)
ENDDO
ENDDO
...
This is really a hoist/sink C approach performed at the block level. Note that in the block
inner loop, only blocks of A and B need to be copied to the TEMP array. There are also block
versions of hoist B and hoist A matrix-matrix multiplication. These are inferior to the hoist/sink
C approach, since each requires C to be copied to and from the TEMP array in the inner block
loop.
290 Chapter 10 • Mathematical Kernels: The Building Blocks of High Performance
A B C
10.5.4.3 Parallelism
Matrix-matrix multiplication routines are very well-suited for parallelism which can occur
at multiple levels. Using a hoist B or hoist/sink C approach, there is a natural parallelism that
occurs at the outer loop level. For example, the hoist B approach shown in Figure 10-7 has n/nb
increments of size nb. Each of these is independent and may be executed in parallel. All proces-
sors need to read all of A, but they need only the part of C that they update and its corresponding
part of B. Likewise, the hoist/sink C approach shown in Figure 10-8 can be modified for paral-
lelism by executing each of the n/nb increments in parallel.
The block and copy code shown in Figure 10-10 can also use outer loop parallelism that is
very similar to the hoist/sink C code above. Each processor needs the part of C that it updates
and the corresponding part of B. However, each processor does eventually need to read and use
all of A. For massively parallel computers that have a very small amount of memory associated
with each processor, there are techniques that limit the amount of data movement of A by having
each processor receive a piece of A, operate with it, and then pass it to the next processor until
all processors have accessed all of A (yet another example of pipelining). Interested readers may
consult Fox [2] for more details.
cache-based computers. Table 10-7 shows the ratio of floating-point to memory operations on
some common BLAS routines.
Table 10-7 The BLAS and Data Reuse.
Floating-point
BLAS F:M ratio operations per
Routine type Optimizations (upper limit) byte
Clearly, DGEMM is the preferred routine because it has the highest F:M ratio. Many pro-
cessors require an F:M ratio of 2.0 for optimal performance. The DGEMV F:M value looks
pretty good since its ratio approaches two. So at first glance, the performance of DGEMV might
be expected to be close to DGEMM performance and for data that is already in-cache, the perfor-
mance can be similar. It’s the out-of-cache performance where DGEMM is far superior to
DGEMV and the other routines. When DGEMV is performed, data is loaded from memory (far,
far away) to cache. Then it is moved from cache to the functional units. For DGEMV, data from
the matrix is used once and that’s it. For a matrix of size n × n, each point generates one fma,
which, for eight byte data, results in 0.25 floating-point operations per byte. DGEMM has the
opportunity to reuse data in cache much more than DGEMV.
Suppose we want to perform a square matrix multiplication of the largest size that fits in
cache. In the first case, suppose the processor has a tiny one KB data cache and the A, B, C each
use one-third of cache. This implies that the size of the matrix n, is [ ( 1 ⁄ 3 ) × ( 1KB ⁄ 8 ) ] = 6
points. This is ridiculously small, but if we assume that the two matrices, A and B, are loaded
into cache once (a blocked hoist/sink C approach), for the 6 × 6 multiplication performed, the
floating-point per byte ratio is
This has three times as much data reuse as DGEMV. Now what happens if a 1 MB cache is
used? A data set of size 209 points may be performed in-cache, leading to a ratio of size
209 / 8 = 28.125 floating-point operations per byte. This is a huge improvement over the
DGEMV reuse!
The final two sections of this chapter analyze algorithms that reduce the number of opera-
tions required for matrix-matrix multiplication. The first technique is due to Winograd [6] and
reduces the number of multiplications. The second algorithm is for complex multiplication only
and it reduces the number of multiplications and additions. The third algorithm was developed
by Strassen [5] and actually reduces the order of the operations for a potentially huge reduction
in the number of calculations!
It’s easy to count the number of operations to perform matrix-matrix multiplication. The
three dominant loops may appear as follows:
Winograd’s Matrix-Matrix Multiplication 293
DO J = 1,N
DO L = 1,K
DO I = 1,M
C(I,J) = C(I,J) + A(I,L) * B(L,J)
ENDDO
ENDDO
ENDDO
The number of number of multiplications is kmn with an equal number of additions. This gives a
total of 2kmn operations. For k = m = n, this is an O(n3) algorithm.
Winograd derived a way to reduce the number of floating-point multiplications in a
matrix-matrix multiplication by half at the expense of a few more floating-point additions. This
is a good technique to use on processors that don’t have an efficient fma instruction. Using the
code above, suppose k is even. The key to this algorithm is forming the product P(I,J):
DO J = 1,N
DO L = 1,K/2
DO I = 1,M
P(I,J) = (A(I,2*L) + B(2*L-1,J)) * (A(I,2*L-1)+B(2*L,J))
ENDDO
ENDDO
ENDDO
A(I,2*L-1)*B(2*L-1,J)
A(I,2*L)*B(2*L,J)
A(I,2*L-1)*A(I,2*L)
B(2*L-1,J)*B(2*L,J)
The first two terms are necessary for the final matrix-matrix multiplication result, while
the last two have to be dispensed with. Note that the last two values are functions of only two of
the loops, though. Thus, while the calculation of P is of O(kmn), the separate calculation of the A
product is O(km) and the separate calculation of the B product is O(kn). These two terms can be
calculated separately and subtracted from the P term. Odd values of k must also be added, too.
The following calculates the complete matrix-matrix multiplication:
294 Chapter 10 • Mathematical Kernels: The Building Blocks of High Performance
DO I = 1,M
X(I) = 0.0
ENDDO
DO L = 1,K/2
DO I = 1,M
X(I) = X(I) + A(I,2L-1) A(I,2L)
ENDDO
ENDDO
DO J = 1,N
Y(J) = 0.0
DO L = 1,K/2
Y(J) = Y(J) + B(2L-1,J)B(2L,J)
ENDDO
ENDDO
IF (AND(K,1) .EQ. 0) THEN ! K IS EVEN
DO J = 1,N
DO I = 1,M
C(I,J) = C(I,J) + P(I,J) - X(I) - Y(J)
ENDDO
ENDDO
ELSE
DO J = 1,N
DO I = 1,M
C(I,J) = C(I,J) + P(I,J) - X(I) - Y(J) + A(I,K)*B(K,J)
ENDDO
ENDDO
ENDIF
Therefore, the worst case is when K is odd. The maximum number of operations is
S1 = B r − B i
S2 = Ar + Ai
S3 = A r − A i
R1 = Ar S1
R2 = Br S2
R3 = Bi S3
(Cr , Ci) = (Cr , Ci) + (Ar , Ai) (Br , B i)
= (Cr + R1 + R3 , Ci + R2 − R1)
This contains three real multiplications and seven real additions. For square matrices, the
additions use only O(n2) operations and can be ignored for very large n. By removing a real
matrix-matrix multiplication, the number of operations drops to 3/4 of the usual approach.
296 Chapter 10 • Mathematical Kernels: The Building Blocks of High Performance
A, B, C load 3 × 16 = 48
S1, S2, S3, R1, R2, R3 store and load 6 × (16+8) = 144
C store 16
Total 352
Strassen’s Matrix-Matrix Multiplication 297
3 2
2n flop 352n Bytes
------------------------- = -----------------------------
y Mflop ⁄ s z MB ⁄ s
or
176 × y
n = -----------------
z
On an HP N-Class server, a PA-8500 processor with a 440 MHz frequency can perform
large matrix multiplications at about 1300 Mflop/s and move data between processor and mem-
ory at 1300 MB/s. Thus, the break-even point should be about
So, on an N-Class, the 3M algorithm is useful once the data exceeds 176 × 176. Recall that
the current trend in computer architectures is that processor performance increases faster than
interconnect performance, so the break-even point will not be shrinking any time soon.
What about parallelism? As processors are added, the memory bandwidth rates rarely
scale as well as the number of processors, so the break-even point usually increases as the num-
ber of processors increase.
C 11 C 12 C 11 C 12 A 11 A 12 B 11 B 12
= +
C 21 C 22 C 21 C 22 A 21 A 22 B 21 B 22
This requires eight matrix multiplications and eight matrix additions of size n/2 × n/2. The
matrix-matrix additions take only O(n2) operations, so we’ll ignore them. If each matrix multi-
plication takes 2(n/2)3 floating-point operations, then eight of them take 2n3 floating-point oper-
ations, which is the same number of operations as the original algorithm.
Strassen derived a difference sequence of operations which eliminated one of the eight
matrix-matrix multiplications at the expense of eleven more matrix-matrix additions. After the
publication of the original algorithm, Winograd was able to remove three of these additions and
his variant of Strassen’s algorithm appears below.
S1 = A21 + A22 M 1 = S 2 S6 T1 = M1 + M2
S2 = S1 − A11 M2 = A11 B11 T2 = T1 + M4
S3 = A11 − A21 M3 = A12 B21
S4 = A12 − S2 M 4 = S 3 S7
S5 = B12 − B11 M 5 = S 1 S5
S6 = B22 − S5 M6 = S4 B22
S7 = B22 − B12 M7 = A22 S8
S8 = S6 − B21
C11 = C11 + M2 + M3
C12 = C12 + T1 + M5 + M6
C21 = C21 + T2 − M7
C22 = C22 + T2 + M5
There are seven matrix-matrix multiplications of size n/2 × n/2 that generate (7/8) × 2n3
floating-point operations. There are also 19 matrix-matrix additions of size n/2 × n/2 (or 15 if
the algorithm does not update preexisting C values). The true power of the algorithm becomes
apparent when it is applied repeatedly. The maximum number of times this optimization can be
applied is log2(n) times so the order of operations from the matrix-matrix multiplication is
log 2 ( n )
7 - ×n
3
--
8
and a large enough matrix-matrix multiplication should require many fewer operations.
Suppose we want to multiply two matrices of size 1024 × 1024. Since 1024 = 210, this
could use ten levels of Strassen’s algorithm. First, the matrices are divided into four submatrices
of size 512 × 512. These can be multiplied using the seven multiplications of Strassen’s algo-
rithm. These submatrices are, in turn, subdivided into four submatrices and this process contin-
ues until the lowest level, which consists of 262,144 matrices of size 2 × 2, is reached. How
much could Strassen’s algorithm speed up the operations? If each step reduced the operation
count by 7/8, then the number of operations should be (7/8)10 ≈ 0.26 or approximately
one-fourth that of the original multiplication. But is actual performance really this good?
In practice, there exists some point where it is no longer profitable to further divide the
matrices in half and perform another step of Strassen’s algorithm. This is due to the large num-
ber of matrix additions that have been introduced. Strassen’s algorithm is usually implemented
by using it recursively until the performance of the current level matrix-matrix multiplication
using a standard matrix-matrix multiplication outperforms a matrix multiplication using another
Strassen step. These multiplications then use the standard matrix-matrix multiplication
approach. The next section will find the break-even point when it is profitable to apply Stras-
sen’s algorithm.
Note that the algorithm is defined for square matrices that are a power of two in size.
Thus, if your matrices are not square and a power of two in size, it is necessary to handle the
non-power of two regions separately. This also reduces the efficiency of the algorithm. There
are fast matrix multiplication algorithms like Strassen’s for n ≥ 3 and some of these algorithms
reduce the order of the algorithm further. The lowest order is n2.376 and was obtained by Cop-
persmith and Winograd [1]. The goal for all of these algorithms is to see how close the order can
be to two. However, to date, all of the algorithms that have a lower order than the Strassen’s
algorithm have so much overhead that they are not useful for real-world applications.
So we know Strassen’s approach should help performance for large enough matrices, but
what is large enough? For a matrix-matrix multiplication of square matrices of size n × n, we’ve
traded a multiplication of size n/2 × n/2 for 15 matrix additions of size n/2 × n/2. When the time
to perform these additions is less than the time to perform the multiplication, then Strassen’s
algorithm can be used profitably.
For each element of the matrix addition, there are two loads and one store. Suppose the
store uses a different array than the ones loaded and the data is eight bytes long. 32 bytes must
be moved for each point (eight bytes for each load and 16 for the store). There are (n/2)2 points
for each matrix addition and 15 matrix additions to perform.
300 Chapter 10 • Mathematical Kernels: The Building Blocks of High Performance
3 2
2 ( n ⁄ 2 ) flop 15 × 32 ( n ⁄ 2 ) Bytes
------------------------------- = --------------------------------------------------
y Mflop ⁄ s z MB ⁄ s
or
480 × y
n = -----------------
z
On an HP N-Class server, a PA-8500 processor with a frequency of 440 MHz can perform
large matrix multiplications at about 1300 Mflop/s and move data between processor and mem-
ory at 1300 MB/s. Thus, the break-even point should be about
So Strassen’s algorithm is not useful until the data size exceeds 480 × 480. As mentioned
in the discussion on using the 3M approach for complex matrix-matrix multiplications, the sin-
gle processor break-even point will not be shrinking in the near future and the break-even point
also increases with the number of processors.
Clearly, Strassen’s approach should be used only on very large matrices. Recall our calcu-
lating that it takes five iterations of Strassen’s method to reduce the number of calculations to
half that of the standard approach. For the N-Class, we could not hope to achieve this until the
data size was at least 480 × 25 = 15,360. This requires three matrices whose total size is
3 × 8 × (15,360)2 = 5.6 GB, which is a pretty large data set! So don’t expect Strassen’s tech-
nique to greatly reduce the computational time significantly except for very large data!
240 × y
n = -----------------
z
so the single processor break-even point on the N-Class above is a matrix of size
Thus, step two gives a 3/4 reduction in the number of operations beyond that achieved by
Strassen’s recursive algorithm.
10.10 Summary
This chapter applied the code optimization techniques of Chapter 5 to some of the simple
kernels used by many mathematical software packages and scientific programs. Some hardware
and software vendors provide high performance routines that perform many of these operations.
If they don’t, or if you need to optimize something similar, but different from the standard rou-
tines, you can apply the techniques to improve your application’s performance.
302 Chapter 10 • Mathematical Kernels: The Building Blocks of High Performance
References:
Many scientific applications spend significant amounts of time solving linear systems of
equations. These may take a n × n matrix A, a vector b (also known as the right hand side vector)
of size n, and solve for a vector x of unknowns. This equation is written as Ax = b.
Various techniques to solve systems of equations have been developed and this is one of
the most researched areas of mathematics. Systems of equations are said to be dense if all, or
nearly all, of the coefficients of A are nonzero. Systems that arise naturally in some disciplines
are sparse, that is, most coefficients of A are zero. The structure of a sparse matrix refers to the
location of the nonzero elements and determines which algorithms are used to optimally solve a
system.
A solution of a system of equations is obtained using either direct or iterative methods.
Direct methods use algorithms that return an exact solution for x after a fixed number of steps.
Iterative algorithms start with a guess for the solution vector x. They perform operations which
refine this to a more accurate solution for x. This process continues (iterates) until a solution
with the desired accuracy is obtained. Algorithms may also be performed in-core or out-of-core.
In-core methods assume that all the data (i.e., A, b and x) fits in the main memory of the com-
puter. This is usually the case since computer memories today may be many gigabytes. All the
algorithms in this book assume that the problem fits in-core. Out-of-core methods pull the data
from disk to memory as the system is being solved. They can perform extremely large problems
since the problem size is limited only by the amount of disk space. Algorithms have to be
restructured for out-of-core accesses and some inefficiencies naturally result.
Computer vendors provide math libraries such as Hewlett-Packard’s Mathematical soft-
ware LIBrary (MLIB) to provide much of the functionality discussed in this chapter. Nearly all
such libraries support the LAPACK math library. Some also provide sparse direct and iterative
solver functionality. There are also software vendors such as the Numerical Algorithms Group
303
304 Chapter 11 • Faster Solutions for Systems of Equations
(NAG) and Precision Numerics (developers of the IMSL math and statistics libraries) that pro-
vide a broad selection of routines. In general, hardware vendors provide a smaller set of func-
tionality than software vendors, but the routines provided by hardware vendors tend to be more
highly tuned for performance.
This chapter introduces some common algorithms and terminology and tries to hit some of
the high spots of numerical linear algebra. Interested readers should consult Demmel [2], Don-
garra [3], or Golub [5] for more in-depth analysis.
3y + 2z = 5
9y + 12z = 3
can be written as
3 2 y = 5
9 12 z 3
Multiplying the second equation by 1/3 and subtracting it from the first equation results in
3 2 y = 5
0 –2 z 4
This is an upper triangular system (denoted U) since all of the entries below the main diag-
onal are 0. This is easy to solve since the second equation can be divided by −2 to obtain the
value of z. The value of z can then be substituted into the first equation, which can then be
solved for y.
Another way to solve the original equations is to multiply the first equation by 6 and sub-
tract from the second equation. This results in
– 9 0 y = – 27
9 12 z 3
This is an example of a lower triangular (L) system since all values above the main diago-
nal are 0. A lower triangular system is just as easy to solve as an upper triangular system. Start
with the equation with one unknown, solve it, use this value in the equation with two unknowns,
A Simple Example 305
and then solve for the remaining unknown. So if a system of equations can be converted into a
form that can be written in terms of triangular matrices, then the solution of the original system
can be easily found.
Most matrices can be written as a product of a lower triangular matrix and an upper trian-
gular matrix. When the lower triangular matrix has all diagonal elements equal to one, the fac-
torization is called the LU factorization or LU decomposition of the matrix. The LU
factorization is unique for a given matrix and can be stored compactly. Since the LU factoriza-
tion is defined to have all diagonal elements equal to one, only the non-zero off diagonal values
in the lower matrix and the non-zero elements of the upper matrix need to be stored. Since this is
requires no more storage than the original matrix, most numerical packages overwrite the origi-
nal matrix with the LU factorization. The original matrix above can be decomposed as follows:
3 2 = 1 0 3 2
9 12 3 1 0 12
So, solving a system of equations can be performed with the following three steps:
The combination of steps two and three is called forward-backward substitution (FBS).
The phrase “solve a system” can be confusing since it can mean to completely solve the system
(including the calculation of L and U) or to just perform the forward-backward substitution after
LU factorization has been performed.
306 Chapter 11 • Faster Solutions for Systems of Equations
11.2 LU Factorization
Code to generate a LU factorization stored in compact form (the diagonal elements of L
are not stored) follows:
C COMPUTE LU FACTORIZATION
DO I = 1, N-1
C
C SCALE LOWER MATRIX COLUMN
DO J = I+1,N
A(J,I) = A(J,I) / A(I,I)
ENDDO
C
C RANK-ONE UPDATE OF MATRIX
DO J = I+1, N
DO K = I+1,N
A(K,J) = A(K,J) - A(I,J) * A(K,I)
ENDDO
ENDDO
ENDDO
3y + 2z = 5
9y + 6z = 15
does not have a unique solution. The system as written has an unlimited number of solutions
since the second equation is a multiple of the first equation. If the last element of the
right-hand-side were 11 instead of 15, then the system would not have a solution at all since
multiplication of the first equation by three and subtraction of the second equation results in
LU Factorization 307
i
i+1
RANK-ONE
UPDATE
SCALE
LOWER
n
MATRIX
COLUMN
0 = 4! We’re really interested in matrices where a unique solution exists. The matrix A in these
systems is called nonsingular. If there is not a unique solution, then A is said to be singular. A
system is singular if and only if there is a zero on the diagonal of the upper matrix in the LU fac-
torization. Thus, if a zero appears on the diagonal, an error condition should be generated and
execution stopped.
11.2.2 Pivoting
Numerical analysts are concerned with the numerical accuracy of the solution. If comput-
ers performed operations with infinite precision, this wouldn’t be an issue. The problem with the
LU factorization is that each term of the lower matrix is created by dividing by a diagonal value.
If a diagonal term is much smaller than the entries under it, the L values can be become unac-
ceptably large and cause incorrect results when systems are solved. If the diagonal value starts
with a value that is larger than all the entries in a column underneath it, then the division doesn’t
present a problem, since the results will all be less than one. So one way to improve the accuracy
of a solution is to reorder the equations to put large values on the diagonal. This can be achieved
using permutation matrices which interchange the rows and/or columns of a matrix.
An identity matrix is a matrix than contains ones on its main diagonal and zeros every-
where else. As you’d expect, multiplying a matrix by the identity matrix leaves the first matrix
unchanged. A permutation matrix is the identity matrix with rows reordered. So multiplying a
matrix by a non-identity permutation matrix results in a matrix that has the same elements as the
original matrix, but whose rows are reordered. So, for example, a permutation matrix, P, could
be applied to a system Ax = b to obtain PAx = Pb. This system would, of course, have the same
solution x as the original system, although the order of the operations might be different. Instead
308 Chapter 11 • Faster Solutions for Systems of Equations
The most frequently used approach for the permutation is obtained by a process known as
partial pivoting. At each iteration of the outer loop in the factorization, the current column in the
lower matrix is searched for the element that has maximum magnitude. Once found, the ele-
ments to the right of this element are swapped with the corresponding elements in the current
row. This is shown in Figure 11-2.
A(i,i) SWAP
imax
A(imax,i)
A(imax,i) ≥ A(j,i), j ≥ i
C COMPUTE LU FACTORIZATION
DO I = 1, N-1
C
C FIND PIVOT ELEMENT
T=0
IPVT(I) = I
DO J = I,N
IF (ABS(A(J,I)) .GT. T) THEN
T = ABS(A(J,I))
IPVT(I) = J
ENDIF
ENDDO
C SWAP REMINDER OF ROW
DO J = I,N
T = A(I,J)
A(I,J) = A(IPVT(I),J)
A(IPVT(I),J) = T
ENDDO
...
This is the algorithm used by the LINPACK routine DGEFA to factor a matrix. (The rou-
tine name stands for Double precision GEneral FActorization.) This is undoubtedly the most
commonly used routine from the LINPACK package. It also the most time-consuming part of
the LINPACK benchmarks.
n–1 n 3
∑ ∑i ≅ 2 ----
2
2i ≅ 2
2 n
3-
i=1 i=1
Since this is an O(n3) algorithm composed of rank-one updates, it should remind us of the
potential of matrix-matrix multiplication. However, in LINPACK, the inner loop of the rank-one
update is replaced by a call to the Level 1 BLAS DAXPY routine.
In Chapter 10, we examined DAXPY and showed that it’s not a very efficient routine for
computers that rely on cache to obtain good performance. After the LINPACK software was
released, it was apparent that algorithms need to be based on higher BLAS routines for good
performance on more types of computer platforms. We’ve already mentioned that the code uses
310 Chapter 11 • Faster Solutions for Systems of Equations
a rank-one update and so could be replaced by the Level 2 BLAS routine DGER. This would
help performance, but we’d really like to use a Level 3 BLAS routine to obtain the best perfor-
mance. Figure 11-3 shows two consecutive iterations of the outer loop. It consists of
i bi,j
i+1 bi +1,j
ak,i
ck,j = ck,j - ak,i bi,j
ak,i+1
ck,j = ck,j -
ak,i+1 bi+1,j
Note that all of the area updated in the second update is also updated in the first update. So
it should be possible to combine two iterations to perform a rank-two update or equivalently a
matrix-matrix multiply C = C − A × B where A is two columns wide. As usual, unroll and jam
techniques allow this to be achieved. Suppose we unroll the outer loop by two and jam the loops
together. Ignoring the cleanup step, the code appears as follows:
NEND = 2*((N-1)/2)
C COMPUTE LU FACTORIZATION
DO I = 1, NEND, 2
C
C UPDATE I, I+1 LOWER MATRIX COLUMNS
C SCALE LOWER MATRIX COLUMN
DO J = I+1,N
A(J,I) = A(J,I) / A(I,I)
ENDDO
LU Factorization 311
C RANK-ONE UPDATE
DO J = I+1,N
A(J,I+1) = A(J,I+1) - A(I,I+1) * A(J,I)
ENDDO
C SCALE LOWER MATRIX COLUMN
DO J = I+2,N
A(J,I+1) = A(J,I+1) / A(I+1,I+1)
ENDDO
C
C UPDATE I, I+1 ROWS
DO J = I+2, N
A(I+1,J) = A(I+1,J) - A(I,J) * A(I+1,I)
ENDDO
C
C RANK-TWO UPDATE
DO J = I+2, N
DO K = I+2,N
A(K,J) = A(K,J) - A(I,J) * A(K,I) - A(I+1,J) * A(K,I+1)
ENDDO
ENDDO
ENDDO
This is shown in Figure 11-4. The code performs the following steps:
1. LU factorization on two columns starting at A(I,I) (columns ak,i and ak,i+1 in Figure 11-4)
2. update two rows starting at A(I,I+2) (this is really FBS on the rows bi,j and bi+1,j )
3. rank-two update starting at A(I+2,I+2) (matrix ck,j)
This process can continue for some arbitrary number, nb, of iterations to perform a
rank-nb update (or matrix-matrix multiply). Thus, we would have LU factorization on nb col-
umns, update nb rows, and perform a rank-nb update. Observe that the rank-nb update will be
the most efficient of the components and will execute at a high percentage of DGEMM peak per-
formance if the value of nb is large enough. However, if the value of nb is chosen too large, the
amount of time in the less efficient components will exceed the time in the rank-nb update. So
the choice of nb presents itself as a balancing act. nb should be large enough so that the rank-nb
update is efficient, but small enough that the percent of time in the other component is small. A
point of reference is that the default blocking value in many related LAPACK routines is 32 or
64.
312 Chapter 11 • Faster Solutions for Systems of Equations
i
bi,j
i+1 bi+1,j
But what about pivoting? We still need to perform pivoting, but now it must be performed
at the block level. Referring to Figure 11-4, inserting pivoting in the algorithm requires
This is the algorithm used by the routine DGETRF (Double precision GEneral TRiangular
Factorization) in the LAPACK software package. Using Level 3 BLAS routines in LAPACK
results in good performance across many types of computers. Obviously, not all of the work can
use Level 3 BLAS, but if nb is chosen correctly, most of it does.
If only DGEMM is optimized (and nothing else), the LAPACK routine DGETRF achieves
quite respectable performance. (Most vendors optimize DGETRF further, but we’ll ignore that
for now.) Table 11-1 compares performance on an HP N-4000 server using the original versions
of DGEFA and DGETRF using vendor optimized BLAS routines. Two problem sizes are shown:
Cholesky Factorization 313
200 equations and unknowns (in-cache) and 1000 equations and unknowns (out-of-cache). The
default block size of 64 was used.
LAPACK
LINPACK DGETRF,
DGEFA, DAXPY DGEMM
(Mflop/s) (Mflop/s)
For in-cache performance, DGEMM can run three times faster than DAXPY. Not all of the
time in DGETRF is spent in DGEMM, but enough is to make DGETRF run twice as fast as
DGEFA. The large, out-of-cache problems are where LAPACK really excels. Due to cache line
reuse, DGETRF performs over six times faster than DGEFA!
DO I = 1, N
SI = 0.0D0
DO J = 1, I-1
SJ = 0.0D0
DO K = 1, J-1
SJ = SJ + A(K,J) * A(K,I)
ENDDO
T = (A(J,I) - SJ) / A(J,J)
A(J,I) = T
SI = SI + T*T
ENDDO
A(I,I) = DSQRT(A(I,I) - SI)
ENDDO
Note that the size of the middle loop is a function of the outer loop. This reduces the num-
ber of floating-point calculations to about (1/3) n3. Thus, the Cholesky factorization requires
only half the amount of storage and half the amount of calculations as LU factorization. Also,
the inner loop of the DPOFA routine uses a call to DDOT. This is more efficient than the call to
DAXPY used by DGEFA. The LAPACK routine DPOTRF (Double precision POsitive definite
TRiangular Factorization) implements a blocked version of DPOFA that uses a call to DGEMM
for the bulk of the computations.
Table 11-2 compares the performance of the LINPACK and LAPACK Cholesky factor-
ization routines. As before, only the BLAS routines were optimized and results were obtained
on an HP N-4000 server using the original versions of the LINPACK and LAPACK routines.
The default block size of 64 was used.
LAPACK
LINPACK DPOTRF,
DPOFA, DDOT DGEMM
(Mflop/s) (Mflop/s)
ScaLAPACK,
LIBRARY LINPACK LAPACK
PLAPACK
consists of multiple RHS vectors. Thus, X and B may also be very large matrices. This hasn’t
been an issue so far since the matrix factorization is independent of the number of solution vec-
tors.
C SOLVE L*Y = PB
DO K = 1, N-1
T = B(IPVT(K))
B(IPVT(K)) = B(K)
B(K) = T
DO I = K+1,N
B(I) = B(I) - T * A(I,K)
ENDDO
ENDDO
C SOLVE U*X = Y
DO K = N, 1, -1
B(K) = B(K)/A(K,K)
T = B(K)
DO I = 1,K-1
B(I) = B(I) - T * A(I,K)
ENDDO
ENDDO
Forward-Backward Substitution (FBS) 317
The FBS calculations have O(n2) data movements and calculations. Since the factorization
is an O(n3) calculation, solving for a single solution vector is not as important as the factoriza-
tion, but we’d still like this to run as fast as possible. The code shown above is the algorithm
used by the LINPACK routine DGESL (Double precision GEneral SoLve). The inner loops are
replaced by calls to the Level 1 BLAS routine DAXPY. It would be better to use a Level 2 BLAS
routine for each of the steps. Since the inner loop is shrinking, this is not completely straight for-
ward. Ignoring pivoting and using unroll and jam on the forward substitution step results in the
following:
NEND = 2*((N-1)/2)
C SOLVE L*Y = B
DO K = 1, NEND, 2
T1 = B(K)
B(K+1) = B(K+1) - T1 * A(K+1,K)
T2 = B(K+1)
C 2 COLUMN MATRIX VECTOR MULTIPLY
DO I = K+2,N
B(I) = B(I) - T1 * A(I,K) - T2 * A(I,K+1)
ENDDO
ENDDO
This is a matrix-vector multiply where the matrix has two columns. This process can con-
tinue up to some blocking factor, nb, and a call to DGEMV inserted. Replacing a call to DAXPY
by one to DGEMV also improves performance, as shown in the previous chapter. Blocking the
backward substitution step is very similar. When the resulting blocked code includes partial piv-
oting, it is the same as the algorithm used by the LAPACK routine DGETRS (Double precision
GEneral TRiangular Solve) using one RHS vector.
vectors allows DGEMM to be used in the LAPACK solution, and the rates are even better than
the LU factorization rates!
LAPACK LAPACK
LINPACK DGETRS, DGETRS,
DGESL, DAXPY DGEMV DGEMM
1 RHS 1 RHS 100 RHS
(Mflop/s) (Mflop/s) (Mflop/s)
(Users beware: Mixing LINPACK and LAPACK factor and solve routines will usually cause
wrong answers since they use different permutation algorithms.)
and one to hold its column number. This is called the row and column index sparse matrix repre-
sentation. So the following matrix
11 0 13 14
0 22 23 0
31 32 33 0
41 0 0 44
can be represented by
IROW = 1 3 4 2 3 1 2 3 1 4
JCOL = 1 1 1 2 2 3 3 3 4 4
A= 11 31 41 22 32 13 23 33 14 44
If the number of nonzero elements of an n × n matrix is represented by nz, then this storage
scheme requires 3 nz elements. If the nonzero elements and indices all use eight-byte data, then
the nonzero elements require 8nz bytes and the indices require 16nz bytes. Thus, the amount of
storage required for the indices is twice that used for the nonzero elements. Clearly, the storage
of the JCOL values is not very efficient. Note the repetition of the column numbers. A smaller
array for JCOL could be used that just stores the location of the beginning index for each col-
umn’s data in the coefficient vector A. This creates the column pointer, row index sparse matrix
representation and is demonstrated below.
COLPTR = 1 4 6 9 11
IROW = 1 3 4 2 3 1 2 3 1 4
A= 11 31 41 22 32 13 23 33 14 44
This requires 2 nz+n+1 storage elements, which is a large improvement. Note that this
scheme has an equally efficient dual row pointer, column index sparse matrix representation that
stores all column indices and a pointer to the row number.
ments in the factored matrix. In fact, you don’t even need to know any of the actual values in the
matrix to perform this analysis. You need to know only where the nonzero elements of A are
located. This adds a third step to the solution of the system. So before the factorization and FBS
are calculated, a symbolic factorization is obtained which determines how the matrix will be fac-
tored.
The choice of how to symbolically factor the matrix is crucial, since it determines how
large the factored matrix will be. Symbolic factorization determines the order in which the equa-
tions are eliminated, i.e., the equivalent of the permutation we discussed for dense systems.
There are many reordering schemes in use. Two popular ones are Multiple Minimum Degree
(MMD) [4] and METIS [7]. Difference schemes are appropriate for different types of problems.
Some software packages even allow different reordering schemes to be performed and the one
with the least amount of fill-in is used during the factorization phase.
Once the symbolic factorization has been performed, the numerical factorization can
begin. Early algorithms performed updates based on a single column of data. These correspond
to using DAXPY in the factorization routine DGEFA. Better performance is obtained by operat-
ing on multiple columns of data. This is the concept behind designing algorithms to use supern-
odes—sets of columns with identical non-zero structure. By designing algorithms to generate
supernodes and using them in the matrix update, the routine DGEMM can be called to increase
performance well beyond what DAXPY can achieve. Therefore, even in the case of sparse sys-
tems of equations, the low-level calculations employ dense Level 3 BLAS routines.
So the first question that comes to mind is, “How many iterations are required?” When the
CG algorithm was derived, it was proven that the maximum number of iterations required to
obtain an exact solution for a system of n equations is n. This assumes infinite precision arith-
metic, though. (Recall that we wouldn’t need pivoting in the direct methods if we had infinite
precision arithmetic.) In practice, floating-point arithmetic can perturb the algorithm so that the
incorrect answers can result. Let’s suppose that the problem is well-behaved. What are the
implications of taking n iterations to obtain a solution?
The most computationally significant part of the CG algorithm is the matrix-vector multi-
ply. There are three parts which can be replaced by calls to DAXPY and two which can be
replaced by calls to DDOT, but these all have only O(n) operations and are insignificant com-
pared to the matrix-vector multiply. So if it takes n matrix-vector multiplications, then about 2n3
operations are required. This is worse than the n3 / 3 operations used by Cholesky factorization
discussed above. What makes it much worse is that large matrix-matrix multiplication can run
over six times faster than large matrix-vector multiplication, since blocking allows extensive
cache reuse for matrix-matrix multiplication. Thus, the CG algorithm might take
(6 × 2 n3 ) / ( n3 / 3) = 36 times longer than a direct method if n iterations are performed, so
what good is it? Clearly, it all comes down to the number of iterations required. If the number is
less than, say, n / 36, then it might be worth trying. In practice, the number of iterations required
is usually much less than n. However, the original CG may be slow to converge to a solution for
a particular problem. Lots of research has been done to accelerate the rate of convergence to a
good solution. The structures that accelerate iterative algorithms to reduce the number of itera-
tions are called preconditioners.
In the equations that describe the CG algorithm above, the copy at the beginning of the
outer loop may be replaced by a preconditioner. Preconditioners work by solving a system that
is similar to A, but whose solution is easier to obtain. So with each iteration they move closer to
322 Chapter 11 • Faster Solutions for Systems of Equations
the real solution than the unconditioned CG method. Of course, this helps only if the precondi-
tioner is computationally inexpensive and improves convergence. There are many different pre-
conditioners to choose from. Dongarra [3, p. 109] expresses it best: “The choice of a good
preconditioner is mainly a matter of trial and error, despite our knowledge of what a good pre-
conditioner should do.”
DO J=1,LASTROW-FIRSTROW+1
SUM = 0.D0
DO K=ROWSTR(J),ROWSTR(J+1)-1
SUM = SUM + A(K)*P(COLIDX(K))
ENDDO
W(J) = SUM
ENDDO
This code uses the row pointer, column index sparse matrix representation discussed ear-
lier. Note that the matrix is represented by the vector A, which is multiplied by elements from
the vector P. However, a gather using COLIDX must be performed for each access of P. Note that
COLIDX has the same number of elements as A. If each of these uses 64-bit data, the memory
bandwidth requirements is double that of just the nonzero elements contained in A. Observe that
ROWSTR and the output array W are both of size N.
For good performance on memory intensive problems, data must be prefetched into mem-
ory. Although the matrix A and the vector COLIDX can be easily prefetched, the vector P is a
problem. Techniques for prefetching gather/scatter operations were discussed in Chapter 5. As
noted there, to prefetch a gather usually requires a compiler directive or pragma to be effective.
A long loop length also helps. However, the size of the inner loop in sparse iterative solvers is
the number of nonzero elements in a column (or row). This tends to be small, perhaps only a
dozen or so. Thus, getting good performance from gather/scatters is made even more difficult.
Iterative Techniques 323
At worst, performance is determined by the speed that data can be loaded from memory, which,
as we discussed, is poor.
PROCESSORS
1 2 N
A SHARED
COLIDX
ROWSTR
W
Figure 11-6 Parallel partitioning for the CG algorithm.
Since performance is determined by the ability of the computer to get data from memory,
parallelism is limited by the memory bandwidth of a system. For example, suppose a computer
324 Chapter 11 • Faster Solutions for Systems of Equations
can deliver four GB/s of memory bandwidth and each processor can request and receive one
GB/s. Then the parallel performance is limited to four times the original performance no matter
how many processors are actually on the system. This can render most of the processors on a
system useless. For this reason and the small amount of data communication required, clusters
of workstations using a low bandwidth interconnect and MPI for parallelism, can be very effec-
tive for iterative algorithms. In this case, the vector P may be replicated in the memory of each
workstation. While this increases the memory requirements, the total amount of memory traffic
is reduced.
11.8 Summary
You’ve just had a very high level view of some techniques of numerical linear algebra.
Hopefully, you’ve gotten a flavor of how these routines can be optimized using high perfor-
mance building blocks and the techniques we discussed in Chapters 5 and 10. The linear algebra
routines for the BLAS, LINPACK, LAPACK and ScaLAPACK can all be found at
http://www.netlib.org/ . The routines comprising PLAPACK are located at
http://www.cs.utexas.edu/users/plapack/. So if your computer vendor doesn’t provide these, you can
“roll your own” to obtain good performance.
References:
The following texts are excellent resources for linear algebra algorithms:
1. Bailey, D.; Barszcz, E.; Barton, J.; Browning, D.; Carter, R.; Dagum, L.; Fatoohi, R.;
Fineberg, S.; Frederickson, P.; Lasinski, T.; Schreiber, R.; Simon, H.; Venkatakrishnan,
V.; Weeratunga, S. The NAS Parallel Benchmarks. RNR Technical Report RNR-94-007,
1994.
2. Demmel, J. W. Applied Numerical Linear Algebra. Philadelphia: SIAM, 1997, ISBN
0-89871-389-7.
3. Dongarra, J. J.; Duff, I. S.; Sorensen, D. C.; van der Vorst, H. A. Numerical Linear Alge-
bra for High-Performance Computers. Philadelphia: SIAM, 1998, ISBN 0-89871-428-1.
4. George, A.; Liu, J. W. H. The Evolution of the Minimum Degree Ordering Algorithm.
SIAM Review, Vol. 31, 1-19, 1989.
5. Golub, G. H.; Van Loan, C. F. Matrix Computations. Baltimore: Johns Hopkins, 1993,
ISBN 0-89871-414-1.
6. Hestenes, M. R; Stiefel, E. Methods of Conjugate Gradients for Solving Linear Systems. J.
Res. Nat. Bur. Standards, Vol. 49, 409-435, 1952.
7. Karypis, G.; Kumar, V. A Fast and High Quality Multilevel Scheme for Partitioning
Irregular Graphs. SIAM J. on Scientific Computing, Vol. 20, 359-392, 1998.
C H A P T E R 1 2
High Performance
Algorithms and Approaches
for Signal Processing
12.1 Introduction
Many signal and image processing applications generate large amounts of data which
must be analyzed to find the interesting information. This is often done using convolutions and
Discrete Fourier Transforms (DFTs) to filter the data. Some examples of their use are:
• To locate oil, petroleum companies set off explosions near potential deposits and then
measure the amount of time it takes for the shock waves to be picked up by sensors. The
sensors’ signals are processed to determine where to drill.
• Astronomers working with data obtained from telescopes try to locate objects such as
black holes by processing the image data to extract data in certain ranges.
• Fighter pilots need to be able to locate hostile missiles that are approaching. Data from
on-board sensors are processed to highlight threats.
• Images obtained from sensors need to be processed to match the resolution of different
output media, depending on whether the images are to be printed on paper or displayed on
computers. Appropriate filters are applied, depending on the situation.
As we shall see, there is a close relationship between convolutions and DFTs. It’s some-
what like the equivalence of matter and energy or the wave/particle duality. Convolutions can be
defined in terms of DFTs, and DFTs can be defined in terms of convolutions.
325
326 Chapter 12 • High Performance Algorithms and Approaches for Signal Processing
Because of the rich set of algorithms related to convolutions and FFTs, there are opportu-
nities to apply many of the techniques defined in earlier sections to improve performance. These
include:
Once again, we’re going to try to give an overview of some of the most important optimi-
zations in a very broad field. For more detailed information, see Charles Van Loan’s excellent
text on Fourier Transform algorithms [9].
Convolutions and correlations can be the highest performing algorithm on any architec-
ture since the floating-point operation to memory operation ratio can be made very large. This
ratio is limited only by the number of hardware registers.
Convolutions and correlations are defined as follows: Let {x0 ,...xm+n-1}, {w0 ,...wn-1},
{y0 ,...,ym-1} be vectors. y is the convolution of x and w if
n–1
yk = yk + ∑x i + k wn – i k = 0, 1, …, m – 1
i=0
Since the only difference between a convolution and correlation is the direction of access
for the w array, the correlation formulation will be used in the analysis below. We’ll sometimes
Convolutions and Correlations 327
refer to x as the input vector, w as the weight vector, and y as the output vector. Note that the cor-
relation can be formulated as a matrix-vector multiplication:
x0 x1 x2 … xn – 1 w0 y0
x1 x2 x3 … xn w1 y1
x2 x3 x4 … xn + 1 w2 = y2
. . . … . . .
xm – 1 xm xm + 1 … xm + n – 1 wn – 1 ym – 1
12.2.1 Code
The code to implement a correlation can be written as
There are two ways to think of the correlation. As written above, it is dot product based
since the inner loop can be replaced by a call to a dot product routine. However, by swapping the
loops, it becomes DAXPY based. The best choice between these two depends on the relative
sizes of m and n. If n is extremely short, the DAXPY version might have better performance than
the dot product version. In general, the dot production formulation is better. However, both of
these ignore the true potential of the kernel.
The original formulation contained two loads and two floating-point operations (the refer-
ences to y having been hoisted and sunk). The unrolled code above has five loads and eight
floating-point operations in the main loops. By increasing the amount of unrolling on both
loops, the F:M ratio can be increased further. The best factors for m and n depend on the typical
sizes of m and n and the number of registers available. Too much unrolling will cause registers
to be spilled. Table 12-1 compares the operations using the original code and various unrolling
factors on both loops.
Table 12-1 Correlation Performance.
Floating-
point Memory
I unroll J unroll operations operations
factor factor per iteration per iteration F:M ratio
1 1 2 2 1.00
2 2 8 5 1.60
3 3 18 8 2.25
4 4 32 13 2.46
n1 – 1 n2 – 1
y k 1, k 2 = y k1, k2 + ∑ ∑x i 1 + k 1, i 2 + k 2 w i , i
1 2
i1 = 0 i2 = 0
k 1 = 0, 1, …, m 1 – 1, k 2 = 0, 1, …, m 2 – 1
12.3 DFTs/FFTs
∑x ω
jk
yk = j n
k = 0, 1, …, n – 1
j=0
The inverse one-dimensional DFT is created by negating the sign on the exponent and
scaling, and is defined by
n–1
y k = 1--- ∑x ω
–j k
j n
k = 0, 1, … , n – 1
n
j=0
Some definitions have the scaling by 1/n on the forward transform, some on the inverse
transform, and some scale both the forward and inverse transforms by 1 ⁄ ( n ) . Since the scal-
ing takes only n operations, it will be ignored in the analysis below.
DFTs were implemented on early computers, but their use was restricted by the large
number of computations required. To calculate a DFT of length n directly requires a matrix-vec-
tor multiplication.
0 0 0 0
ωn ωn ωn … ωn x0 y0
0 1 2 n–1
ωn ωn ωn … ωn x1 y1
0 2 4 2(n – 1) =
ωn ωn ωn … ωn x2 y2
. . . … . . .
0 n–1 2 (n – 1 ) (n – 1 ) (n – 1 ) xn – 1 yn – 1
ωn ωn ωn … ωn
If the input data consists of complex numbers and the calculation of the powers of ωn are
ignored, 8 n2 real floating-point operations are required. Thus, a one-dimensional direct Fourier
Transform (FT) is an O(n2) algorithm.
Fourier transforms are important since they allow data sequences to move from the time
domain to the frequency domain and vice versa. For example, suppose you record the tempera-
ture at noon every day for ten years. This is an example of a sequence that is a linear function of
time (days). The DFT of the input returns data that is a function of frequency (1/day). A filter
may be applied to this output sequence to remove noise. Subsequently, an inverse DFT may be
applied to return the data to the time domain.
In 1965 Cooley and Tukey [6] dramatically altered the nature of signal processing by
deriving the FFT algorithm that greatly reduced the number of operations required. For an input
size of 2n, their algorithm took only 5n log2 n floating-point operations. (In general, FFTs will
be defined as techniques to implement one-dimensional DFTs with O(n log n ) computations.)
DFTs/FFTs 331
The speedup by using a FFT instead of a direct FT is immense. Table 12-2 compares the
number of floating-point operations required for direct FTs and FFTs for powers of 32.
The larger the size of n, the larger the reduction. FFT algorithms have spawned thousands
of books and articles, a few of which are listed as references at the end of this chapter. Although
FFTs caused a revolution in scientific computing, some FFT algorithm steps have unfortunate
characteristics for high performance computers that include
has only two nonzero entries per row. Furthermore, each data point is naturally paired with
another point so that the operations done on these two points consist of a complex multiplica-
tion, a complex addition, and a complex subtraction. To calculate the number of floating-point
operations, note that there are log2 independent matrices or steps and each step accesses all n
points. For every two points in each step, there are three complex operations, or 10 float-
ing-point operations. Therefore, the operation count for the FFT is
0 0 0 0
ω4 ω4 ω4 ω4
0 1 2 3
ω 4 ω4 ω4 ω4
0 2 4 6
ω 4 ω4 ω4 ω4
0 3 6 9
ω 4 ω4 ω4 ω4
However, by definition, ω4 is the fourth root of one, so ω44 = 1 = ω40. Some of the elements in
the matrix can be reduced by factoring by ω44 , resulting in
0 0 0 0
ω 4 ω4 ω4 ω 4
0 1 2 3
ω 4 ω4 ω4 ω 4
0 2 0 2
ω 4 ω4 ω4 ω 4
0 3 2 1
ω 4 ω4 ω4 ω 4
There are eight factorizations of this matrix or one of its permutations that are used by FFT algo-
rithms. Three factorizations of the matrix for n = 4 that result in the DFT are shown below.
0 0 0 0
ω4 0 ω4 0 ω4 ω4 0 0 x0 y0
0 1 0 2
0 ω4 0 ω4 ω4 ω4 0 0 x2 y1
× × =
0 2 0 0 x1 y2
ω4 0 ω4 0 0 0 ω4 ω4
0 3 0 2 x3 y3
0 ω 4 0 ω4 0 0 ω4 ω4
DFTs/FFTs 333
0 0 0 0
ω4 ω 4 0 0 ω4 0 ω4 0 x0 y0
0 2 0 0
ω4 ω4 0 0 0 ω4 0 ω4 x1 y2
× × =
0 1 0 2 x2 y1
0 0 ω4 ω4 ω4 0 ω4 0
0 3 0 2 x3 y3
0 0 ω4 ω4 0 ω4 0 ω4
0 0 0 0
ω4 ω 4 0 0 ω4 0 ω4 0 x0 y0
0 1 0 0
0 0 ω4 ω4 0 ω4 0 ω4 x1 y1
× × =
0 1 0 0 x2 y2
ω4 ω 4 0 0 ω4 0 ω4 0
0 3 0 2 x3 y3
0 0 ω4 ω4 0 ω4 0 ω4
In the first set of matrices, the input array was reordered by swapping xj with its
bit-reversed position. This is found by taking the index (zero-based) of the position, writing its
binary representation, reversing bits, and using this as the index to exchange with. For example,
if n = 32, the data at location 1310 = 011012 must be swapped with the data at location
101102 = 2210. The second set of matrices requires this permutation to be performed on the out-
put array. The third set does not require a permutation of the input or output data to obtain the
DFT.
The next few sections contrast the eight radix-2 sparse factorizations.
• input array
• work array
• array of powers of ωnk (hereafter referred to as the trigonometric or trig array)
• permutation requirements
• stride on the arrays
• loop length
334 Chapter 12 • High Performance Algorithms and Approaches for Signal Processing
• no work array
• trig array loads hoisted outside the inner loop
• no permutation step
• all inner loop data accesses to be unit stride
• a long inner loop length
SUBROUTINE DIRECT(N,X,WORK,IS)
C DIRECT FOURIER TRANSFORM
C
C X IS THE ARRAY OF DATA TO BE TRANSFORMED
C N IS THE NUMBER OF DATA POINTS
C M IS THE POWER OF 2 SUCH THAT N = 2**M
C IS = 1 FOR FORWARD TRANSFORM
C IS = -1 FOR INVERSE TRANSFORM
IMPLICIT NONE
INTEGER N, I, J, IS
REAL TWOPI, ANGLE, AMULT
COMPLEX X(0:*), WORK(0:*), U
PARAMETER (TWOPI = -2.0 * 3.14159265358979323844)
DO I = 0,N-1
WORK(I) = CMPLX(0.0,0.0)
DO J = 0,N-1
ANGLE = (IS*TWOPI*I*J)/REAL(N)
U = CEXP(CMPLX(0.,ANGLE))
WORK(I) = WORK(I) + U * X(J)
ENDDO
ENDDO
IF (IS .EQ. 1) THEN
AMULT = 1.0
ELSE
AMULT = 1.0 / N
ENDIF
DO I = 0,N-1
X(I) = AMULT*WORK(I)
ENDDO
END
DFTs/FFTs 335
As shown in the sample factorizations for n = 4 above, some FFTs require the data to be
changed to bit-reversed order at the beginning or the end of processing. Following is a routine
that performs this operation:
SUBROUTINE BIT_REVERSE_ORDER(N,X)
IMPLICIT NONE
INTEGER N, J, I, M
COMPLEX X(*), TEMP
J=1
DO I = 1,N
IF (J .GT. I) THEN
TEMP = X(I)
X(I) = X(J)
X(J) = TEMP
ENDIF
M=N/2
100 IF ((M .GE. 2) .AND. (J .GT. M)) THEN
J=J-M
M=M/2
GOTO 100
ENDIF
J=J+M
ENDDO
END
∑ ∑ ∑ ∑x
2j k ( 2j + 1 )k jk k jk
yk = x 2j ω n + x 2j + 1 ω n = x 2j ω m + ω n 2j + 1 ω m
j=0 j=0 j=0 j=0
Note that the summations use only the even or odd values of x and that ωmjk is in both
terms and can be factored out. This splitting process continues for log2 n steps to create the
Cooley-Tukey FFT. For n = 4, this is the first FFT factorization shown in Section 12.3.2. There
are three other splittings of the x vector that results in a FFT. In similar fashion, the y vector may
be split to produce the four DIF factorizations. These are dual transforms to the DIT transforms
and are also obtained by performing all the DIT computations in reverse order.
336 Chapter 12 • High Performance Algorithms and Approaches for Signal Processing
Since DIT factorizations split the x vector, they may also reorder (permute) x. As a result,
when a permutation of x is required for a DIT algorithm, it is the first step of these algorithms.
Therefore, when a permutation is required for a DIF algorithm, it must be the last step.
SUBROUTINE DIT(N,M,X,WORK,IS)
C DECIMATION-IN-TIME APPROACHES
C
C X IS THE ARRAY OF DATA TO BE TRANSFORMED
C N IS THE NUMBER OF DATA POINTS
C M IS THE POWER OF 2 SUCH THAT N = 2**M
C IS = 1 FOR FORWARD TRANSFORM
C IS = -1 FOR INVERSE TRANSFORM
IMPLICIT NONE
INTEGER M, N, IS
COMPLEX X(*), WORK(*)
LS = 2**L
NS = N2 / LS
CALL FFT_DIT_APPROACHES (LS,NS,X,WORK,IS)
DO I = 1,N
X(I) = WORK(I)
ENDDO
ENDDO
DO I = 1,N
X(I) = AMULT*X(I)
ENDDO
END
ENDDO
ENDDO
END
12.3.3.4 Cooley-Tukey
The Cooley-Tukey factorization started the revolution, so it’s useful to analyze in detail.
As discussed earlier, it requires a permutation of the input data before any calculations are
begun. The computational component of the code performs
Note that the X and Y arrays have the same size and are accessed in the same order. There-
fore, the Y array (i.e., the work array) is superfluous and can be replaced by the X array. The
inner loop starts at n/2 and shrinks by a factor of two with each of the log2 n steps. The X
accesses are not unit stride, though. To obtain this requires exchanging the loop indices, which
brings the trig array accesses into the inner loop. Now the size of the inner loop starts at one and
grows by a multiple of two with each of the log2 n steps. Regardless of whether the I loop or the
J loop is innermost, there will be steps with a very short inner loop length. The unit stride code
without a work array appears as follows:
COMPLEX X(LS,2,NS)
...
DO J = 1,NS
DO I = 1,LS
U = CEXP(CMPLX(0.,((I-1)*ANGLE)))
C = U * X(I,2,J)
X(I,2,J) = X(I,1,J) - C
X(I,1,J) = X(I,1,J) + C
ENDDO
ENDDO
DFTs/FFTs 339
12.3.3.5 Pease
The starting point for this algorithm is the code
Pease noticed that this FFT has the J and I indices adjacent in X and Y. Therefore, the two loops
can be combined into a single loop as
The advantage of this approach is that the loop length is always n/2. However, it requires a
work array, the accesses of X have a stride of two elements, and if there is a temporary trig array
to hold all the precomputed trig values, it will be of size (n/2) log2 n.
The Stockham code also has unit stride array accesses. The drawbacks are that there must
be a work array and the inner loop length decreases with each step. The transposed Stockham
algorithm is obtained by starting with
and exchanging the I and J loops. Note that the scalar U becomes a vector which must be loaded
in the inner loop.
The transposed Stockham code does not require a permutation and has unit data accesses,
but must load the trig coefficients in the inner loop. The inner loop size starts small and grows
with each step.
DFTs/FFTs 341
Permutation 3 3
required?
Work array 3 3 3
required?
Constant inner 3
loop length?
Unit stride 3 3 3
everywhere?
Which of the above features are most important on modern computers? Software libraries
that include FFT routines contain extensive optimizations, but we’ll make some high-level
observations. The DIT algorithms above were implemented in routines very similar to the exam-
ples shown. The only difference was that the calculation of the trig array was performed in a
separate initialization step which was not included in the measurements. This is consistent with
how vendors implement production FFT routines. So, for example, the Stockham code appeared
as
This code was executed on a PA-8500 processor in an HP N-Class server for two data
sizes: 64 and 1024. These used COMPLEX*8 data and both sets of data fit in the one MB data
cache on this processor. Table 12-4 shows their performance.
What do we learn from the results? Since all of the data fits in the data cache, the existence
and size of the work array should not matter, nor should having the trig array values loaded in
the inner loop or unit stride accesses. After the first call, all of this data is in-cache waiting to be
accessed and no cache conflicts should occur since the data size is small. Indeed, none of these
issues appears to make a difference. However, having a permutation or short inner loop length
are major inhibitors of good performance. Thus, the Cooley-Tukey algorithm, which requires
both a permutation and some short inner loop lengths, performs worse than the other routines.
The effect of the permutation is more noticeable in the larger size problem. In the Pease algo-
rithm, the advantage of a long loop length is offset by the disadvantage of performing a permu-
tation. The fact that the Stockham algorithm does not require a permutation step makes it very
popular. The disadvantage of requiring a work array of size n in the Pease and Stockham algo-
rithms is usually not a big drawback. On systems with robust memory systems such as vector
computers, memory bandwidth is large enough to support this extra traffic. On cache based sys-
tems, different techniques are used for problems that are in-cache and out-of-cache. For prob-
lems that are in-cache, the additional work array is an issue only for problems that are close to
exceeding the size of the cache.
The Stockham does have a noticeable advantage over the Pease algorithm regarding the
data requirements for the trig array. A radix-2 Stockham algorithm needs only a trig array of size
n/2, whereas the Pease algorithm needs to use (n/2) log2 n elements. For a problem of size 1024,
the Stockham routine requires 512 trig elements, while the Pease routine uses 5120 elements.
Thus for large problems, the large size of the trig array limits the usefulness of the Pease algo-
rithm. To increase the inner loop length in the Stockham algorithm, the Stockham and trans-
posed Stockham algorithms are sometimes coupled as follows [3]:
DFTs/FFTs 343
• Perform Stockham steps until the inner loop becomes very short
• Transpose the data
• Perform transposed Stockham steps
Thus, the loop length can be kept long at the expense of doing a transpose of the data. The
transpose can also be moved inside the first transposed Stockham step [10].
All the advantages/disadvantages discussed so far for the individual DIT algorithms are
shared by their DIF dual algorithms. Note that the DIF algorithms have the complex multiplica-
tion occurring as the last complex floating-point operation instead of the first one. DIT algo-
rithms are used more often than DIF algorithms, and a later optimization uses the fact that DIT
algorithms have the complex multiplication occurring before the other complex operations.
Later discussions often use the DIT Stockham code.
three-dimensional arrays. If two consecutive steps are combined in a single routine, they appear
as follows:
By unrolling the first J loop by two and unrolling the second I loop by two and jamming
the resulting loops together, the four loops can be condensed back to two loops. The stores from
the first step map exactly to the loads in the second step, so these memory references can be
eliminated. The inner loop of this new kernel does four times the amount of work of the inner
loop of the radix-2 kernel. Thus, one would expect to need four complex multiplications and
eight complex additions. However, combining the steps allows one of the complex multiplica-
tions to be eliminated. Thus, the six associated real floating-point operations are eliminated and
so there are only 4.25n log2 n flops required. The forward radix-4 FFT routine is shown below.
DO I = 1,LS
U1 = CEXP(CMPLX(0.,((I-1)*ANGLE)))
U2 = U1 * U1
U3 = U1 * U2
DO J = 1,NS
C0 = X(J,I)
C1 = U1 * X(J+NS,I)
C2 = U2 * X(J+2*NS,I)
C3 = U3 * X(J+3*NS,I)
D0 = C0 + C2
D1 = C0 - C2
D2 = C1 + C3
D3 = CMPLX(0.,-1.)*(C1-C3)
Y(J,I) = D0 + D2
Y(J,I+LS) = D1 + D3
Y(J,I+2*LS) = D0 - D2
Y(J,I+3*LS) = D1 - D3
ENDDO
ENDDO
END
Table 12-5 compares the number of operations for the two approaches. For the radix-4
kernel, the F:M ratio is 34:16. This process of using higher radices can continue, but as the radix
is increased the number of floating point registers required grows rapidly. A radix-8 algorithm
needs more floating-point registers than are available on most RISC processors. Only IA-64 pro-
cessors and some vector processors have enough registers to benefit from using a radix-8 kernel,
so the radix used should be carefully chosen.
Loads 2 4 4 8 8 16
Stores 2 4 4 8 8 16
Multiplications 1 4 3 12 7 28
Additions 2 6 8 22 16 46
Some processors use 64-bit floating-point registers that allow the left and right 32-bit
components of the register to be accessed independently. For these processors, COMPLEX*8
FFTs can be implemented using 64-bit loads and stores and 32-bit floating-point calculations.
This increases the F:M ratio by a factor of two.
The radix-4 Stockham DIT algorithm has many good characteristics and will be used in
following sections.
DO I = 1,LS
U1R = COS((I-1)*ANGLE)
U1I = SIN((I-1)*ANGLE)
U2R = U1R*U1R - U1I*U1I
U2I = 2*U1R*U1I
U3R = U1R*U2R - U1I*U2I
U3I = U1R*U2I + U1I*U2R
DO J = 1,NS
C0R = REAL(X(J,I))
C0I = AIMAG(X(J,I))
X1R = REAL(X(J+NS,I))
X1I = AIMAG(X(J+NS,I))
DFTs/FFTs 347
X2R = REAL(X(J+2*NS,I))
X2I = AIMAG(X(J+2*NS,I))
X3R = REAL(X(J+3*NS,I))
X3I = AIMAG(X(J+3*NS,I))
As shown above, this requires 12 real multiplications and 22 real additions. On computers
with fma instructions, only half of the multiplications can map naturally to fma instructions. So
there are six multiplications, six fma instructions, and 16 additions for a total of 28 instructions.
Note that there were only 16 memory operations, so this kernel has a very nice 28:16 F:M ratio.
Thus, the floating-point instructions dominate the memory instructions. Goedecker [7] found a
way to decrease the total number of floating-point instructions by increasing the number of fma
instructions.
348 Chapter 12 • High Performance Algorithms and Approaches for Signal Processing
FFT algorithms perform complex multiplications of an input array (xr , xi) by a trig array
consisting of complex (cosine(a), sine(a)) pairs. This multiplication produces
xr cosine(a) − xi sine(a)
xi cosine(a) + xr sine(a)
which is composed of a multiplication, fnma, multiplication, and fma. Note that the two multipli-
cations cannot be combined with additions to produce fma instructions. Goedecker realized that
by producing (cosine(a), sine(a)/cosine(a)) pairs, i.e., (cosine(a), tangent(a)), more fma instruc-
tions could be produced. The multiplication becomes
This still consists of a multiplication, fnma, multiplication, and fma, but now the two mul-
tiplication instructions are at the conclusion of the complex multiplication and can be fused with
later additions and subtractions. The (cosine(a), tangent(a)) pairs can be precomputed to replace
the (cosine(a), sine(a)) pairs or they can be recomputed with each step as the following code
shows:
DO I = 1,LS
U1R = COS((I-1)*ANGLE)
U1I = SIN((I-1)*ANGLE)
U2R = U1R*U1R - U1I*U1I
U2I = 2*U1R*U1I
U3R = U1R*U2R - U1I*U2I
U3I = U1R*U2I + U1I*U2R
There’s only one nagging question about the above algorithm. What happens if α = 0?
This would cause a problem since tangent(0) is infinity, and this would be problematic for later
350 Chapter 12 • High Performance Algorithms and Approaches for Signal Processing
computations. Strangely enough, using fma instructions in the code gets around this. There are
three trig values that could cause problems: u1r, u2r, u3r. However, u1r never takes on the value
zero since the values of the angle vary from 0 to π*(ls-1)/(2*ls). Fortunately, the cosine of these
values is always positive. u2r is a problem, though. It is calculated by squaring (u1r,u1i). When
u1r = 2 ⁄ 2 and u1i = -u1r, then u2r should equal zero. However, the calculation of u2r consists
of u1r*u1r - u1i*u1i (a multiplication and an fnma). Also, 2 ⁄ 2 cannot be exactly represented in
IEEE floating-point format. As discussed earlier, the fnma is performed with slightly higher pre-
cision than the multiplication. Therefore, the values of u1r*u1r and u1i*u1i are, literally, a bit dif-
ferent. Thus, the value of u2r is not exactly zero. However, if these preliminary operations do not
use an fnma instruction, undefined values will result and the algorithm will fail! Although the
number of floating-point operations performed actually increases to 44, the number of instruc-
tions decreases to 22 floating-point instructions, all of which are fma instructions.
It’s a good time to analyze the theoretical performance of some of the algorithms dis-
cussed. Suppose a processor has a floating-point functional unit that can perform one multiplica-
tion, one addition, or one fma per cycle. The theoretical peak for the processor is two flops per
cycle. (The Hewlett-Packard PA2.0 processors can perform double these rates, so scale accord-
ingly for them.) Using a Stockham radix-2 algorithm, the inner loop of size n/2 has 10 float-
ing-point operations. If these map to two multiplications, two fma instructions, and four
additions, the processor can achieve, at most, 10 operations in eight clocks cycles, or 1.25 float-
ing-point operations per clock cycle. Thus, the algorithm runs at 1.25 / 2 = 62.5% of peak.
Another way to view this is that it requires (8/10) × 5n log2 n = 4n log2 n floating-point instruc-
tions.
A radix-4 Stockham algorithm requires 34 floating-point operations for an inner loop of
size n/4 that performs two radix-2 steps. Goedecker’s algorithm allows these to be done in 22
clocks. All power of two FFTs are assumed to use 5n log2 n operations, so the radix-4 kernel has
a theoretical peak of
which is 1.82 / 2 = 91% of peak. In terms of instructions, it takes (22/40) × 5 = 2.75n log2 n
floating-point instructions to implement, which is quite an improvement.
Can we do better? Increasing the radix to higher powers of two such as radix-8 does
require fewer operations that the radix-4 code. However, these higher radices are not as amena-
ble to Goedecker’s treatment and it does not improve their performance. There are other tech-
niques [9] that have about 4n log2 n floating-point operations for power of two FFTs, but they
suffer from other inefficiencies such as non-unit data access patterns. The radix-4 algorithm
described above is a very efficient solution for high performance FFTs, especially when the data
fits in a processor’s data cache.
We’re always interested in knowing the theoretical minimum number of operations for an
algorithm. Comparing the radix-4 and radix-2 algorithms, it becomes apparent that the number
DFTs/FFTs 351
of complex additions remains the same, but there are fewer complex multiplications in the
radix-4 algorithm. Similarly, a radix-8 algorithm has fewer complex multiplications than a
radix-4 algorithm, but they have the same number of complex additions. So, suppose an FFT
algorithm was derived that contained no complex multiplications, but the same number of com-
plex additions as the radix-4 algorithm. This algorithm would require 2n log2 n floating-point
operations which would use 2n log2 n floating-point instructions. Goedecker’s algorithm
requires 2.75n log2 n instructions for the radix-4 algorithm, so there are, at most, 0.75n log2 n
instructions to be removed. The idea of eliminating all complex multiplications will be dis-
cussed further in Section 12.3.12 on polynomial transforms.
FFT algorithms can be derived for any prime number. Powers of three and five are very
common and these are nearly as efficient as power of two FFTs. Radix-3 and radix-5 algorithms
can also use Goedecker’s algorithm to improve performance. Composite numbers that can be
represented as n = 2k × 3l × 5m can be calculated by performing k radix-2 steps, followed by l
radix-3 steps and m radix-5 steps. Table 12-6 shows the theoretical efficiency of these algo-
rithms.
Multiplications 1 4 4 12 3 12 12 32
Additions 2 6 6 16 8 22 16 40
Flt.-pt. operations 5.00n log2 n 9.33n log3 n = 4.25n log2 n 14.40n log5 n =
5.89n log2 n 6.20n log2 n
Flt.-pt. instructions 6 16 22 40
using fma, fnma
Flt.-pt. instructions 3.00n log2 n 5.33n log3 n = 2.75n log2 n 8.00n log5 n =
3.36n log2 n 3.44n log2 n
352 Chapter 12 • High Performance Algorithms and Approaches for Signal Processing
k = 0, 1, …, n – 1
Cosine is called an even function because cos(θ) = cos(−θ) for arbitrary angles θ. So, for
example, cos (2πjk / n) = cos (−2πjk / n). The midpoint of the DFT sequence occurs at k = n/2.
The value at this point is cos(2πjn / 2n) = −1 and the values of cosine are also symmetric about
this point. What this means is that the values from the cosine component
n–1
∑ x cos -----------
2πjk
yrk = k = 0, 1, …, n – 1
j
n
j=0
will be symmetric about the midpoint n/2. Thus, yr(n/2 + j) = yr(n/2 − j) for any j < n/2.
Sine is called an odd function because sin(θ) = −sin(−θ). The values from the sine compo-
nent on one side of the midpoint n/2 are also the negative of the values on the other side. Thus, if
n–1
∑ x sin -----------
2πjk
yi k = k = 0, 1, …, n – 1
j
n
j=0
yi(n/2 + j) = −yi(n/2 − j). The sine component is also special since the value at yi(0) = yi(n/2) =
0. Since the sine component is multiplied by the imaginary number i, the FFT yk of real data xj
has the following properties:
• y is conjugate symmetric (i.e., the values y(n/2 + j) are the complex conjugate of y(n/2 − j)
for j < n/2)
• y(0) and y(n/2) are real values.
The first property is especially useful since only the first n/2 + 1 points need to be calcu-
lated. The remaining points can be obtained by taking the complex conjugate of their corre-
sponding points. Just as the FFT of a real sequence is conjugate symmetric, the FFT of a
DFTs/FFTs 353
conjugate symmetric sequence is a real sequence. Since half of the input values are ignored and
half of the output samples aren’t necessary, this suggests that the FFT of real data may be able to
exploit this extra space to reduce the number of operations. In fact, it can.
For the next examples, let x, y and z be complex vectors of length n with real components
xr, yr and zr and imaginary components xi, yi and zi. Let conjg(x) indicate the complex conju-
gate of x.
• Pack xrj and yrj into a complex array zj by mapping xrj to the real locations of zj and yrj to
the imaginary locations of zj , so zj = complex(xr j , yrj).
• Perform an in-place, length n FFT on zj.
• Unscramble zj as follows:
x0 = complex(zr0 , 0.0)
y0 = complex(zi0 , 0.0)
xj = (1/2) (zj + conjg(z n-j)), 1 ≤ j ≤ n/2
yj = (−i/2) (zj − conjg(zn-j)), 1 ≤ j ≤ n/2
• Copy to create the conjugate symmetric values:
xj = conjg(xn-j), (n/2 + 1) ≤ j ≤ n−1
yj = conjg(yn-j), (n/2 + 1) ≤ j ≤ n−1
This is preferable to performing one FFT at a time. Using a one-dimensional FFT takes
5n log2 n operations. Using the packed form takes 5n log2 n + 4n operations for two FFTs, so
there’s nearly a factor of two savings in operation count.
• Pack the real vector xrj of length n into a complex array zj of length n/2, with the even
values (base zero) mapping to the real locations, zrj , and the odd values mapping to the
imaginary locations, zij , so zj = complex(xr 2j , xr2j+1).
• Perform a length n/2, in-place, complex FFT treating the input as two real sequences as
shown above to produce two real vectors zrj and zij.
354 Chapter 12 • High Performance Algorithms and Approaches for Signal Processing
The number of operations for this approach is about (5/2)n log2 n + 4n. So once again, packing
the real data reduces the operation count to about one half of the original.
12.3.8 Performance
We have discussed a lot of techniques for one-dimensional FFTs, but how good is the
actual performance? The above methods work well on problems that fit in data cache or execute
on vector computers. This is because, in both of these cases, there is sufficient bandwidth to sup-
port the memory requirements of the algorithms. Figure 12-1 shows the results of a radix-4 FFT
on an HP N-Class computer. What happens at 128 K? This is awful! Why does this occur and
Thus, if you need to perform 32 FFTs of length 32, the inner loop length can remain 32
throughout the calculations instead of shrinking with each step. This is most attractive when the
problem fits into the data cache. Figure 12-2 shows how a grid may be blocked to ensure that the
data being processed is less than the cache size.
To help quantify the amount of memory traffic, define memory_transfer as in Chapter 5 to
be
block
FFT
n1
< 1 cache size
n2
Blocking is a very good approach since the data may miss cache to begin with, but it
remains in-cache until the data is completely processed. The load and store required for the FFT
calculations generate only two memory_transfers.
If the size of each FFT is larger than the data cache, then the techniques for large FFTs dis-
cussed later should be used.
12.3.9.2 Corresponding Data Points of Successive Data Sets Are Stored Contigu-
ously (DFT of Each Row)
If the data fits in the data cache, using the number of FFTs as the inner loop length is very
attractive since data accesses are unit stride. Suppose multiple FFTs can fit in the data cache. An
efficient approach is to copy blocks of data into a temporary working area, perform simulta-
neous FFTs, and copy the data back as shown in Figure 12-3.
n2
n2
If multiple points for a single FFT map to the same location in the data cache, a popular
approach is to transpose the entire data set, perform one-dimensional FFTs of each column, and
transpose the data back. This is shown in Figure 12-4.
FFT transpose
n2
n1 FFT
n2
n1
Having to copy data for row simultaneous FFTs can be expensive. How many data
memory_transfers are required? In this approach, the data must be
1. Loaded from the original grid and stored to the working area
2. Loaded and stored (the FFT performed)
3. Loaded from the working area and stored to the original grid
If the data is not in-cache to begin with, step one generates three memory_transfers (one
for the load and two for the store). Step two generates two memory_transfers to support the FFT
calculations. Step three generates three memory_transfers as in step one. Thus, eight
memory_transfers are generated, as opposed to the corresponding column approach which gen-
erates only two memory_transfers.
One way to improve this is to choose a temporary working area that is substantially
smaller than the cache size. Then there is the increased probability that some of the data will be
retained in-cache between steps. For example, if the part of the original grid and the total work-
ing area don’t map to the same location in the cache, then, when the FFT is performed, the data
does not miss cache, which removes the two memory_transfers in step two. Likewise, the store
in step one and the load in step can be eliminated, so the total number of memory_transfers may
be as low as two. Due to differences in cache architectures, some experimentation may be neces-
sary to choose the optimal blocking factor.
All of this effort to block for cache isn’t necessary on vector computers. Since row simul-
taneous FFTs can use the number of FFTs as the inner loop length and vector computers have
358 Chapter 12 • High Performance Algorithms and Approaches for Signal Processing
robust memory systems that can support the high memory bandwidth required, these FFT algo-
rithms run well on vector computers.
This is sometimes called the twiddle factor method or a row/column four-step approach.
There are two other ways to interpret the above four components. If a computer is more efficient
at performing DFTs of rows than columns, the row-oriented four-step approach may be used.
This approach is often used on vector processors since, as discussed earlier, they can perform
row simultaneous FFTs very efficiently.
DFTs/FFTs 359
If a computer is much more efficient at performing DFTs on columns of data, the six-step
algorithm below may be a good solution.
The six-step approach has all the FFTs operating on columns of data at the added expense
of two more transpose steps than the twiddle-factor approach. Computers that depend on caches
are great at operating on columns of data since an individual column usually fits in-cache. How-
ever, rows of data are more likely to exceed the cache size due to the large strides between indi-
vidual points of the FFT and may have poor performance due to cache thrashing.
When n is square, a square decomposition requires only the square root of n data points for
each individual FFT. For example, a size 4 M point FFT that uses eight-byte data for each point
requires 32 MB of data, which is larger than most data caches. Treating this as a square of size 2
K × 2 K means each small FFT requires only 32 KB of data, which easily fits into data cache on
most computers. If the problem size is so large that the small FFTs still don’t fit in-cache, the
four- or six-step algorithms can be applied recursively until the individual FFTs can fit in-cache.
Therefore, the number of cache misses is greatly reduced.
The six-step approach is also well-suited to parallelism. The most complicated part is the
parallelism of the transposes which was discussed in Chapter 10. The individual FFTs can be
performed in parallel, as can each column of the twiddle multiplication.
Figure 12-5 compares the twiddle, six-step, and a soon-to-be-discussed seven-step
method. All of the small simultaneous FFTs in these approaches use the Stockham autosort
algorithm. A blocked transpose is used for the transpose steps.
The twiddle approach is clearly better than the six-step approach. As usual, the goal for
large problems is to minimize the number of cache misses. How many memory_transfers are
generated and what can be done to improve them?
360 Chapter 12 • High Performance Algorithms and Approaches for Signal Processing
Although the individual FFTs may fit in-cache, many cache misses may occur in the other
steps. The six-step approach may be modified to a seven-step approach which has better cache
characteristics and requires less memory [11].
Let X be the input array of length n = k × m, and let Y be a work array of size n. Let U be
an array of size n that contains the twiddle factors. Using the earlier definition of
memory_transfer, a transpose or copy of X to Y causes three memory_transfers: one for the load
of X and two for the store of Y. This definition implies that the entire six-step approach causes 17
memory_transfers.
The twiddle method has fewer memory_transfers than the four-step method. Steps four
through six of the six-step method are the same as steps two through four of the twiddle method.
It’s hard to analyze step one of the twiddle method. For large powers of two, we’re guaranteed
DFTs/FFTs 361
that multiple points in the DFT map to the same cache line; therefore, the data for an FFT should
be copied to a contiguous work array, the FFT performed and the data copied back as shown in
Figure 12-3. This may take as many as eight memory_transfers (identical to the six-step
approach). If the work space can be chosen to be small enough, then all memory_transfers
except those associated with the work space may vanish. This leaves only the three
memory_transfers associated with X. Therefore, the twiddle approach takes somewhere between
12 and 17 memory_transfers.
Back to the six-step approach. If the work space is ignored, the total number of
memory_transfers for the row FFTs is three since it’s guaranteed that the load of X will have
been knocked out-of-cache before the FFT of X can be stored. Thus, the twiddle method might
generate only 12 memory_transfers.
The goal is to decrease the number of memory_transfers. If the transpose can be done
in-place, the Y array is eliminated, and the number of memory_transfers decreases. If n = k2,
then this is feasible. Table 12-7 shows the number of memory_transfers for both approaches.
Memory_ Memory_
Step Data movement transfers Data movement transfers
Total 17 13
This is substantially better than the first approach since it generates 13 memory_transfers
instead of 17. It also reduces the memory space requirements by a third. The general case of the
362 Chapter 12 • High Performance Algorithms and Approaches for Signal Processing
product of powers of two, three and five is explored by Wadleigh [11], but we’ll just analyze
powers of two. Note that n = 22k+p where p is zero or one and k is an integer.
Case 1, p = 0 ( n = 22k ) Treat the data as a square array of size s × s where s = 2k. To
transpose the array X requires only swapping diagonals across the main diagonal, as in shown in
Chapter 10.
This transpose may be blocked for good cache line reuse and does not require a work
array. The six-step approach can be performed as shown in Table 12-7 with Xi,j being overwrit-
ten in each step.
Case 2, p = 1 ( n = 2 × 22k ) The data may be decomposed as n = s × t × t where s = 2
and t = 2k. Thus, the problem is composed of two squares of size t × t. In the previous section,
we were able to transpose the data in-place by exchanging a diagonal below the main diagonal
with its corresponding diagonal above the main diagonal. For non-square matrices, there is no
main diagonal and this approach is not possible. For example, if the matrix is of size 2t × t, it’s
not clear which element (t+2, 1) should be swapped. It certainly isn’t element (1, t+2) since this
location isn’t even defined. Note that the first and third steps don’t need to be complete trans-
poses, though. In the first step, treat each of the squares separately and swap points across their
main diagonals. Perform the second step simultaneous FFTs on this array. For the third step,
repeat the swapping operations of the first step. At the conclusion of the third step, the data is
back in proper order for the remaining steps. The fourth and fifth steps may proceed as before.
The sixth step is a problem since it must be a complete transpose. To accomplish this, break the
transform into two parts. Let r1 = t, and r2 = t. The goal is to transpose X(s × r1 , r2) to
X(r2 , s × r1 ). Treat the data as three-dimensional arrays. The data X(s × r1 , r2) = X(r1 , s , r2)
must be transposed to X(r2 , s, r1) = X( r2 , s × r1 ). The transpose may be achieved by:
The first transpose operates on a column of data at a time and the second transpose is the
same operation as the partial transposes of steps one and three. The first transpose is problematic
since it is difficult to perform in-place. However, using a work array Y of size s × r1 allows this
transpose to be implemented by
This work area is small and fits into the cache. Since it will stay in the cache for each col-
umn of X, the work array Y can be ignored in terms of cache memory_transfers. The two trans-
poses above completes the transform. The number of steps has increased from six to seven, but
the number of memory_transfers is the same in the square six-step approach and the seven-step
approach.
DFTs/FFTs 363
Refinements The fourth through sixth steps should be considered together. Since a
work array Y of size s × t is used in step six, these steps can be simplified by operating on a col-
umn of data at a time. The output from the twiddle multiplication is stored to the work area of
size s × t. The FFT of size s × t is performed on this data. The first part of the complete transpose
is performed and the result is stored in X. In step four, a column of the arrays U and X is loaded
and the results are stored to the small column array Y. This results in two memory_transfers: one
for X and one for U. Since Y remains in-cache as the columns are processed from X and U, it
does not contribute any additional memory_transfers. Step five does not introduce any
memory_transfers since Y is in-cache. Step six assumes that Y is in-cache and the column of X
being processed is still in-cache from step four. Therefore, the one memory_transfers in step six
is caused by storing X to memory. Thus, an FFT of length n can be performed with only 11
memory_transfers, as shown in Table 12-8.
5. t FFTs of length s × t Yi → Yi 0
Total 11
The advantage of the seven-step approach is shown in Figure 12-5. Note that once the
problem size reaches 8 M points, more cache misses start occurring. A larger data cache would
move this degradation so that it occurs at a larger point. All of the approaches will have a similar
degradation in performance at some point. These techniques can be applied recursively to mini-
mize this.
364 Chapter 12 • High Performance Algorithms and Approaches for Signal Processing
∑ ∑x
j1 k1 j2 k2
y k1, k2 = j 1, j 2 ω n 1 ω n 2
j1 = 0 j 2 = 0
k 1 = 0, 1 , …, n 1 – 1, k 2 = 0, 1, …, n 2 – 1
n1 × ops1d(n2) + n2 × ops1d(n1)
Thus, when n1 and n2 are powers of two and both equal to n, there are 10n2 log2 n operations.
The polynomial transform requires a complex addition and subtraction for each point, but
no multiplication. It produces 2n2 log2 n floating-point operations. The two complex multiplica-
tions produce 2 × 6n2 real operations and the FFTs require 5n2 log 2 n floating-point operations.
Therefore, the number of operations is 7n2 log2 n + 12n2 floating-point operations. At what
DFTs/FFTs 365
point is it advantageous to use this approach? Mapping to one-dimensional FFTs takes 10n2 log2
n operations, so for problems where n > 16, the polynomial approach should require fewer float-
ing-point operations and hence perform better. Or should it?
We’ve gone to great pains to optimize one-dimensional FFTs. What happens if we look at
the number of instructions required instead of the number of floating-point operations? Using
fma instructions in the complex multiplications generates 2 × 4n2 instructions. The one-dimen-
sional FFTs generate 2.75n2 log2 n instructions using a radix-4 kernel with Goedecker’s optimi-
zation, while the polynomial transform generates 2n2 log2 n instructions. Now we must check
when the polynomial transform approach has the same number of instructions as the one-dimen-
sional DFT approach, i.e., when
The break-even point now occurs when n = 1625. This requires over 21 MB using eight-byte
data and will exceed the size of most data caches. Also, the permutation required in the last step
wreaks havoc with good cache management schemes, so this approach is not a good one for
cache based computers.
n 1 – 1 n2 – 1 n3 – 1
∑ ∑ ∑x
j 1 k1 j2 k2 j3 k3
y k 1, k 2, k 3 = j 1, j 2, j 3 ω n 1 ω n 2 ω n 3
j 1 = 0 j 2 = 0 j3 = 0
k 1 = 0, 1, …, n 1 – 1, k 2 = 0, 1, …, n 2 – 1, k 3 = 0, 1, …, n 3 – 1
Thus, when n1, n2, n3 are powers of two and all equal to n, there are 15n3 log2 n operations.
366 Chapter 12 • High Performance Algorithms and Approaches for Signal Processing
n3
DFT
n1 DFT DFT
DFT n1 DFT
n2
n2
The previously defined techniques for simultaneous FFTs can be applied. The real ques-
tion is when to copy data to a work array whose size is smaller than cache. Cases to consider
include:
1. n2 n3 FFTs of size n1
2. transpose n1 × n2 × n3 to n2 × n3 × n1
3. n1 n3 FFTs of size n 2
4. transpose n2 × n3 × n1 to n3 × n1 × n2
5. n1 n2 FFTs of size n3
6. transpose n3 × n1 × n2 to n1 × n2 × n3
y = last m elements of (inverse DFT { forward DFT (x) × forward DFT (w) } )
where the DFTs are of length m+n−1 and only the first m elements of the inverse DFT are
retained for y.
Thus, three DFTs and a pointwise multiplication perform the convolution. Similarly, the
correlation of x and w can be calculated by accessing the w array backwards as follows:
Due to this relationship, large convolutions and correlations are frequently performed
using Fourier transforms. However, this technique is not without cost.
1,024,000 floating-point operations while the FFT approach takes only 552,921 floating-point
operations, so the FFT approach looks very attractive.
Convolutions are also used to filter two-dimensional objects such as images. Assume the
weight array, w, is an m × m grid and the input array is of size n × n. The direct approach takes
2m2n2 operations. If m is a power of two plus one, the worst case padding for the FFT approach
takes 3 × 10 (2m+2n−3)2 log2 (2m+2n−3) + 8 (2m+2n−3)2 operations. A common size for n is
1024. When m < 27, the direct approach is probably faster for this problem size, but larger val-
ues of m may benefit by using the DFT approach.
Thus, the DFT is defined in terms of a complex convolution and two complex vector mul-
tiplications. Performing the convolution and three vector multiplications directly would take
(8)(2n)(n) + (3)(8n) = 16n2 + 24n floating-point operations. However, we know we can use
FFTs to calculate the convolution by bumping the problem size up to the next power of two and
performing radix-2 FFTs. So we can calculate an arbitrary length FFT by mapping it to a convo-
lution, which is, in turn, mapped to three FFTs. The worst case for the FFTs occurs when the
length 2n convolution is two more than power of two. This takes
(3)(5)(4n−4) × log2 (4n−4) + 8 (2n+4n−4) floating-point operations. However, the break-even
point for this case is only 26, so using three FFTs is the best approach for even fairly small prob-
lems.
12.5 Summary
FFTs and convolutions give rise to fascinating algorithms which are interconnected at a
deep mathematical level. Using the techniques developed in previous chapters allowed us to
optimize some of these algorithms until they execute close to the peak processor performance.
We also explored algorithmic optimizations that reduce the order of the calculations so the run
time required is a tiny fraction of the original time.
References:
9. Van Loan, C. Computational Frameworks for the Fast Fourier Transform. Philadelphia:
SIAM, 1992, ISBN 0-89871-285-8.
10. Wadleigh, K. R.; Gostin, G.B.; Liu, J. High-Performance FFT Algorithms for the Convex
C4/XA Supercomputer. J. Supercomputing, Vol. 9, 163-178, 1995.
11. Wadleigh, K. R. High Performance FFT Algorithms for Cache-Coherent Multiproces-
sors. Int. J. of High Performance Computing Applications, Vol. 2, 163-171, 1999.
Index
A B
absolute value 174 barrier 212–218, 231, 233
access time 46 basic block 83
advanced load 17 bcopy 265
affinity 59, 216, 218 binary compatibility 82
aio_cancel 220 BLACS (Basic Linear Algebra Communication
aio_error 220 Subprograms) 247
aio_read 220–222, 224 BLAS 246, 256, 324
aio_return 220 Level 1 256–258, 265–269, 309, 317
aio_suspend 220 Level 2 256–258, 270, 276–280, 310, 317
aio_write 222, 224 Level 3 256–258, 281–292, 310, 312–315,
aliasing 90, 162–168 318, 320
applications xxv Standard 256–258, 270–276
array padding 96, 111 block and copy 96, 104, 113, 118
asm macro 88, 89 blocking 113, 116, 309
assembler 81 branch delay slot 17, 88, 95
assembly language 7, 81 buffer cache 50
asynchronous I/O 53–55 bypassing 52
POSIX 220–221 buffered binary I/O 51
ATLAS 249 bus 65
atomic exchange 201 bus snooping 67
atomic operation 196 bzero 263
automatic parallelism 234
disabling 236 C
automatic performance tuning 248–251
C programming language xxvii, 131, 159–188, 257
371
372 Index
Message Passing Interface (MPI) 62, 76, 140, 144, omp_set_num_threads 234
237–239, 315, 324 omp_test_lock 234
message-passing 62, 237 omp_unset_lock 234
METIS ordering 320 one’s-complement 12
MIMD 58 open 196
MIPS R10000 8, 153–157 OpenMP 60, 85, 232, 236, 268–269, 275
MISD 57 optimization metrics 86
mmap 62, 199, 204 order of complexity 122, 180, 298, 330
Moore’s law 2 outer loop unrolling 113, 114
mpctl 216, 217 out-of-order execution 24
MPI (see Message Passing Interface)
MPI_irecv 237 P
MPI_isend 237
PA-8500 7, 152–157
MPI_recv 237
page table 43
MPI_recv_init 239
parallel I/O 219–225
MPI_recvinit 238
parallel models 58–63
MPI_send 237
fork/exec 62, 200
MPI_send_init 239
threads 205
MPI_sendinit 238
parallel scheduling 201, 205, 207, 233, 234, 239
MPI_startall 238
permutation matrix 307
MPI_waitall 238
PHiPAC 248
MSC.Nastran 318
phreads 59
multi-dimensional arrays 181
pipelining 9
Multiflow Trace 8, 29
pivoting 307–309
Multiple Minimum Degree (MMD) ordering 320
PLAPACK 247, 315, 324
munmap 199
polynomial transform 364–365
mutex 201, 206
positive definite matrix 313
POSIX 53, 59
N
power function 93
Netlib 247 pread 222, 224, 231
nonsingular system 307 preconditioners 321
NUMA 70 predicting performance 154
procedure 82
O process 58
object code 81 process parallelism 59, 192
object code compatibility 81 processor 7
omp_destroy_lock 234 processor families 7
omp_get_num_threads 234 CISC 8, 23–25, 28, 30, 140, 256
omp_init_lock 234 EPIC xxvii, 29–30
OMP_NUM_THREADS environment variable LIW (see processor families, VLIW)
234 RISC xxvii, 8, 23–25, 28, 30, 256
OMP_SCHEDULE environment variable 234
superscalar RISC 24
omp_set_lock 234
vector 8, 25–28, 256
376 Index