004-2178-002-CRAY T3E C and C++ Optimization Guide-Version 3.2-January 1999
004-2178-002-CRAY T3E C and C++ Optimization Guide-Version 3.2-January 1999
Guide
Copyright © 1997, 1999 Cray Research, Inc. All Rights Reserved. This manual or parts thereof may not be reproduced in any
form unless permitted by contract or by written permission of Cray Research, Inc.
Use, duplication, or disclosure by the Government is subject to restrictions as set forth in the Rights in Data clause at FAR
52.227-14 and/or in similar or successor clauses in the FAR, or in the DOD or NASA FAR Supplement. Unpublished rights
reserved under the Copyright Laws of the United States. Contractor/manufacturer is Silicon Graphics, Inc., 2011 N. Shoreline
Blvd., Mountain View, CA 94043-1389.
Autotasking, CF77, CRAY, Cray Ada, CraySoft, CRAY Y-MP, CRAY-1, CRInform, CRI/TurboKiva, HSX, LibSci, MPP Apprentice,
SSD, SUPERCLUSTER, UNICOS, and X-MP EA are federally registered trademarks and Because no workstation is an island, CCI,
CCMT, CF90, CFT, CFT2, CFT77, ConCurrent Maintenance Tools, COS, Cray Animation Theater, CRAY APP, CRAY C90,
CRAY C90D, Cray C++ Compiling System, CrayDoc, CRAY EL, CRAY J90, CRAY J90se, CrayLink, Cray NQS,
Cray/REELlibrarian, CRAY S-MP, CRAY SSD-T90, CRAY SV1, CRAY T90, CRAY T3D, CRAY T3E, CrayTutor, CRAY X-MP,
CRAY XMS, CRAY-2, CSIM, CVT, Delivering the power . . ., DGauss, Docview, EMDS, GigaRing, HEXAR, IOS,
ND Series Network Disk Array, Network Queuing Environment, Network Queuing Tools, OLNET, RQS, SEGLDR, SMARTE,
SUPERLINK, System Maintenance and Remote Testing Environment, Trusted UNICOS, UNICOS MAX, and UNICOS/mk are
trademarks of Cray Research, Inc., a wholly owned subsidiary of Silicon Graphics, Inc.
AMPEX and DST are trademarks of Ampex Corporation. DEC is a trademark of Digital Equipment Corporation. DLT is a
trademark of Quantum Corporation. EXABYTE is a trademark of EXABYTE Corporation. IBM and Magstar are trademarks of
International Business Machines Corporation. STK, TimberLine, RedWood, and WolfCreek are trademarks of Storage Technology
Corporation. Silicon Graphics and the Silicon Graphics logo are registered trademarks of Silicon Graphics, Inc. UNIX is a
registered trademark in the United States and other countries, licensed exclusively through X/Open Company Limited. X/Open
is a registered trademark, and the X device is a trademark, of X/Open Company Ltd.
The UNICOS operating system is derived from UNIX® System V. The UNICOS operating system is also based in part on the
Fourth Berkeley Software Distribution (BSD) under license from The Regents of the University of California.
Record of Revision
Version Description
004–2178–002 i
Contents
Page
Preface xi
Related Publications . . . . . . . . . . . . . . . . . . . . . . . xi
Obtaining Publications . . . . . . . . . . . . . . . . . . . . . . . xi
Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Reader Comments . . . . . . . . . . . . . . . . . . . . . . . . xiii
004–2178–002 iii
CRAY T3ETM C and C++ Optimization Guide
Page
SHMEM [3] 45
Using shmem_get and shmem_put for data transfer . . . . . . . . . . . . . 46
Example 8: Example of a shmem_put transfer . . . . . . . . . . . . . . 46
Optimizing Existing MPI and PVM Programs by Using SHMEM . . . . . . . . . . 51
Example 9: MPI version of the ring program . . . . . . . . . . . . . . 51
Optimizing by Using shmem_get . . . . . . . . . . . . . . . . . . 55
Example 10: shmem_get version of the ring program . . . . . . . . . . . 55
Optimizing by Using shmem_put . . . . . . . . . . . . . . . . . . 58
Example 11: shmem_put version of the ring program . . . . . . . . . . . 58
Passing 32-bit Data . . . . . . . . . . . . . . . . . . . . . . . . 59
Example 12: 32-bit version of ring program . . . . . . . . . . . . . . . 60
iv 004–2178–002
Contents
Page
004–2178–002 v
CRAY T3ETM C and C++ Optimization Guide
Page
vi 004–2178–002
Contents
Page
Index 153
004–2178–002 vii
CRAY T3ETM C and C++ Optimization Guide
Page
Figures
Figure 1. Data transfer comparison . . . . . . . . . . . . . . . . . . 3
Figure 2. Position of E registers . . . . . . . . . . . . . . . . . . . 5
Figure 3. Flow of data on a CRAY T3E node . . . . . . . . . . . . . . . 7
Figure 4. Data flow on the EV5 microprocessor . . . . . . . . . . . . . . 8
Figure 5. First value reaches the microprocessor . . . . . . . . . . . . . . 10
Figure 6. Ninth value reaches the microprocessor . . . . . . . . . . . . . . 12
Figure 7. Output stream . . . . . . . . . . . . . . . . . . . . . . 14
Figure 8. An external GigaRing network . . . . . . . . . . . . . . . . . 16
Figure 9. Fan-out method used by broadcasting routines . . . . . . . . . . . . 34
Figure 10. A PvmMax reduction . . . . . . . . . . . . . . . . . . . 39
Figure 11. The gather/scatter process . . . . . . . . . . . . . . . . . . 41
Figure 12. shmem_put64 data transfer . . . . . . . . . . . . . . . . . 48
Figure 13. Identification of neighbors in the ring program. . . . . . . . . . . . 54
Figure 14. shmem_iget and shmem_iput transfers . . . . . . . . . . . . . 66
Figure 15. Reordering elements during a scatter operation . . . . . . . . . . . 70
Figure 16. shmem_broadcast operation . . . . . . . . . . . . . . . . 73
Figure 17. Results of a shmem_fcollect . . . . . . . . . . . . . . . . 76
Figure 18. Results of a shmem_double_min_to_all . . . . . . . . . . . . 81
Figure 19. Overlapped iterations . . . . . . . . . . . . . . . . . . . 90
Figure 20. Pipelining a loop with multiplications . . . . . . . . . . . . . . 91
Figure 21. Before and after array a has been optimized . . . . . . . . . . . . 94
Figure 22. Multiple PEs using a single file . . . . . . . . . . . . . . . . 114
Figure 23. Multiple PEs and multiple files . . . . . . . . . . . . . . . . 117
Figure 24. I/O to and from a single PE . . . . . . . . . . . . . . . . . 118
Figure 25. Data paths between disk and an array . . . . . . . . . . . . . . 120
viii 004–2178–002
Contents
Page
Tables
Table 1. Latencies and bandwidths for data cache access . . . . . . . . . . . . 20
Table 2. Latencies and bandwidths for access that does not hit cache . . . . . . . . 20
Table 3. Functional unit . . . . . . . . . . . . . . . . . . . . . . 92
004–2178–002 ix
Preface
This publication documents optimization options for the Cray C and C++
compilers running on CRAY T3E systems.
Related Publications
The following documents contain additional information that may be helpful:
• Cray C/C++ Reference ManualCray C/C++ Reference Manual
Obtaining Publications
The User Publications Catalog describes the availability and content of all Cray
Research hardware and software documents that are available to customers.
Customers who subscribe to the Cray Inform (CRInform) program can access
this information on the CRInform system.
To order a document, call +1 651 683 5907. Silicon Graphics employees may
send electronic mail to [email protected] (UNIX system users).
Customers who subscribe to the CRInform program can order software release
packages electronically by using the Order Cray Software option.
Customers outside of the United States and Canada should contact their local
service organization for ordering and documentation information.
004–2178–002 xi
CRAY T3ETM C and C++ Optimization Guide
Conventions
The following conventions are used throughout this document:
Convention Meaning
command Denotes a command, library routine or function,
system call, part of an application program,
program output, or anything else that might
appear on your screen.
manpage(x) Man page section identifiers appear in
parentheses after man page names. The following
list describes the identifiers:
1 User commands
1B User commands ported from BSD
2 System calls
3 Library routines, macros, and
opdefs
4 Devices (special files)
4P Protocols
5 File formats
7 Miscellaneous topics
7D DWB-related information
8 Administrator commands
Some internal routines (for example, the
_assign_asgcmd_info() routine) do not have
man pages associated with them.
variable Italic typeface denotes variable entries and words
or concepts being defined.
user input This bold, fixed-space font denotes literal items
that the user enters in interactive sessions.
Output is shown in nonbold, fixed-space font.
[] Brackets enclose optional portions of a command
or directive line.
xii 004–2178–002
Preface
Term Definition
Cray PVP systems All configurations of Cray parallel vector
processing (PVP) systems.
Cray MPP systems All configurations of the CRAY T3E series.
Reader Comments
If you have comments about the technical accuracy, content, or organization of
this document, please tell us. Be sure to include the title and part number of
the document with your comments.
You can contact us in any of the following ways:
• Send electronic mail to the following address:
[email protected]
004–2178–002 xiii
CRAY T3ETM C and C++ Optimization Guide
xiv 004–2178–002
Background Information [1]
004–2178–002 1
CRAY T3ETM C and C++ Optimization Guide
1.1.2 SHMEM
SHMEM is a set of functions that pass data in a variety of ways, provide
synchronization, and perform reductions (glossary, page 148). SHMEM functions
are implemented on Cray MPP systems, multiprocessing Silicon Graphics
systems, and Cray PVP systems but not on any other company’s computers.
What SHMEM lacks in portability, it makes up for in performance. SHMEM is
the fastest of the Cray MPP programming styles.
2 004–2178–002
Background Information [1]
{
rc = pvm_initsend( ...
rc = pvm_pkint( ...
rc = pvm_send( ...
PVM
rc = pvm_recv( ...
rc = pvm_upkint( ...
SHMEM
{ shmem_put( ...
shmem_barrier_all
shmem_barrier_all
a10510
In this example of typical data transfers, PVM requires five steps on the two
PEs involved in the transfer: initialize a send buffer, pack the data, send the
data, receive the data, and unpack the data. SHMEM requires only one step:
send the data. However, one or more synchronization routines are almost
always necessary when using SHMEM. You usually must ensure that the
receiving PE does not try to use the data before it arrives.
SHMEM does a direct memory-to-memory copy, which is the fastest way to
move data on a CRAY T3E system. Adding SHMEM functions to your code, or
replacing the statements of another programming style with SHMEM functions,
will almost always enhance the performance of your program. Replacing only
the major data transfers with shmem_put or shmem_get can often give you a
major speedup with minimal effort. For more information on the functionality
available in SHMEM, see the intro_shmem(3) man page.
004–2178–002 3
CRAY T3ETM C and C++ Optimization Guide
1.2 Hardware
The CRAY T3E hardware performs at a rate of two to three times that of
CRAY T3D systems. The following sections contain an overview of the memory
system, a brief description of the microprocessor, a look at the network and the
system’s peripherals, and statistics detailing where the increased performance
comes from.
1.2.1 Memory
A memory operation from a PE takes one of two forms:
• A read from, or write to, the PE’s own memory (called local memory). Each
PE has between 64 Mbytes (8 64-bit Mwords) and 2 Gbytes (256 64-bit
Mwords) of memory local to the processor.
• A read from, or write to, remote memory (the memory local to some other
PE).
Note: A word in this document is assumed to be 64-bits in length, unless
otherwise stated.
Operations between the memory of two PEs make use of E registers. E registers
are special hardware components that let one PE read from and write to the
memory of another PE. E registers are largely reserved for internal use, but
users can customize access to them (see Section 6.1, page 131).
E registers are positioned between the PE and the network, as illustrated in the
following figure. They are memory-mapped registers, which means they reside
in noncached memory and have an address associated with them.
4 004–2178–002
Background Information [1]
PE
E registers
Outgoing Returning
PUT GET
data data
a10012
Operations within a PE, between local memory and the microprocessor, are
always faster than operations to or from remote memory. Data read from local
memory is accessed through two levels of cache: a 96-Kbyte secondary cache
(glossary, page 149) and a high-speed, 8-Kbyte data cache (glossary, page 143).
Data written to local memory passes through a 6-entry write buffer (glossary,
page 152) and secondary cache.
Cache coherence (glossary, page 142), which was a user concern on the
CRAY T3D system, is performed automatically on the CRAY T3E system.
Cache is high-speed memory that helps move data quickly between local
memory and the EV5 microprocessor registers. It is still an important part of
MPP programming. The cache_align directive aligns each specified variable
on a cache line boundary. This is useful for frequently referenced variables and
for passing arrays in SHMEM (see Section 3.3, page 59). The cache_align
directive can be used with all of the programming styles described in this guide.
Data cache is a direct-mapped cache (glossary, page 143), meaning each local
memory location is mapped to one data cache location. When an array, for
example, is larger than data cache, a location in data cache can have more than
004–2178–002 5
CRAY T3ETM C and C++ Optimization Guide
one of the array addresses mapped to it. Each location is a single, 4-word
(32-byte) line (glossary, page 146).
Secondary cache is three-way set associative, and lines are 8 (64-bit) words long,
for a total of 64 bytes. In a three-way, set-associative cache (glossary, page 149),
each memory location is associated with three lines in secondary cache. Which
of the three lines to which the data is added is chosen at random. Any line can
be selected.
For an example of how data cache and secondary cache work, see Procedure 1,
page 9, which describes data movement between local memory and the
microprocessor. For an illustration of the components of a PE, see Figure 3. The
abbreviations on the figure have the following meanings. Many of these terms
are also used in Chapter 4, page 85.
EV5 The RISC microprocessor
E0, E1 Integer functional units
FA, FM Floating-point functional units
WB Write buffer
MAF Missed address file
ICACHE Instruction cache (not relevant to this discussion)
DCACHE Data cache
SCACHE Secondary cache
SB Stream buffer
6 004–2178–002
Background Information [1]
EV5
E0 E1 FA FM
WB
ICACHE DCACHE
MAF
SCACHE
SB SB SB SB SB SB
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
a10015
The size of a local memory page (marked as DRAM in the preceding figure)
depends on the amount of memory in your machine. A memory size of 128
004–2178–002 7
CRAY T3ETM C and C++ Optimization Guide
Mbytes, for example, has a page size of 16 Kbytes. The following figure shows
more detail from the microprocessor part of the data flow.
Missed Control
To support
address file circuitry
Read data Instructions
Control
Read data or
instructions From support
Write data circuitry
Secondary cache
Write data
To support
circuitry
a10014
Each PE has four functional units: two for floating-point operations and two for
integer operations. It can handle six concurrent input and output data streams
(glossary, page 150).
For the following loop, a PE will create streams between memory and the
functional units for all of the input operands (b[i], c[i], and so on) and one
stream between the functional units, through the write buffer, through
secondary cache, and back to memory for the output operand (a[i]):
8 004–2178–002
Background Information [1]
004–2178–002 9
CRAY T3ETM C and C++ Optimization Guide
b [0]
EV5
b [0-3]
Data
cache
b [0-7]
Secondary
cache
b [0-N]
Local
memory
a10511
6. When other registers need b[1] through b[3], they find them in data
cache.
7. When a register needs b[4], data cache does not have it.
10 004–2178–002
Background Information [1]
8. Data cache requests b[4] through b[7] from secondary cache, which has
them and passes them on.
9. Data cache passes b[4] through b[7] on to the appropriate registers as it
gets requests for them. When the microprocessor finishes with them, it
requests b[8] from data cache.
10. Data cache requests a new line of data elements from secondary cache,
which does not have them. This is the second secondary cache miss, and it
is the signal to the system to begin streaming data.
11. Secondary cache requests another 8-word line from local memory and puts
it into another of its three-line buckets. It may end up in any of the three
lines, since the selection process is random.
12. A 4-word line is passed from secondary cache to data cache, and a single
value is moved to a register. When the value of b[8] gets to the register,
the situation is as illustrated in the following figure.
004–2178–002 11
CRAY T3ETM C and C++ Optimization Guide
b [8]
EV5
b [0-3]
b [4-7]
Data b [8-11]
cache
b [0-7]
b [8-15]
Secondary
cache
b [16-23]
Stream
buffer
b [0-N]
Local
memory
a10512
12 004–2178–002
Background Information [1]
13. Because streaming has begun, data is now prefetched. A stream buffer
anticipates the microprocessor’s continuing need for consecutive data, and
begins retrieving b[16] through b[23] from memory before it is
requested. As long as the microprocessor continues to request consecutive
elements of b, the data will be ready with a minimum of delay.
14. The process of streaming data between local memory and the registers in
the microprocessor continues until the loop is complete.
These steps describe the input stream only. The values of the a array pass
through the write buffer and secondary cache, as illustrated in the following
figure, on their way back to local memory. Values of a are written to local
memory only when a line in secondary cache is dislodged by a write to the
same line, or when values of a are requested by another PE.
004–2178–002 13
CRAY T3ETM C and C++ Optimization Guide
b [8] a [8]
EV5
b [0-3]
a [0-3]
b [4-7]
Data a [4-7]
b [8-12] Write
cache a [8] buffer
b [0-7]
b [8-15]
Secondary
cache
a [0-7]
b [16-23]
Stream
buffer
b [0-N]
a [0. . .]
Local
memory
a10513
14 004–2178–002
Background Information [1]
004–2178–002 15
CRAY T3ETM C and C++ Optimization Guide
a10013
The GigaRing channel supports the peripherals and networks described in the
following sections.
16 004–2178–002
Background Information [1]
004–2178–002 17
CRAY T3ETM C and C++ Optimization Guide
18 004–2178–002
Background Information [1]
004–2178–002 19
CRAY T3ETM C and C++ Optimization Guide
Table 2. Latencies and bandwidths for access that does not hit cache
20 004–2178–002
Background Information [1]
Caution: Do not use both the Cray MPP Apprentice tool and _rtc on the
! same code at the same time. MPP Apprentice introduces a significant amount
of overhead that will be included in the _rtc numbers but not in the
numbers that MPP Apprentice itself reports. Distinguishing between the time
used by your code and the overhead is difficult.
If your CRAY T3E system has PEs running at different clock rates (for instance,
some at 300 megahertz and others at 450 megahertz), you will have to know
004–2178–002 21
CRAY T3ETM C and C++ Optimization Guide
what each PE’s clock rate is in order to time the program correctly. For
information on how your mixed-speed PEs are configured, see your system
administrator. The grmview(1) command shows you at what speed each
physical PE runs, but, when you execute your program, physical PEs numbers
are mapped to logical PE numbers, which are different.
22 004–2178–002
Parallel Virtual Machine (PVM) [2]
004–2178–002 23
CRAY T3ETM C and C++ Optimization Guide
• Distributing data from one PE to multiple PEs and gathering data from
multiple PEs to a single PE (see Section 2.12, page 40).
You cannot, however, change the value within the program by using the
pvm_setopt(3) function. You must reset the value outside of your program, as
follows. Specify the new value for PVM_DATA_MAX in bytes.
% setenv PVM_DATA_MAX 8192
% ./a.out
This example changes the value of the maximum message size to 8,192 bytes (or
1,024 64-bit words) for the entire program. The second line executes the
program.
Increasing the size of PVM_DATA_MAX is not always the best solution. If you
have one or two large transfers in your program, but a number of smaller
transfers, you may not want to increase the size of all messages. Adjusting the
size of PVM_DATA_MAX may not help your overall performance. It takes away
from the memory available to the application, and a large message is not
always transferred quickly, especially when it is broadcast to multiple PEs.
Breaking the large messages up into smaller messages may be faster in some
cases. Whether this proves to be faster in your program depends upon the
application. You may have to time the program to find out. For information on
timing your code, see Section 1.3, page 21.
PVM does not handle large amounts of data in the same way as small amounts.
For large transfers (greater than the value of PVM_DATA_MAX), the message
24 004–2178–002
Parallel Virtual Machine (PVM) [2]
contains the first chunk of data and the address of the data block on the
sending PE. After the receiving PE unpacks, it uses remote loads to get the
remainder of the data.
Often, remote stores used for short messages can occur at the same time as
computation on the receiving PE. But with large messages, remote loads require
the receiving PE to wait until the loads complete. If the same data is being sent
to several PEs, those PEs may all try to do remote loads at the same time,
creating a slowdown as they share the limited memory bandwidth.
• It requires you to wait until the transfer is complete before accessing the
data, which can slow the program down at times.
• You must either provide your own synchronization or send a short message
from the receiving PE to let the sending PE know the transfer is complete.
• It is optimized for contiguous (stride-1) data. You lose any performance
benefit if your data is not contiguous.
004–2178–002 25
CRAY T3ETM C and C++ Optimization Guide
26 004–2178–002
Parallel Virtual Machine (PVM) [2]
To move on to the next optimization topic, go to Section 2.4, page 28. For a
brief description of the above program, continue with this section.
Line 3 references the PVM header file. See your system administrator for its
actual location on your system. You can specify the location of the header file
with the -I option on the cc(1) or CC(1) command line.
004–2178–002 27
CRAY T3ETM C and C++ Optimization Guide
3. #include <pvm3.h>
Line 10 declares the sending and receiving arrays as type float, meaning they
contain 32-bit data on CRAY T3E systems.
10. float d_send[1000], d_recv[1000];
28 004–2178–002
Parallel Virtual Machine (PVM) [2]
pvm_bufinfo(3). In both the send and the receive, the pvm_psend and
pvm_precv functions offer much simpler and faster code.
The speedups from using pvm_psend and pvm_precv are most noticeable for
small messages, meaning less than the value of the PVM_DATA_MAX
environment variable (see Section 2.1, page 24). For large messages (greater
than PVM_DATA_MAX), the performance benefits over pvm_send and pvm_recv
are not significant.
The following example shows a program that passes data by using the
pvm_psend and pvm_precv functions. The source PE (src in the program)
passes data to the destination PE (dest), which in turn passes it back to the
src PE.
me = _my_pe();
parray = &array[0];
/* Initialize data */
if (me == src)
for (i=0; i < len; i++)
array[i] = i * 1.0;
004–2178–002 29
CRAY T3ETM C and C++ Optimization Guide
30 004–2178–002
Parallel Virtual Machine (PVM) [2]
In the simplest case, you can remove the initialization and packing steps from
the loop, as follows:
bufid = pvm_initsend(PvmDataRaw);
retpk = pvm_pklong (parray, n, 1);
for (i=0; i < numpes; i++) {
retsend = pvm_send (i, mtag);
}
This is more efficient because you pack the data only once. This means you
need only one extra data block (for PvmDataRaw or PvmDataDefault
packing) and one memory copy. Other PVM functions, such as pvm_bcast(3)
and pvm_mcast(3) (see Section 2.9, page 33), provide alternatives to what
remains of the for loop.
Some programs may include this inefficient code construct but hide it, as in the
following example:
for (i=0; i < numpes; i++) {
retmy = myown_send(i, parray, n, 1);
}
In this case, you are gaining portability but sacrificing performance. You may
want to consider using PVM directly or writing a routine that runs more
efficiently.
004–2178–002 31
CRAY T3ETM C and C++ Optimization Guide
page 147), meaning it does not wait until the message arrives but rather returns
immediately if there is no message. By checking with pvm_nrecv periodically,
your program can monitor the arrival of a message and execute other
statements while it waits.
The following example outlines one way in which you can make use of
pvm_nrecv:
arrived = pvm_nrecv(-1, 4);
if (arrived == 0) {
/* Do something else */
}
else {
/* Process data in message */
}
32 004–2178–002
Parallel Virtual Machine (PVM) [2]
nonblocking receive, so that if the message has not arrived, the code can do
other work. See the example in the preceding subsection.
004–2178–002 33
CRAY T3ETM C and C++ Optimization Guide
0 1 2 3 4 5 6 7
= sender = receiver
a10004
If you use a group name representing a subset of the PEs, there is no special
optimization. PVM simply goes through the list of PEs in the group and sends
to each PE.
The pvm_mcast function does not offer special optimizations. PVM goes
through the specified array of PE numbers and sends to each PE.
The following two examples use pvm_bcast and pvm_mcast, respectively, to
transfer an array of 10 64–bit elements to all other PEs attached to the job:
Example 3: pvm_bcast
#include <pvm3.h>
#include <stdio.h>
main()
{
int len = 10, mtag = 99;
int mytid, me, npes, bufid, i;
int retpk, retbc, retrecv, retupk;
double arr[10];
34 004–2178–002
Parallel Virtual Machine (PVM) [2]
if (me == 0) {
/* Initialize array */
for (i=0; i<len; i++)
arr[i] = (i+1) /2.0;
Example 4: pvm_mcast
#include <pvm3.h>
#include <stdio.h>
main()
{
int len = 10, mtag = 99;
int mytid, me, npes, bufid, i;
int retpk, retmc, retrecv, retupk;
int pe_arr[10];
double arr[10];
004–2178–002 35
CRAY T3ETM C and C++ Optimization Guide
if (me == 0) {
/* Initialize array */
for (i=0; i<len; i++)
arr[i] = (i+1) /2.0;
36 004–2178–002
Parallel Virtual Machine (PVM) [2]
In the preceding example, regardless of which PE gets its data there first, PE 0
will wait until the data from PE 1 arrives and is received before it can receive
data from any other PE. The following example receives whatever data arrives
first:
for (i=0; i<npes; i++) {
istat = pvm_recv(-1, msgtag);
istat = pvm_upklong(&x[msgtag-1] * length, 1);
}
Unless the data in this example can be put into the x array in random order,
you must check the message tag to find out which PE sent a given message.
The example assumes the sending PE has sent its PE number in the message
tag. Remember, the data is likely to arrive in a different order for different
executions of the program.
A loop such as the following offers a second way to order the arriving data in
the receiving array:
004–2178–002 37
CRAY T3ETM C and C++ Optimization Guide
This example assumes that each PE has sent its identifier in source, which is
the first part of the message.
The following example assumes the sending PEs did not include their identifiers
in the message. Instead, a call to the pvm_bufinfo function retrieves the value
of the task identifier (the fourth argument), converts it to a PE number using
pvm_get_PE(3), and places it in the variable nextpe. nextpe is then used in
the pvm_upklong function to provide the element number in the array x.
for(i=0; i<npes; i++) {
istat = pvm_recv(-1, msgtag);
istat = pvm_bufinfo(bufid, &bytes, &msgtag, &source);
nextpe = pvm_get_PE(source);
istat = pvm_upklong(&x[(nextpe-1) * length], length, 1);
}
38 004–2178–002
Parallel Virtual Machine (PVM) [2]
0 31 0 6 0 13 0 31
1 11 1 9 1 7 1 11
2 91 2 14 2 91 2 8
3 18 3 3 3 18 3 5
4 5 4 1 4 3 4 5
5 4 5 0 5 4 5 4
6 36 6 36 6 0 6 14
7 17 7 11 7 17 7 15
8 20 8 20 8 12 8 18
9 19 9 19 9 7 9 1
a11385
Reduction functions are faster than other PVM methods of finding the same
answers. The following example adds the values at each array position for each
instance of the array and returns the sum to that array position on PE 0:
main()
{
int mytid, me, npes, istat, len=10;
int arr[10], results[10];
int i, mtag;
mytid = pvm_mytid();
mtag = 99;
me = pvm_get_PE(mytid);
npes = pvm_gsize(NULL);
004–2178–002 39
CRAY T3ETM C and C++ Optimization Guide
/* Everyone sums */
40 004–2178–002
Parallel Virtual Machine (PVM) [2]
Process
PE1
PE0 Scatter
} Gather PE0
}
0 0
1 1
PE2
2 2
3 3
4
5
6
7
} } 4
5
6
7
}
8 8
PE3 9
9
10 10
11
} 11
a11386
The following example collects each PE’s small_c array into the big_c array
on the root PE:
004–2178–002 41
CRAY T3ETM C and C++ Optimization Guide
mytid = pvm_mytid();
mype = pvm_get_PE(mytid);
npes = pvm_gsize(0);
/* Initialize arrays */
for(i=0; i<ixdim; i++)
for(j=0; j<iydim; j++)
a[i,j] = mype;
if(mype == root_pe)
for(i=0; i<ixdim*4; i++)
printf("%f", big_c[i]);
42 004–2178–002
Parallel Virtual Machine (PVM) [2]
The following example scatters the big_x array into the smaller small_x
arrays.
main() {
mytid = pvm_mytid();
mype = pvm_get_PE(mytid);
004–2178–002 43
SHMEM [3]
You can either use shared memory (SHMEM) routines alone or mix them into a
program that primarily uses PVM (glossary, page 148) or MPI (glossary, page
146), thereby offering opportunities for optimizations beyond what the
message-passing protocols can provide. Be aware, however, that SHMEM is not
a standard protocol and will not be available on machines developed by
companies other than Silicon Graphics and Cray Research. SHMEM is
supported on Cray PVP systems, Cray MPP systems, and on Silicon Graphics
systems.
For background information on SHMEM, see Section 1.1.2, page 2. For an
introduction to the SHMEM routines, see the shmem_intro(3) man page.
This chapter describes the following optimization techniques:
• Improving data transfer rates in any CRAY T3E program by using SHMEM
get and put routines (see Section 3.1, page 46). This section provides an
introduction to data transfer, which is the most important capability that
SHMEM offers.
• Improving the performance of a PVM or MPI program by adding SHMEM
data manipulation routines (see Section 3.2, page 51).
• Avoiding performance pitfalls when passing 32-bit data rather than 64-bit
data (see Section 3.3, page 59).
• Copying strided (glossary, page 150) data while maintaining maximum
performance. The strided data routines enable you, for example, to divide
the elements of an array among a set of processing elements (PEs) or pull
elements from arrays on multiple PEs into a single array on one PE (see
Section 3.4, page 62).
• Gathering and scattering data and reordering it in the process (see Section
3.5, page 67).
• Broadcasting (glossary, page 142) data from one PE to all PEs (see Section 3.6,
page 71).
• Merging arrays from each PE into a single array on all PEs (see Section 3.7,
page 73).
• Executing an atomic memory operation (glossary, page 141) to read and
update a remote value in a single process (see Section 3.8, page 76).
004–2178–002 45
CRAY T3ETM C and C++ Optimization Guide
46 004–2178–002
SHMEM [3]
004–2178–002 47
CRAY T3ETM C and C++ Optimization Guide
PE0 PE1
Dest Source
0 1 1 0
1 2 2 1
2 3 3 2
3 4 4 3
4 5 5 4
5 6 6 5
6 7 7 6
7 8 8 7
a11384
48 004–2178–002
SHMEM [3]
Defining the number of PEs in a program and the number in an active set
(glossary, page 141) as powers of 2 (that is, 2, 4, 8, 16, 32, and so on) helped
performance on CRAY T3D systems. Also, declaring arrays as powers of 2 was
necessary if you were using Cray Research Adaptive Fortran (CRAFT) on
CRAY T3D systems. Both have changed as follows on CRAY T3E systems:
• Declaring arrays such as source and dest as multiples of 8 helps SHMEM
speed things up somewhat, since 8 is the vector length of a key component
of the PE remote data transfer hardware. Declaring the number of elements
as a power of 2 does not affect performance unless that number is also a
multiple of 8.
• Defining the number of PEs, whether you are referring to all PEs in a
program or to the number involved in an active set, as a power of 2 does
not usually enhance performance in a significant way on the CRAY T3E
system. Some SHMEM routines, notably SHMEM_BROADCAST, still do benefit
somewhat from defining the number of PEs as a power of 2.
For information on optimizing an existing MPI program using shmem_get and
shmem_put, see Example 8, page 46. For a complete description of the
MPP-specific statements in the preceding example, continue on with this section.
In the shmem_put example, line 2 imports the SHMEM include file, which
defines parameters needed by many of the routines. The location of the file
may be different on your system. Check with your system administrator if you
do not know the correct path. If you are using the module command and have
loaded craytools, you do not need this line in your program. Subsequent
examples will use the #include method.
2. #include <mpp/shmem.h>
004–2178–002 49
CRAY T3ETM C and C++ Optimization Guide
Note: Most of the examples in this publication use the intrinsic functions
_my_pe and _num_pes because they are at least as fast as their
message-passing system equivalents, such as the shmem_my_pe and
shmem_n_pes functions. Both are available on Cray PVP systems as well as
Cray MPP systems.
In lines 11 and 12, the _my_pe and _num_pes functions are called on each PE.
Their values are assigned to me and npes, respectively. Any subsequent
reference to the variable me returns the number of the calling PE.
10. /* Get PE information */
11. me = _my_pe();
12. npes = _num_pes();
The value of me is tested in line 15. Because the test returns a value of true on
PE 1, PE 1 initializes the source array in line 17.
15. if(me == 1) {
16. for(i=0; i<8; i++)
17. source[i] = i+1;
In line 18, PE 1 executes the shmem_put64 function call that sends the data. It
sends eight array elements from its source array to the dest array on PE 0.
18. shmem_put64(dest, source, 8*sizeof(dest[0])/8, 0);
Line 25 selects PE 0, which is passively receiving the data. PVM requires their
receiving tasks (glossary, page 150) to call routines that receive and unpack the
data, but shmem_put64 places the data directly into PE 0’s local memory. PE 0
is not involved in the transfer operation. After being released from the barrier
in line 22, PE 0 prints dest, and the program exits. The shmem_udc_flush
function in line 26 is a no-op on CRAY T3E systems, but keeping it in your code
makes the program more portable.
24. /* Print from the receiving PE */
25. if(me == 0) {
26. _shmem_udc_flush();
50 004–2178–002
SHMEM [3]
004–2178–002 51
CRAY T3ETM C and C++ Optimization Guide
22.
23. /* Initialize data */
24. send = me;
25. total = me;
26.
27. for(i=2; i<=npes; i++) {
28.
29. /* Send data to next PE */
30. ierr=MPI_Send(&send,1,MPI_INT,next,99,MPI_COMM_WORLD);
31.
32. /* Receive data from previous PE */
33. ierr=MPI_Recv(&recv,1,MPI_INT,prev,99,MPI_COMM_WORLD,
34. &istat);
35. /* Do work */
36. total = total + recv;
37. send = recv;
38. } /* End of for loop */
39.
40. printf("PE= %d Result= %d Expect= %d", me, total,
41. (int)((npes-1)*npes*.5));
42. }
This program simply passes messages around a ring of PEs. All of the PEs
execute all of the statements. Each passes its PE number around and adds the
number it receives to the variable total. When each PE has seen the PE
number of every other PE, each prints out its own PE number and the total it
has calculated.
The output from the program, reflecting the random order in which the PEs
finish, is as follows:
PE = 1 Result = 28 Expect = 28
PE = 5 Result = 28 Expect = 28
PE = 3 Result = 28 Expect = 28
PE = 7 Result = 28 Expect = 28
PE = 2 Result = 28 Expect = 28
PE = 4 Result = 28 Expect = 28
PE = 6 Result = 28 Expect = 28
PE = 0 Result = 28 Expect = 28
For the shmem_get version of the same program, see Section 3.2.1, page 55;
otherwise, see the remainder of this section for a more detailed description of
the MPI version.
52 004–2178–002
SHMEM [3]
Line 2 references the MPI header file. If you use the module(1) package to
define your environment and have loaded mpt as one of the modules, this line
is not necessary. Line 3 references the C intrinsics file.
2. #include <mpi.h>
3. #include <intrinsics.h>
The next few statements define a PE’s neighbors. The variables next and prev,
defined in lines 16 and 19, specify which of its neighbors most of the PEs will
be passing to and which they will be receiving from. Lines 17, 18 , 20, and 21
define neighbors for the PEs on the end, namely PE and PE 7 in an 8-PE
configuration. Lines 17 and 18 cause PE 7 to pass to PE 0, and lines 20 and 21
cause PE to receive from PE 7.
15. /* Define the ring */
16. next = me + 1;
17. if(next >= npes)
18. next = next - npes;
19. prev = me -1;
20. if(prev < 0)
21. prev = prev + npes;
The values for next and prev in each PE are as illustrated in the following
figure.
004–2178–002 53
CRAY T3ETM C and C++ Optimization Guide
PE 0 PE 1 PE 2 PE 3 PE 4 PE 5 PE 6 PE 7
Line 23 initializes the variable that each PE will pass on to next, and line 24
initializes the variable that will hold the running total in each PE. Both
variables at first contain the number of the respective PE.
23. /* Initialize data */
24. send = me;
25. total = me;
Next comes the for loop, within which the MPI statements pass and receive
the data. It executes once for every PE except me, since the value of me is
already in the running total. In line 30, the MPI_Send function sends the data
(send) to the next PE. All of the MPI functions are described in more detail on
their man pages.
29. /* Send data to next PE */
30. ierr=MPI_Send(&send,1,MPI_INT,next,99,MPI_COMM_WORLD);
Line 33 receives the data from the PE that is defined in prev and puts it into
the variable recv.
32. /* Receive data from previous PE */
33. ierr=MPI_Recv(&recv,1,MPI_INT,prev,99,MPI_COMM_WORLD,
34. &istat);
At the end of the for loop, each PE updates its running total and moves the
number it received into the send variable, preparing to pass it on to next in
the next iteration of the loop.
35. /* Do work */
36. total = total + recv;
37. send = recv;
54 004–2178–002
SHMEM [3]
004–2178–002 55
CRAY T3ETM C and C++ Optimization Guide
Lines 2 and 3 reference the stdio and SHMEM #include files. Line 3
declares the size of the remote and local arrays to be 10.
1. #include <stdio.h>
2. #include <mpp/shmem.h>
3. #define N 10
Lines 14 and 15 use the same intrinsic functions and get the same information
(the number of the calling PE and the number of PEs involved in the job) as the
MPI calls in lines 11 and 12 of the MPI version of the program (see Example 9,
page 51).
14. me = _my_pe();
15. npes = _npes();
56 004–2178–002
SHMEM [3]
The SHMEM version of the program defines the ring, initializes the data,
updates the running total, and writes the output exactly as in the MPI version
of the program. Only the method of passing data differs.
Lines 34 and 40, which set barriers, are necessary when using the shmem_get
function. Synchronization is implicit in the MPI version because of the MPI
mode of operation: each send is matched by a receive. You must provide your
own synchronization when using SHMEM. The shmem_barrier_all function
takes advantage of the fast hardware barrier mechanism, making these calls
relatively inexpensive. This can replace the implicit synchronization in MPI by
this faster synchronization method when you are converting between MPI and
SHMEM.
33. /* Synchronize - Make sure data is ready on other PEs */
34. shmem_barrier_all();
35.
36. /* Get data from previous PE */
37. shmem_double_get(d_recv, d_send, N, prev);
38.
39. /* Synchronize - Make sure data is ready on other PEs */
40. shmem_barrier_all();
Other performance improvements that you will see when converting from MPI
to SHMEM data passing are as follows. They apply whether you are using
shmem_put or shmem_get.
• SHMEM does not require separate calls to routines to initialize, to send the
data, and to receive the data.
• SHMEM does not require the remote PE to be involved while doing
transfers. That means the remote PE is free to do other work, although it
does not do so in this example.
If you have written programs for the CRAY T3D system or for Cray PVP
systems, you are probably accustomed to flushing data cache (glossary, page 145)
on the PE receiving the data in order to preserve cache coherence (glossary, page
142). That is no longer necessary on the CRAY T3E system. For portability
purposes, however, you can leave cache flushing routine calls in your program.
They are essentially ignored on CRAY T3E systems, so they do not affect
performance, but they are required by CRAY T90 systems and may be required
by future systems.
004–2178–002 57
CRAY T3ETM C and C++ Optimization Guide
58 004–2178–002
SHMEM [3]
The first half of the program is the same as the shmem_get version (see
Example 10, page 55), except that the remote variable is now d_recv, which is
the target of the data transfer.
In line 39, each PE passes the data to the next PE.
38. /* Send data to next PE */
39. shmem_double_put(d_recv, d_send, N, next);
004–2178–002 59
CRAY T3ETM C and C++ Optimization Guide
When you use the 32-bit routines, try to align either both or neither of the
destination array and the source array on a 64-bit boundary. Performance slips
significantly when the two are not so aligned, as in the following call:
shmem_put32(dest[0], source[5], nlong, pe);
Instead, cache-align the two arrays and begin the transfer either on two
even-numbered or two odd-numbered array elements. The C and C++ directive
cache_align serves the purpose of aligning cache.
#pragma _CRI cache_align dest, source
shmem_put32(dest, source, nlong, pe);
The following 32-bit version of the ring program uses the shmem_put32
function:
60 004–2178–002
SHMEM [3]
004–2178–002 61
CRAY T3ETM C and C++ Optimization Guide
62 004–2178–002
SHMEM [3]
19.
20. /* Get PE info */
21. me = shmem_my_pe();
22.
23. if (me == SENDER) {
24.
25. /* Initialize data */
26. for (i=0; i < 2*N; i++) {
27. d_send[i] = i + me + 1.0;
28. }
29. for (i=0; i < 3*N; i++) {
30. d_recv[i] = 0.0;
31. }
32.
33. /* Synchronize - Make sure data is ready */
34. shmem_barrier_all();
35. /* Note: Sender does nothing but synchronize,
36. Receiver does all the work */
37. }
38. else if (me == RECEIVER) {
39.
40. /* Synchronize - Make sure data is ready on other PE */
41. shmem_barrier_all();
42.
43. /* Get data */
44. shmem_double_iget(d_recv,d_send,3,2,N,SENDER);
45.
46. /* Print results */
47. printf("Receiver=%d d_recv=%f %f %f %f %f %f %f\n",
48. me, d_recv[0],d_recv[1],d_recv[2],d_recv[3],
49. d_recv[4],d_recv[5],d_recv[6]);
50. }
51. }
004–2178–002 63
CRAY T3ETM C and C++ Optimization Guide
8. #define N 100
9. #define SENDER 0
10. #define RECEIVER 1
11.
12. double d_recv[3*N];
13.
14. main()
15. {
16. int me, i;
17. double d_send[2*N];
18.
19. /* Get PE info */
20. me = shmem_my_pe();
21.
22. if (me == SENDER) {
23.
24. /* Initialize data */
25.
26. for (i=0; i < 2*N; i++)
27. d_send[i] = i + me + 1.0;
28.
29. for (i=0; i < 3*N; i++)
30. d_recv[i] = 0.0;
31.
32. /* Send data */
33. shmem_iput(d_recv, d_send, 3, 2, N, RECEIVER);
34.
35. /* Synchronize - Make sure data has arrived */
36. shmem_barrier_all();
37. }
38. else if (me == RECEIVER) {
39.
40. /* Synchronize - Make sure data has arrived */
41. shmem_barrier_all();
42.
43. /* Make sure cache is up to date */
44. shmem_udcflush();
45.
46. /* Print results */
64 004–2178–002
SHMEM [3]
/* Receive data */
bufid = pvm_recv(sender, mtag);
istat = pvm_upkdouble(d_recv, N, 3);
For a description of how to efficiently collect data from all PEs and distribute it
to all PEs, see Section 3.5, page 67. Or continue on with the remainder of this
section for a brief description of the two SHMEM strided data programs.
The data is transferred from within an if statement, beginning on line 38 in the
shmem_iget version and line 33 in the shmem_iput version.
The structures of the two if statements are identical in that the sender (PE 0)
executes the if clause and the receiver (PE 1) executes the else clause, but the
placement of the data transfer functions differs. The shmem_double_iget
function, executing on PE 1, retrieves the data from PE 0. The
shmem_double_iput function, executing on PE 0, copies the data to PE 1.
shmem_iget version:
44. shmem_double_iget(d_recv,d_send,3,2,N,SENDER);
shmem_iput version:
004–2178–002 65
CRAY T3ETM C and C++ Optimization Guide
As the third and fourth arguments of the function calls specify, both functions
take every second array element from the source array (d_send) and place
them in every third element of the target array (d_recv). The arguments to the
two functions are the same. See the following figure for an illustration of the
transfers:
PE 0 PE 1
d_send d_recv
0 1 1 0
1 2 0 1
2 3 0 2
3 4 3 3
4 5 0 4
5 6 0 5
6 7 5 6
a10828
The output, which is the same for both programs, identifies the PE that received
the data and the values of the first seven elements of the d_recv array:
66 004–2178–002
SHMEM [3]
Receiver=1 d_recv=1. 0. 0. 3. 0. 0. 5.
004–2178–002 67
CRAY T3ETM C and C++ Optimization Guide
The following lines reorder the data in a PVM version of the same program:
for(i=0; i<N; i++)
iupk = pvm_upkdouble(d_recv[1+index[I]], 1, 1);
In the following lines of the SHMEM program, the index array (defined in line
13) is referenced in the call to the shmem_ixput function, which reorders the
array itself.
13. long index[N] = { 99, 19, 28, 91, 82, 37, 73, 46, 64, 55 };
68 004–2178–002
SHMEM [3]
004–2178–002 69
CRAY T3ETM C and C++ Optimization Guide
PE 0 PE 1
0 1. 0. 0 99 0
1 2. 0. 1 19 1
2 3. 0. 2 28 2
●
3 4. ●
91 3
●
4 5. 82 4
2. 19
5 6. 37 5
●
●
6 7. ● 73 6
7 8. 28 46 7
3.
8 9. ● 64 8
●
●
9 10. 55 9
6. 37
●
●
●
8. 46
●
●
●
10. 55
●
●
●
9. 64
●
●
●
7. 73
●
●
●
5. 82
●
●
●
4. 91
●
●
●
0. 98
1. 99
a10836
70 004–2178–002
SHMEM [3]
main()
{
int i, j;
004–2178–002 71
CRAY T3ETM C and C++ Optimization Guide
The output from this program is as follows. Notice that the broadcast did not
include the dest array on PE 0.
The original array values are: 1 4 9 16 25 36 49 64
PE 0 has 8*0
PE 1 has 1 4 9 16 25 36 49 64
PE 2 has 1 4 9 16 25 36 49 64
PE 3 has 1 4 9 16 25 36 49 64
The following figure illustrates the contents of the arrays after the
shmem_broadcast program has executed.
72 004–2178–002
SHMEM [3]
PE 0
source dest
0 1 0 0
. . . .
. . . .
. . . .
7 64 0 7
PE 1 PE 2 PE 3
7 64 7 64 7 64
a10838
004–2178–002 73
CRAY T3ETM C and C++ Optimization Guide
main()
{
int i, j, n;
/* Assume 4 PEs */
npes = _num_pes();
me = _my_pe();
} /* End of program */
74 004–2178–002
SHMEM [3]
PE 3 has 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
PE 0 has 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
PE 1 has 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
PE 2 has 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
The following figure illustrates the contents of both the myvals and the
allvals arrays on each PE after the example is run. A call to pvm_gather
followed by a call to pvm_bcast would produce the same result as one call to
shmem_fcollect, but the call to shmem_fcollect is faster.
004–2178–002 75
CRAY T3ETM C and C++ Optimization Guide
PE 0 PE 1 PE 2 PE 3
0 1 1 0 0 5 1 0 0 9 1 0 0 13 1 0
1 2 2 1 1 6 2 1 1 10 2 1 1 14 2 1
2 3 3 2 2 7 3 2 2 11 3 2 2 15 3 2
3 4 4 3 3 8 4 3 3 12 4 3 3 16 4 3
5 4 5 4 5 4 5 4
6 5 6 5 6 5 6 5
7 6 7 6 7 6 7 6
8 7 8 7 8 7 8 7
9 8 9 8 9 8 9 8
10 9 10 9 10 9 10 9
11 10 11 10 11 10 11 10
12 11 12 11 12 11 12 11
13 12 13 12 13 12 13 12
14 13 14 13 14 13 14 13
15 14 15 14 15 14 15 14
16 15 16 15 16 15 16 15
a10839
76 004–2178–002
SHMEM [3]
004–2178–002 77
CRAY T3ETM C and C++ Optimization Guide
11. if (mype != 0) {
12. lock(pvint);
13. printf("PE %d owns the lock \n",mype);
14. unlock(pvint);
15. }
16.
17. shmem_barrier_all();
78 004–2178–002
SHMEM [3]
The following example finds the smallest value at each position in four arrays,
sends those values to arrays on all PEs, and has PE 0 print the values from its
array.
main() {
int me, i;
long pSync[SHMEM_REDUCE_SYNC_SIZE];
double foo[NR], foomin[NR], pWrk[SHMEM_REDUCE_MIN_WRKDATA_SIZE];
me = _my_pe();
srand48();
shmem_barrier_all();
004–2178–002 79
CRAY T3ETM C and C++ Optimization Guide
The following figure illustrates the contents of the two arrays on each PE at the
end of the program.
80 004–2178–002
SHMEM [3]
PE 0 PE 1
PE 2 PE 3
004–2178–002 81
CRAY T3ETM C and C++ Optimization Guide
/* Get PE information */
me = _my_pe();
npes = _num_pes();
/* Initialize data */
for(i=0; i<N; i++)
send[i] = me;
The output from running the program on 16 PEs is as follows. As with most
programs involving output from multiple PEs, the order in which the PEs finish
is random.
PE = 5 Result = 120 Expect = 120.
PE = 12 Result = 120 Expect = 120.
PE = 13 Result = 120 Expect = 120.
PE = 2 Result = 120 Expect = 120.
PE = 11 Result = 120 Expect = 120.
PE = 10 Result = 120 Expect = 120.
PE = 14 Result = 120 Expect = 120.
PE = 8 Result = 120 Expect = 120.
PE = 7 Result = 120 Expect = 120.
PE = 15 Result = 120 Expect = 120.
PE = 6 Result = 120 Expect = 120.
PE = 9 Result = 120 Expect = 120.
82 004–2178–002
SHMEM [3]
Comparing this to the PVM, shmem_get, and shmem_put versions of the ring
program, shmem_int_sum_to_all delivers the best performance.
004–2178–002 83
Single-PE Optimization [4]
Some of the most significant improvements you can make to your program are
not linked to parallelism. They fall into the category of single-PE optimizations.
This chapter describes what you can do to get each processing element (PE)
running at as close to peak performance as possible.
This chapter makes frequent reference to the hardware, especially the path
between local memory and the functional units. For background information on
CRAY T3E hardware, see Section 1.2.1, page 4.
To identify the parts on your program that take the most time and to get
feedback on performance improvements, use a performance analyzer such as
pat(1) or the MPP Apprentice tool. For more information, see Section 1.3, page
21.
This chapter addresses the following optimization topics:
• Unrolling loops (see Section 4.1, page 85).
• Using pipelining for loop optimization (see Section 4.2, page 86).
• Making the best use of cache (see Section 4.3, page 92).
• Optimizing stream buffers, which are key to many of the single-PE
optimizations (see Section 4.4, page 95).
• Optimizing division operations (see Section 4.5, page 104).
• Vectorizing some math operations within a loop (see Section 4.6, page 107).
• Bypassing cache (see Section 4.7, page 110).
004–2178–002 85
CRAY T3ETM C and C++ Optimization Guide
• Increasing merging in the missed address file (MAF) of the EV5 processor.
For an illustration of where the MAF fits in, see Figure 3, page 7, and Figure
4, page 8.
Although the compiler does unroll loops for you, unrolling is not done by
default on CRAY T3E systems. You can enable the unrolling of all loops
generated by the compiler by including the -h unroll option on the cc(1) or
CC(1) command line. You can direct the compiler to unroll select loops by
setting the command line option to -h nounroll and placing the unroll
directive immediately in front of a loop, as follows:
#pragma _CRI unroll
for(i=0; i<n; i++) {
You can direct the compiler not to unroll a specific loop by placing the
nounroll directive in front of it.
The unroll directive can be applied to any loop of a loop nest, not just the
innermost loop. For a loop that is not the innermost loop, the compiler
performs a technique called unroll and jam. A loop must meet special criteria,
however, to ensure that correct behavior is maintained. In particular, the loop
must have no data dependencies (glossary, page 143) across its iterations. Also,
the compiler performs unroll and jam only on nests in which each loop (except
the innermost) contains only one loop. If these criteria are not met, the compiler
does not take the risk of performing the optimization.
86 004–2178–002
Single-PE Optimization [4]
but more options are planned for future releases. Along with the command-line
option, you can provide the compiler with additional information on selected
loops with a directive.
The concurrent directive in this example informs the software pipeliner that
the relationship p>3 is true. This allows the pipeliner, for example, to safely
004–2178–002 87
CRAY T3ETM C and C++ Optimization Guide
88 004–2178–002
Single-PE Optimization [4]
Combining parallel and vector loops with recurrent loops before pipelining
is worthwhile provided the resulting loop body does not grow too large and
does not become memory bound.
y[i] = t;
i = i+1;
004–2178–002 89
CRAY T3ETM C and C++ Optimization Guide
if (i==n) exit;
}
Without software pipelining, the processor issues an average of less than one
instruction per CP. The execution of successive loop iterations is sequential, as
opposed to overlapped. A given iteration does not begin until the previous
iteration completes. This loop takes 5 CPs per iteration (assuming hits in data
cache, which has a latency of 2 CPs).
However, by overlapping the execution of successive iterations and by creating
a new loop body, pipelining produces an average throughput of one iteration
every 2 CPs. The initiation interval (the time required to start each iteration) of
2 CPs, along with the fact that every iteration now takes 6 CPs to complete,
proves that an overlap of 3 (642) has been achieved.
After the transformation, the new loop has parts of multiple iterations executing
at the same time (see the following figure), has multiple exits, uses twice as
much register space, and reorders the update to the loop induction variable (i
= i+1;) relative to its use in the store to y. But the throughput has increased
by a factor of 2.5, and the two integer functional units of the EV5 processor are
kept busy within the loop.
i=0;
t1=x[i];
i=i+1;
t2=x[i];
for {
y[i-1]-t1; i=i+1;
if(i==n) exit; t1=x[i];
y[i-1]=t2; i=i+1;
if(i==n) exit; t2=x[i];
}
Figure 20, page 91, illustrates the overlap of iterations for the following loop:
for(i=0; i<n; i++) {
... = a[i] * b[i];
... = c[i] * d[i];
}
90 004–2178–002
Single-PE Optimization [4]
Iterations
Iteration i
Initiation
interval
Iteration i + 1
a[i] + b[i]
Time
Iteration i + 2
a[i + 1] * b[i + 1]
Multiply
pipeline c[i] * d[i]
is 100%
busy a[i + 2] * b[i + 2]
c[i + 1] * d[i + 1]
c[i + 2] * d[i + 2]
a11336
004–2178–002 91
CRAY T3ETM C and C++ Optimization Guide
c[i+1] * d[i+1]
a[i+2] * b[i+2]
c[i] * d[i]
a[i+1] * b[i+1]
92 004–2178–002
Single-PE Optimization [4]
One way to detect potential cache reuse in a loop nest is by looking for array
references that do not contain all of the loop nest’s loop control variables.
Dimensions that have references without a loop control variable, or that have
loop control variables that differ from reference to reference, are generally
candidates for reuse.
In the preceding example, the first and second dimensions have reuse potential
because they each have references in which the subscript is not a loop control
variable. These dimensions should be made the fastest running, rightmost
dimensions. The dimension declarations can be changed as follows:
double a[n][3][3], b[n][3][3], c[n][3][3];
004–2178–002 93
CRAY T3ETM C and C++ Optimization Guide
a [5][2][2] a [2][2][5]
0 a [0][0][0] 0 a [0][0][0]
1 a [1][0][0] 1 a [1][0][0]
2 a [2][0][0] 2 a [2][0][0]
3 a [3][0][0] 3 a [0][1][0]
4 a [4][0][0] 4 a [1][1][0]
5 a [5][0][0] 5 a [2][1][0]
6 a [0][1][0] 6 a [0][2][0]
7 a [1][1][0] 7 a [1][2][0]
8 a [2][1][0] 8 a [2][2][0]
9 a [3][1][0] 9 a [0][0][1]
10 a [4][1][0] 10 a [1][0][1]
11 a [5][1][0] 11 a [2][0][1]
12 a [0][2][0] 12 a [0][1][1]
13 a [1][2][0] 13 a [1][1][1]
14 a [2][2][0] 14 a [2][1][1]
15 a [3][2][0] 15 a [0][2][1]
16 a [4][2][0] 16 a [1][2][1]
17 a [5][2][0] 17 a [2][2][1]
18 a [0][0][1] 18 a [0][0][2]
19 a [1][0][1] 19 a [1][0][2]
20 a [2][0][1] 20 a [2][0][2]
21 a [3][0][1] 21 a [0][1][2]
22 a [4][0][1] 22 a [1][1][2]
23 a [5][0][1] 23 a [2][1][2]
24 a [0][1][1] 24 a [0][2][2]
25 a [1][1][1] 25 a [1][2][2]
26 a [2][1][1] 26 a [2][2][2]
27 a [3][1][1] 27 a [0][0][3]
28 a [4][1][1] 28 a [1][0][3]
29 a [5][1][1] 29 a [2][0][3]
30 a [0][2][1] 30 a [0][1][3]
31 a [1][2][1] 31 a [1][1][3]
32 a [2][2][1] 32 a [2][1][3]
33 a [3][2][1] 33 a [0][2][3]
34 a [4][2][1] 34 a [1][2][3]
35 a [5][2][1] 35 a [2][2][3]
36 a [0][0][2] 36 a [0][0][4]
37 a [1][0][2] 37 a [1][0][4]
38 a [2][0][2] 38 a [2][0][4]
39 a [3][0][2] 39 a [0][1][4]
40 a [4][0][2] 40 a [1][1][4]
41 a [5][0][2] 41 a [2][1][4]
42 a [0][1][2] 42 a [0][2][4]
43 a [1][1][2] 43 a [1][2][4]
44 a [2][1][2] 44 a [2][2][4]
45 a [3][1][2] 45 a [0][0][5]
46 a [4][1][2] 46 a [1][0][5]
47 a [5][1][2] 47 a [2][0][5]
48 a [0][2][2] 48 a [0][1][5]
49 a [1][2][2] 49 a [1][1][5]
50 a [2][2][2] 50 a [2][1][5]
51 a [3][2][2] 51 a [0][2][5]
52 a [4][2][2] 52 a [1][2][5]
53 a [5][2][2] 53 a [2][2][5]
55 55 a11152
Of course, loop interchange can only be done if it does not violate data
dependencies. The compiler will usually perform interchange under default
94 004–2178–002
Single-PE Optimization [4]
004–2178–002 95
CRAY T3ETM C and C++ Optimization Guide
96 004–2178–002
Single-PE Optimization [4]
First, the compiler generates the following loops. Notice the expansion of the
scalar temporary t into the compiler temporary array ta in the following
example.
Finally, in the following example, the compiler stripmines the loops by 256 to
increase the potential for cache hits and reduce the size of arrays created for
scalar expansion. Stripmining itself does not provide a performance benefit, but
in combination with other optimizations (especially loop splitting, unrolling,
and vectorization), it can speed up your program.
004–2178–002 97
CRAY T3ETM C and C++ Optimization Guide
The compiler does not split up if statements that are nested within other if
statements. Nested if statements remain intact at the end of the splitting
process.
The compiler also splits individual statements that would allocate more than six
stream buffers, as in the following example.
98 004–2178–002
Single-PE Optimization [4]
Statements such as those in the preceding example are split only on add,
subtract, and multiply operations.
The -h split command-line option directs the compiler to try to split all
loops in a file. The -h nosplit option, which is the default, splits only loops
preceded by the split directive.
Note: There is potential for increasing the execution time of certain loops by
splitting them. Loop splitting also increases compile time, especially when
loop unrolling is also enabled. The split compiler directive and the -h
nosplit command-line specification let you select only those loops that will
benefit from splitting. Timing a loop both ways can help you determine
whether splitting it will enhance performance.
For more information on loop splitting options and directives, see the Cray
C/C++ Reference Manual, publication SR–2179.
004–2178–002 99
CRAY T3ETM C and C++ Optimization Guide
100 004–2178–002
Single-PE Optimization [4]
The loop in the following example allocates only two streams on CRAY T3E
systems:
004–2178–002 101
CRAY T3ETM C and C++ Optimization Guide
If the statements using the same streams are grouped together, the loop only
needs to be split into two loops, as shown in the following example:
The situation in the preceding example might be found in codes that contain
loops that have been unrolled manually. These loops should either have their
statements grouped, or the loops should be rerolled. The following loop,
unrolled manually, will be split into four different loops:
102 004–2178–002
Single-PE Optimization [4]
As in Example 30, page 102, and Example 31, page 102, the first and third lines
of the loop access the same streams, as do the second and fourth lines. By
grouping the first and third lines and second and fourth lines, the loop in the
following example, which was unrolled manually, will be split into two loops.
(Rerolling this loop would probably be best, but the intent of the example is to
demonstrate grouping.)
004–2178–002 103
CRAY T3ETM C and C++ Optimization Guide
time by using the set_d_stream(3) library function. The stream buffer level is
changed to 0 in the following example:
#include <mpp/rastream.h>
set_d_stream(0);
The get_d_stream(3) function saves the current stream buffer level for the
purpose of restoring it later.
For certain versions of the CRAY T3E system, stream buffers are disabled by
default for the following classes of programs:
• A program that calls subroutines from the SHMEM library.
• A program that uses the cache_bypass directive (see Section 4.7, page 110).
The document CRAY T3E Programming with Coherent Memory Streams outlines
conditions under which you can safely enable streams for programs in these
categories. The document is available online at the following URL:
http://www.sgi.com/t3e/guidelines.html
Once you ensure that your program is safe, you can enable streams by using
the set_d_stream(3) library function or by setting the SCACHE_D_STREAMS
environment variable to 1.
104 004–2178–002
Single-PE Optimization [4]
Because the divide is loop invariant (glossary, page 146), the divide can be
changed to a multiply by the reciprocal, as shown in the following example.
xinv = 1.0/x;
for(i=0; i<256; i++)
a[i] = (b[i] + 2.0 * c[i] + d[i]) * xinv;
By default, the C and C++ compilers change a divide into a reciprocal multiply
for you. You do not have to change the code at all. Unless you disable it by
specifying the -h nofastfpdivide option on the cc or CC command line, the
compiler will use the faster reciprocal multiply at every opportunity.
Other operations can proceed when a divide operation is in progress. If moving
a divide operation outside of a loop is not possible, you can sometimes
preschedule it from within your source code. The following inner loop has a
divide operation in line 7 that causes a wait of about 60 CPs, and the result is
immediately used.
004–2178–002 105
CRAY T3ETM C and C++ Optimization Guide
Using a technique similar to bottom loading (glossary, page 142), the division
required for the next iteration of the loop is computed in advance. The divide
operation itself is in line 14 of the following example. The result of the divide is
not needed until the next pass of the loop, so the floating-point operations
following the divide can overlap with the 60 CPs, assuming a 64-bit divide.
This kind of division is unconventional, but it increases the performance of the
code. The compiler does not make the following changes automatically.
106 004–2178–002
Single-PE Optimization [4]
The early divide for the last iteration in the preceding example is potentially
unsafe because it may go out of bounds. You may have to provide a special
case for the last iteration.
4.6 Vectorization
The CRAY T3E compiler offers a method to vectorize select math operations
inside loops. This is not the same kind of vectorization available on a Cray PVP
systems On a CRAY T3E system, the compiler restructures loops containing
scalar operations and generates calls to specially coded vector versions of the
underlying math routines. The vector versions are between two and four times
faster than the scalar versions. The compiler uses the following process:
1. Stripmine the loop. (For more information on stripmining, see Example 24,
page 97.)
2. Split vectorizable operations into separate loops, if necessary. (For more
information on loop splitting, see Example 24, page 97.)
3. Replace loops containing vectorizable operations with calls to vectorized
intrinsics.
Vectorizing reduces execution time in the following ways:
• By reducing aggregate call overhead, including the subroutine linkage and
the latency to bring scalar values into registers needed by the intrinsic
routine.
004–2178–002 107
CRAY T3ETM C and C++ Optimization Guide
108 004–2178–002
Single-PE Optimization [4]
to libmfastv may generate only a NaN for a particular operand rather than
an exception, causing exceptions further down the line. A NaN is a value that
is not a number but rather a symbolic entity encoded in floating-point format.
Vectorization is only performed on loops that the compiler judges to be
vectorizable. This determination is based on perceived data dependencies and
the regularity of the loop control. These loops will likely be a significant subset
of those seen as vectorizable by the Cray PVP compiler. Vectorization of
conditionally executed operators is deferred. Vectorization of loops that contain
potentially early exits from the loop is also deferred.
Vectorization will be performed on the following intrinsics and operators. The
first set supports both 32-bit and 64-bit floating-point data:
sqrt(3C)
1/sqrt (replaced by a call to sqrtinv(3C)
log(3C)
exp(3C)
sin(3C)
cos(3C)
004–2178–002 109
CRAY T3ETM C and C++ Optimization Guide
110 004–2178–002
Single-PE Optimization [4]
The cache_bypass directive can also be used to initialize large arrays if the
contents are not immediately needed in cache, avoiding unnecessary reads into
cache and improving the memory bandwidth.
The directive precedes a for, while, do while, or if ... goto loop and
affects all of the named arrays whose base data types are 64 bits. (Support for
32-bit and complex data types is not yet implemented.) In the following loop,
arrays a and b will be accessed through E registers rather than through cache:
#pragma _CRI cache_bypass a, b
for(i=0; i<n; i++)
x[i] = a[i] * b[i];
• It does not guarantee that the specified variables are not in cache before the
loop.
• It does not guarantee that the specified variables are not in cache after the
loop.
• It does not invalidate cache.
• It does not affect program results in any way.
Even if you include a cache_bypass directive before a loop, the compiler
ignores it if it determines it cannot generate code efficiently. The loop must
meet the following requirements before the compiler uses E registers:
• The loop must be an inner loop, if it is nested in another loop.
• The loop must be vectorizable. Use the ivdep directive in cases where
ambiguous data dependencies are preventing the loop from vectorizing.
(For more information on the ivdep directive, see Section 4.6.1, page 110.
For more information on vectorization, see Section 4.6, page 107.)
You will probably have to enable loop unrolling to realize the full benefit from
this feature. For information on the unrolling command-line option and
directive, see Section 4.1, page 85.
The benefit is greater the more random the index stream, however, benefit has
been seen from index streams with secondary cache hit rates as high as 50%.
Bypassing cache does generate more code for candidate loops, potentially
increasing the compile time slightly. It also increases the latency of memory
004–2178–002 111
CRAY T3ETM C and C++ Optimization Guide
112 004–2178–002
Input/Output [5]
Optimizing I/O on the CRAY T3E system is not, for the most part, very
different from optimizing I/O on Cray PVP systems. If you are already
acquainted with Cray PVP I/O, much of this chapter should be familiar to you.
As on other Cray Research systems, there are a few optimizations that apply
regardless of the kind of I/O your program is performing. For instance, using
binary (or unformatted) data rather than ASCII data is a good idea that should
be used whenever possible. To reduce redundancy, it is not listed as an
optimization in every section of this chapter.
The following optimization topics are covered:
• Choosing a strategy for doing I/O in a parallel programming environment
(see Section 5.1, page 113).
• Using unformatted I/O whenever possible (see Section 5.2, page 119).
• Coping with formatted I/O when necessary (see Section 5.3, page 122).
• Making use of the performance-enhancing FFIO layers in your program (see
Section 5.4, page 123 ).
• Optimizing random access I/O (see Section 5.5, page 128).
• Striping a file over disk partitions (see Section 5.6, page 128).
004–2178–002 113
CRAY T3ETM C and C++ Optimization Guide
• Have each PE involved in I/O, perhaps all of them, open a separate file.
This method often requires you to divide data up into multiple files before
reading and to merge files after writing. You can do the dividing and
merging outside the scope of the program. Each file can be read from and
written to different disks (see Section 5.1.2, page 116).
• Have one PE do all of the I/O. The PE performing the read shares the data
with the rest of the PEs and collects it again before writing the output. This
method can be very fast when you make use of disk striping (see Section
5.1.3, page 118).
The following sections describe the performance benefits and detriments of
these three methods.
PE 0 PE 1 PE 2 PE 3
Synchronization
mechanism for write
Data
Read
file
Write
a11300
Having many PEs reading from the same file can cause a slowdown due to I/O
contention. Reading or writing to a single file is most effective under the
following circumstances:
114 004–2178–002
Input/Output [5]
main()
{
int me, i, next, workers, size=100000;
int num_elements, fd, j, ret, fret;
int array[size];
me = _my_pe();
workers = _num_pes;
size = 100000;
num_elements = size/workers;
004–2178–002 115
CRAY T3ETM C and C++ Optimization Guide
fd = ffopen("datafile", O_RDWR);
for(i=0; i<num_elements; i++) {
next=findnext(next, me, num_elements);
fret = ffseek(fd, next, 0);
if (fret < 0) exit();
/* Perform calculations */
When executing this program, pipe the input and output data files as follows:
mpprun -n 10 a.out <infile >outfile
main()
{
int fd, errno;
char inbuf[8*4096], outbuf[8*4096];
if(_my_pe() == 0)
116 004–2178–002
Input/Output [5]
fd=ffopen("file0",O_RDWR|O_RAW);
if(_my_pe() == 1)
fd=ffopen("file1",O_RDWR|O_RAW);
if(_my_pe() == 2)
fd=ffopen("file2",O_RDWR|O_RAW);
if(_my_pe() == 3)
fd=ffopen("file3",O_RDWR|O_RAW);
/* Work on data */
In this example, each PE reads from and writes to its own file using the FFIO
routines ffread(3C) and ffwrite(3C). For more information on FFIO, see
Section 5.4, page 123. The external file for each PE is identified by its own copy
of fd. The number 8*4096 is assumed to be eight disk sectors.
PE 0 PE 1 PE 2 PE 3
Read
Write
a11301
004–2178–002 117
CRAY T3ETM C and C++ Optimization Guide
PE 1
PE 0 PE 2
PE 3
Read
Write
Data file
An array put routine
An array get routine
a11302
You can share the data using one of the following methods:
• You can use one of the Message Passing Toolkit (MPT) products to pass the
data on. PVM, MPI, and SHMEM all have broadcast routines that pass data
to other PEs. If every PE needs all the data, the shmem_broadcast routine
is the fastest of the three, but it is not portable to other vendors’ systems.
For examples of shmem_broadcast, see Section 3.6, page 71. For
information on pvm_bcast(3), see Section 2.9, page 33.
• If each PE only needs part of the data, use get and put functions. Again, the
SHMEM put and get routines are the fastest. For information on using
shmem_put64 and shmem_get64, see Section 3.1, page 46. If portability is
a concern, PVM and MPI have put and get functions, but they are slower
than SHMEM.
118 004–2178–002
Input/Output [5]
• Convert the scanf(3) and printf(3) function calls in C and C++, and the
cout, cin, <<, and >> C++ functions to unformatted functions. Use the
ffread(3) and ffwrite(3) functions, along with the assign command
before executing your program, as follows:
% cc -X4 myprog.c
% assign -s bin f:outfile
% ./a.out
By default, the system always accesses the next record automatically during a
read or a write. That means sequential I/O will be fast. When you are also
manipulating unformatted data, you have the potential for very fast data
transfers. The following section describes how to get the most out of a good
combination.
004–2178–002 119
CRAY T3ETM C and C++ Optimization Guide
If the whole file does not fit in memory, you can read in parts of it at a time,
possibly getting work done on the current data while waiting for the next
chunk to be read. The following figure illustrates the data flow for an array
named A in PE 4.
PE 3 PE 4 PE 5
A
Library Library
buffers buffers
Library
layer
System System
buffer buffer
System
layer
Write
Read a11333
To move the data between disk and the system buffers, use the following
optimizations:
• Choose system call I/O. System call I/O is specified on an assign
command either explicitly, by selecting the system or syscall FFIO layer,
or implicitly; if it is not specified, it is added automatically. (For more
information on FFIO, see Section 5.4, page 123.) The following example
selects syscall:
120 004–2178–002
Input/Output [5]
% assign -F syscall
• Make I/O requests that begin and end on disk sector boundaries. Most disk
sectors are the same as a block size, 512 words (or 4,096 bytes). Check with
your system administrator to make sure of the size of a disk sector on your
CRAY T3E system.
The following command preallocates an area that is 80 512-word blocks in
size:
% assign -n 80
• If you are using the read(2) and write(2) system calls directly, switch to
ffread(3C) and ffwrite(3C). Doing so will reduce the number of calls to
the system, and performance should be at least as good as using the system
calls.
Optimizations such as double buffering (or even triple and quadruple
buffering) and disk striping are performed by the operating system. User
striping can still gain you performance improvements, but it is more labor
intensive than other optimizations described in this chapter. (For an example of
user striping, see Section 5.6, page 128.)
By adjusting the arguments to the cachea and bufa layers of FFIO, you can
have double buffering done automatically. The following assign command
creates two buffers, each 50 blocks in size:
& assign -F bufa:50:2
To optimize the process of moving data between the system buffers and an
array in your program, use the following techniques:
• Access the FFIO libraries using the ffread(3) and ffwrite(3) functions.
• Take advantage of asynchronous I/O if you can accomplish other work
while the I/O is taking place. If sequential, unformatted I/O requests take
most of your program’s time, you can probably improve performance by
combining computation with the inherent asynchronous capability of I/O.
First, select an asynchronous FFIO layer by running assign commands
such as the following before executing your program:
004–2178–002 121
CRAY T3ETM C and C++ Optimization Guide
148) and write-behind (glossary, page 152) capabilities that can improve
performance significantly.
• Use the setvbuf(3) function to specify the size of the library buffer. The
following sets the buffer size to 48 512-word blocks, which is the default:
setvbuf (FILE stream-file, *NULL, _IOFBF, 48*512)
• If the file is small enough to fit entirely into the memory of a PE, or if a
certain part of the file is heavily accessed, use the memory-resident layer in
FFIO. The memory-resident layer involves less overhead than, for example,
the cachea layer. For more information, see Section 5.4.1, page 123.
float a[isize];
for(i=0; i<isize;) {
myfile << a[i] << ’ ’;
122 004–2178–002
Input/Output [5]
++i;
if(i%5 == 0) myfile << ’/n’;
}
The following example reads five elements in a single I/O request. It makes
80% fewer I/O calls and helps the program to execute faster:
float a[i];
for(i=0; i<isize; i+=5)
scanf("%f %f %f %f %f", a[i], a[i+1], a[i+2], a[i+3], a[i+4]);
5.4 FFIO
The bufa and cachea layers of flexible file I/O (FFIO) do asynchronous
buffering and caching internally. Those two, along with global, which
distributes a data file across multiple PEs, and mr, which stores part or all of a
data file in the memory of a single PE, are the high performance FFIO layers.
The following sections describe how to improve the performance of your
program by using them.
004–2178–002 123
CRAY T3ETM C and C++ Optimization Guide
The application PEs (APP under the Type column) are the ones to look at. The
UsrMem column shows 119 Mbytes available to a user program, meaning each
PE probably has 128 Mbytes of local memory. If the combined size of your data
and your executable file do not approach 119 Mbytes, you may be able to move
the entire data file into the memory of a single PE. To enable the
memory-resident layer of FFIO, enter an assign command such as the
following before executing your program. This example allocates 10 512-word
blocks (about .4 Mbytes) of memory for the data coming from the file myfile.
% assign -F mr:10 f:myfile
The data is automatically read into the memory-resident area when the file
myfile is opened and written back out when myfile is closed. If the data
area proves to be too small, the data file is split automatically between local
memory and disk.
124 004–2178–002
Input/Output [5]
This example allocates 50 blocks for each page, with each block capable of
containing 512 64-bit words of data. For a 10-PE program, it allocates 1 page for
each PE, meaning there are 50 blocks on each of the 10 PEs, for a total of 500
blocks. The distribution for an array of 500 blocks would place the first 50 on
the first PE to access a file page, the second 50 on the next PE to access a file
page, and so on. The following figure represents the layout of the words of data
on each PE. It does not reflect the random order in which the PEs access the file.
004–2178–002 125
CRAY T3ETM C and C++ Optimization Guide
PE 0 PE 1 PE 2 PE 3 PE 4
PE 5 PE 6 PE 7 PE 8 PE 9
If you use distributed I/O during an operation in which all of the PEs were
involved, each PE would hold data for every other PE. Although this might
seem like a confusing arrangement, it is a good use of memory for most
applications.
The advantages of using distributed I/O are as follows:
• You get more buffer space without severely impacting the memory of any
single PE.
126 004–2178–002
Input/Output [5]
• The data becomes essentially a globally accessible file. You do not have to
know on which PE any particular data element is stored.
The following are disadvantages:
• You are using memory that might be needed by a PE.
• PEs might need a large amount of data residing in the memory of other PEs,
creating many remote transfers.
This example sets up a cache of 40 pages, each page of which is 100 blocks of
512 64-bit words (51,200 words). If the I/O libraries detect sequential access,
they perform either asynchronous read-ahead or asynchronous write-behind.
The third parameter to cachea, which is 2 in the example, tells the libraries
how many pages you want to be read ahead. Setting the third parameter is
important to the performance for a program using sequential I/O, since the
default is no read ahead.
Using cache is similar to using library buffers (see the following section); both
have an asynchronous capability and both are stored in the memory of PEs.
There are differences between the two, however.
Cache contains an indexing system for the data in an active cache. You can
choose any indexed data in cache and quickly move it into a register.
A buffer does not have an indexing scheme; it knows only the file position at
the top of the buffer. A buffer is designed for sequential access. If you
reposition within a buffer, the current buffer is flushed and a new set of data is
read from disk.
004–2178–002 127
CRAY T3ETM C and C++ Optimization Guide
This allocates one buffer of 40 512-word blocks, or about 164 Kbytes. The
buffers are allocated during program execution. Each PE opening a file receives
a buffer.
5.6 Striping
Striping a file over disk partitions adds a level of parallel processing to the
slowest part of I/O: the physical reading data from and writing data to disk.
You can specify automatic striping by entering an assign(1) statement such as
the following before executing your program:
% assign -p 0-3 -n 8400 -q 21 -s u f:mydata
128 004–2178–002
Input/Output [5]
This command stripes over four partitions (0, 1, 2, and 3), putting 21 sectors on
each partition.
004–2178–002 129
Hardware Access [6]
004–2178–002 131
CRAY T3ETM C and C++ Optimization Guide
You can find a convenient source of E register opcodes and bit field definitions
in the mpp/mpphw_t3e.h header file, available in the CrayLibs package. It has
both physical and virtual addresses for the E register commands. The physical
addresses correspond to the values listed in the Commands booklet, but you
must use the virtual addresses in user and library code. The operating system
maps the virtual addresses to the physical address when an E register
command is issued.
The header file has examples of inline functions that use E registers. Inline
functions are also available in the CrayLibs package for CRAY T3E systems.
132 004–2178–002
Hardware Access [6]
6.1.1 Basics
The stq, ldq, stl, and ldl load/store hardware instructions trigger E register
operations when accompanied by special addresses. The opcodes and operands
for the E register commands are packaged in the following locations:
• The address of the load/store instruction.
• The data argument in the load/store instruction.
• The more operands block (MOBE) of E registers. A bit field in the data
argument points to up to 4 E registers.
• The source or destination (SADE) E register, which contains data to be
transferred. This E register is specified in a bit field within the address for
the load/store instruction.
004–2178–002 133
CRAY T3ETM C and C++ Optimization Guide
134 004–2178–002
Hardware Access [6]
The Ecmd pointer in line 6 is a pseudo-pointer that contains the E register put
command code and a pointer to the E register (_MPC_E_REG_SADE) that will
contain the data to be transferred to the remote processor. The volatile
004–2178–002 135
CRAY T3ETM C and C++ Optimization Guide
keyword in the declaration of Ecmd ensures that the pseudo-store to *Ecmd that
follows will be translated by the compiler into an actual store instruction.
6. volatile long * const Ecmd =
7. (volatile long *)(_PUT(_MPC_E_REG_SADE));
The #pragma directive in line 10 tells the C compiler that the user is accessing
the first source-and-destination E register (SADE). The compiler will not itself
generate code that uses this SADE behind the scenes to do its own
optimizations. It will use a different SADE, if needed.
10. #pragma _CRI sade 1
136 004–2178–002
Hardware Access [6]
The #pragma _CRI inline directive in line 21 inlines the function just
defined. It is placed wherever it is called in the C source file. If you do not
want to inline the function, omit this directive.
21. #pragma _CRI inline _I_shmem_int_p
004–2178–002 137
CRAY T3ETM C and C++ Optimization Guide
This directive indicates that n BESUs should be allocated for use in the
compilation unit. The sum of the BESU counts specified with directives in a
program is recorded at link time and placed in the a.out(5) header. The
operating system allocates the specified number to the application team at
program startup. As a special case, the operating system does not allocate a
BESU if the BESU count in the a.out header is 1 and the program uses one
PE.
Note: The besu directive is not required when accessing BESUs through
the standard barrier and event library routines mentioned in this section.
It is only required when programming BESUs directly though techniques
described in the Barrier and Eureka Synchronization (CRAY T3E Systems).
• Use the besu_alloc() function to obtain the correct memory-mapped
address. The synopsis for besu_alloc is as follows:
#include <mpp/besu.h>
besu_t *besu_alloc(void);
138 004–2178–002
Hardware Access [6]
void mybarrier(void)
{
static int firstcall = 1;
static besu_t *besuptr;
if (firstcall) {
firstcall = 0;
besuptr = besu_alloc();
}
/* Arm the barrier, indicating this PE’s arrival */
*besuptr = _MPC_OP_BAR;
/* Wait for all other PEs to arrive at the barrier */
while (*besuptr != _MPC_S_BAR) ;
}
004–2178–002 139
CRAY T3ETM C and C++ Optimization Guide
When a process in an application team calls the abort function from C or C++,
or when a program receives an exception signal (operand range error and
floating-point errors are two examples), the entire application team is
terminated. The following library extensions to C and C++ terminate the whole
application team:
• globalexit(). This C-callable function causes all processes in the
application team to go through normal exit processing. This function takes a
single int argument to indicate exit status:
void globalexit(int status);
• The C_APTEAM category can be passed to the killm(2) system call to cause
the signal to be sent to all processes in an application team.
140 004–2178–002
Glossary [7]
active set
004–2178–002 141
CRAY T3ETM C and C++ Optimization Guide
blocking receive
All processors see the same value for any memory location,
regardless of which cache the actual data is in, or which
processor most recently changed the data. On the CRAY T3E
system, only local memory references can be cached (all remote
memory references use external E registers). Hardware on each
CRAY T3E processor maintains cache coherence with the local
memory, including when data is modified by a remote
processor.
cache hit
142 004–2178–002
Glossary [7]
clock period
004–2178–002 143
CRAY T3ETM C and C++ Optimization Guide
disk mirroring
144 004–2178–002
Glossary [7]
flushing cache
004–2178–002 145
CRAY T3ETM C and C++ Optimization Guide
latency of memory
message-passing system
146 004–2178–002
Glossary [7]
004–2178–002 147
CRAY T3ETM C and C++ Optimization Guide
reduction
148 004–2178–002
Glossary [7]
004–2178–002 149
CRAY T3ETM C and C++ Optimization Guide
150 004–2178–002
Glossary [7]
thrashing
After the references to a[0] and b[0], the page for a and the
page for b are in memory. When c[0] is referenced, the
operating system removes a’s page, because it has room for
only two pages, and a’s page was least recently used. Next
a[1] is referenced, but a’s page is now out of memory, so the
operating system removes b’s page. Likewise, the reference to
b[1] ends up removing c’s page. Because more pages are
referenced than there are room for in memory, because they are
referenced cyclically, and because they are allocated in a
least-recently-used basis, reuse never occurs, and the paging
mechanism gives no benefit.
thread
004–2178–002 151
CRAY T3ETM C and C++ Optimization Guide
ulp
152 004–2178–002
Index
C
B
cache
background coherence, 5
Parallel Virtual Machine (PVM), 2 data and secondary, 5
programming styles, 2 how it works, 9
SHMEM, 2 load and store timings, 19
topics, 1 miss, 9, 11
background information, 1
004–2178–002 153
CRAY T3ETM C and C++ Optimization Guide
154 004–2178–002
Index
F H
fan-out hardware
definition, 144 illustration, 7, 8
fan-out distribution, PVM, 34 hardware overview, 4
FDDI network, 18 HIPPI disks, 17
FFIO HIPPI network, 19
description, 123
Fiber Channel disks, 16
Flexible File I/O I
description, 123
flow of data, 7 I/O, 113
flushing cache I/O from a single PE, 118
definition, 145 I/O requests
formatted I/O using large, 122
optimizations, 122 I/O strategies, 113
reducing, 122 IEEE division, 105
functional unit if statement, splitting loop with, 98
and pipelining, 91 individual routine
functional units, 8 definition, 145
initializing data, PVM, 30
inner loop trip count, maximizing, 100
G instance number, 145
intrinsic routines
gather data, SHMEM, 67 vectorized, 109
gather operation intrinsics.h include file, 49
definition, 145 invariant references, maximizing, 93
gathering data, PVM, 40 IPI-2 disks, 17
get_d_stream routine, 104 ivdep directive
GigaRing and pipelining, 88
definition, 145
GigaRing network, 15
global I/O, 125 L
glossary
description, 1 large transfers
grmview example, 123
004–2178–002 155
CRAY T3ETM C and C++ Optimization Guide
156 004–2178–002
Index
004–2178–002 157
CRAY T3ETM C and C++ Optimization Guide
158 004–2178–002
Index
004–2178–002 159