Parallel Computers 2 - Architecture, Programming and Algorithms PDF
Parallel Computers 2 - Architecture, Programming and Algorithms PDF
Parallel Computers 2
ARCHITECTURE, PROGRAMMING AND ALGORITHMS
R W Hockney
Emeritus Professor, University of Reading
CR Jesshope
Reader in Computer Architecture
Department of Electronics and Computer Science,
University of Southampton
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter
invented, including photocopying, microfilming, and recording, or in any information storage or retrieval
system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.
copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222
Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides
licenses and registration for a variety of users. For organizations that have been granted a photocopy license
by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
PREFACE
1 INTRODUCTION 1
1.1 History of parallelism and supercomputing 2
1.2 Classification of designs 53
1.3 Characterisation of performance 81
vii
viii CONTENTS
APPENDIX 581
REFERENCES 587
INDEX 611
P reface to th e First Edition
The 1980s are likely to be the decade of the parallel computer, and it is the
purpose of this book to provide an introduction to the topic. Although many
computers have displayed examples of parallel or concurrent operation since
the 1950s, it was not until 1974-5 that the first computers appeared that
were designed specifically to use parallelism in order to operate efficiently
on vectors or arrays of numbers. These computers were based on either
executing in parallel the various subfunctions of an arithmetic operation in
the same manner as a factory assembly line (pipelined computers such as the
CDC STAR and the TIASC), or replicating complete arithmetic units
(processor arrays such as the ILLIAC IV). There were many problems in
these early designs but by 1980 several major manufacturers were offering
parallel computers on a commercial basis, as opposed to the previous research
projects. The main examples are the CRAY-1 (actually first installed in 1976)
and the CDC CYBER 205 pipelined computers, and the ICL DAP and
Burroughs BSP processor arrays. Unfortunately, since we wrote this material
Burroughs have experienced problems in the production of the BSP and,
although a prototype machine was built, Burroughs have withdrawn from
this project. However, this still remains a very interesting design and with
its demise perhaps gives more insight into this field. Pipelined designs are
also becoming popular as processors to attach to minicomputers for signal
processing or the analysis of seismic data, and also as attachments to micro
based systems. Examples are the FPS AP-120B, FPS-164, Data General
AP/130 and IBM 3838.
Parallelism has been introduced in the above design because improvements
in circuit speeds alone cannot produce the required performance. This is
evident also in the design studies produced for the proposed National
Aeronautics Simulation Facility at NASA Ames. This is to be based on a
computer capable of 109 floating-point operations per second. The CDC
IX
X PREFACE TO THE FIRST EDITION
in one line, thus circumventing the need for verbose descriptions and aiding
the classification of designs by allowing generic descriptions by a formula.
Most material in this book has been collected for a lecture course given
in the Computer Science Department at Reading University entitled
‘Advanced Computer Architecture’. This course has evolved into a 40-lecture
unit over the last five years, and it was the lack of a suitable text that provided
the motivation for writing this book. The lecture course is given as an option
to third year undergraduates but would also be suitable for a specialised
course at MSc level, or as taught material to PhD students preparing for a
thesis in the general area of parallel computation.
Many people have helped, by discussion and criticism, with the preparation
of the manuscript. Amongst these, we would like to mention our colleagues
at Reading University: particularly Jim Craigie, John Graham, Roger Loader,
John Ogden, John Roberts and Shirley Williams; and Henry Kemhadjian of
Southampton University. Dr Ewan Page, Vice-Chancellor of Reading
University, and the series editor Professor Mike Rogers of Bristol University
have also suggested improvements to the manuscript. We have also received
very generous assistance with information and photographs from representatives
of the computer manufacturers, amongst whom we wish to thank Pete
Flanders, David Hunt and Stewart Reddaway of ICL Research and Advanced
Development Centre, Stevenage, and John Smallbone of ICL, Euston;
Professor Dennis Parkinson of ICL and Queen Mary College, London
University; Stuart Drayton, Mick Dungworth and Jeff Taylor of CRAY
research, Bracknell; David Barkai and Nigel Payne at Control Data
Corporation (UK), and Patricia Conway, Neil Lincoln and Chuck Purcell
of CDC Minneapolis; J H Austin of Burroughs, Paoli, and G Tillot of
Burroughs, London; C T Mickelson of Goodyear Aerospace, Akron; and
John Harte, David Head and Steve Markham of Floating Point Systems,
Bracknell. Many errors were also corrected by Ms Jill Dickinson (now
Mrs Contla) who has typed our manuscript with her customary consummate
skill.
We have dedicated this book to computer designers, because without their
inspiration and dedication we would not have had such an interesting variety
of designs to study, classify and use. The design of any computer is inevitably
a team effort but we would like to express our particular appreciation to
Seymour Cray, Neil Lincoln, George O ’Leary and Stewart Reddaway, who
were the principal designers for the computers that we have selected for
detailed study.
R W Hockney, C R Jesshope
1981
P reface to th e Second Edition
In the seven years that have passed since the publication of Parallel Computers
sufficient changes have occurred to warrant the preparation of a second
edition. Apart from the evolution of architectures described in the first edition
(e.g. CYBER 205 to ETA10, and CRAY-1 to CRAY X-MP and CRAY-2),
many novel multi-instruction stream ( m im d ) computers have appeared
experimentally, and some are now commercially available (e.g. Intel iPSC,
Sequent Balance, Alliant FX/8). In addition, microprocessor chips (e.g. the
INMOS transputer) are now available that are specifically designed to be
connected into large networks, and many new systems are planned to be based
upon them.
Whilst keeping the overall framework of the first edition, we have included
these developments by expanding the Introduction (including necessary
extensions to the algebraic architecture notation, and the classification of
designs), and selecting some architectures for more detailed description in
the following chapters. In the chapter on vector pipelined computers
(Chapter 2), we have included a description of the highly successful Japanese
vector computers (Fujitsu VP, Hitachi S810, and NEC SX2), as well as the
new generation of multiple vector computers from the USA (the CRAY-2
and ETA10). In the chapter on multiprocessors and processor arrays, we
have included the Denelcor HEP, the first commercially available m im d
computer, and the Connection Machine.
Although the HEP is no longer available, we feel that its architecture,
based on multiple instruction streams time-sharing a single instruction
pipeline, is sufficiently novel and interesting to warrant inclusion; and the
Connection Machine represents the opposite approach of connecting a very
large number (approximately 65000) of small processors in a network.
The INMOS transputer is the first commercial chip which has been
conceived of as a building block for parallel computers; its architecture and
xiii
XIV PREFACE TO THE SECOND EDITION
R W Hockney
C R Jesshope
(Reading and Southampton Universities, August 1987)
In tro d u ctio n
1
2 INTRODUCTION
t The gate delay time is the time taken for a signal to travel from the input of one
logic gate to the input of the next logic gate (see e.g. Turn 1974 p 147). The figures
are only intended to show the order of magnitude of the delay time.
4 INTRODUCTION
logical unit for processing the data. The latter two units are referred to as
the central processing unit or c p u . The important feature in the present
context is that each operation of the computer (e.g. memory fetch or store,
arithmetic or logical operation, input or output operation) had to be
performed sequentially, i.e. one at a time. Parallelism refers to the ability to
overlap or perform simultaneously many of these tasks.
The principal ways of introducing parallelism into the architecture of
computers are described fully in §1.2. They may be summarised as:
(a) Pipelining— the application of assembly-line techniques to improve
the performance of an arithmetic or control unit;
(b) Functional— providing several independent units for performing
different functions, such as logic, addition or multiplication, and allowing
these to operate simultaneously on different data;
(c) Array—providing an array of identical processing elements ( p e s )
under common control, all performing the same operation simultaneously
but on different data stored in their private memoricc—i.e. lockstep
operation;
(d) Multiprocessing or m i m d the provision of several processors, each
—
FIGURE 1.2 Evolutionary tree showing the architectural connections and influences during the development of parallel computers from
the early 1950s to the mid-1980s. Numbers in parentheses are a guide only to the performance in M flop/s of the computer when used sensibly.
Full lines indicate design and manufacture, the star the delivery of the first operational system. Broken lines indicate strong architectural
7
relationships.
8 INTRODUCTION
CRAY-1, TIASC). Other machines (MU5, ATLAS, IBM 370, UNIVAC 1100,
DEC 10, CRAY-1) are described in the special issue of the Communications
of the Association of Computing Machinery devoted to computer architecture
(ACM 1978) and in the book by Ibbett (1982) entitled The Architecture of
High Performance Computers, which also discusses the CDC 6600 and 7600,
the IBM 360 models 91 and 195, the TIASC and CDC Cyber 205. Useful
summaries of the architecture of most commercially available computers are
given in a series of reports on computer technology published by Auerbach
(1976a). A recent review of parallelism and array processing including
programming examples is given by Zakharov (1984).
It does not appear that this ability to perform parallel operation was included
in the final design of Babbage’s calculating engine or implemented; however
it is clear that the idea of using parallelism to improve the performance of a
machine had occurred to Babbage over 100 years before technology had
advanced to the state that made its implementation possible. Charles Babbage
undoubtedly pioneered many of the fundamental ideas of computing in his
work on the mechanical difference and analytic engines which was motivated
by the need to produce reliable astronomical tables (Babbage 1822, 1864,
Babbage 1910, Randell 1975).
The first general-purpose electronic digital computer, the ENIAC, was a
highly parallel and highly decentralised machine. Although conceived and
designed principally by J W Mauchly and J P Eckert Jr, ENIAC was described
mainly by others (Goldstine and Goldstine 1946, Hartree 1946, 1950, Burks
and Burks 1981 ). Since it had 25 independent computing units (20 accumulators,
1 multiplier, 1 divider/square rooter, and 3 table look-up units), each
following their own sequence of operations and cooperating towards the
solution of a single problem, ENIAC may also be considered as the first
HISTORY OF PARALLELISM AND SUPERCOMPUTING 9
could say that the algorithm was literally wired into the computer. It is
interesting that such ideas are beginning to sound very ‘modern’ again in
the 1980s in the context of m im d computing, reconfigurable v l s i arrays, and
special-purpose computers executing very rapidly a limited set of built-
in algorithms. However, the time was not ripe for this type of parallel
architecture in the 1940s, as can be seen in the following quotation from
Burks (1981) who was a member of the ENIAC design team.
The ENIAC’s parallelism was relatively short-lived. The machine was completed
in 1946, at which time the first stored program computers were already being
designed. It was later realized that the ENIAC could be reorganized in the
centralized fashion of these new computers, and that when this was done it
would be much easier to put problems on the machine. This change was
accomplished in 1948__ Thereafter the plugboard of the ENIAC was never
modified, and the machine was programmed by setting switches at a central
location. Thus the first general-purpose electronic computer, built with a parallel
decentralized architecture, operated for most of its life as a serial centralized
computer! [Author’s italics.]
lines were provided as fast-access registers on DEUCE. One single- and one
double-word delay line were associated with fixed-point adders and could
also act as accumulators.
Bit-parallel arithmetic became a practical part of computer design with
the availability of static random-access memories from which all the bits of
a word could be read conveniently in parallel. The first experimental machine
to use parallel arithmetic was finished at the Institute of Advanced Studies
(IAS) in 1952, and this was followed in 1953 by the first commercial computer
with parallel arithmetic, the IBM 701. Both these machines used electrostatic
cathode ray tube storage devised by Williams and Kilburn (1949), and were
followed in a few years by the first machines to use magnetic-core memory.
The most successful of these was undoubtedly the IBM 704 of which about
150 were sold. This machine had not only parallel arithmetic but also the
first hardware floating-point arithmetic unit, thus providing a significant
speed-up over previous machines that provided floating-point arithmetic, if
at all, by software. The first IBM 704 was commissioned in 1955 and the
last machine switched off in 1975. The remarkable history of this first-
generation valve machine which performed useful work for 20 years is
described by McLaughlin (1975).
In the IBM 704, along with other machines of its time, all data read by
the input equipment or written to the output equipment had to pass through
a register in the arithmetic unit, thus preventing useful arithmetic from being
performed at the same time as input or output. Initially the equipment was
an on-line card reader (150 to 250 cards per minute), card punch (100 cards
per minute) and line printer (150 lines per minute). Soon these were replaced
by the first magnetic tape drives as the primary on-line input and output
equipment (at 15000 characters per second, at least 100 times faster than the
card reader or line printer). Off-line card-to-tape and tape-to-printer facilities
were provided on a separate input/output (I/O ) computer, the IBM 1401.
However these tape speeds were still approximately 1000 times slower than
the processor could manipulate the data, and input/output could be a major
bottleneck in the overall performance of the IBM 704 and of the computer
installation as a whole.
The I/O problem was at least partially solved by allowing the arithmetic
and logic unit of the computer to operate in parallel with the reading and
printing of data. A separate computer, called an I/O channel, was therefore
added whose sole job was to transfer data to or from the slow peripheral
equipment, such as card readers, magnetic tapes or line printers, and the
main memory of the computer. Once initiated by the main control unit, the
transfer of large blocks of data could proceed under the control of the
I/O channel whilst useful work was continued in the arithmetic unit. The
HISTORY OF PARALLELISM AND SUPERCOMPUTING 13
I/O channel had its own instruction set, especially suited for I/O operations,
and its own instruction processing unit and registers. Six such channels were
added to the IBM 704 in 1958 and the machine was renamed the IBM 709.
This is therefore an early case of multiprocessing. The machine still used
electronic valves for its switching logic and had a short life because, by this
time, the solid state transistor had become a reliable component. The IBM
709 was re-engineered in transistor technology and marketed in 1959, as the
IBM 7090. This machine, together with the upgraded versions (IBM 7094
and 7094 II), was extremely successful and some 400 were built and sold.
During the early development of any new device it is usual to find a wide
range of innovative thought and design, followed by a period dominated by
a heavy investment in one particular type of design. After this period, further
innovation tends to be very difficult due to the extent of this investment. The
pattern can be seen in the development of the motor car, with a wide range
of engine principles used around 1900 including petrol, steam and rotary
engines, and the subsequent huge investment in the petrol driven internal
combustion engine that one can now scarcely imagine changing. Similarly
many novel architectural principles for computer design were discussed in
the 1950s although, up to 1980, only systems based on a single stream of
instructions and data had met with any commercial success. It appears likely,
however, that very large-scale integration ( v l s i ) technology and the advent
of cheap microprocessors may enable the realisation of some of these
architectures in the 1980s.
In 1952 Leondes and Rubinoff (1952) described a multi-operation computer
based around a rotating drum memory. The machine DINA was a digital
analyser for the solution of the Laplace, diffusion and wave equations. A
somewhat similar concept was put forward later by Zuse (1958) in the design
for a ‘field calculating machine’. The principle of spatially connected arrays
of processors was established by von Neumann (1952) who showed that a
two-dimensional array of computing elements with 29 states could perform
all operations, because it could simulate the behaviour of a Turing machine.
This theoretical development was followed by proposals for a practical design
by Unger (1958) that can be considered as the progenitor of the SOLOMON,
ILLIAC IV and ICL DAP computers that were to appear in the 1970s.
Similarly the paper by Holland (1959) describing an assembly of processors
each obeying their own instruction stream, can be considered the first
large-scale multiprocessor design and the progenitor of later linked micro-
processor designs such as those proposed about 20 years later by Pease
(1977), and by Bustos et al (1979) for the solution of diffusion problems. The
indirect binary n-cube, proposed by Pease, was a paper design for 2"
microprocessors connected topologically as either a 1-, 2-, up to n-dimensional
14 INTRODUCTION
cube suitable for a wide range of common numerical algorithms, such as the
fast Fourier transform. It was envisaged that up to 16 384 microprocessors
could be used (n= 14). The Hypercube announced by IMS Associates
(Millard 1975) was based on Pease’s ideas: it comprised a 4-dimensional
cube with two microprocessors at each node, one for internode communication
and the other for data manipulation. This particular hypercube came to
nothing at this time; however, the idea reappeared in the early 1980s as the
Cosmic Cube and the Intel Personal Supercomputer (see §1.1.8).
The maximum transfer rate of data to and from memory was thereby increased
by a factor equal to the number of memory banks. This was the first use of
parallelism in memory, and enabled a relatively slow magnetic-core memory
to be matched more satisfactorily to the faster processor. Almost all
subsequent large computers have used banked (sometimes called interleaved)
memory of this kind. The first STRETCH was delivered to Los Alamos
in 1961, but did not achieve its design goals. Its manufacture also proved
financially unsatisfactory to the company, and after seven systems were built
(one being installed at AWRE Aldermaston UK) the computer was withdrawn
from the product range. Potential customers were then sold the slower but
very popular IBM 7090 series.
After the experience of STRETCH, it seemed that IBM had lost interest
in high-speed computing. The IBM 360 series of computers was announced
in 1964, but it contained no machine with a performance comparable to that
of the CDC 6600 which was first installed the same year. The dramatic success
of the CDC 6600 in replacing IBM 7090s and converting most large scientific
centres to a rival company, made IBM respond. It was not until 1967, however,
that the IBM 360/91 (Anderson et al 1967) arrived with a performance of
about twice that of the CDC 6600. This machine had the look-ahead facility
of STRETCH, and like the CDC 6600 had separate execution units for
floating-point and integer address calculation each of which was pipelined
and could operate in parallel. The principle of pipelining was also introduced
to speed up the processing of instructions, the successive operations of
instruction fetch, decode, address calculation and operand fetching being
overlapped on successive instructions. In this way several instructions were
simultaneously in different phases of their execution as they flowed through
the pipeline. However the CDC 7600 appeared in 1969 and outperformed
the 360/91 by about a factor two. IBM ’s reply in 1971 was the 360/195 which
had a comparable performance to the CDC 7600. The IBM 360/195 (Murphy
and Wade 1970) combined the architecture of the 360/91 with the cache
memory that was introduced in the 360/85. The idea of introducing a
high-speed buffer memory (or cache) between the slow main memory and
the arithmetic registers goes back to the Ferranti ATLAS computer
(Fotheringham 1961). The cache, 32 768 words of 162 ns semiconductor
memory in the 360/85, held the most recently used data in blocks of 64 bytes.
If the data required by an instruction were not in the cache, the block
containing it was obtained from the slower main mempry (4 Mbytes of
756 ns core storage, divided into 16 independent banks) and replaced the
least frequently used block in the cache. It is found in many large-scale
calculations that memory references tend to concentrate around limited
18 INTRODUCTION
regions of the address space. In this case most references will be to data in
the fast cache memory and the performance of the 4 Mbyte slow memory
will be effectively that of the faster cache memory.
Gene Amdahl, who was chief architect of the IBM 360 series (Amdahl
et al 1964), formed a separate company in 1970 (the Amdahl Corporation)
to manufacture a range of computers that were compatible with the IBM
360 instruction code, and could therefore use IBM software. These machines
were an important step in the evolution of computer technology and
parallelism. The first machine in the range, the AMDAHL 470V/6, was the
first to use large-scale integration ( l s i ) technology for the logic circuits of
the c pu (bipolar emitter coupled logic, e c l , with 100 circuits per chip), and
for this reason is sometimes called a fourth-generation technology computer.
Six such machines were delivered in 1975 of which the first two were to
NASA and the University of Michigan. The use of l s i reduced the machine
to about one-third of the size of the comparable IBM 360/168, which used
a much smaller level of integration. Although the arithmetic units in this
machine were not pipelined, a high throughput of instructions was obtained
by pipelining the processing of instructions. The execution of instructions
was divided into 12 suboperations that used 10 separate circuits. When
flowing smoothly, a new instruction could be taken every two clock periods
(64 ns) and therefore up to six instructions were simultaneously in different
phases of execution, and could be said to be in parallel execution. A high-speed
buffer (or cache) bipolar memory of 16 Kbytes (65 ns access) improved the
effective access time to the main memory of up to 8 Mbytes of mo s store
(650 ns access).
and the main memory (compared with one on the CRAY-1 and three on the
CYBER 205 and CRAY X-MP). The maximum main memory size is
256 Mbytes (eight times that of the CYBER 205 or CRAY-1). The
FACOM VPs provide masking instructions similar to those of the CYBER
205 and they have a hardware indirect vector addressing instruction
(i.e. a random scatter/gather instruction), present on the CYBER 205 but
missing on the CRAY-1. The clock period of the vector unit is 7.5 ns which
gives a maximum processing rate of 533 Mflop/s for the VP-200. This rate
and the memory sizes given above are halved for the VP-100. The first delivery
was a VP-100 to the University of Tokyo at the end of 1983. Two other
Japanese pipelined vector computers were announced in 1983, the HITACHI
S-810 model 10/model 20 (630 Mflop/s maximum) and the NEC SX-1 /SX-2
(1300 Mflop/s maximum). All these Japanese computers are described in
more detail in Chapter 2, §2.4.
IBM’s first venture into vector processing was announced in 1985 (nine
years after the first CRAY-1 was delivered) as a part of the System/370 range
of computers. This comprises a multiprocessor IBM 3090 scalar processor
with a maximum performance of about 5 Mflop/s, to which can be optionally
attached vector facilities which have about three to four times the scalar
performance. This is a much lower ratio of vector to scalar speed than is
provided by competitive machines, but is thought by IBM to be the best
cost-effective choice. Each processor of the 3090 can support only one vector
facility, and the current (1987) maximum number of processors is six, although
this number will presumably increase. The IBM 3090-VF therefore provides
both vector ( s im d ) processing in the vector facility, and multi-tasking ( m im d )
programming using the multiple processors. Each 3090 processor has a
64 Kbyte cache buffer memory that is used for instruction and data by both
the scalar and vector units. Behind the cache, the four-processor model 400
has a central memory of 128 Mbyte and one extended memory of 256 Mbyte.
Each vector facility has 16 vector registers, each holding 128 32-bit numbers,
which may be combined in pairs for storing 128 64-bit numbers. The cycle
time is 18.5 ns corresponding to a peak performance of 54 Mflop/s per
pipeline. The vector facility has independent multiply and add pipelines giving
a peak theoretical performance of 108 Mflop/s per vector facility. However,
these rates can only be approached in highly optimised codes which keep
most of the required data in registers or cache memory most of the time,
thereby making minimal access to central or extended memory which is a
severe bottleneck in the system. For example, the rate observed for a single
dyadic vector operation in FORTRAN (see §1.3.3) with all data in cache
memory is 13 Mflop/s. If the data are in central memory, this is reduced to
24 INTRODUCTION
7.3 Mflop/s for a stride between successive elements of the vectors of unity
(compare 70 Mflop/s on the CRAY X-MP, see §2.2.6) and to 1.7 Mflop/s
for a stride of eight.
overlapping of the read, write and arithmetic, together with the provision of
all memory bank to pe connections, is expected to eliminate most bottlenecks;
and the BSP is designed thereby to sustain a large fraction of its maximum
processing rate of 50 Mflop/s on the majority of problems, The BSP is chosen
for detailed study in §3.4.3, but was withdrawn from the market in 1980 before
any had been sold.
ARPA and NASA jointly established an Institute of Advanced Computation
(IAC) to support the ILLIAC IV, and in 1977 this institute published a
design proposal for a machine called PHOENIX (Feierbach and Stevenson
1976b) to replace the ILLIAC IV in the mid-1980s. IAC foresaw the need
for a 10 Gflop/s machine in order to solve three-dimensional problems in
aerodynamic flow with sufficient resolution. The PHOENIX design can be
described as 16 ILLIAC IVs each executing their own instruction stream
under the control of a central control unit. If each pe can produce a result
in 100 ns (a reasonable assumption with 1980 technology) the total of 1024 pes
could produce the required 1010 operations per second.
NASA also commissioned two other design studies, from Control Data
Corporation and Burroughs, for machines to replace the ILLIAC IV and
form a Numerical Aerodynamic Simulation Facility (NASF) for the mid-
1980s (Stevens 1979). The CDC design was based on an uprated four-pipe
CYBER 205, operating in lockstep fashion, plus a fifth pipe as an on-line
spare that can be electronically switched in if an error is detected. Each
pipeline can produce one 64-bit result or two 32-bit results every 8 ns.
However each result may be formed from up to three operations, leading to
a maximum computing rate of 3 Gflop/s. There was also a fast scalar processor
clocked at 16 ns. In contrast, the Burroughs design may be regarded as an
upgrade to the BSP architecture, being based on 512 pes connected to 521
memory banks (the nearest larger prime number). Unlike the ILLIAC IV,
each pe has its own instruction processor. The same instructions are assigned
to each processor, but the arrangement does permit them to be executed in
different sequences depending on the result of data-dependent conditions
that may differ from processor to processor. With a planned floating-point
addition time of 240 ns, multiplication time of 360 ns and 512 processors, a
maximum processing rate of about 1-2 Gflop/s could be envisaged, ecl
technology was planned with a 40 ns clock period.
The initial SOLOMON computer design (Slotnick et al 1962) was a 32 x 32
array of one-bit processors each with 4096 bits of memory, conducting its
arithmetic on 1024 numbers in parallel in a bit-serial fashion. This describes
quite closely the pilot model of the ICL Distributed Array Processor
(DAP), that was started in 1972 and commissioned in 1976 (Flanders et al
1977, Reddaway 1979, 1984). The first production model of the machine was
28 INTRODUCTION
from memory in the form of words (e.g. 32-bit floating-point numbers) and
by 1960 most scientific computers would process the bits of the word in
parallel. Such a procedure may be described as word-serial and bit-parallel
processing. Shooman recognised that many problems involving information
retrieval required searches on only a few bits of each word and that
conventional word-serial processing was inefficient. He proposed, therefore,
that the memory should also be referenced in the orthogonal direction, i.e.
across the words by bit slice. If the bits in memory are thought of as a
two-dimensional array, with the bits of the kih word forming the kth
horizontal row, then the /th bit slice is the bit sequence formed from the /th
bit of each number—that is to say the /th vertical column of the array. In
the orthogonal computer, one pe is provided for each word of memory and
all the bits of a bit slice can be processed in parallel. This is called bit-serial
and word-parallel processing. The orthogonal computer provided a ‘horizontal
unit’ for performing word-serial/bit-parallel operation and a separate ‘vertical
unit’ for bit-serial/word-parallel operation.
The idea of performing tests in parallel on all words has led to the idea
of the associative or content-addressable memory, in which an item is
referenced by the fact that part of its contents match a given bit pattern (or
mask), rather than by the address of its location in store. In a purely associative
memory, there is no facility to address a data item by its position in store.
However many systems provide both forms of addressing. The multitude of
different processors based on associative memories have been reviewed by
Thurber and Wald (1975) and Yau and Fung (1977), and the reader is also
referred to the books by Foster (1976) and Thurber (1976) for a more complete
treatment. We will describe here only two of these, the OMEN and STARAN,
that have been marketed commercially.
The OMEN (Orthogonal Mini EmbedmeNt) series of computers were a
commercial implementation of the orthogonal computer concept, manufactured
by Sanders Associates for signal processing applications. They are described
by Higbie (1972). The OMEN-60 series used a PDP-11 for the conventional
horizontal arithmetic unit and an array of 64 pes for the associative vertical
arithmetic unit which operated on byte slices, rather than bit slices. Depending
on the model, either bit-serial arithmetic with eight bits of storage were
provided with each pe , or alternatively hardware floating-point with eight
16-bit registers and five mask registers. Logic was provided between the pe s
to reverse the order of the bytes within a slice, or perform a perfect shuffle
or barrel shift.
Another computer derived from the orthogonal computer concept was the
Goodyear STARAN (Batcher 1979) which was conceived in 1962, completed
in 1972 and by 1976 about four had been sold. The STARAN typically
comprised four array modules, each 256 one-bit pes and between 64 Kbits
30 INTRODUCTION
are typical, depending on the problem and the care with which it is
programmed. There are no vector instructions, and such operations must be
coded as a tight scalar loop. In 1985 an implementation of the FPS 164 in
ec l technology was announced. This machine, called the FPS 264, is four to
five times faster than the FPS 164, but otherwise the same.
A novel enhancement to the FPS 164 was announced in 1984, the FPS
164/MAX which stands for Matrix Algebra Accelerator. This machine has
a standard FPS 164 as a master, and may add up to 15 MAX boards, each
of which is equivalent to two additional FPS 164 c pu s . In total there is
therefore the equivalent of 31 FPS 164 c pu s or 341 Mflop/s. The FPS 164
and 164/MAX are considered in detail in §2.5.7 and §2.5.8, respectively.
(ii) UK experiments
The many small experimental university m im d systems developed in the late
1970s and early 1980s are too numerous to enumerate in full. However, the
following systems are typical. Shared-memory m im d systems have been built
at the University of Loughborough under Professor Evans, and used
extensively for the development of parallel m im d algorithms (Barlow, Evans
and Shanehchi 1982). The first system used two Interdata 70 minicomputers
sharing 32 Kbytes of their address space. Later, in 1982, four TI 990/10
microcomputers sharing memory were commissioned as an asynchronous
parallel processor (Barlow et al 1981). A larger m im d system, called CYBA-M, has
been built at the University of Manchester Institute of Science and Technology
(UMIST) under Professor Aspinall, and consists of 16 Intel 8080 micro-
processors sharing a multiport memory through part of their address space
(Aspinall 1977, 1984, Dagless 1977). Other interesting UK projects are the
Manchester Data-Flow computer, the Imperial College ALICE reduction
machine and the University of Southampton RPA and Supernode projects.
The last two are discussed in Chapter 3.
(y) Erlangen e g p a
Perhaps the most original and imaginative m im d architecture was developed
under Professor Händler at the University of Erlangen, West Germany, and
is called the Erlangen General Purpose Array or e g pa (Händler, Hofmann
and Schneider 1975, Händler 1984). The connections between the computers
in e gpa are topologically similar to a pyramid, with computers at the corners
and connections along the edges. The control C-computer, at the top of the
pyramid, controls four B-computers at the corners of its base. The four
B-computers also have direct connections along the edges of the base. This
five-computer system has been working since 1981, and uses AEG 80-60
minicomputers each with 512 Kbytes of memory. The idea is expandable to
further levels by making each B-computer itself the top of a further pyramid
with four A-computers at its base. There are then 16 A-computers in total,
which are connected amongst themselves at the lowest level as a 4 x 4 array
with nearest-neighbour connections, as in the ICL DAP. The advantage of
the egpa pyramidal topology is that short-range communication can take
place most effectively along the nearest-neighbour connections, whereas
long-range communication can take place most effectively by sending the
data higher up the pyramid. For example, if the bottom computers form an
n x n array, then the furthest computers can communicate in 21og2n steps
via the top of the whole pyramid, compared with In steps if the nearest-
neighbour connections alone were provided. The five-computer system has
been used for contour and picture processing, and has proved to be about
three times faster than a single computer of the same type (Herzog 1984). If
the hierarchical development is taken one stage further by making each
A-computer the top of a further pyramid, there will be 1, 4, 16 and 64
HISTORY OF PARALLELISM AND SUPERCOMPUTING 39
(vii) MIDAS
The Modular Interactive Data Analysis System (MIDAS) at the University
40 INTRODUCTION
machine has been working since October 1983 and a number of computational
physics problems have been reformulated and successfully computed on it.
An extended 210 node hypercube, called the Homogeneous Machine, is
planned using at each node the faster Intel 80286 plus 80287 plus 80186 chips
and 256 Kbytes of local memory (using 64 Kbit chips), expandable to
1 Mbyte with 256 Kbit chips. The maximum performance of the 210
hypercube is estimated to be about 100 Mflop/s, that is to say, about the
same as the large supercomputers (CRAY X-MP and CYBER 205).
16M flop/s per node. However, this rate assumes that the vectors are
contiguously stored in the dram and that both the multiplier and adder are
being used simultaneously. The INMOS transputer has the job of rearranging
data into contiguous form, which it may do in parallel with the operation
of the vector pipelines. However, the ratio of the time to perform an arithmetic
operation to the time to rearrange data in the dra m to the time to obtain
data from a neighbouring node is 1:26:256. Again we find that communication
between nodes is at least two orders of magnitude slower than the arithmetic,
and the caveats applied above to the performance of the Intel iPSC, clearly
also apply to the T-series.
number fixed by the hardware that may not suit the problem or algorithm
being implemented. Each d m m may contain up to 1M 64-bit words, giving
a maximum capacity of 1 Gbyte for a complete system. The switch is a
multilevel packet-switching network with a 50 ns propagation time between
nodes. A typical time to get data to a pe m from memory via the switch is
2 fis. Because of its interesting architecture we have chosen the Denelcor HEP
for detailed study in §3.4
The development of the HEP was sponsored by the US Army, and
culminated in the delivery of a 4 pe m x 4 d m m system to the Ballistics Research
Laboratories, Aberdeen, Maryland in 1982. This was especially fitting because
BRL had also sponsored and taken delivery of ENIAC in 1946, which can
be considered as the first m im d computer. Single p e m / d m m systems have been
delivered to the University of Georgia Research Foundation (1982),
Messerschmidt Research Munich (1983), and Los Alamos Research
Laboratories (1983). The original HEP as described above, or HEP-1, was built
with conservative ecl 10K technology, because it was not desired to pioneer
a revolutionary architecture at the same time as a new technology. However,
the HEP-2, which was announced in 1983 for delivery in 1986, was to have
used upgraded ecl v l s i technology with a switching time of 300 ps and
2500 gates per chip. The clock period would have been 20 ns and there would
have been a combined multiply-add pipeline, with a peak rate of 100 Mflop/s
per p e m . It was to be based on the HEP-1 m im d architecture and designed to
have a performance ranging from 250 Mips to 12000 Mips, corresponding
roughly to 50 Mflop/s and 2.4 Gflop/s. Regrettably, these plans did not come
to fruition because the company went out of business in 1985 due to financial
problems. However, the design of the HEP is so novel and interesting that we
describe it in more detail in §3.4.4.
(.xvii) FPS-5000
The Floating Point Systems 5000 series of computers is, like ELXSI, a
bus-connected shared-memory m im d system (Cannon 1983). An 8 or
12 Mflop/s control processor and up to three 18 Mflop/s XP32 arithmetic
coprocessors are connected via a 6 Mword/s bus to a system-common
memory of up to 1 Mword. The control processor is either an FPS AP-120B
or FPS-100 ‘array processor’, but the XP32 is a new design, using the WEITEK
32-bit floating-point multiplier and adder v l s i chips. The latter are eight-stage
pipelines operating on a 167 ns clock period, giving a peak performance of
6 Mflop/s per chip. One multiplier chip feeds its result into two adder chips
in a manner suited to the computation of the fast Fourier transform, giving
a peak performance of 18 Mflop/s per coprocessor, and 62 Mflop/s for the
maximum system. Independent programs can be executed in the control
processor and each of the coprocessors. The FPS-5000 is described in more
detail in §2.5.10.
(Feng 1981) which is similar to that required to bring data together in the
log2n stages of a fast Fourier transform on n data (the so-called butterfly
operation, see also §3.2.2 and §5.5.1). Hence the name butterfly switch.
The first model of the BBN Butterfly (1985) contained 128 ces out of a
design maximum of 256, each based on the Motorola 68000 with 1 Mbyte
of local memory. Later enhancements include hardware floating-point using
the Motorola 68020 with 68881 coprocessor, and a local memory of 4 Mbyte.
On ce with its memory is contained on a board. The original 128-c e
Butterfly computer has achieved 26 Mflop/s on the multiplication of two
400 x 400 matrices, and 3 Mflop/s on the solution of 1200 linear equations
by Gaussian elimination.
(0 Convex C-l
The Convex C-l uses 8000 gates per chip and packages the machine with
128 Mbyte of memory in a single 5-foot high 19 inch rack. A second rack
contains a tape drive and disc storage. With a total power dissipation of
3.2 kW, fan cooling is adequate, and an office environment, as used for the
typical minicomputer, is satisfactory. In contrast, the CRAY X-MP super-
computer uses 16-gate per chip integration, generates about 200 kW, and
requires a separate freon refrigeration plant for cooling; hence one reason
for the difference in costs. The relative computation speeds should be roughly
inversely proportional to the clock periods which are 12.5 ns for the CRAY-1
and 100 ns for the C-l, giving a ratio of eight, much as is observed.
The performance on a real problem can be judged and set in context by
the LINPACK benchmark (Dongarra 1986) in which a set of 300 linear
equations is solved in FORTRAN using matrix-vector techniques (see
50 INTRODUCTION
table 1.1). The relative performances are 66 Mflop/s for the CRAY-1S,
8.7 Mflop/s for the Convex C-l and 0.1 Mflop/s for the DEC VAX 11/780
with floating-point accelerator (a much used and typical 1984 minicomputer).
Hand optimisation of critical coding improves the performance of the C-l
to 14 Mflop/s. On the other hand a 1984 supercomputer, the CRAY X-M P/4
M IN I M IN IS U P E R SU PER
T h e o r e tic a l 5 .9 (1 ) 20 44 2 1 0 (1 )
peak 4 7 (8 ) 4 0 (3 2 - b it ) 4 2 0 (2 )
p e r fo r m a n c e 8 4 0 (4 )
(''oo, « 1/ 2 ) (0 .9 . 1 5 1 ) ( 1 ) (7 0 ,5 3 )(l)f
F O R T R A N § 1 .3 .3 (1 .1 , 2 3 ) ( 8 ) (1 3 0 , 5 7 0 0 )( 2 )r
L iv e r m o r e 3 9 6 (1 )
in n e r p r o d u c t
L iv e r m o r e 6 8 (1 )
tr id ia g o n a l
L iv e r m o r e 14 7 (1 )
p a r tic le p u sh e r
L IN P A C K ® 1 .3 (1 ) 2.9 7.3 2 4 (1 )
FO RTRAN 2 .5 ( 8 )
n = 100 6 .2 ( 8 ) ‘
A s s e m b le r 1 .7 (1 ) 3.2 4 4 (1 )
in n e r -lo o p 2 .6 ( 8 )
n = 100 8.5 ( 8 ) '
M a tr ix -v e c to r 171 ( 1 )
b e s t a s s e m b le r 0 .1 1 c 7.3 ( 8 ) c 8 .7 C 257 (2 )
n = 300 1 4 (8 ) 14 26 4 8 0 (4 )
Notes
a Solution of n linear equations (Dongarra et al 1979).
b Typical 1984 32-bit minicomputer for comparison.
c All FORTRAN.
d Number of c e s or c p u s used in parentheses.
c FORTRAN with compiler directives. f Hockney (1985a)
HISTORY OF PARALLELISM AND SUPERCOM PUTING 51
using all the four c pu s , can achieve 480 Mflop/s on this problem. The
significance of the development of minisupercomputers is thus that engineering
firms currently doing technical calculations on a VAX 11/780 or similar
minicomputer can enhance their calculational capability probably by about
two orders of magnitude by installing a minisupercomputer, without
significantly increasing their costs. This means that complex engineering
simulations, previously confined to costly specialist supercomputer centres,
can now for the first time be performed in-house. Furthermore, the quality
of the arithmetic is improved from 32-bit for a VAX to 64-bit on a
minisupercomputer. However, although the 1984 Convex C-l can justifiably
claim to have between 1/8 and 1/4 of the performance of a 1976
supercomputer (the CRAY-1) the benchmark results also show that it has
only 1/50 of the performance of a 1984 supercomputer (the CRAY X-MP/4)
on the LINPACK benchmark.
The architecture of the Convex C-l is broadly similar to the CRAY-1 in
that it is based on a set of functional units working from vector registers.
However, the detailed architecture is quite different; the machine does not
use the CRAY instruction set. Unlike the CRAY-1, the C-l instruction set
allows addressing to an individual byte, and the 32-bit address (compare
24-bit on the CRAY-1) can directly address a virtual memory space of
4 Gbyte, or 500 Mword (64-bit). There are three functional units (for
load/store/vector edit, add/logical and multiply/divide), and eight vector
registers holding 128 64-bit elements each. Each functional unit comprises
two identical pipes, one for odd and the other for even elements. Each pipe
performs a 64-bit operation every 200 ns or a 32-bit operation every clock
period of 100 ns. This leads to an effective processing rate of one 64-bit result
every 100 ns, or one 32-bit result every 50 ns. Since only two of the pipes
perform floating-point arithmetic, this corresponds to a peak performance
of 20 Mflop/s in 64-bit mode or 40 Mflop/s in 32-bit mode. The vector
registers are supplied with data from either a 64 Kbyte, 50 ns cache memory,
or directly from a 16 Mword, 16-bank dynamic r am main memory. The
bandwidth for transfers of 64-bit data words from main memory to the cache
is 10 Mword/s, compared to 80 Mword/s on the CRAY-1 and 315 Mword/s
on the CRAY X-MP.
(ii) SCS-40
The second minisupercomputer to be announced was the Scientific Computer
Systems SCS-40 which appeared in 1986. Like the Convex C-l, the
manufacturers claim that the SCS-40 delivers 25 % of the performance of the
CRAY X-MP/1 at 15% of the cost. However, unlike the C-l, this machine
uses the CRAY instruction set, and CRAY programs should run without
alteration. The physical architecture comprises 16 main memory banks
52 INTRODUCTION
of overhead. Each c e has a concurrency control unit (ecu) which is its interface
to the concurrency bus. The ecus distribute the work among the c e s at
run-time, and synchronise the calculation by hardware. In this way, for
example, data dependencies between different instantiations of a DO loop
are maintained correctly by hardware without any intervention from the
programmer, even though different loop indices are assigned different ce s .
The Alliant FX/8 is to be used as the eight-PE cluster in the Illinois Cedar
project that has already been described (§1.1.8). It is also sold separately, for
example, as an Apollo DOMAIN workstation. First deliveries were made in
1985. A one-CE model is marketed as the Alliant FX/1. This has a 32 Kbyte
cache and one or two 8 Mbyte memory modules.
We have seen during our discussion of the history of parallelism that a wide
variety of different parallel architectures have been proposed, and a fair
number have been realised, at least in experimental form. Attempts to bring
some order into this picture by classifying the designs have not however met
with any general success, and there is (cl987) no useful and accepted
classification scheme or accompanying notation. We will however present
the taxonomy of Flynn (1972) in §1.2.2 and that of Shore (1973) in §1.2.3
since both of these have been discussed quite widely and some of the associated
terminology has become part of the language of computer science. The
problem with these classifications is that several well established architectures,
particularly the highly successful pipelined computer, do not fit into them at
all clearly, and others such as the ICL DAP may fit equally well into several
different groups. An alternative approach is to focus attention on the principal
ways in which parallelism appears in the architecture of actual computers,
namely: pipelining, processor replication and functional parallelism. These
divisions themselves, springing as they do from the engineering reality, form
the basis of a taxonomy that is easier to apply than the more theoretical
concepts of Flynn and Shore.
The first stage in the development of any successful classification is,
however, the ability to describe reasonably accurately and concisely the
essential features of a particular architecture, and this requires a suitable
notation. We give therefore in §1.2.4 such a notation that enables an
architecture to be described within one line of text. It plays much the same
role in discussing computer architecture as the chemical formulae of large
molecules do in chemistry. This notation is then used in §1.2.5 to give a
structural classification of the serial and parallel computers that are discussed
in this book.
54 INTRODUCTION
inter-relations of the programs being executed, must assume that the different
phases of one job have to be executed in sequence and that any I/O statement
in the program must be completed before the next statement is executed. In
some circumstances a programmer can maintain control over his I/O
operations and arrange to read his data in blocks from backing store. In this
case he may apply buffering in order to overlap the I/O in his program with
the execution of his program. In three-stage buffering, for example, two
channels are used and three blocks of data are present in the store.
Simultaneously a new block is read into buffer 1 via the input channel, new
values are calculated by the processor from the block of data in buffer 2, and
the last block of values calculated are being written from buffer 3 via the
output channel. The calculation proceeds by cyclically changing the roles of
the buffers.
We can see from the above that the main requirement of computer
architecture in allowing parallelism at the job level is to provide a correctly
balanced set of replicated resources, which comes under the general classification
of functional parallelism applied overall to the computer installation. In this
respect it is important for the level of activity to be monitored well in all
parts of the installation, so that bottlenecks can be identified, and resources
added (or removed) as circumstances demand.
We next consider the types of parallelism that arise during the execution
of a program that constitutes one of the job phases considered above. Within
such a program there may be sections of code that are quite independent of
each other and could be executed in parallel on different processors in a
multiprocessor environment (for example a set of linked microprocessors).
Some sections of independent code can be recognised from a logical analysis
of the source code, but others will be data-dependent and therefore not known
until the program is executed. In another case, different executions of a loop
may be independent of each other, even though different routes are taken
through the conditional statements contained in the loop. In this case, each
microprocessor can be given the full code, and as many passes through the
loop can be performed in parallel as there are microprocessors. This situation
arises in Monte-Carlo scattering calculations for non-interacting particles,
and has important applications in nuclear engineering. The programming
problems associated with such linked microprocessors are an active area of
current research, and many such systems have become operational in the
1980s (see §§1.1.8 and 1.2.6).
All manufacturers of computers designed for efficient operations on vectors
of numbers (Burroughs, Texas Instruments and CRAY) have produced
FORTRAN compilers that recognise when a DO loop can be replaced by
one or several vector instructions. This is the recognition by software that a
56 INTRODUCTION
f l b y t e is a s e q u e n c e o f 8 b in a r y d ig it s ( b it s ) .
CLASSIFICATION OF DESIGNS 57
From our point of view, the problem with the above classification scheme is
that it is too broad: since it lumps all parallel computers except the
multiprocessor into the si md class and draws no distinction between the
pipelined computer and the processor array which have entirely different
computer architectures. This is because it is, in effect, a classification by broad
function (whether or not explicit vector instructions are provided) rather
than a classification of the design (i.e. architecture). It is rather like classifying
all churches in one group because they are places of worship. This is certainly
a valid broad grouping, and distinguishes churches from houses, but is not
very useful to the architect who wishes to distinguish between the different
styles of church architecture by the shape and decoration of their arches and
58 INTRODUCTION
windows. In this book we are, like the architect, interested in studying the
details of the organisation, and therefore need a finer classification.
is regarded as a two-dimensional array of bits with one word stored per row,
machine II reads a vertical slice of bits, whereas machine I reads a horizontal
slice. Examples are the ICL DAP and STAR AN.
Machine III. This is a combination of machines I and II. It comprises a
two-dimensional memory from which may be read either words or bit slices,
a horizontal pu to process words and a vertical pu to process bit slices; in
short it is the orthogonal computer of Shooman (1970). Both the ICL DAP
and STARAN may be programmed to provide the facilities of machine III,
but since they do not have separate pu s for words and bit slices they are not
in this class. The Sanders Associates OMEN-60 series of computers is an
implementation of machine III exactly as defined (Higbie 1972).
Machine IV. This machine is obtained by replicating the pu and dm of
machine I (defined as a processing element, pe ) and issuing instructions to
this ensemble of pe s from a single control unit. There is no communication
between the pe s except through the control unit. A well known example is
the PEPE machine. The absence of connections between the pe s limits the
applicability of the machine, but makes the addition of further pe s relatively
straightforward.
Machine V. This is machine IV with the added facility that the pe s are arranged
in a line and nearest-neighbour connections are provided. This means that
any pe can address words in its own memory and that of its immediate
neighbours. Examples are the ILLIAC IV, which also provides short-cut
communication between every eight pes .
Machine VI. Machines I to V all maintain the concept of separate data
memory and processing units, with some databus or switching element
between them, although some implementations of one-bit machine II
processors (e.g. ICL DAP) include the pu and dm one the same ic board.
Machine VI, called a logic-in-memory array ( lima ), is the alternative
approach of distributing the processor logic throughout the memory.
Examples range from simple associative memories to complex associative
processors.
Forgetting for the moment the awkward case of the pipelined vector
computer, we can see that Shore’s machines II to V are useful subdivisions
of Flynn’s simd class, and that machine I corresponds to the si sd class. Again
the pipelined vector computer, which clearly needs a category of its own, is
not satisfactorily covered by the classification, since we find it in the same
grouping as unpipelined scalar computers with no internal parallelism above
the requirement to perform arithmetic in a bit-parallel fashion. We also find
60 INTRODUCTION
t Note that the three naturally occurring isotopes of carbon all have the same electronic
structure and therefore the same chemistry.
CLASSIFICATION OF DESIGNS 61
A The units
(1) Symbols— the following is an alphabetical list of the symbols which
designate the different types of units comprising a computer or processor:
B An integer, fixed-point or boolean execution unit.
C A computer which is any combination of units including
at least one I unit.
Ch An I/O channel which may send data to or from an I/O
device interface and memory, independently of the other
units.
D An I/O device, e.g. card reader, disc, v d u . The nature of
the device is given as a comment in parentheses.
E An execution unit which manipulates data. That is to say
it performs the arithmetic, logical and bit-manipulation
functions on the data streams. It is subdivided into F and
B units and often called an a l u or arithmetic and logical
unit.
F A floating-point execution unit.
H A data highway or switching unit. Transfers data without
change, other than possibly reordering the data items (e.g. the
FLIP network of STARAN).
I An instruction unit which decodes instructions and sends
(or issues) commands to execution units where the instructions
are carried out. Often called an ip u or instruction processing
unit.
IO An I/O device interface which collects data from a device
and loads it to a local register or vice versa.
M A one-dimensional memory unit where data and instructions
62 INTRODUCTION
(6 ) Multiple units— the number of units of the same kind that can operate
simultaneously is indicated by an initial integer. Note particularly that a unit,
however multipurpose and complex, is not counted more than once unless
it can perform more than one operation at a time, e.g.
E a single multifunction execution unit for multiplication,
addition, logical etc, performing only one operation at a
time, as in the IBM 7090,
10E 10 independent function units for multiplication, addition,
logical etc, that may operate simultaneously, as in the CDC
6600.
(7) Replication—a bar over a symbol, or over a structure delimited by braces
{ }, is used to indicate that all the units in a group are identical, e.g.
A simplex connection can transfer data only in the direction shown. A full
duplex connection may transfer data in both directions at the same time.
A half duplex connection may transfer data in either direction, but not both
at the same time. Using the concurrent and sequential separators we have:
<- > an abbreviation for { < - , - > },
< -/- > an abbreviation for { < - / - > }.
In the above the dash may be replicated (or printed as a line of arbitrary
length) in order to improve the appearance of a structural description, e.g.
The right-hand side of such a definition should contain only data paths.
n o-conn ection sym bol (|). A unit may be connected outside the braces as
follow s:
-----------------U2-----------------
— U3 —
— {— U1—, |U 2—, — U3— }— as above but U2 is not con-
nected to the left, thus:
— U1 —
------------ U2-----------------
U3 —
— {— U 1—/ — U2—/ U3 — }— three parallel alternative paths,
working one at a time. Since
this is a distinction in time it
cannot be drawn differently
from the third example above.
In order to illustrate the use of descriptions of paths instead of simple units,
and the nesting of such parallel connections, consider the complex of
connections
CLASSIFICATION OF DESIGNS 67
\i/
VN
Other more complex connection patterns may be specified as comments
between parentheses.
The following symbols are used: bit is denoted by (b) and byte (B), with the
usual si unit prefixes and the convention K = 1024, M = K 2, G = K 3, T = K 4.
D Control of units
A set of units under the control of an instruction stream defines a computer.
70 INTRODUCTION
This definition of a processor fits with the common usage in the term processor
array for an array of E— M units under common control of an external
I unit, as in the ICL DAP.
The above distinction is not absolutely clear cut, because many units which
we would regard as execution units, are in fact controlled by microprograms,
the instructions of which are processed by the E unit. From the point of view
of overall architecture— which is our main interest here— the important
point is whether a unit is programmable by the user. If it is, then we would
regard it as having an I unit. If it is not user programmable then there is no
I unit, even though internally a fixed micro-instruction stream may be
involved. Similarly, from the overall architectural view we would only
describe programmable registers as part of the description of a computer,
even though there are many other internal registers in the computer (for
example between the stages of a pipelined execution unit).
There is no reason why the notation should not be used to describe the
internal structure of a microprogrammed general-purpose arithmetic pipeline,
or a microprocessor in detail. In this case all the internal registers,
programmable or not, would be described and the unit processing the
microprogram would be usefully classified as an I unit. It is clear, therefore,
that the meaning of an I unit depends on the use to which the notation is
being put, and should be made clear in the supporting text.
(19) The extent of control exercised by an I or a C unit is shown by the
control brackets [ ]. The units controlled are listed inside the brackets,
separated by commas if they may operate simultaneously, or by slashes if
the units operate only one at a time, e.g.
For example:
I[10F, 10C]r The CDC 6600 with 10 different in-
dependent floating-point functional units
and 10 identical I/O computers. Instruc-
tions are issued when units are ready.
C[64P], The ILLIAC IV with 64 identical pro-
cessors controlled in lockstep mode.
I[4C ](description ofcontrol) Four computers controlled by an I unit
in the manner described in the comment
within parentheses.
E Examples
In order to illustrate the use of the above notation we give below one-equation
or one-line descriptions of a variety of computers. The detail of the description
72 INTRODUCTION
or, alternatively
is relatively rare and that the subdivisions are fine enough to differentiate
those computers that one feels should be treated separately.
At the highest level we follow the functional classification of Flynn and
divide computers into those with single instruction streams (si) and those
with multiple instruction streams ( m i m d ). A taxonomy for m im d computers
FIGURE 1.7 A classification of the processor arrays discussed in this book. (STARAN is classified for the situation when there is no
data permutation in the FLIP network.)
CLASSIFICATION OF DESIGNS 77
the second alternative naturally divide into those with a separate and
identifiable switch (switched mi md ) and those in which computing elements
are connected in a recognisable and often extensive network ( mi md networks).
In the former all connections between the computers are made via the switch,
which is usually quite complex and a major part of the design. In the latter,
individual computing elements ( c e s ) may only communicate directly with their
neighbours in the network, and long-range communication across the
network requires the routing of information via, possibly, a large number of
intermediate c e s . The c e must therefore provide a small computer (e.g. a
microprocessor), a portion of the total system memory, and a number of
links for connection to neighbouring elements in the network. Early network
systems provided these facilities on a board. However, the INMOS Transputer
80 INTRODUCTION
(see §3.5.5) now provides a c e on a chip, and is an ideal building block for
mi md networks.
In network systems the c e s are the nodes of the network and may also be
called nodal computers or processors, or processing elements. Within our
notation and classification, however, they should be called computing
elements in order to indicate that they are complete computers with an
instruction processing unit. The term pe is reserved for the combination of
arithmetic unit and memory without an instruction processing unit, as is
found in s im d computers such as the ICL DAP (see §3.4.2).
Switched systems are further subdivided in figure 1.9 into those in which
all the memory is distributed amongst the computers as local memory and
the computers communicate via the switch (distributed-memory m i m d ); and
those in which the memory is a shared resource that is accessed by all
computers through the switch ( shared-memory mi md ). A further subdivision
is then possible according to the nature of the switch, and examples are given
in figure 1.9 of crossbar, multistage and bus connections in both shared- and
distributed-memory systems. Many larger systems have both shared common
CHARACTERISATION OF PERFORMANCE 81
computer and the Southampton ESPRIT Supernode computer (see §3.5.5( v))
are designed to satisfy this requirement.
to derive this simple generic description of all serial and parallel computers
we must first examine in more detail the principal ways of increasing the
speed of an arithmetic unit.
element pair enters the arithmetic unit. This sequential calculation of the
elements of the result vector is illustrated in the centre of figure 1 . 1 1 , where
a time axis running from top to bottom is understood. If / is the number of
suboperations (in this case, / = 4) and x is the time required to complete each
(usually the clock period) then the time to perform an operation on a vector
of length n is
( 1 . 1 a)
and the maximum rate of producing results is
(U b )
In serial operation we notice that the circuitry responsible for each of the
/ suboperations is only active for 1/ / of the total time. This in itself represents
an inefficiency, which is more obvious if we draw the instructive analogy
with a car assembly line. The suboperations that are required to manufacture
the sum of two numbers are analogous to the suboperations that are required
to manufacture a car: for example, ( 1 ) bolt the body to the chassis, ( 2 ) attach
the engine, (3) attach the wheels and (4) attach the doors. Serial operation
corresponds to only one group of men working at a time, and only one car
being present on the assembly line. Clearly, in our example, three-quarters
of the assembly-line workers are always idle.
The car assembly line obtains its efficiency by allowing a new car to begin
assembly in suboperation ( 1 ) as soon as the first car has gone on to
suboperation (2). In this way a new car is started every x time units and,
when the line is full, a car is completed every x time units. We often speak
of the suboperations as forming a pipeline, and in our example there are four
cars at various stages of assembly within the pipeline at any time, and none
of the assembly-line workers is ever idle. This is the principle that is used to
speed up the production of results in a pipelined arithmetic unit, and is
illustrated to the left of figure 1.11. The timing diagram shows that the
speed-up is obtained by overlapping (i.e. performing at the same time, or in
parallel) different suboperations on different pairs of arguments. The time to
perform the operation on a vector of length n is therefore
(l-2a)
where sx is a fixed set-up time that is required to set up the pipeline for the
vectors in question, i.e. to compute the first and last addresses for each vector
and other overheads. It also includes the fixed time for numbers to be
transferred between memory and the arithmetic pipeline. The number of
sub-operations (stages or segments) in the pipeline is / and therefore differs for
different arithmetic operations. When full, and therefore operating smoothly,
84 INTRODUCTION
Comparing this with equation (1.1b) one sees that pipelining of an operation
increases the speed at most by a factor of /— the number of suboperations
that are overlapped.
It is obvious from the above that any operations that can be subdivided
into roughly equal suboperations can be pipelined. A very common example
is the pipelining of instruction processing, in which the overlapped sub-
operations might be: ( 1 ) instruction decode; ( 2 ) calculate operand addresses;
(3) initiate operand fetch; (4) send command to functional unit; and (5) fetch
next instruction. Computers with pipelined instruction processing units are
the IBM 360/91 (one of the earliest), ICL 2980, AMDAHL 470V/6. Other
computers, such as the BSP, pipeline on a more macroscopic scale and overlap
the operations of ( 1 ) memory fetch, ( 2 ) unpipelined arithmetic operation and
(3) storage of results. Machines that do have an arithmetic pipeline, do not
necessarily have vector instructions in their repertoire (e.g. CDC 7600, IBM
360/195). The most notable machines with arithmetic pipelines and vector
instructions are the CDC STAR 100 (the first), its derivative the CYBER
205, the TIASC and the CRAY-1.
An alternative way of increasing the speed of arithmetic is to replicate the
execution units and form an array of processing elements ( pe s ) under the
common control of a single instruction stream. The pe s all perform the same
arithmetic operation at the same time, but on different data in their own
memories. If there are N such processors (N < n), the first N argument pairs
(xf, yf) can be sent to the pe s and the first N results found simultaneously in
a time of one parallel operation on all elements of the array, say iy (4r in
our example). The next N elements can then be loaded and also computed
in parallel in a further iy time units. This is repeated until all the elements
are computed. The timing diagram is shown on the right of figure 1.11 and
we conclude that the time to compute a vector of length n on such a processor
array is
(1.3a)
and
(1.3b)
where f * 1 — the ceiling function of x — is the smallest integer that is either
equal to or greater than x. The function gives the number of repetitions of
the array operation that are required if there are more elements in the vector
than there are processors in the array. The maximum rate of producing results
CHARACTERISATION OF PERFORMANCE 85
The two parameters and n 1/2 completely describe the hardware performance
of the idealised generic computer and give a first-order description of any
real computer. These characteristic parameters are called:
86 INTRODUCTION
(1.4b)
where n0 = roo/n 1/2 is called the specific performance.
When deriving average values for and n 1 / 2 from a timing expression
for a sequence of vector operations, it is important to remember that t is
defined as the time per vector operation of length n. Thus if a vector arithmetic
operations take a time
(1.4c)
then
changes in technology. We shall see that it varies from n1/2= 0 for serial
computers with no parallel operation to nll2 = oo for an infinite array of
processors. It therefore provides a quantitative one-parameter measure of
the amount of parallelism in a computer architecture. Because nl/2 does not
appear as a factor in equation (1.4a), the relative performance of different
algorithms on a computer is determined by the value of n1/2. The vector
length (or average length), n, may be said to measure the parallelism in the
problem, and the ratio v = n1/2/n measures how parallel a computer appears
to a particular problem. If v = 0 or small then an algorithm designed for a
sequential or serial environment will be the best; however if v is large an
algorithm designed for a highly parallel environment will prove the most
suitable. Chapter 5 is therefore mainly a discussion of the influence of n 1 / 2
or v on the performance of an algorithm.
It is evident from the timing equation (1.2a) for a pipelined computer that
any overhead, such as the set-up time s t , contributes to the value of n1/2,
even though it may not represent any parallel features in the architecture.
The number of pipeline stages, /, in the same expression does, however,
measure hardware parallelism because it is the number of suboperations
that are being performed in parallel. It is not therefore strictly true that n 1 / 2
always measures hardware parallelism, but we may describe it as measuring
the a p p a r e n t p a r a l l e l i s m of the hardware. From the user’s point of view the
behaviour of the computer is determined by the timing expression (1.4a) and
the value of n1/2, however it arises. A pipelined computer with a large value
of n l/2 appears and behaves as though it has a high level of real hardware
parallelism, even though it may be due to a long set-up time. And it simply
does not matter to the user how much of the apparent parallelism is real.
For this reason we will not draw a distinction between real and apparent
parallelism in the rest of this book; we simply refer to n 1 / 2 as measuring the
parallelism of the computer.
Expressing the timing equation (1.4a) in terms of a start-up time,
i0, and a time per result, t , we have
(1.4c)
which, comparing with equation (1.4a), corresponds to
(1.40
where
which could have been done in the time of the vector start-up, t0. It therefore
measures the importance, in terms of lost floating-point operations, of the
start-up time to the user. Secondly, when the vector length equals n1/2, the
first and second terms of equation ( 1.4e) are equal, and half the time is being
lost in vector start-ups (first term), and only half the time is being used to
perform useful arithmetic (second term).
Previous analyses of vector timings by Calahan (1977), Calahan and Ames
(1979), Heller (1978) and Kogge (1981) all use linear timing relations like
( 1.4e), rather than our expression ( 1.4a). All the above authors recognise the
importance of the ratio (i 0 /r), but they do not single it out as a primary
parameter. In our view, however, it is not the absolute value of the start-up
time that is of primary importance in the comparison of computers and
algorithms, but its ratio to the time per result. It is for this reason that we
signify this ratio with a descriptive symbol, n1/2, and make it central to the
analysis.
It will be clear from the above that an n 1 / 2 timing analysis can be applied
to any process that obeys a linear timing relation such as (1.4a) or (1.4e).
Although we do not pursue it in this book, an obvious case is that of input
and output operations (I/O). In many large problems I/O dominates the
time of calculation. Obtaining data from a disc is characterised by a long
start-up time for the movement of arms and track searching, before the first
data element is transferred, followed by the rest of the elements in quick
succession. That is to say, the time to access n elements obeys equation ( 1.4e)
which is best interpreted using the equivalent equation (1.4a) and the two
parameters and n 1/2. The maximum rate, r^ , is usually measured in Mbytes
per second, and n 1 / 2 would be the length of the block transferred in bytes
required to produce an average transfer rate of Values of n 1 / 2 obtained
for I/O systems are usually very large, indicating that one should transfer
large blocks of data (n > nl/2) as few times as possible.
(1.5)
10
20
CHARACTERISATION OF PERFORMANCE 89
whence, by comparison with equation (1.4a), one obtains for the pipeline
computer
( 1 .6 b)
and
( 1 .6 c)
Since the length of the pipeline, /, depends on the operation being carried
out, we do not expect nl/2 to be absolutely constant for a particular computer.
It will depend to some extent on the operations being performed, and how
the computer is used. In particular we note that any unnecessary overheads
in loop control that are introduced by a compiler, will appear as a software
addition to the value of s, and therefore as an increased value of n1/2.
Notwithstanding these reservations, we regard n1/2 as a useful characterisation
of vector performance.
For serial computers we compare the timing equation (1.1a) with the
generic form (1.4a) and obtain
( 1 .6 d)
The characterisation of a processor array by the parameter n1/2 is less
obvious, because the timing formula (1.3a) is discontinuous, as shown in
figure 1.13. It is best to distinguish two cases, depending on whether the
vector length, n, is less than or greater than the number of processors in the
array, N. If n ^ N, the array is filled or partially filled only once. Thus the
time for a parallel operation is independent of n and equal to i(j. In this
circumstance, from the point of view of the problem, the array acts as though
it has an infinite number of processors. Appropriately, the correct limit is
obtained in the generic formula (1.4b) if we take
(1.7a)
On the other hand if n > N, the processor will have to be filled several times
and the best characterisation will be obtained if we take as the generic
approximation a line that represents the average behaviour of the array. This
is the broken line in figure 1.13, that passes through the centres of the steps
which represent the actual performance. We obtain from the intercept and
slope of this line
(1.7b)
A more complicated situation arises in the case of an array of pipelined
processors— for example the CDC NASF design. In this case the actual
execution time is given by
( 1 .8 a)
where N is the number of identical arithmetic pipelines, and s t and / are the
set-up time and number of segments of each pipe. This result can be
understood by thinking of the system as one pipeline that performs its
suboperations on superwords, each of which comprises N numbers. Then
\~n/N~\ is the number of such superwords that must be processed, and
CHARACTERISATION OF PERFORMANCE 91
TABLE 1.2 The specific performance, 7i0, for a range of parallel computer
architectures. Maximum values are quoted.
can regard the nl/2 axis as ranging from the most general-purpose on the
left to the most specialised on the right.
(1.9d)
Thus the time of execution is constant independent of the vector length, and
the performance is proportional to vector length. This is the behaviour of
an infinite array of processing elements, since there are, in this case, always
enough processors to assign one to each of the vector elements. The
computation can, therefore, always be completed in the time for one parallel
operation of the array, independent of the length of the vector. Thus, in the
short-vector limit all computers act like infinitely parallel arrays, even though
their n1/2 might be quite small.
We find therefore that characterises the performance of a computer on
long vectors, whilst n0 characterises its performance on short vectors, hence
the subscript zero.
Many parallel computers have a scalar unit with nlj2 = 0 and a maximum
processing rate r ^ as well as a vector processing array or pipeline with
« 1 / 2 > 0 and a maximum processing rate of r ^ s. The vector breakeven length,
nb, is the vector length above which the vector processor takes less time to
perform the operation on a vector than the scalar processor. Using the generic
formula (1.4a) we obtain:
( 1. 10)
where Roo( = roov/roos) is the ratio of maximum vector to maximum scalar
processing rates, both measured in elements per second. The relationship
(1.10) is plotted in figure 1.18. Generally speaking, it is desirable to have a
small value of nb, otherwise there will be few problems for which the vector
processor will be useful. Equation (1.10) shows that this can be achieved by
a small value of n1/2 or a large ratio between the vector and scalar processing
rates. The vector breakeven length and other parameters are given for a
selection of computers in table 1.3.
Since the ratio of vector to scalar speed is usually substantial (of
order 1 0 ), the overall performance of an actual program depends on the
fraction v of the arithmetic operations between pairs of numbers (called
elemental operations) that are performed by vector instructions compared to
98 INTRODUCTION
foo
Computerf «1/2 (M flop/s) ^oo «*
those that are performed by scalar instructions. We shall refer to this ratio
as the fraction of arithmetic vectorised. The average time per elemental
operation is then
( 1 . 1 1 a)
where iv and is are the average times required for an elemental operation
when performed with, respectively, a vector or scalar instruction. The rate
of execution is r = t ~ 1 and is a maximum rv for complete vectorisation
(v= 1 , i = iv); hence the fraction of the maximum realisable gain that is
achieved with a fraction v of the arithmetic vectorised is
( 1 . 1 1 b)
CHARACTERISATION OF PERFORMANCE 99
where R = rv/ r s = ts/ t v = R œrj is the actual vector to scalar speed ratio for
the problem concerned, and is a function of vector length through the
efficiency rj. Figure 1.19 shows g as a function of v for a variety of values of
R from 2 to 1000. It is clear that for large R a very high proportion of the
arithmetic must be vectorised if a worthwhile gain in performance is to be
realised. Obtaining such levels of vectorisation may not be as difficult as it
appears, because the introduction of one vector instruction of length n
vectorises n elemental operations, where n may be very large. As a measure
of the amount of vectorisation required, we define vl/2 as the fraction of
arithmetic that must be vectorised in order to obtain one-half of the maximum
realisable gain. From equation (1.11b) we obtain
( 1 . 1 2 a)
which in the limit of large R becomes
( 1 . 1 2 b)
( 1 . 1 2 c)
That is to say, if the vector unit is much faster than the scalar unit, the speed
of a vector computer is determined only by the speed of its scalar unit, r^
and the scalar fraction of the arithmetic ( 1 —v). Put in another way, if a
tortoise and a hare are in a relay race, the average speed of the pair is almost
entirely determined by the speed of the tortoise and how far it has to travel.
The speed of the hare is not important because the time taken by the hare
is, in any case, negligible. It would not improve matters significantly, for
example, if the hare were replaced by a cheetah. This effect is shown by the
slow approach of vl/2 to its asymptotic value in figure 1 .2 0 , as the speed of
the vector unit increases. In a sentence, vector computers with slow scalar
units are doomed, as can be seen from the history of the CDC STAR 100
(see §1.1.3). The above effect and equations (1.11) and (1.12) are collectively
known as Amdahl's law (Amdahl 1967), and the steep rise in the curve of
figure 1.19 as Amdahl's wall In the design of their vector computer, IBM
have concluded that it is not cost effective to build a vector unit that is more
than four or five times faster than the scalar unit, although we have seen in
table 1 . 2 that many of the most successful vector computers have values of
Roo significantly higher than this.
The Amdahl law arises whenever the total time of some activity is the
sum of the time for a fast process and the time for a slow process. The same
law therefore applies when a job is subdivided for execution on a m i m d
CHARACTERISATION OF PERFORMANCE 101
There are, however, many cases where the natural subdivision of a problem
onto a m im d computer leads to the execution of an identical sequence of
instructions by all the microprocessors. Consider, for example, the solution
of p independent sets of tridiagonal equations, one set given to each of p
processors, or the calculation of p independent fast Fourier transforms. Both
these problems occur at different stages of the solution of partial differential
equations by transform methods (see §5.6.2), and contain no data-dependent
branches. They are therefore ideally suited to s im d solution and do not require
the multiple instruction streams of a m im d computer.
The three problems (or overheads) associated with m im d computing are:
( 1 ) scheduling of work amongst the available processors (or instruction
streams) in such a way as to reduce, preferably to zero, the time that processors
are idle, waiting for others to finish;
( 2 ) synchronisation of the processors so that the arithmetic operations take
place in the correct sequence;
(3) communication of data between the processors so that the arithmetic
is performed on the correct data.
The problem of communication of data from memory to the arithmetic
units is present on all computers and causes the difference between peak
performance rates of arithmetic pipelines usually quoted by manufacturers,
and the average performance rates found for realistic problems. Scheduling
and synchronisation are, however, new problems introduced by m i m d
computation. Three parameters are used to quantify the problems, Ep
for scheduling, s 1 / 2 for synchronisation and / 1 / 2 for communication
(Hockney 1987b, c, d).
(0 Scheduling parameter: Ep
Scheduling is the most commonly studied problem in m i m d computation;
indeed until recently it was the only problem of the three that had received
much attention. Most of the literature on parallel algorithms is concerned
with the problem of scheduling the work of different algorithms onto a
p-processor system, on the assumption that the time taken for synchronisation
and data communication can be ignored. For the moment we shall also make
this assumption, because we wish to consider the latter two effects separately.
Following the work of David Kuck’s group at the University of Illinois
(Kuck 1978 p 33) we introduce the efficiency of scheduling, £ p, of work
amongst p processors as follows. If 7\ is the time to perform all the work
on one processor, and Tp is the time to perform the work when it is shared
amongst the p processors of the same type, then
(1.13)
CHARACTERISATION OF PERFORMANCE 103
Perfect scheduling occurs when it is possible to give (l/p )th of the work to
each processor, when Tp = T 1/p and Ep = 1. If the work cannot be exactly
balanced between the p processors, then some processors finish before others
and become idle, making Tp> T^/p and Ep < 1. For this reason scheduling
is sometimes referred to as load balancing.
If is the maximum performance of the p-processor system, then each
processor has a maximum performance of r^/p. Also, if we consider the
processors to be serial computers (n1/2 = 0 ), as would be the case if they were
microprocessors, and let the work be quantified as s floating-point arithmetic
operations, then the time to perform all the work on one processor is
(1.14)
(1.15)
(1.17)
This expression is the same as equation (1.9a) and figure 1.16 for the average
vector performance as a function of vector length, with replaced by £ p,
and w1 / 2 replaced by s l/2Ep.
It is important to know the value of s 1 / 2 for a m im d computer because it
is a yardstick which can be used to judge the grain size of the program
parallelism which can be effectively used. In other words, it tells the
programmer how much arithmetic there must be in a work segment before
it is worthwhile splitting the work between several processors. If we regard
CHARACTERISATION OF PERFORMANCE 105
( 1 .2 0 a)
and that vector operations in the arithmetic pipeline obey the timing relation
( 1 .2 0 b)
( 1 .2 1 a)
then, in the case that memory transfers and arithmetic cannot be overlapped,
we find that
( 1 .2 1 b)
( 1 .2 1 c)
CHARACTERISATION OF PERFORMANCE 107
where
Thus the ratio x determines the extent to which the average performance
of the combined arithmetic pipeline and memory approaches asymptotically
that of the arithmetic pipeline alone. The peak performance is defined as
the performance in the limit/- * oo, and is equal in this model to r^. Equation
( 1 .2 1 b) shows that the manner in which this asymptote is reached is identical
to the way the average performance of a single pipeline, r, approaches its
asymptote, r^, as a function of vector length n, namely like the pipeline
function pipe(x) (see equation (1.9a) and figure 1.16). Hence, in analogy to
« 1 /2 , we introduce another parameter / 1 / 2 (the half-performance intensity)
which is the value of / required to achieve half the peak performance. The
parameter f l/2 is a property of the computer hardware alone and provides
a yardstick to which the computational intensity, / (which is a function of the
algorithm and application alone), should be compared in order to estimate
the average performance, r^. If/ = / 1/2, then the average performance is half
the maximum possible (as with n1/2); but if we require 90% of the maximum
performance (0.9r^) then we need / = 9 / 1/2. Equation (1.21c) shows how the
n 1 / 2 of the combined memory and arithmetic pipeline varies from that of the
memory for small / to that of the arithmetic pipeline for large /.
Figures 1.21(a) and (h) show measurements of / 1 / 2 for an FPS 164 linked by a
channel to an IBM 4381 host (Hockney 1987c). Similar measurements have
also been reported for the FPS 5000 (Curington and Hockney 1986).
The value of / 1 / 2 is determined by plotting f / r ^ versus f which should be
approximately linear. The inverse slope of the best straight line is and the
negative intercept with the/ axis i s / 1/2, because rearranging equation ( 1 .2 1 b)
we have
( 1 .2 1 d)
If, on the other hand, memory transfers may take place at the same time as
the arithmetic pipeline performs arithmetic, it is possible to overlap memory
transfer with arithmetic. In this case, a different functional form arises which
we define as the knee function. This is the truncated ramp function
( 1 .2 2 a)
where
( 1 .2 2 c)
CHARACTERISATION OF PERFORMANCE 109
Thus, the effect of allowing memory transfer overlap is to halve the value of
/ 1/2, and therefore reduce the amount of arithmetic per memory reference
that is required to achieve a given fraction of the peak performance.
In the above analysis, if the memory and arithmetic pipeline are in the
same cpu , we have an analysis of a memory-bound single instruction stream/
single data stream ( s i s d ) or s im d computer. If, on the other hand, the memory
and arithmetic pipeline are in different c pu s , we have an analysis of the
communication overhead in a m im d computer. In both cases the key hardware
parameter is / 1 / 2 which in this model of computation is proportional to the
ratio of arithmetic performance to the rate of data transfer. Other analyses
of performance degradation due to data communication problems are given
by Lint and Agerwala (1981) and Lee, Abu-Sufah and Kuck (1984).
(1.24a)
t = t(start-up and synchronisation) -I- t(communication) + t(calculation)
(1.24b)
where the three terms are identified respectively with start-up and
synchronisation, communication and calculation as shown. The terms given
in the equation are explained as follows. The term s is the total number of
floating-point operations (flop) performed in all processors; m is the number
of I/O data words in the work segment (see below); / equals s/m and gives
the floating-point operations per I/O data word; t0(p) is the time for the
null job (s = m = 0); tt(p) is the time per I/O word on average; and ta(p) is
the time per floating-point operation, on average.
The problem variable m quantifies the communication that a work segment
has with the rest of the program. If, as is usual, the body of a work segment
is written as a subroutine, it is the number of words contained in all the
input and output variables and arrays of the subroutine, for all instantiations
of the subroutine. In the case of the above benchmark, if the vector length
is n, then m = 3n (two input vectors and one output vector), and s = 3nf]
because the arithmetic is repeated 3 / times.
A little rearrangement shows that
(1.25a)
and
(1.25b)
or
(1.25c)
where
(1.25d)
(1.26)
(1.28)
where only the time of the second term is reduced by parallel execution. We
have, of course, assumed that synchronisation and communication time are
negligible. If we define the performance R (p) of the program to be the inverse
of the executing time, then
(1.29a)
which can also be written
(1.29b)
where
(1.29c)
Thus the asymptotic rate R is the inverse of the time for the part of the
program that could not be parallelised. This occurs in the model when the
number of processsors p —►oo which reduces the time for the parallelisable
part of the program to zero. We notice that the functional form of the
approach to the asymptotic rate is again that of the pipeline function
pipe(x), and p1/2, as usual, is the number of processors required to achieve
half of the asymptotic performance.
The usual parameter that is used to compare the performance of algorithms
is the speed-up, Sp (Kuck 1978), which is defined as
(1.30a)
where Ti and Tp are, respectively, the times for the algorithm to run on one
or p processors. From which we conclude
Both the pipeline function equations (1.29b) and (1.30a) are expressions
of Amdahl’s law (Amdahl 1967), that the performance or speed-up cannot
exceed that obtained if the vectorised or parallelised parts take zero time.
The maximum rate is then determined by the time for the unvectorised or
unparallelised part of the program.
Many parallelised programs are found to follow the functional dependence
of equation (1.29b), at least for small values of p. In practice, however, as p
becomes larger, synchronisation and other overheads usually increase rapidly
with p, with the result that there is often a maximum in the program
performance, and a subsequent reduction in performance as the number of
processors further increases. This observed behaviour has been fitted to the
function
(1.31a)
(1.31b)
/?«> = 0.5
(1.32)
117
118 PIPELINED COMPUTERS
The CRAY X-MP and CRAY-2 are both manufactured by Cray Research
Inc. f in Chippewa Falls, Wisconsin, USA. They are derivatives of the CRAY-1
computer which was designed by Seymour Cray and first installed at the Los
Alamos Scientific Laboratory in 1976 (Auerbach 1976b, Hockney 1977,
Russell 1978, Dungworth 1979, Hockney and Jesshope 1981). The develop-
ment of the CRAY X-MP, which is a multiprocessor version of the CRAY-1,
is due to Steve Chen and his team (Chen 1984) in Chippewa Falls. A two-cpu
version was announced in 1982, and a four-cpu version in 1984. Data on the
CRAY X-MP is taken primarily from the CRAY X-MP Computer Systems
reference manuals (Cray 1982, 1984a, 1984b). The next development in this
series is the CRAY Y-MP, which was announced in 1987. This is anticipated
to have a clock period of 4-5 ns, and to comprise of up to 16 c pu s and have
32 Mword of common memory, backed up with a secondary memory of
1 Gword.
The CRAY-2, on the other hand, has been developed as a separate project
by Seymour Cray at his Chippewa Falls laboratory. It uses more advanced
circuit technology and a new concept in cooling. This permits the use of the
FIGURE 2.1 A typical CRAY X-MP installation, showing the computer in the
centre, the I/O subsystem on the right and the solid state device ( s s d ) on the left.
(Photograph courtesy of Cray Research Inc.)
120 PIPELINED COMPUTERS
FIGURE 2.3 A single 6 inch x 8 inch circuit board of the CRAY X-MP
mounted with packaged integrated circuit chips. Each board has the
capacity for a maximum of 144 chips. (Photograph courtesy of Cray
Research Inc.)
space and weighs 5.25 tons. The ios and s s d each occupy 15 square feet of
floor space, and weigh 1.5 tons.
Figure 2.3 shows a double-layer module. Two circuit boards are attached
to each of two copper plates and rigidly fixed to form a three-dimensional
structure of four boards. As well as the connections within the circuit boards,
cross connections are made in the third dimension between the four printed
circuit boards. The four-board model then slides in grooves as shown in
figure 2.2.
The CRAY X-MP obtains it high performance, in part, from the compact
arrangement of its circuit boards that leads to short signal paths and
propagation delays. This can be seen in figure 2.4, which shows the layout
122 PIPELINED COMPUTERS
FIGURE 2.4 (a) The layout of memory and logic within the central column of the
CRAY X-M P/48 showing the location of the four c pu s and memory. (b ) Allocation
of circuit board positions to units in one quadrant (i.e. cpu 1).
of the modules in the X-MP/48. There are twelve columns in total, forming
the 270° arc of the CRAY X-MP housing. The central four columns house
THE CRAY X-MP AND CRAY-2 123
the four c pu s and the left and right four columns the 8 Mword of main
memory.
2.2.2 Architecture
The overall architecture of the CRAY X-MP can be described as one, two
or four CRAY-1-like c pu s sharing a common memory of up to 8 Mwords.
The original two-cpu model introduced in 1982 occupied a three-quarter
cylinder of 12 columns, as shown in figure 2.1. However, in 1984 the use of
higher density chips allowed the two-cpu and one-cpu models to be housed
in six columns, and a four-cpu model was introduced using the full 12 columns.
The one-cpu model is available with 1, 2 or 4 Mwords of 76 ns static mo s
memory, arranged in 16 banks (1 and 2 Mwords) or 32 banks (4 Mwords);
and the two-cpu model with 2 or 4 Mwords of 38 ns ecl bipolar memory,
arranged in 16 and 32 banks respectively. The four-cpu model has 8 Mwords
of ecl memory arranged in 64 banks. The latter computer is called the CRAY
X-MP/48 where the first digit gives the number of c pu s and the second digit
the number of megawords of shared memory.
The overall architectural block diagram in figure 2.5 is drawn for the
CRAY X-MP/22 and X-MP/24. Each c pu has 13 or 14 independent
functional units working to and from registers (there are two vector logical
units in the model X-MP/48). As in the CRAY-1, there are eight 24-bit
address registers (AO,...., A7), eight 64-bit scalar registers (SO,..., S7) and
eight vector registers (VO, ..., V7), each holding up to 64 64-bit elements.
The address registers have their own address integer add and integer multiply
pipelines, and the scalar and vector registers each have their own integer add,
shift, logical and population count pipelines. Floating-point arithmetic
operations are performed in three pipelines, for multiply, add and reciprocal
approximation ( r a ), which take their arguments from either the vector or
scalar registers. Note that this means scalar and vector floating-point
operations cannot be performed simultaneously, as they can in computers
with separate scalar and vector units (e.g. CYBER 205). A 7-bit vector length
( v l ) register specifies the number of elements (up to a maximum of 64) that
are involved in a vector operation, and the 64-bit vector mask ( v m ) register
has one bit for each element of a vector register and may be used to mask
off certain elements from action by a vector instruction.
In order to perform arithmetic operations, data must first be transferred
from common memory to the registers. This may be done directly, or in the
case of the S and A registers via block transfers to the T and B buffer registers,
each of which holds 64 data items. The maximum data transfer to the S and
A registers is one word every two clock periods, and to the T and B registers
combined is 3 words per clock period. Transfers to and from the vector
124 PIPELINED COMPUTERS
FIGURE 2.5 Architectural block diagram of a CRAY X-M P/2, showing the two
c pu s , memory and principal data paths. The various letters are Cray’s abbreviations
for various registers in the machine (i.e. A, B, S, T and V). These are followed by
a decimal number showing the number of registers. The 64' represents the length of
register in bits, s b , s t , v m and v l are other registers, see text. The ending ‘F ’ designates
floating-point operations.
registers must be made directly with common memory and three 64-bit data
paths are provided for this purpose. Two of the paths transfer input arguments
from common memory to the vector registers and the third transfers the
result from the vector register to common memory. Because the data paths
are separate, two input arguments and one result can be transferred per clock
THE CRAY X-MP AND CRAY-2 125
Memory contention on the X-MP can then take several forms. Bank
conflicts, with which we are familiar on other systems, occur when a bank
is accessed whilst it is still processing a previous memory reference. A
simultaneous conflict arises when a bank is referenced simultaneously on the
independent lines from different c pu s . This is resolved by alternating the
priority of the c pu s every four clock periods. Finally there is a line conflict
within a c p u , which occurs when two or more of the A, B, C and I/O data
paths make a memory request to the same memory section (i.e. want the
same line) in the same machine cycle. Line conflicts are resolved by giving
priority to a vector reference with an odd stride over that with an even stride.
If both references have the same parity the first reference to have been issued
has priority. Memory references are resolved, and waits are initiated if conflicts
arise, on an element-by-element basis during vector references. This means
that the time interval between elements of a vector reference is not predictable
or necessarily regular. Since these elements feed into the arithmetic pipelines,
stages of a pipeline may become temporarily empty. These holes or ‘bubbles’
in the pipeline cause a degradation in average arithmetic performance,
depending on the detailed pattern of memory referencing. Since contention
for memory may depend on activity in c pu s that are not under the control
of the programmer the degradation is likely to be unpredictable. However,
in the very worst possible case of four c pu s each making three memory
accesses per clock period to the same bank, each c pu would only satisfy an
access every 16 clock periods, compared to a maximum rate of three accesses
every clock period, giving a theoretical maximum degradation of memory
bandwidth by a factor of 48. Cheung and Smith (1984) analyse more realistic
patterns of memory access and conclude that performance is typically
degraded by 2.5 to 7% on average due to memory contention, and in
particularly bad cases by 20 to 33%.
The solid state device, or ssd , is a solid state mos storage device ranging
from 64 to 1024 Mbytes (i.e. up to 128 Mword arranged in 128 banks) which
can be used in place of disc storage for the main data of very large user
problems, and by the operating system for temporary storage. It is therefore
often referred to as a solid state disc, and acts like a disc with an access time
of less than 5 0 //s. It may be linked to an X-MP/48 via one or two
1250 Mbyte/s channels, and to an X-M P/2 by one such channel. The
X-MP/1 may be connected by a 100 Mbyte/s channel.
Front-end computers, magnetic tape drives and disc storage units are
always connected to the CRAY X-MP by an I/O subsystem (ios). The first
was introduced in 1979 to improve the I/O performance of the original
CRAY-1 computer with both disc and magnetic tape storage. The enhanced
version introduced in 1981 is an integral part of all CRAY X-MP installations,
THE CRAY X-MP AND CRAY-2 127
and comprises two to four 16-bit I/O processors ( i o p ) and an 8 Mword buffer
memory. The first io p handles communication with up to three front-end
processors, and the second io p handles communication between up to 16 disc
units and the X-MP common memory via the buffer memory. The third and
fourth io ps are optional, and each may be used to attach a further 16 disc
units. The DD-49 disc introduced in 1984 has a capacity of 1200 Mbyte and
a 10 Mbyte/s transfer rate. The ios communicates with the CRAY X-MP
via two 100 Mbyte/s channels. There is also an optional direct path between
the ios and ss d with a transfer rate of 40 Mbyte/s.
All the c pu s of a CRAY X-MP share a single I/O section which
communicates with the ios and ss d . A 1250 Mbyte/s channel is provided for
ss d data transfer, and two 100 Mbyte/s channels for ios transfers. Four
6 Mbyte/s channels transfer I/O requests from the c pu s and other computers
to the io ps .
Each c pu has four instruction buffers, each holding 128 16-bit instruction
parcels. On the X-M P/2 all eight memory ports are used for instruction
fetch, which takes place at 8 words (32 parcels) per clock period. For this
reason all other memory references on both processors are suspended during
an instruction fetch from either processor. An exchange package of 16 64-bit
words defines the status of a user job and may be exchanged in 380 ns.
The intercommunication section of the CRAY X-MP contains three
clusters of common registers (five clusters on the X-MP/48) that may be
accessed by all c pu s for communication and synchronisation purposes. There
is also a common clock that allows program timing to be made to the nearest
clock period. The clock cycles of all c pu s are synchronised. The common
registers are eight 24-bit s b registers, eight 64-bit s t registers and 32 1-bit
synchronisation or semaphore ( s m ) registers. Instructions are provided to
transfer their contents to the A and S registers.
The 13 functional units take data from, and return data to, the A, S, V
and vector mask registers only. The functional units, which may all operate
concurrently, fall into four groups. The clock time, t , of the first models of
the CRAY X-MP was 9.5 ns. This was reduced to 8.5 ns on later models of
the machine. We will use the value of 9.5 ns in the rest of this chapter.
Unit time
Functional unit (ns) (clock periods) = /
Unit time
Functional unit (ns) (clock periods) = /
S c a la r u n its ( 6 4 -b it )
(7) addition 57 6
(8) multiplication 66.5 7
(9) reciprocal approximation ( ra ) 133 14
V e c to r u n its(6 4 -b it )
(10) integer addition 28.5 3
(11) shift 38 4
(12) logical 19 2
(13) population count 57 6
M a in m e m o r y o p e r a tio n s (6 4 - b it )
AH the functional units are pipelined and may accept a new set of arguments
every clock period. In the above list the unit time is the length of the pipeline
in nanoseconds or clock periods. In the latter units, it is therefore the
variable / of §1.3.1. In the case of scalar instructions the above is the time
from instruction issue to the time at which the scalar result is available (i.e.
ready) in the result register for use by another instruction. In the case of a
vector instruction an additional clock period is required to transfer each
operand from a vector register to the top of a functional unit, and a further
clock period is required to transfer each result element from the bottom of
the pipeline to the result vector register. The timing formula for a vector
instruction processing n elements is therefore
(2.1a)
Comparing equation (2.1a) with the general formulae (1.6), we see that the
start-up time and half-performance length for register-to-register vector
operations are respectively given by:
(2.1b)
THE CRAY X-MP AND CRAY-2 129
The address units are for address, index and other small-range calculations
with integers less than about 16 million in two’s-complement arithmetic.
Data are taken from and returned to the A registers. The scalar units operate
exclusively on 64-bit data in the S registers and, with the exception of
population count, return results to the S registers. Integer addition is in two’s
complement arithmetic. Shifts may be performed on either the 64-bit contents
of an S register (unit time 2 clock periods), or on the 128 bits of two
concatenated S registers (unit time 3 clock periods). The logical unit performs
bit-by-bit manipulation of 64-bit quantities and is an integral part of the
modules containing the S registers. For this reason the data do not need to
leave the S register modules and the function can be performed in one clock
period. The population-count unit counts the number of bits having the value
of one in the operand (unit time 4 clock periods) or the number of zeroes
preceding the first one in the operand (unit time 3 clock periods). The resulting
7-bit count is returned to an A register.
The vector units perform operations on operands from a pair of V registers
or from a V register and an S register. The result is returned to a V register
or the vector mask register. Successive operand pairs are transmitted to the
functional unit every clock period. The corresponding result emerges from
the unit / clock periods later, where / is the functional unit time. After this,
results emerge at the rate of one per clock period. The number of elements
that are processed in such a vector operation is equal to the number stored
in the 7-bit vector length register ( v l ). The elements used are the elements
0, 1, 2, ..., up to the number specified. Some vector operations are also
controlled by the 64-bit vector mask register ( v m ). Bit n of the mask
corresponds to element n of a vector register. The v m register is used in
conjunction with instructions to pick out elements of a vector, or to merge
two vectors into one (see §2.2.4). The vector integer addition unit performs
64-bit two’s-complement integer element-by-element addition and sub-
traction. The vector shift unit shifts the 64-bit contents of vector elements or
the 128-bit contents of pairs of adjacent vector elements. Shift counts are
stored in the instruction or an A register. The vector logical unit performs
bit-by-bit logical operations on the elements of a vector and also creates
64-bit masks in the v m register.
The three floating-point functional units perform both scalar and vector
floating-point operations. The arguments and results may therefore be either
S or V registers. The 64-bit signed-magnitude floating-point number has a
48-bit mantissa giving a precision of about 14 decimal digits, and a 16-bit
biased binary exponent giving a normalised decimal number range of
approximately 10"2500 to 10 +25O°. Separate units are provided for floating-
point addition, multiplication and division. Division is performed in the
130 PIPELINED COMPUTERS
(2.3c)
therefore
(2.3d)
where n is the iteration number. In the above code, instruction (2.2a) forms
an approximation to the reciprocal of S2 using the reciprocal approximation
unit. This approximation is stored in S3 and corresponds to x (n) in equations
(2.3c,d). The reciprocal iteration instruction (2.2b) computes with one
instruction the contents of the bracket in equation (2.3d) and is executed by
the floating-point multiplication unit. The multiplication instruction (2.2c)
multiplies the initial approximation to (S2)_1 by the numerator SI, and
the multiplication instruction (2.2d) applies the correction specified by
equation (2.3d). After one iteration the result is accurate to 47 bits, that is
to say essentially the full precision of the 48-bit mantissa. The reason for
multiplying by the numerator before applying the correction is that the
instructions (2.2b) and (2.2c), although they use the same unit, may be started
on successive clock periods; whereas instruction (2.2d), if it came first, would
have to wait the completion of instruction (2.2b) because it requires the value
of S4. With the ordering in equation (2.2) the division is complete in 29 clock
periods corresponding to 2.8 Mflop/s. If the instructions (2.2) were replaced
by vector instructions and placed in the order (2a, 2c, 2b, 2d), instructions
THE CRAY X-MP AND CRAY-2 131
(2.2a) and (2.2c) would chain together (see three paragraphs ahead). The
element-by-element division of two vectors can be accomplished at the
asymptotic average rate of 27 Mflop/s.
Instructions on the CRAY X-MP are either one-parcel (16-bit) instructions
or two-parcel (32-bit) instructions. Prior to execution, instructions reside in
four instruction buffers, each of which can contain 128 one-parcel instructions
or equivalent combinations of different length instructions. The buffers are
filled cyclically from main memory. Whenever a required instruction is not
present in the buffers, the next buffer in the cycle is completely filled by taking
one 64-bit word (4 parcels) from each memory bank in parallel. There is a
22-bit program counter ( p ) containing the address of the next instruction to
be executed, a 16-bit next instruction parcel ( n i p ) to hold the next instruction,
a 16-bit current instruction parcel ( c i p ) to hold the instruction waiting to
issue, and a 16-bit lower instruction parcel to hold the second parcel of a
two-parcel current instruction.
An instruction can be issued (i.e. sent to the functional unit for execution)
if the unit is not busy with a previous operation and if the required input
and result registers are not reserved by any other instructions currently in
execution. Different instructions place different reservations on the registers
that they use, and the manual (Cray 1976) should be consulted for details.
Broadly speaking a vector instruction reserves the output vector register for
the duration of the operation and the input vector register until the last
element has entered the top of the pipeline. However if a vector instruction
uses a scalar register this is not reserved because a copy of the scalar is kept
in the functional unit. The scalar register may therefore be altered in the
clock period after the vector instruction is issued. Similarly the value of the
vector length register is kept by the functional unit and the v l register can
be changed immediately after instruction issue. Hence instructions with
different vector lengths can be executing concurrently. In the case of scalar
instructions only the result register is reserved by the instruction, in order
to prevent it being read by other instructions before its value has been
updated.
A special feature of the CRAY X-MP architecture is the ability to chain
together a series of vector operations so that they operate together as one
continuous pipeline (Johnson 1978). The difference between unchained and
chained operations is illustrated in figure 2.6. The upper diagram shows the
timing for three vector instructions if chaining does not take place.
The time for unchained operation of a sequence of m vector instructions is
(2.4a)
132 PIPELINED COMPUTERS
FIGURE 2.6 Timing diagram (a) for three unchained vector operations
a, b and c and (b) for the same operations if they can be chained together.
In this example s + / = 10 and n = 50. The time sequence for the processing
of the kih element is given by a horizontal line drawn k units down from
the time axis. An element passes through a pipeline for an operation
when this line crosses the trapezoid for that operation.
(2.4b)
(2.4d)
(2.5a)
(2.5c)
The second equalities in equations (2.4d) and (2.5c) are the results if all the
pipes have the same start-up time and pipelength, as will frequently be
approximately true. We see from the above that the average behaviour of a
sequence of unchained vector operations is the same as the behaviour of a
single operation. However, if m vector operations are chained together, both
the asymptotic performance and the half-performance length are increased
m-fold. This effect was noted previously in Chapter 1 as arising from
the replication of processors. In the case of chaining it arises because
the m pipelines are working concurrently which similarly increases both
performance, r^, and the amount of parallelism, n 1/2. In the case of unchained
operation the pipelines are working sequentially and there is consequently
no increase in the average rap' of computation or the parallelism.
The architecture of the CRAY X-MP can be conveniently summarised
using the a s n notation of §1.2.4. A short description, ignoring I/O
and the details of the register connections, but still presenting the essential
computational features of the machine, would read:
2.2.3 Technology
The main 8 Mword memory of the CRAY X-MP/48 is composed of either
bipolar or static mo s 64 Kbit v l s i chips with an access time of 38 ns or four
134 PIPELINED COMPUTERS
clock periods. In contrast to the CRAY-1, which used 4/5 NAND gate chips,
the logic circuits of the CRAY X-MP and the A and S registers are made
from 16-gate array integrated circuits with a 300-400 ps propagation delay.
As in the CRAY-1, the backwiring between modules is made by twisted pairs
of wires.
( 2. 6 )
II parcel 1 || parcel 2 1|
Read /Write:
B/fc, Ai ,A0 Read (Ai) words to B register jk from (AO)
,A0 T jk M Store (A i) words from T register jk to (AO)
Ai exp,Ah Read from ((Ah) + jkm) to Ai. (AO = 0, jkm = exp)
exp,A/i Si Store (Si) to ((Ah) + jkm). (A0 = 0,jkm = exp)
Vi ,A0,Ak Read (VL) words to Vi from (A0) incremented by (Ak)
V/ Store (VL) words from V/ to (A0) incremented by (Ak)
>
>
o
A/ si Transmit (S;) to Ai
Bjk Ai Transmit (Ai) to Bjk
Si Vk Transmit (Tjk) to Si
SMyTc 1 Set jkth semaphore register
Ai SB, Read jth shared B register
Control
J Bjk Branch to (Bjk)
THE CRAY X-MP AND CRAY-2 135
Set M ask— the 64 bits of the vector mask (VM) register correspond
one-for-one to the 64 elements of a vector register. If the element satisfies a
condition the corresponding bit of VM is set to one, otherwise it is zeroed.
The conditions are zero, non-zero, positive or zero, negative. Thus:
VM V5, Z set VM bit to 1 where V5 elements are zero,
VM V7, P set VM bit to 1 where V7 elements are positive or zero.
Vector Merge— the contents of two vector registers Wj and V/c are merged
into a single result vector Vi according to the mask in the VM register. If
the /th bit of VM is 1 the /th element of Wj becomes the /th element of the
result register, otherwise the /th element of V/c becomes the /th element of
the result register. Wj may alternatively be a scalar register. The value in the
vector length register determines the number of elements that are merged.
136 PIPELINED COMPUTERS
Thus:
Vi Vj\Wk & VM merge Wj and V/c into Vi according to the
pattern in VM,
V7 S2!V6 & VM merge S2 and V6 into V7 according to the
pattern in VM.
The purpose of the mask and merge instructions is to permit conditional
evaluation with vector instructions. Consider, for example, the evaluation of
(2.7)
(2.8a)
THE CRAY X-MP AND CRAY-2 137
(2) Gather
(2.8b)
In either case the integer array INDEX contains a set of indices which may
specify addresses that are scattered arbitrarily over the main memory of the
computer. A scatter operation distributes an ordered set of elements Y(7)
throughout memory, according to the pattern of addresses in the array
INDEX. Conversely the gather operation collects the scattered elements of
X and sorts them into the ordered array Y. Such operations occur in sorting
problems; in reordering as, for example, the unscrambling of the bit-reversed
order of the fast Fourier transform (see §5.5.2); and in the charge assignment
(scatter operation) and field interpolation (gather operation) steps of a
simulation code using a particle-mesh model (see Hockney and Eastwood
1981).
CRAY X-MP models prior to the four-cpu announcement in 1984, and
the earlier CRAY-1 and CRAY-1S computers, had no special hardware or
instructions for implementation of scatter and gather operations. Such
loops as (2.8) had to be executed by scalar instructions, and had a
rather disappointing performance of about 2.5 M op/s (Hockney and
Jesshope 1981). The CRAY X-MP/4, however, provides vector scatter/gather
instructions, and also related compress index instructions that permit these
operations to execute at much higher vector speeds:
,A0, V/ V/ vector scatter Vy using indices in Vi
Vi ,A0, V/ vector gather into Vi using indices in Vy
Vi, VM Vy, Z compress index of zero Vy, i.e. obtain indices of
zero elements of Vy as a compressed vector
in Vi, and set VM = 1 in corresponding bit
positions.
The gather instruction is executed on either of the two read ports, and the
scatter instruction in the write port. The compress index instruction is
executed in the vector logical unit.
Using these instructions the scatter loop (2.8a) can be implemented by
VO ,A0, 1 vector load Y(I) into VO
VI ,A0, 1 vector load INDEX(I) into VI (2.8c)
,A0, VI VO vector scatter to X(INDEX(I))
and the gather loop by
VO ,A0, 1 vector load INDEX(I) into VO
VI ,A0, VO gather X(INDEX(I)) into VI (2.8d)
,A0, 1 VI vector store Y to common memory
138 PIPELINED COMPUTERS
(28e)
2.2.5 Software
The principal new feature of the CRAY X-MP software compared to that
THE CRAY X-MP AND CRAY-2 139
2.2.6 Performance
The performance of the original CRAY-1 computer has been fully discussed
in the first edition of Parallel Computers (Hockney and Jesshope 1981). Here
we present measurements made on the two-cpu CRAY X-MP/22 of the three
parameters r^, n 1/2 and s l/2 (see §§1.3.3 and 1.3.6, and Hockney (1985a)).
First we give measurements of and n 1/2 on a single c p u , and afterwards
consider the overhead, s 1/2, of synchronising the two c pu s by different
methods. Figure 2.7 shows the result of executing the equivalent of the code
(1.5) for dyadic and two types of triadic operations. For each vector length
N in the program (1.5), the measurement was repeated 100 times and the
minimum value is plotted in figure 2.7. The result is obtained in the standard
way by fitting the best straight line and recording its inverse slope as and
its negative intercept on the n-axis as n 1/2. The results are shown in table 2.1.
The first three cases in table 2.1 are measurements of vector instructions.
The dyadic case uses only a single vector pipeline with all vectors stored in
main memory, and is to be compared with values of = 22 Mflop/s and
« 1 / 2 = 18 previously obtained on the CRAY-1 (Hockney and Jesshope 1981).
We find a three-fold increase in r x due, primarily, to the provision of three
memory ports on the CRAY X-MP compared with one on the CRAY-1. The
THE CRAY X-MP AND CRAY-2 143
Operation: r oo n l/2 ¿0
statement 10 ol code (1.5) (M flop/s) (flop) (/is)
Dyadic 70 53 0.75
A(I) = B(I)*C (I)
(CRAY-1 values) (22) (18) (0.82)
All vector triad 107 45 0.42
A(I) = D (I)*B (I) + C(I)
CYBER 205 triad 148 60 0.40
A(I) = s*B (I) + C( I)
Scalar code 5 4 0.80
A(I) = B(I)*C (I)
144 PIPELINED COMPUTERS
NMAX = 4 0 0
ID T (l) = 2
T1 r 9.5E-9*RTC(DUM)
T2 = 9 . 5E-9"RTC(DUM)
T0 = T2-T1
DO 2 0 N = 2 , NMAX, 2
T1 = 9 . 5 E - 9 “ RTC( DUM)
NHALF = M / 2 ,
NH1 = NHALF + 1
T2 = 9 . 5 E - 9 * R T C ( DUM)
T = T2-T1-T0
WRITE ( 6 , 1 0 0 ) N, T
20 CONTINUE
DO 1 0 I = N l , N2
10 A (I) = B (I)*C (I)
RETURN
END
FIGURE 2.8 Program for measuring r , and s 1/2 when a job is split between the
two of the CRAY X-MP/22 using the TASKS method of synchronisation.
c p u s
statement ensures that both c pu s have finished their share of the work before
the timer is called again to record the end of the measurement (T2). The
parameters to DOALL are used to ensure that the two c pu s do different
146 PIPELINED COMPUTERS
r oo S1/2 to no —to 1
Method (M flop/s) (flop) 0«) (k /s)
Note: results are deliberately rounded to two significant figures only. Greater precision
would suggest spurious accuracy.
which leads to the values of and s 1/2 given in table 2.2. The next least
expensive method of synchronisation proves to be the LOCKS method. In
this case we observe (table 2.2) s 1/2 = 4000, about 2/3 of the value found for
the TASKS method. On the other hand, we find the EVENTS method half
as expensive as the LOCKS method with s 1/2 = 2000. In order to determine
the least overhead possible, a simplified form of the LOCKS method has
been programmed in CAL by John Larson of Cray Research Inc. The
overhead is thereby reduced by a factor of ten to s 1/2 = 220. An examination
of the CAL code shows that there is no wasted time, and it is unlikely that
synchronisation can be achieved on the CRAY X-MP with less overhead.
However, it must be said that in the CAL code, one c pu waits for the other
to finish by continually testing one of the synchronisation registers. This
prevents the waiting c pu from doing any other work during this time, and
hence this code would hardly be acceptable as a general method of
synchronisation.
FIGURE 2.9 Overall view of the CRAY-2 computer, with the coolant
reservoir in the background.
(Cray 1985). The mainframe in the foreground, which weighs nearly three
tons, contains a foreground processor, four background processors, a large
common memory of 256 Mword, and all power supplies and backwiring.
This is all contained within a cylindrical cabinet 4 feet high and 4( feet in
diameter. The compression in size is remarkable— in effect a four-cpu CRAY
X-MP plus its I/O system and solid state device have been shrunk to a third
or a quarter of their present size, and placed in a single container. The power
generation of 195 kW is little changed from that of the CRAY X-MP, but
an entirely new cooling technology— liquid immersion cooling— is used to
remove it. In this method all the circuit boards and power supplies are totally
immersed in a bath of clear inert fluorocarbon liquid which is slowly circulated
(about one inch per second) and passed through a chilled water heat
exchanger to extract the heat. Cooling is particularly effective because the
coolant, which has a high dielectric constant and good insulating properties,
is in direct contact with the printed circuit boards and integrated circuit
packages. Figure 2.10 shows a closer view of the coolant circulating past the
148 PIPELINED COMPUTERS
boards. The cooling system is closed and valveless; the collection of columns
at the rear of figure 2.9 is a reservoir for the 200 gallons of coolant. If a
circuit board needs to be replaced all the coolant must be pumped to the
reservoir before the circuit boards can be reached. This can be accomplished
in a few minutes.
The clock period on the CRAY-2 is 4.1 ns, and this necessitates short
connecting wires between boards. This requirement has led to the develop-
ment of three-dimensional pluggable modules, each comprising eight printed
circuit boards rigidly held together as a unit with cross connections between
the boards. Each module (see figure 2.11) forms an 8 x 8 x 12 array of
integrated circuit packages giving approximately 250 chip positions. The
module measures approximately 1 x 4 x 8 inches, weighs two pounds, and
consumes 300 to 500 W of power. The 320 modules are mounted in 14
columns forming a 300° arc. There are 152 c pu modules and 128 memory
modules, with approximately 240000 chips, nearly 75000 of which are
THE CRAY X-MP AND CRAY-2 149
memory. The logic chips use 16-gate arrays, as in the CRAY X-MP. As in
the other CRAY computers, backwiring uses twisted pair wires between
2 inches and 25 inches in length. In all there are about 36 000 such connections
with a total length of about six miles.
The overall architecture of the CRAY-2 is shown in figure 2.12, and can
be described as four background processors accessing a large shared common
memory, and under the control of a foreground processor. The common
memory of 256 M 64-bit words is directly addressable by all the processors
(32-bit addressing is used). It is arranged in four quadrants of 32 banks,
giving a total of 128 banks with 2 Mword per bank. Dynamic m o s memory
technology is used (256K bits per chip), which means that the access time
(about 250 ns) is very long compared to the main e c l memory of the
CRAY X-MP (38 ns). In this respect, the common memory is more correctly
compared to the solid state device memory of an X-MP. Each memory bank
has a functionally independent data path to each of four bidirectional memory
ports, each of which connects to one background processor and one
foreground communications channel. The total bandwidth to memory is
therefore 1 Gword/s.
Access to common memory is phased, which means that each processor
can have access to a particular quadrant only every four clock periods when
its phase for accessing that quadrant comes around. There are four phases,
one assigned to each of the background processors. The 32 memory banks
in a quadrant share a data path to each common memory port; however,
because of the phased access scheme, only one bank accesses the path in a
150 PIPELINED COMPUTERS
contiguously stored vector are spread first across the quadrants, then across
the banks, and finally across the words of a bank.
If references to successive elements of the vector occur every clock period,
then a given quadrant is referenced every four clock periods, and a given
bank every 128 clock periods (512 ns). The common memory access time of
about 250 ns therefore avoids memory bank conflicts in this ideal case of
reference to a contiguously stored vector. Bank conflicts will occur if the
stride in memory address between successive elements is four or a larger
power of two. In the worst case, when all the elements are stored in the same
bank, the memory access rate is only 1/64 of the maximum for contiguous
vectors. Consequently the performance of the CRAY-2 depends critically
on the pattern of memory accesses to the common memory, and may vary
widely for different jobs, and for differently programmed implementations of
the same job.
The foreground processor supervises the background processors, common
memory and peripheral controllers via four 4-Gbit/s communication
channels. Each channel is a 16-bit data ring connecting one background
processor, up to nine disc controllers, a front-end interface, a common
memory port, and the foreground processor. The ring transfers a 16-bit data
packet between stations every clock period. The foreground processor itself
is a 32-bit integer computer with a 4 Kword (32-bit) local data memory and
32 Kbytes of instruction memory.
The architecture of a background processor is shown in figure 2.13, and
can be described as a CRAY-1 architecture with the B and T intermediate
registers replaced by 16 Kword of local memory. As in the CRAY-1, there
is only a single bidirectional data path to the common memory, and there
are eight 64-word vector registers and eight 64-bit scalar registers. The eight
address registers have become 32-bit registers but there is no longer a
data path directly from the common memory to the address registers.
The functional units are also somewhat differently arranged, and reduced
in number to nine. There is now no separate floating-point reciprocal
approximation unit. This function now shares the floating-point multiply
unit, which also provides the new function of hardware square root. The
vector shift, population count, and integer arithmetic share a single vector
integer unit; and the scalar integer add and population count are both
performed in the scalar integer unit. Except for being 32-bit units, the address
functional units are the same as on the CRAY-1. The 16 Kword local memory
has an access time of four clock periods, and is intended for the temporary
storage of scalars and vector segments during computation. The speed of
this memory relative to the functional units is the same as the main memory
of the CRAY X-MP, which of course is very much larger (up to 8 Mword).
152 PIPELINED COMPUTERS
Each background processor has its own 64-bit real-time clock which is
advanced every clock period, and synchronised at start-up time with the
clocks in the other background processors. There is a 32-bit program address
counter, and common memory field protection is provided by a 32-bit
base and limit registers. Eight 1-bit semaphore flags and a 32-bit status
register are provided to control access to shared memory areas, and to
synchronise the background processors. Eight instruction buffers hold 64
16-bit instruction parcels, and instruction issue takes place every other clock
period. There are 128 basic instruction codes which include scatter/gather
facilities, like the CRAY X-MP/4; however, the ability to chain together a
succession of vector instructions, which is an important feature of the CRAY-1
and X-MP, has been lost and is not available on the CRAY-2.
THE CRAY X-MP AND CRAY-2 153
TABLE 2.3 Results of the (r00, n ll2) benchmark for a variety of simple kernels on
one c p u of the CRAY-2, using FORTRAN code and the CIVIC compiler (1985).
The improved values in parentheses are for CFT77 (1.3), as of November 1987.
Operation:
statement 10 roc r oo
program (1.5) Stride (M flop/s) (M flop/s) «1/2
memory references and maximising the use of the cpu local memory for
intermediate results.
The CRAY-3 is an implementation of the CRAY-2 in gallium arsenide
(GaAs) technology which should allow a clock period of 1 ns. The chips are
manufactured by Gigabit Logic Inc., California (Alexander 1985) and Harris
Microwave (McCrone 1985). The CRAY-3 is also likely to have more
processors, perhaps 8 to 16 and a common memory of one gigaword. It is
scheduled for 1988/9.
FIGURE 2.14 View of the CDC CYBER 205, showing the vector stream and
string units on the left and the floating-point pipes on the right. (Photograph
courtesy of Control Data Corporation.)
from the top left of figure 2.15 and is of a two-pipe machine. The main
memory comprises four or eight memory sections. In the series 400 machine,
these are housed in two or four wedge-shaped cabinets, each containing one
million 64-bit words in two memory sections. The scalar section which
contains the instruction processing unit, forms the central core of the machine.
To one end is attached the memory through the memory interface unit, and
to the other end is attached the vector processor. The latter comprises one,
two or four vector floating-point arithmetic pipelines, a vector stream and
string section, and an I/O and vector set-up and recovery section. Overall
a 4 Mword computer occupies a floor area of about 23 ft x 19 ft. Cooling
for the basic central computer with 1 Mword of memory consists of two
30-ton water-cooled condensing units which are housed separately, and power
is supplied by one 250 kVA motor-generator set. The heat dissipation is
about 118 kW. An additional 80 kVA motor-generator set and 90 tons of
cooling are needed for the 4 Mword system, and a standby 250 kVA generator
is provided. The CYBER 205 is designed to be attached to a front-end system,
typically a CYBER 180, CDC 6000, IBM or VAX computer.
The series 600 computer differs from the above description only in the
memory cabinets, which fit to the end and side of sections J and K forming
a rectangular plan. There are still four or eight memory sections, but these
may now contain 0.5, 1, 1.5, or 2 Mwords each, giving configurations with
1, 2, 4, 8, 12 or 16 Mwords of memory.
2.3.2 Architecture
The principle units and data paths in the CYBER 205 are shown in figure 2.16.
Memory comprises eight sections (A to H), each divided into eight memory
stacks (called memory modules on the series 600). Each memory stack (or
module) is divided into eight memory banks and has an independent 32-bit
data path to the memory interface unit. Each bank contains 16K 39-bit
half-words on the series 400 (32 data bits plus 7 s e c d e d bits). On the series
600, however, a bank may contain 16K, 32K, 48K or 64K half-words,
depending on the number of ranks of chips mounted on the memory board
(see §2.3.3).
The memory is organised into pairs of sections (A/H, B/G, C /F, D /E)
each of which therefore has 16 stacks and a 512-bit data path to the interface
unit. This data width is known as a superword or sword. It is equivalent to
eight 64-bit words or 16 32-bit half-words, and is the unit of access to memory
for vectors. Successive addresses in a sword are stored in different memory
stacks, so that a sword may be accessed in parallel by taking a half-word
from each stack of the 16 stacks in a double section of memory. The access
and cycle time of memory is 80 ns. However, provided successive references
158
PIPELINED COMPUTERS
FIGURE 2.16 Architectural block diagram showing the principal units and data paths of the CDC CYBER 205 series 400.
The series 600 increases the memory up to 16 Mwords.
THE CDC CYBER 205 AND ETA10 159
are to different banks, a fresh sword can be referenced every clock period of
20 ns from each double section. This is a bandwidth of 400 Mword/s per
double memory section. However, not all this bandwidth is used by the
memory interface unit (see below).
Although memory may be addressed by the bit, byte (8 bits), half-word
(32 bits) or word (64 bits), access to memory is by the sword (512 bits) for
vectors, by the word or half-word otherwise. The memory interface unit
organises memory requests at each 20 ns interval into swords, words or
half-words, then delivers or assembles this data via 128-bit wide paths to the
scalar and vector sections. Communication with the rest of the computer is
in terms of three read paths and two write paths, and the memory interface
unit has a one-sword buffer associated with each path (Rl, R2, R3 for read
and W l, W2 for write). The memory interface unit connects to the scalar,
vector and I/O sections via 10 128-bit data paths, each of which has a
maximum transfer rate of 128 bits every clock period, giving a maximum
total transfer rate of 1000 Mword/s.
On a two-pipe machine the memory interface unit operates as described
above, and has a maximum read throughput in vector mode of one sword
or 8 words per clock period. As each pipe requires two new input arguments
per clock period, the capacity of the data paths and interface unit just matches
the needs of the arithmetic pipelines. A four-pipe machine, however, requires
twice the above amount of data per clock period, and four-pipe machines
have the Rl, R2, and W 1 buffers increased to 1024 bits, and the corresponding
data paths to the vector unit increased to 256 bits. The interface unit then
makes simultaneous reference to a double-sword (1024 bits) of data, by
simultaneously accessing a half-word from 32 memory stacks spread over
four memory sections. Clearly this is still not using the full bandwidth of the
memory itself, which is capable of supplying a half-word simultaneously from
each of its 64 memory stacks (eight stacks in each of eight memory sections).
The scalar section reads from buffers Rl and R3, and writes to buffer Wl.
All data paths to the vector section also pass through the scalar section where
s e c d e d checking and priority determination for memory requests take place.
The inclusion of s e c d e d checking in the architecture substantially extends
the mean time between failures. The scalar section contains the instruction
issue pipeline, which has a maximum issue rate of one instruction every 20 ns
clock period ( t ). The three address instructions are drawn from a stack which
may hold up to 128 32-bit instructions or 64 64-bit instructions, or mixtures
with equivalent total length. Both vector and scalar instructions are decoded
in the instruction issue pipeline, which dispatches the decoded vector
instructions to the vector unit for execution. The issue of decoded scalar
instructions to the scalar arithmetic functional units is controlled on a
160 PIPELINED COMPUTERS
Unit time
Load/Store 300 15
Addition/Subtraction 100 5
Multiplication 100 5
Logical 60 3
Single Cycle 20 1
Division, Square root, Conversion
for 64 bits 1080 54
for 32 bits 600 30
The above unit times are the total times required to compute either a 32-bit
or a 64-bit result; however all units except the last are pipelined and may
take a new set of arguments every clock period. However note that the register
file can supply at most one new pair of arguments per clock period and this
is the factor most likely to limit the processing rate of the scalar section as
a whole. However, the division, square root and conversion unit is not
pipelined in this way, and new arguments are only accepted every 54 clock
periods. The result from any of the above units can be passed directly to the
input of any unit, in a process called shortstopping. This process, when
applicable, eliminates the time needed to write results to a register and retrieve
them for use in the next arithmetic operation. The unit times above assume
that shortstopping takes place and do not include the time to write results
THE CDC CYBER 205 AND ETA10 161
to registers or memory. The register file may supply at most two operands
for the current instruction and store one result from the previous instruction
concurrently during every clock period. This is found to be sufficient to
support a scalar performance of 45 Mflop/s out of a peak potential of one
instruction every 20 ns or 50 Mflop/s.
Access to main memory is controlled by the load/store unit which acts as
a pipeline and may accept one read (load) from memory every clock period
or one write (store) to memory every two clock periods. A buffer is provided
in the unit for up to six read and three write requests. A randomly accessed
word can be read from memory and loaded into the register file in 300 ns
provided memory is not busy. If it is, up to a further 80 ns is added to the
time.
Operations on vectors of numbers or strings of characters are performed
in the vector section which comprises either one, two or four floating-point
pipelines and a string unit that are fed with streams of data by a stream unit.
Unlike the CRAY X-MP there are no vector registers, and all vector
operations are main-memory to main-memory operations, necessitating data
to travel a round trip of about 50 ft, compared to less than 6 ft on the
CRAY X-MP. This difference partly explains the much longer vector start-up
time on the CYBER 205. A vector may comprise up to 65 535 consecutively
addressed elements. If the required data is not consecutively stored the
required elements can be selected by a control vector of bits, one bit for each
word of the vector. The operation is then only performed for elements for
which the corresponding control bit is one. However all elements of the
consecutively stored vector must be read from memory even though only a
small fraction may be operated upon. Alternatively if the control vector is
sparse in ones, the specified elements of a long vector may be selected by a
compress operation and re-stored consecutively. Subsequent operations may
then be performed with better efficiency on the new compressed vector. In
addition efficient scatter/gather instructions are implemented by microcode
in the stream unit. These reference memory either randomly according to an
index list, or periodically (i.e. at equal intervals).
Data is received from main memory in three input streams: A and B for
the two streams of floating-point numbers, and (M, X, Y) for control vectors
and character strings. There are two output streams: C for floating-point
numbers and R for character strings. Each of these streams is 128 bits wide
and is distributed by the stream unit into 128-bit data streams for use by the
floating-point pipelines and 16-bit streams for use by the string unit. Each
of the identical floating-point pipelines (PI to P4 in figure 2.16) comprises
five separate pipelined functional units for addition, multiplication, shifting
operations and delay, connected via a data interchange (see figure 2.17).
162 PIPELINED COMPUTERS
Division and square root are performed in the multiplication unit. Each unit
is attached to the data interchange by three 128-bit data paths (two input
paths A and B, and the output path C). These paths can therefore support
a rate of 100 million 64-bit results per second (M r/s) per unit. The units
themselves however are only capable of generating results at half this rate
(50 M r/s for 64-bit operation and 100 M r/s for 32-bit operation). For simple
vector instructions using a single unit the data interchange connects the two
input streams A and B and the output stream C to the appropriate functional
unit, leading to an asymptotic operation rate of 50 Mflop/s (64-bit operation)
and 100 Mflop/s (32-bit operation) per pipeline. If the two successive vector
instructions use different units, contain one operand that is a scalar and are
preceded by the select-link instruction, then ‘linkage’ takes place. The output
stream from the first unit used is fed by the data interchange to the input of
the second unit. In this way the two units operate concurrently and the two
instructions act as a single vector instruction with no intermediate reference
to main memory. Examples of such linked triadic operations! are:
vector + scalar*vector, (vector 4- scalar)*vector,
t A triadic operation is one involving three input arguments, e.g. A -l- B x C; a dyadic
operation is one involving two input arguments, e.g. A + B.
THE CDC CYBER 205 AND ETA10 163
which occur frequently in matrix problems (for example the inner product
of two vectors by the middle-product method, see §5.3.2). This facility plays
the same role as chaining does on the CRAY X-MP, but is more restrictive.
On the CYBER 205 two operations at most may be linked together and one
of the operands must be a scalar. For such linked triads the asymptotic
performance of a floating-point pipeline is doubled to 100 and 200 Mflop/s
for 64-bit and 32-bit arithmetic respectively. The maximum asymptotic
performance on the CYBER 205 is therefore 800 Mflop/s for linked triads
in 32-bit arithmetic on a 4-pipe machine.
Figure 2.18 shows in more detail the overall organisation of the addition
and multiplication pipelines. The addition operation (figure 2.18(a)) is seen to
be divided into seven principal suboperations. A backward connection (or
shortstop) is provided around the ADD segment to allow the unnormalised
addition result of one element of a vector operation to be added to the next
element of the vector. This facility is used in the interval instruction which
forms the vector Cl+1= C , -hB; C0 = A. Another shortstop takes the
normalised result C of the addition pipe back to become the B operand input.
The result arrives back at B eight clock periods after the operands contributing
Shortstop
FIGURE 2.18 Block diagram of the principal sections of (a) the floating-
point addition and (b) multiplication pipelined units on the CDC CYBER
205. (Diagram courtesy of Control Data Corporation.)
164 PIPELINED COMPUTERS
2.3.3 Technology
The main interest in the technology of the CYBER 205 centres on the use
of a novel l s i circuit, packaging and cooling technique for the logic of the
computer, based on l s i bipolar e c l gate-array logic. As an example we show
in figure 2.19 an overall view of a 15-layer arithmetic circuit board which
can hold an array of 10 x 15 l s i chips. The main feature is the \ in x \ in
coolant pipe carrying freon that passes horizontally across the board
10 times. The l s i chips are mounted on ceramic holders approximately
0.5 in x 0.7 in x 0.1 in and clamped directly onto the cooling pipes which
maintain a chip temperature of 55 ± 1°C. In figure 2.20 a technician is shown
replacing an l s i chip and numerous empty clamps awaiting chips are shown
attached to the coolant pipes. In figure 2.21 the various connectors and
clamps that hold the ceramic-mounted chip to the pipe are shown, and
figure 2.22 shows how these are assembled. The copper pad on the ceramic
mount makes direct thermal contact with the cooling pipe. The l s i chip has
52 external connections that are brought out to the side and bottom of the
ceramic mount (called an l s i array, figure 2.22 left). Two connectors, one for
each side of the ceramic mount and each containing 26 pins, are plugged
into the circuit board (figure 2.21 centre and figure 2.22 right) and a plastic
retainer mounted over them. The l s i array is dropped into the retainer and
clamped in place by a metal spring clip. To give an idea of the compression
in size achieved by l s i , we note that the entire scalar section L (see figures 2.14
and 2.15) is housed on 16 l s i boards in a cabinet about seven feet long. The
l s i chips use bipolar transistors with emitter-coupled logic ( e c l ) circuitry.
FIGURE 2.21 The various holders, connectors and clamps that are
used to attach the 0.5 in x 0.7 in ceramic-mounted l s i chip to the coolant
pipes. In the centre two chips are shown mounted on a section of circuit
board. (Photograph courtesy of Control Data Corporation.)
FIGURE 2.22 Left: top and bottom views of the l s i chip mounted on its ceramic
holder. The mounted chip is called an l s i array. Right: exploded view of how the l s i
array is attached to the coolant pipe and circuit board. ( Diagrams courtesy of Control
Data Corporation.)
168 PIPELINED COMPUTERS
The main memory of the CYBER 205 series 400 uses 4K bipolar memory
chips with a cycle and access time of 80 ns. Auxiliary logic uses emitter-
coupled circuitry with e c l 100K chips. Each one million 64-bit words of
memory is housed in two memory sections, each of which holds eight memory
stacks, as is shown in figure 2.23. Figure 2.24 shows a close-up of one stack
which stores 128K 32-bit half-words in eight independent memory banks. A
stack contains two input boards, one output board and 16 memory boards.
Cooling is provided by freon cooling plates that lie between the boards. A
memory board, which is shown in figure 2.25, provides 20 or 19 bits of a
word in parallel from groups of four 4K memory chips. A pair of memory
boards forms a memory bank that accesses in parallel 39 bits (32 data bits
and 7 s e c d e d bits), one bit from each of 39 memory chips. The memory
address specifies which of the 16K bits from a group of four 4K-bit chips is
accessed.
Although the organisation of the CYBER 205 series 600 memory is identical
to the series 400 from the user’s point of view, the technology and packaging
are quite different. Static m o s 16-Kbit chips are used, and the greater level of
integration enables the memory to be offered as either 1, 2, 4, 8, 12 or
FIGURE 2.23 The racks associated with the storage of one million 64-bit words
in the CYBER 205. There are two sections of memory on the extreme left and
right, and two smaller cabinets of the memory interface in the centre. In the
memory section on the right eight memory stacks holding a total of half a million
words can be seen. The CYBER 205 may contain 1, 2 or 4 such million-word
assemblies.
THE CDC CYBER 205 AND ETA10 169
FIGURE 2.24 A memory stack from the CYBER 205. The stack has
two input boards, one output board and 16 memory boards. A pair of
memory boards comprise one memory bank of 16K 32-bit half-words.
There are eight banks in a stack which gives storage for 128K half-words.
(Photographs courtesy of Control Data Corporation.)
FIGURE 2.25 A memory board from the CYBER 205 that provides 20 bits in
parallel. The 4K-bit memory chips are mounted centrally in two 4 x 10 arrays. Four
chips are associated with each bit position of the word, giving 16K memory addresses.
Two such boards form a memory bank, and provide the 39 bits of a half-word
(32 data bits and 7 s e c d e d check bits). (Photograph courtesy of Control Data
Corporation.)
The memory access time of the series 600 is unchanged at 80 ns, as are all
other features of the CYBER 205.
The operation of the CYBER 205 scalar and vector sections is controlled
by microcode that is stored in memories housed on auxiliary logic boards
that may hold up to 90 e c l 100K circuit chips. One such board is shown
lying horizontally at the top of the l s i board in figure 2.19. These memories,
the 128 32-bit instruction stack, and the 256 64-bit register file are all
assembled on auxiliary boards from the e c l 100K chips. These memory
elements have a read/write cycle time of 10 ns which includes the 1.0 ns
e c l 100K gate delay time.
virtual memory. Thus one may address up to 2.8 x 1014 bits, 3.5 x 1013 bytes,
8.8 x 1012 32-bit half-words or 4.4 x 1012 64-bit words of virtual memory.
The top half of virtual storage is reserved for the operating system and vector
temporaries, leaving 2.2 x 1012 64-bit words of virtual address space for user
programs and data. On the other hand, the physical main memory has a
maximum of 4.2 x 106 64-bit words. The operating system transfers programs
and data into main memory as either short pages (512, 2K or 8K 64-bit
words long) or long pages of 64K 64-bit words. The translation between the
virtual memory address and the physical memory address is performed in
the scalar section using 16 associative registers. The registers hold associative
words which contain the virtual and corresponding physical addresses of the
16 most recently used pages. They may all be compared in one clock period
for a match between a virtual address in the instruction being processed and
the virtual page addresses in the associative word. If there is no match, the
comparison continues into the space table which is the extension of the list
of associative words into main memory. If there is a match the virtual address
is translated into the physical address, and the program execution continues.
If the page is not found, the program state is automatically retained in memory
and the monitor program is entered in order to transfer control to another
job.
Floating-point arithmetic may be performed on either 32-bit half-words
or 64-bit full-words. Numbers are expressed as C x 2£ where the coefficient
C and the exponent E are two’s complement integers. In the 32-bit format
E has 8 bits and C has 24 bits, allowing a number range from ± 10"27 to
+ 10 +4°. In 64-bit format E has 16 bits and C has 48 bits, giving a number
range from ± K)-8616 to ± 10+8644. The binary points of both the exponent
and the coefficient are on the extreme right of their bit fields, since they are
both integers, and the sign bit is on the extreme left of the field. A number
is normalised when the sign bit of the coefficient is different from the bit
immediately to its right. Double-precision results are stored as two numbers
in the same format as the single-precision result, and referred to as the upper
and lower results. They may be operated upon separately. Another feature
is the provision of significance arithmetic. In this mode the result of a
floating-point operation is shifted in such a way that the number of significant
digits in the result is equal to the number of significant digits of the least
significant operand.
Instructions in the CYBER 205 are three-address and may be either 32 or
64 bits long with 12 possible formats. There are 219 different instructions
which may be divided into the following categories (the number of instructions
in each category is given in parentheses):
Register (60) Vector macro (15)
172 PIPELINED COMPUTERS
In this short review we cannot attempt to describe all the instructions, but
we do attempt to give the flavour of the instruction set by giving examples
of the more interesting instructions.
Register instructions manipulate data in the 256 64-bit register file, either
as 32-bit half-words or as 64-bit full-words, depending on the instruction. R,
S and T stand for 8-bit register numbers. The bits in a word are numbered
from left to right starting at zero. Examples of register instructions are:
ADDX R, S, T Add address part (i.e. bits 16 to 63) of register
R to register S and store in register T.
EX PH R, T Take the half-word exponent from register Rand
place in least significant bit positions of register T.
Index instructions load and manipulate 16, 24 or 48 portions of registers:
IS R, 116 Increase the rightmost 48 bits of register R by the
16-bit operand 116 in bits 16 to 31 of the 32-bit
instruction.
Branch instructions can be used to compare or examine single bits, 48-bit
indices, 32- or 64-bit floating-point operands. The results of the comparison
determine whether the program continues with the next sequential instruction
or branches to a different instruction sequence:
(48 bits). The control vector is a bit vector containing one bit position for
each element of the vector operands. It is used to control (or mask) the
storage of the result of the vector operation. One may, for example, require
that storage only takes place for elements for which the corresponding control
bit is a one (or alternatively a zero). In the following A, B, C, X, Y, Z stand
for register numbers in the range 00 to FF in hexadecimal:
ADDNS [A, X \ [£, Y], [C, Z ] Add normalised the sparse vector
specified by registers [ A , X ] to
the sparse vector specified by
registers [£, Y], storing the result
as a sparse vector specified by
registers [C ,Z ].
174 PIPELINED COMPUTERS
t Anl Bn, C„ mean the nth element of the floating-point vector specified by the words
in registers A, B, C respectively. Z n means the nth bit of the order or control vector
specified by the word in register Z.
THE CDC CYBER 205 AND ETA10 175
The operations of periodic or random scatter and gather (see §2.2.4) are
performed by single instructions on the CYBER 205. They apply to items
or groups moved to or from either main memory or the register file. Examples
are:
VTOVX \_A,X],B,C Vector to indexed vector transmission
B -> C indexed by A
VXTOV \_A, X ],# , C Indexed vector to vector transmission
B indexed by A -> C
Index lists that are used above may be generated in any convenient way;
however a special search instruction is provided that can be useful for this
purpose, for example:
SRCHEQ A,B, C ,Z Search for equality and form index list. A„
is compared with all elements of B until
equality is found. The number of unsuccessful
comparisons before the ‘hit’ is entered into
C„. This is repeated for all elements of A. The
counts Cn are in fact the indices of the
elements satisfying the condition of equality.
The comparison may be limited to certain
elements by the control vector Z.
Another form of the search instruction provides a single index, for example:
SELLT [v4,X], [£ , Y ],C ,Z Select on less than conditions: the
corresponding elements of A and B
are compared in turn starting with
the first element. The index of the
first pair to satisfy the condition
(here An< B n) is placed in the
register C. Element pairs are skipped
or included according to the bits in
the control vector Z.
176 PIPELINED COMPUTERS
2.3.5 Software
The software used on the CYBER 205 is a development of that written for
the CDC STAR 100, and has been in operational use on the STAR 100,
CYBER 203 or CYBER 205 since 1974. The principal items of software are:
(1) CYBER 205-OS— a batch and interactive operating system;
(2) CYBER 200 FORTRAN—a vectorising compiler for the main high-
level language;
(3) CYBER 200 META— the assembler language that gives access to all
the hardware features of the machine;
(4) CYBER utilities—including a loader and file editing and maintenance
facilities.
The CYBER 200 operating system is designed to handle batch and
interactive access, either locally or from remote sites, via a front-end computer
such as a CYBER 180 series, IBM or VAX. The mass storage for user
files is on CDC 819 disc units (capacity 4800 Mbit, average data rate
THE CDC CYBER 205 AND ETA10 177
given an array A, A( 10; 100) means the vector which starts at location A( 10)
and which is 100 elements long. Descriptors can be used implicitly within
expressions, or by dynamic declaration. Examples are given below.
In both cases the array of elements B (2), ..., 5(1000) is multiplied by two
and stored in locations A( 1),..., ,4(999). In case (1) the descriptors are used
in place of the array names in the arithmetic statement, and in case (2) the
array elements to be used are specified in the arithmetic statement.
Access to all instructions of the CYBER 205 may be obtained through
special subroutine calls of the form CALL Q8ADDX(R,S, T) which, for
example, generates the single machine instruction ADDX R, S, T. The
mnemonics for other machine instructions can similarly be prefixed with the
reserved letters Q8 and used as a subroutine call in order to generate a single
instruction in the place in the FORTRAN code where the subroutine call is
made. Alternatively user-supplied assembler code can be incorporated in a
FORTRAN program by linking a subprogram generated by the CYBER 200
assembler itself as an external reference to the FORTRAN program during
the loading of the program parts.
Some frequently occurring dyadic and triadic operations on vectors,
including recursive operations, have been efficiently programmed and are
available via special subroutine calls to the STACKLIB routines. Examples
from the 25 general forms, are:
(1 ) Add Recursive V1
(2 ) Multiply Add
In the above the letters following Q8 identify the type of arithmetic operations
involved, and the numerical code indicates whether the operands are scalar
or vector and which operands are recursive.
The META assembler program for the CYBER 205 generates relocatable
binary code from mnemonic machine instructions, procedures, functions and
miscellaneous directives. Access is thereby provided to all the hardware
facilities of the machine. Directives allow the programmer to control the
process of assembly. Some features of the assembler are: conditional assembly
capability for selective assembly, generation of re-entrant code that can be
used simultaneously by several users without duplication of the code, ability
to redefine all or any of the instruction mnemonics, ability to define a symbol
for a set (or list) of data, the attributes of such sets (type and number of
elements etc) may be assigned and referenced by the programmer. The assembly
process takes place in two passes. In the first pass, all statements are
interpreted, values are assigned to symbols, and locations are assigned to
each statement. In the second pass, external and forward references are
satisfied, data generation is accomplished, and the binary output and assembly
listings are produced. Assembler programs are modular in form and may
consist of several subprograms that are linked together by the LOADER
program.
The LOADER program is one of the operating system utilities. It takes
relocatable binary code produced by the FORTRAN compiler of the META
assembler, links these with any requested library routines, and produces an
executable program file. The user has control over the characteristics of the
program file and may, for example, specify that certain routines be loaded
as a group in either a small or large virtual page. Source files of the
CYBER 205 software system, including the compiler and assembler, and user
programs are all stored as card images in program files that may be created,
edited and maintained on a card-by-card basis by the utility program
UPDATE. Object binary files may be edited with the object library editor.
The above file maintenance activities, job preparation and input/output
are performed on the front-end computer, thereby leaving the CYBER 205
for its main task of large-scale calculation. The hardware link between the
180 PIPELINED COMPUTERS
front-end computer and the CYBER 205 is controlled by link software which
permits multiple front-end computers to operate concurrently.
2.3.6 Performance
We consider first the performance of the CYBER 205 in the best case, when
successive elements of all vectors are stored contiguously in memory. The
performance of non-contiguously stored vectors is considered on page 184.
Table 2.4 gives the expected performance of a selection of such contiguous
vector operations in a 64-bit floating-point arithmetic on a two-pipe
CYBER 205. We notice immediately that the half-performance length n1/2
is, with the exception of scatter and gather, close to 100; that is to say at
least twice as long as the CRAY X-MP. Since the value of n1/2 determines
the best algorithm to use (see Chapter 5), it may be the case that different
algorithms should be used on the two machines, even though they are both
in the general category of pipelined vector computers. For most instructions,
the asymptotic operation rate is 100 M op/s or Mflop/s, compared with
70 Mflop/s for such dyadic operations on the CRAY X-MP. This rate is
reduced to 40-50 M op/s for the scatter, gather, max/min and product of
element instructions.
TABLE 2.4 Expected vector performance of a two-pipe CYBER 205 for a selection
of instructions (64-bit working), interpreted in terms of and n 1/2. The actual
performance in a multiprogramming environment may differ somewhat from these
values. N = number of elements in the output vector, / = number of elements in the
input vector or vectors. All vectors are contiguously stored in memory.
Time roo
Instruction (clock periods) (M op/s) «1/2
(2.9)
then this is also the time for operating on a vector of length ri = cn, where
c = 2 or 4 in the above cases. Substituting in equation (2.9), we have
(2.10a)
(2.10b)
In the new situation (indicated by a prime) we have, by definition
(2.10c)
( 2. 11)
However two operations must be credited for every result returned to memory
and is thereby doubled. Put another way: let the timing equation for
either a multiplication or addition vector operation be
(2.12a)
where s is the time to read and write to memory and / is the arithmetic pipe
length. In the CYBER 205 s » l > 1, hence
(2.12b)
If two such operations are linked together, then the timing equation per
vector operation becomes:
(2.13a)
and hence
and (2.13b)
Thus, as previously stated, r^ is doubled and n i/2 is approximately unchanged.
This is shown by curves B and C in figure 2.26. In fact n l/2 is increased by
about 50% due to the time for the select-link instruction, which has been
ignored in the above analysis, but which must be executed just prior to the
vector instructions that are to be linked together. This increase in n i(2 can
be seen in curve C of figure 2.26.
The results of the above measurements of r^ and n l/2 are summarised in
table 2.5 and compared with the previous results for a single c pu of the
CRAY X-MP. The specific performance n0 measures the short vector
performance (see § 1.3.5), hence one can see immediately that the short vector
performance of the X-MP/1 is always greater than that of the CYBER 205,
even the four-pipe machine. On the other hand—except for the one-pipe 205
in 32-bit mode—the long vector performance of the CYBER 205, which is
measured by r«,, is always greater than that of the X-MP/1. It follows that
there must be a vector length, say n, above which the CYBER 205 is faster
and below which the CRAY X-MP is faster.
The value of h can be obtained by equating the performance of the two
machines. If we use a superscript (2) for the CYBER 205 and the superscript
(1) for the CRAY X-MP, one obtains
(2.14a)
whence
(2.14b)
where a = r ^ / r ^ and y = n ^ / n ^ are the ratio of asymptotic performance
and the ratio of specific performance respectively. Equation (2.14b) was used
to calculate the values of h in table 2.5.
TABLE 2.5 The asymptotic performance r h a l f performance length n 1/2 and the
specific performance n0 = r ao/ n ll2 for contiguous memory-to-memory operations on
the CYBER 205 and a one-CPU CRAY X-MP. The term h is the vector length above
which the CYBER 205 has a high average performance than the CRAY X-M P/1.
The CRAY X-MP/1 has the higher performance for vector lengths less than n.
The performance parameters given in tables 2.4 and 2.5 apply only if
successive elements of the vectors involved are stored in successive memory
addresses. Such vectors are said to be contiguous, and the memory of all
computers is usually organised to access such vectors without memory-bank
or memory-data-path conflicts. The stride of a vector is the interval in memory
address between successive elements of the vector. A contiguous vector is
therefore a vector with a stride of one, and any other vector is a non-contiguous
vector. In general, vectors may have other constant strides. For example, if
the elements of an (n x n) matrix are stored contiguously column by column
(normal FORTRAN columnar storage), the rows of the matrix form vectors
with a constant stride of n. Such vectors are sometimes described as being
periodic. Other vectors may have elements whose location is specified by a
list of addresses which may have arbitrary values. Such vectors are referred
to as random vectors, and are accessed by using the scatter/gather (or indirect
addressing) instructions of a computer.
Since computers are normally optimised for rapid access to elements of
a contiguous vector, their performance is usually degraded (sometimes
dramatically) if non-contiguous vectors are involved. This is particularly true
in the case of the CYBER 205 and, as an example, we consider the timing
of a dyadic operation X = Y *Z between random vectors. Since the only
vector instructions available on the CYBER 205 are between contiguous
vectors, this non-contiguous vector operation must be performed in several
stages: first the two input vectors Y and Z must be ‘gathered’ into two
temporary contiguous vectors; then a contiguous vector operation can
be performed, producing a temporary contiguous result; and finally the
contiguous result is ‘scattered’ to the random locations of the vector Z. We
can calculate the timing for this non-contiguous operation by using the timing
formulae in table 2.4:
(2.15)
Thus we find that the use of non-contiguous vectors has degraded the
performance by almost a factor 10 from the asymptotic contiguous
performance of 100 Mflop/s. Because the time for the non-contiguous
operation is dominated by the time for the scatter/gather operations which
are not speeded up by increasing the number of vector pipelines, the above
performance of approximately 10 Mflop/s maximum is virtually unchanged
if the number of vector pipelines is increased.
It would, of course, be absurd to program the CYBER 205 entirely with
non-contiguous vector operations of the kind discussed in the last paragraph.
First, all problems should be structured so that the number of non-contiguous
operations is reduced to a minimum, possibly even zero; and secondly if
non-contiguous operations are unavoidable, it is desirable to group them so
that many contiguous operations (rather than one in the above example)
are performed on the temporary contiguous vectors. In this way the overhead
of the scatter/gather operations is amortised over many vector operations.
Even so the contiguous and non-contiguous performance for a dyadic
operation are the best and worst possible cases, and actual performance on
a particular problem will lie between the two. The fact that the range of
performance between the worst and best case on the CYBER 205 is so large
is indicative that considerable program restructuring may be necessary to
get the best performance out of this computer.
(a )
FIGURE 2.27 (a) Overall architectural block diagram of the ETA10 computer.
FIGURE 2.27 cont. (b ) General view of the ETA10 installation at Florida State
University, Tallahasse, which was installed in January 1987. Each of the two low
cabinets at the front holds four c p u s , each with 4 Mword of local memory. The
large shared memory of up to 256 Mword and the I/O units are contained in the
taller cabinet behind.
and shared memory, however, will be air cooled. The ETA10 will therefore
to be the first commercially produced cryogenic computer. Figure 2.27(c)
(bottom) shows the c pu boards being lowered into the cryogenic tank, and
the thick insulating layer that separates the liquid-nitrogen-cooled logic
boards from the air-cooled local memory.
The 4 Mword local memory in each c pu is made from 64K-bit c m o s static
r a m chips, and the shared memory uses 256K-bit dynamic r a m chips. The
use of the above density of v l s i enables the maximum eight-cpu 256 Mword
ETA10 system to be contained in a single cabinet occupying only 7 ft x 10 ft
of floor space (figure 2.21(b)).
The instruction set of the ETA10 is upward compatible with that of the
CYBER 205: that is to say all CYBER 205 machine instructions are
included. However, some instructions have been added to manipulate the
communication buffer and permit m im d programming, i.e. the synchron-
isation of the multiple c pu s when working together to solve a single problem
(multi-tasking on the CRAY X-MP).
A large software program has been mounted to support the ETA10. The
vos (virtual operating system) will be provided with user interfaces to
maintain compatibility with the CYBER 205 virtual operating system, and
UNIX will be provided for compatibility with a large range of workstations
and minicomputer front-ends, vos design emphasises direct interactive
communication between the user and the ETA10, with support for both
high-speed and local area networks, thus eliminating the requirements for a
larger general purpose front-end computer. Although FORTRAN 77 is
available, the main programming language is expected to be FORTRAN 8X,
which anticipates the ANSI 8X standard (see Chapter 4) and provides
structures for expressing program parallelism.
In addition to vectorising compilers for the above languages, the KAP
preprocessor developed by Kuck and Associates, and based on Kuck’s
parafrase system, automatically identifies parallelism in programs and
restructures them to enhance the level of subsequent vectorisation (Kuck
1981). The operating system will also contain a multi-tasking library for
parallel processing. The multi-tasking tools allow access to a shareable data
set from each processor.
and vector units like the CYBER 205, and vector registers like the CRAY-1;
indeed a single-block architectural diagram could be used for all three
machines. The computers are the Fujitsu FACOM VP-100 and VP-200 with
an advertised peak performance of 266 and 533 Mflop/s respectively; the
Hitachi HITAC S810 models 10 and 20 with a peak performance of 315 and
630 Mflop/s; and the NEC SX1 and SX2 with a peak performance of 570
and 1300 Mflop/s. Although similar in overall architecture, the three
computers differ significantly in detail, particularly in the cooling technology.
We will now describe the computers in more detail.
FIGURE 2.28 Overall view of the Fujitsu VP-200. The three cabinets on the
right are for memory, scalar unit, and channel processors; the two on the left are the
vector unit and a second cabinet of memory.
figure 2.29 are about 0.75 inch x 0.75 inch x 0.75 inch, and are mounted on
multichip carriers ( m) holding up to 121 assemblies in an 11 x 11 array,
c c s
called a stack. This is shown in figure 2.30(b). The main memory of the
VP-100/200 is made from 64K-bit static chips with an access time of
m o s
55 ns. These chips do not require fins to dissipate the heat and are mounted
as flat packs on 24 x 38 cm2 6-layer printed circuit memory boards (not
illustrated, but like any other such board). Each board contains a 4 x 32 array
of 64K-bit data chips or 1 Mbyte of memory.
The overall architecture of the FACOM vector processors is given in
figure 2.31. The architecture of the vector unit is CRAY-like, in the sense
that multiple functional units (for floating-point add, multiply and divide)
work from a vector register memory (64 Kbyte). However, there is a separate
scalar unit, as in the CYBER 205, with 64 Kbyte of buffer storage (5.5 ns
JAPANESE VECTOR COMPUTERS 193
( b )
32 bits (16 bits on the VP-100), which are used to store mask vectors which
control conditional vector operations and vector editing operations.
A unique feature of the computer is that the vector register storage may
be dynamically reconfigured under program control either as 256 vector
registers of 32 64-bit elements, 128 registers of 64 elements,... etc, or as eight
registers of 1024 elements. The length of the vector registers is specified by
a special register, and can be altered by a program instruction.
The clock period of the scalar unit is 15 ns, which is called the major cycle.
The vector unit, however, works on a clock period of 7.5 ns, the minor cycle.
On the VP-200 the floating-point add and multiply pipelines can deliver two
64-bit results per clock period, leading to a peak performance of 267 Mflop/s
for register-to-register dyadic operations using one pipeline, and 533 Mflop/s
for register-to-register triadic operations which use both the add and multiply
pipelines simultaneously. These rates are halved on the VP-100. The divide
pipeline is slower and has a peak performance of 38 Mflop/s. On the VP-200
each load/store pipeline can deliver four 64-bit words every 15 ns, or
equivalently a bandwidth of 267 Mword/s (133 Mword/s on the VP-100).
These rates are 2/3 of the bandwidth which is required to support a dyadic
operation with arguments and results stored in main memory. Thus, unlike
the CRAY X-MP and CYBER 205, the Fujitsu VP has insufficient memory
bandwidth to support such memory-to-memory operations. This puts a
heavier burden on the compiler to make effective use of the vector registers
for intermediate results in order to limit transfers to and from main memory.
The instruction set of the Fujitsu VP is identical to the IBM 370, with the
addition of vector instructions; indeed IBM-370-generated load modules will
run on the VP without change. Vector instructions include conditional
evaluation of a vector arithmetic operation controlled by a mask with one
bit per element of the vector (as on the CYBER 205, §2.3.4); compress and
expand vectors according to a condition; and vector indirect addressing,
that is to say a random scatter/gather instruction as described for the
CRAY X-MP in §2.2.4. This instruction can gather four elements every 15 ns.
It is anticipated that most users will write their programs in FORTRAN,
and an extensive interactive software system is being developed for the
interactive optimisation and vectorisation of such programs (Kamiya, Isobe,
Takashima and Takiuchi 1983, Matsuura, Miura and Makino 1985). For
example, the vectorisation of IF statements presents a particular problem,
<
FIGURE 2.30 Logic technology of the Fujitsu vector processor, (a) A
multichip carrier ( mc c ) with space for 121 l s i chips, (b) A stack of 13 horizontally
mounted mc c s .
196 PIPELINED COMPUTERS
and the FORTRAN 77/VP vectorising compiler selects the best of three
possible methods. These are: (a) conditional evaluation using a bit-mask;
(b) selection of the participating elements into a compressed vector before
performing the arithmetic; and (c) the use of the vector indirect addressing
to select the participating elements. The compiler compares the three methods,
based on the relative frequency of load/store operations in the DO loop,
and the fraction of the vector elements which are participating (the true
ratio). If the true ratio is medium to high, a masked arithmetic operation is
best; otherwise the compress method is best when the frequency of load/store
operations is low, and indirect addressing is best when the frequency is high.
Interaction takes the form of suggestions to the programmer on how to
restructure his program to improve the level of vectorisation.
1984. Other machines have been installed for internal company use. The
Hitachi S-180 model 10 and model 20 vector computers are similar in overall
architecture to the Fujitsu machines, as can be seen by comparing figures 2.31
and 2.32. The principal difference is that the S-810 has more pipelines
(Nagashima, Inagami, Odaka and Kawabe 1984). There are three load and
one load/store pipelines on the S-810 compared to only two load/store
pipelines on the VP. This means that memory-to-memory dyadic operations
can be supported on the S-810 at the full rate. The main memory size is
256 Mbytes (40 ns access time) and there is 64 Kbytes of vector register
storage. Like the Fujitsu VP, this register storage can be reconfigured
dynamically to hold vectors of different lengths.
The model 20 has 12 floating-point arithmetic pipelines (four add, two
multiply/divide followed by add, and two multiply followed by add). The
clock period for both models is 14 ns, which corresponds to a theoretical
peak performance of 71.4 Mflop/s per pipeline for register-to-register
operation, or 857 Mflop/s for the 12 pipelines. However, if one takes into
account the time to load the vector registers from main memory this is
reduced to a realistic maximum performance of 630 Mflop/s if all the
pipelines are used. The design is optimised to evaluate expressions such as
A = (B + C)*D which require three vector loads and one vector store and
thus use the four load and store pipelines. The model 10 has a 6 floating-point
FIGURE 2.32 Overall architectural block diagram of the Hitachi S-810 model 10.
(
v m r denotes the vector mask register and L denotes logical operations.)
198 PIPELINED COMPUTERS
pipelines and half the vector register and main memory size. Its peak
performance is quoted as 315 Mflop/s.
As with any of the computers discussed, the observed performance on
actual problems will be less than the above peak rates, because of problems
of memory access. As a simple test, the (ro0,n l/2) benchmark described in
§1.3.3 has been executed for a number of vectorised DO loops (statement 10
of program segment (1.5)). The results are given in table 2.6 for the S-810
models 10 and 20 in both 32- and 64-bit precision. For the model 10 we
observe a maximum performance of approximately 240 Mflop/s for the
four-ops case, which is an expression that makes maximum use of the
hardware. In this case, there are three input vectors and one output vector
which occupy all the four pipelines to memory. In addition, the expression
uses the two add and the two multiply pipelines. Since the pipelines have a
clock period of 14 ns this corresponds to a maximum rate of 71 Mflop/s per
pipeline, giving a maximum expected performance of 284 Mflop/s. The
measured value of 240 Mflop/s is less than this, due to the time required to
load the vector registers.
If there are more than three input vectors and one output vector, the memory
bandwidth is insufficient to feed the arithmetic pipelines with data at the rate
TABLE 2.6 Results for the (rQ0,w1/2) benchmark on the Hitachi S -810/10
with figures for the model 20 in parentheses. Upper case variables are vectors,
and lower case are scalars. (Data courtesy of M Yasumura, Hitachi Central
Research Laboratory, Tokyo.)
Operation:
statement 10 Precision foo
program (1.5) Stride bits (M flop/s) n 1/2
F I G U R E 2 .3 3 G e n e r a l v ie w o f th e N ip p o n E le c tric C o m p a n y S X 2 c o m p u te r .
(b)
FIGURE 2.34 Water-cooled technology of NEC SX1/SX2. (a) A 10 cm2 multichip
package, containing 36 l s i chips, each of which has 1000 logic gates, (b) The liquid
cooling module.
in table 2.7. The top three rows give the average performance in Mflop/s for
three of the so-called Livermore loops (McMahon 1972, Arnold 1982, Riganati
and Schneck 1984). Fourteen such DO loops were selected by the Lawrence
Livermore Laboratory as being typical of their computer-intensive work. We
have selected three which generally exhibit both the best and the worst
performance of a computer. Loop 3, the inner product, is at the centre of
most linear algebra routines. Most vector computers make special provision
for this loop, and the highest vector performance is usually observed. The
performance in loops 6 and 14 is usually more characteristic of the scalar
performance because the DO loops involve recurrences and the opportunity
JAPANESE VECTOR COMPUTERS 203
Fujitsu"
CRAY CYBER IBMf VP200(2) Hitachi NEC
Problem X-MPd 205f 3090VF' VP400(4) S-810/20 SX2 CRAY-2d'' ETA10
Notes
a Solution of linear equation using d g e f a and d g e s l for matrices of order n (Dongarra et al 1979).
b Basic linear algebra subroutines ( b l a s ) optimised in assembler (Dongarra 1985).
c Matrix-vector method of Dongarra and Eisenstat (1984). Matrix order 300. Best reported assembler (Dongarra 1985).
d Number of c pu s used is shown in parentheses
e Fuss and Hollenberg 1984.
f Number of pipelines is shown in parentheses,
g Van der Steen 1986.
h All FORTRAN.
i After optimisation (Nagashima et al 1984).
j Dongarra 1986.
k Dongarra and Sorensen 1987.
JAPANESE VECTOR COMPUTERS
The FPS AP-120B and its derivatives— the AP 190L, the FPS-100, 164,
164/MAX, 264 and the FPS 5000—are all members of a single family of
computers based on a common architecture, namely that of ihe FPS AP-120B.
These computers have been renamed as follows: the original ms i version of
the FPS-164 (now discontinued) is called the MHO, the later v l s i version
(first called the FPS-364) is called the M30, the FPS-164/MAX is called the
M l45, and the FPS-264 is now the M60. They are all manufactured
by Floating Point Systems Inc.t in Beaverton near Portland, Oregon, USA.
The company was founded in 1970 by C N Winningstad to manufacture
low-cost yet high-performance floating-point units to boost the performance
of minicomputers, particularly for signal processing applications. Starting in
1971, the company produced floating-point units for inclusion in other
manufacturers’ machines (e.g. Data General). The first machine marketed
under the company’s name, the AP-120B, was co-designed by George
O ’Leary and Alan Charlesworth, and had a peak performance of 12 Mflop/s.
Deliveries began in 1976, and by 1985 approximately 4400 machines had
been delivered. The FPS-100 is a cheaper version of the AP-120B, made for
inclusion as a part of other computer systems. The AP-120B was designed
for attachment to minicomputers, and a version with more memory, called
the AP-190L, was introduced for attachment to larger mainframe computers
such as the IBM 370 series.
In 1980 the concept of the AP-120B was broadened from rather specialised
signal processing applications to general scientific computing by increasing
the word length from 38 to 64 bits, and the addressing capability from 16
to 24 bits. The memory capacity was also greatly increased, first to 1 Mword
then to 7.25 Mword. The new machine which evolved was the FPS-164 which
was first delivered in 1981. By 1985 about 180 FPS-164s had been sold.
Although capable of solving much larger problems than the AP-120B,
the FPS-164 was no faster at arithmetic—indeed its peak performance
of 11 Mflop/s was 1 Mflop/s less than that of the AP-120B. The first
improvement in arithmetic speed came in 1984 with an enhancement to the
architecture called the matrix accelerator (MAX) board. Each such MAX
board can be regarded computationally as the equivalent of two additional
FPS-164 c pu s , so that a machine with the maximum of 15 MAX boards has
a theoretical peak performance of 31 FPS-164 cp u s or 341 Mflop/s. The
AP-120B and the FPS-164 are both implemented in low-power (and therefore
low-speed) transistor-transistor logic ( t t l ), and the next improvement
F I G U R E 2 .3 7 R e a r v ie w o f th e F P S A P - 1 2 0 B s h o w in g th e v e rtic a lly
m o u n t e d 10 in x 15 in c irc u it b o a r d s a n d fa n c o o lin g . ( P h o t o g r a p h
c o u rte s y o f D H e a d a n d F lo a tin g P o in t S y ste m s, S A L td .)
2.5.2 Architecture
The overall architecture of the AP-120B is shown in figure 2.39. It is based
on multiple special-purpose memories feeding two floating-point pipelined
arithmetic units via multiple data paths. The machine is driven synchronously
from a single clock with a period of 167 ns. This means that the state of the
machine after a sequence of operations is always known and reproducible.
210 P IP E L IN E D C O M P U T E R S
FIGURE 2.38 Two circuit boards from the FPS AP-120B. Left:
the control buffer logic board from the control unit which decodes
instructions; right: a board from the program memory which stores
instructions. (Photograph courtesy of D Head and Floating Point
Systems, S A Ltd.)
The operation of the machine can therefore be exactly simulated, clock period
by clock period, and the machine does not suffer from the delicate timing
uncertainties that had plagued some earlier computers which had separate
clocks driving several independent units. Multiple data paths are provided
between the memories and the pipelines, in order to minimise the delays and
contentions that can occur if a single data path is shared between many units.
Starting at the top of figure 2.39 the memories are: a program memory of
up to 4K 64-bit words for storing the program (cycle time 50 ns); a scratch-pad
(S-pad) memory of 16 16-bit registers for storing addresses and indices; a
table memory (167 ns cycle time, either read only or read/write memory) of
up to 64K 38-bit words for storing frequently used constants, such as the
sine and cosine tables for use in calculating a Fourier transform; two sets
(data pad X and data pad T) of 32 38-bit registers for storing temporary
floating-point results; and a main data memory for 38-bit words (plus three
parity bits), directly addressable to 64 Kwords but, with an additional 4-bit
page address, expandable up to 1 Mword. Separate 38-bit data paths are
provided to each of the two inputs to the floating-point adder and multiplier.
These four independent paths may be fed from the main data memory, the
data pads or from table memory. Three further 38-bit data paths feed results
from the two pipelines back to their own inputs, or to the data pads or main
data memory. These multiple paths allow an operand to be read from each
data pad and a result written to each data pad during one machine cycle.
THE FPS AP-120B AND DERIVATIVES 211
FIGURE 2.39 Overall architecture of the FPS AP-120B, showing the multiple
memories, arithmetic pipelines and data paths. (Diagram courtesy of Floating Point
Systems Inc.)
212 PIPELINED COMPUTERS
The address within both data pads is given by the contents of the data pad
address register ( d p a ). Relative addressing ( —4 to 4-3) with respect to this
address is available separately for each data pad within the instruction in
the data pad index fields XR, YR, XW, YW (see §2.4.4).
The main data memory is available in 8K modules (or 32K modules
depending on the chip type) which are each organised as a pair of independent
memory banks, one bank for the odd addresses and the other bank for the
even addresses. The standard memory has an access/cycle time of 500 ns and
the optional fast memory has a cycle time of 333 ns. Successive references to
the same memory bank (e.g. all even addresses less than 8K) must be separated
by at least three clock periods with standard memory or two clock periods
with fast memory. Two successive references to different memory banks (e.g.
two neighbouring addresses which are from odd and even banks, or two
even addresses separated by 8K and therefore in different modules) may
however be made on successive clock periods. Alternating references to the
odd and even memory banks, as would occur when accessing sequential
elements of a long vector, can occur at one reference per clock period for
fast memory, giving an access to the same bank every 333 ns (matching the
capability of the memory chip), and an effective minimum cycle time between
requests to memory as a whole of 167 ns. For standard memory this rate
must be halved, giving an effective memory cycle time for such optimal
sequential access of 333 ns. If repeatedly accessing the same bank, a cycle
time of 500 ns (three clock periods) applies. The memory is therefore described
by the manufacturer as having an interleaved ‘cycle’ time of 167 ns for fast
memory or 333 ns for standard memory, even though the memory chips have
a physical cycle time of 333 and 500 ns respectively. However, it should be
remembered, when making comparisons with other machines, that we have
previously quoted the cycle time of the memory chips as a measure of the
quality of the memory (e.g. 38 ns main memory of the CRAY X-MP, although
this is organised into banks so as to give an interleaved ‘cycle’ time of 9.5 ns).
If a memory reference to a part of the memory that is busy occurs, the
machine stops execution until the memory becomes quiet.
Instructions on the AP-120B are 64 bits wide, and each instruction controls
the operation of all units in the machine. Thus there is, in this sense, only
one instruction in the instruction set (see §2.5.4) with fields which control
each of the 10 functions, although some fields overlap and thus exclude
certain combinations of functions. This arrangement of control is referred to
as ‘horizontal microcode’. Instructions are processed at the maximum rate
of one per clock period, i.e. 6 million instructions per second, but since each
instruction controls many operations this is equivalent to a higher rate on a
conventional machine whose instructions only control one unit.
THE FPS AP-120B AND DERIVATIVES 213
2.5.3 Technology
The AP-120B is designed for reliability and therefore uses only well proven
components and technologies, under conditions well clear of any operating
limits. As a result mean time between failure ( m t b f ) of the hardware is typically
several months to a year. The logic of the computer is made from low-power
Schottky bipolar t t l (transistor-transitor logic) chips with a level of
integration varying from a few gates per chip to a few hundred gates per
chip. Typical gate delays in this logic technology are 3-5 ns. Various registers
in the computer are also made in this logic technology. These are the S-pad
and data-pad registers, and the subroutine return stack. The 50 ns program
source memory and the 167 ns table memory both use IK Schottky bipolar
memory chips, whereas the slower and larger main data memory uses either
4K or 16K mo s memory chips.
It is interesting to compare the CRAY X-MP with the AP-120B from the
point of view of technology, speed and power consumption, as they represent
opposite extremes. The CRAY X-MP uses high-speed and high-power bipolar
ecl technology with sub-nanosecond gate delays and a clock period of 9.5 ns,
leading to the need for a large freon cooling system to dissipate a total of
about 115 kW. The AP-120B on the other hand uses mostly low-power
technology and consequently has a much longer clock period of 167 ns.
However this permits the use of air cooling and limits the total power
consumption to about 1.3 kW.
216 PIPELINED COMPUTERS
FIGURE 2.42 The data fields in the 64-bit instruction of the FPS
AP-120B. This single instruction controls the operation of all units in
the computer at every clock period. (Diagram courtesy of Floating Point
Systems Inc.)
THE FPS AP-120B AND DERIVATIVES 217
2.5.5 Software
Software for the AP-120B, except for device drivers, is written in FORTRAN,
so that it may be compiled to run on a variety of host computers. It may be
subdivided into the following categories:
(1) operating system;
(2) program development software;
(3) application libraries.
The operating system consists of an executive APEX and a set of diagnostic
routines APTEST. The executive controls transfers of data between the host
and the AP-120B, transfers a p programs from the host to the a p program
source ( ps ) memory and initiates the execution of programs in the a p . The
operation of APEX is illustrated in figure 2.43. Most user programs will be
FORTRAN programs that call either upon AP-120B maths library programs
THE FPS AP-120B AND DERIVATIVES 219
that subroutines 1 and 2 have already been called, the following sequence of
events takes place as the FORTRAN program executes in the host computer:
(1) FORTRAN program calls on a p using routine 3 (VADD);
(2) routine 3 calls APEX;
(3) ps memory table searched: routine 3 not in ps memory;
(4) APEX transfers AP-120B instructions from host to ps memory;
( 5 ) p s memory table updated;
calling program on the host, which may then proceed with other calculations
that do not use the a p . If a call to another a p routine is met before the first
has finished, APEX will wait for the first call to be completed.
The program development software comprises:
2.5.6 Performance
The AP-120B does not include a real-time clock and it is therefore impossible
to time the execution of programs accurately. Attempts to time programs by
using the clock on the host computer are usually imprecise and variable
because of the effect of the host operating system. This is particularly the
case if a time-sharing system is in use. In estimating the performance of the
AP-120B we are therefore forced to rely on the timing formula given in the
AP-120B maths library documents. The document (FPS 1976b) which we
use gives timing formulae that may be related directly to formula (1.4a)
defining and n 1/2. The minor differences from later documents (FPS
1979a, b) are unimportant. Because of the synchronous nature of these
machines theoretical timings should be reliable; however the absence of a
clock makes the optimisation of large programs very difficult. The detailed
timing of large programs soon becomes tedious and error prone. An
alternative is to simulate the execution of the ap program on the host computer
using the program APSIM (see §2.5.5). This program produces the theoretical
program timing but again may be impractical for the timing of large programs
because it runs about 1000 times slower than the program would execute on
the AP-120B itself.
Using the maths library documents (FPS 1976b) we give the timing
formulae for a selection of simple vector operations, and derive from them
estimates for n1/2 and r^. We quote the timing formulae for the standard
memory (500 ns chip cycle time) and give the improved values of for the
fast memory (333 ns chip cycle time) in parentheses. Where there is a small
timing variation because of the choice of odd or even memory locations for
the vectors, we have taken the minimum timing. None of these minor timing
alternatives substantially change the character of the machine, and they can
largely be ignored.
Vector move
hence
We note that this operation is memory bound and the transfer rate doubles
for the fast memory. However n l/2 is unaffected by the memory type.
222 PIPELINED COMPUTERS
(2 ) Vector addition
therefore
therefore
(4 ) Vector division
therefore
and we see that, because the calculation is dominated by arithmetic, the faster
memory does not increase the performance.
(5 ) Vector exponential
therefore
(6 ) Dot product
THE FPS AP-120B AND DERIVATIVES 223
therefore
considerations are for the radix-2 transform and show that this algorithm
does not have a high enough computational intensity to keep the arithmetic
pipes busy. By combining two levels of the f f t together, we obtain the radix-4
algorithm and increase the computational intensity to 2.5 flop/ref, which is
a figure satisfying the conditions given above for the fast memory. We find
that the maths library subroutine CFFT does use the radix-4 algorithm and
gives a performance of 8 Mflop/s.
The values of half-performance length found above are in the range
n 1 / 2 = 1-3, showing that the AP-120B, although it has many parallel features,
actually behaves very similarly to a serial computer. In this respect the
computer is similar to the CRAY-1 (nl/2 ~ 10) and quite different from the
CYBER 205 (n1/2 ~ 100) or ICL DAP (n 1/2 ~ 1000). The selection of the
best algorithm is normally determined by the value of n l/2 (see Chapter 5)
and we would expect algorithms optimised to perform well on a serial
computer also to perform well on the AP-120B. However, as has been
emphasised above, the performance of a program may be more dependent
on the management of memory references than on the questions of vector
length that are addressed by the value of n l/2.
2.5.7 FPS-164 (renamed M140 and M30) and 264 (renamed M60)
As can be seen from figure 2.44 the FPS-164 is substantially larger than the
AP-120B, being about 5.5 ft high and occupying about 2.5 ft x 7 ft of floor
space, principally because of the need to accommodate a much larger memory.
The same cabinet is also used for the FPS-164/MAX and FPS-264. The
principal improvements introduced in the FPS-164 (compared with the
AP-120B) are:
(a) 64-bit floating-point arithmetic compared with 38-bit;
(b) 32-bit integer arithmetic compared with 16-bit;
(c) 24-bit addressing to 16 Mword compared to 16-bit addressing to
64 Kword only;
(d) 64-bit X- and Y-pad data registers compared with 32-bit;
(e) 64 32-bit S-pad address registers compared with 16 16-bit;
(f) 1024 64-bit instruction cache replacing program memory;
(g) 256 32-bit subroutine return address register stack;
(h) main memory expandable from 0.25 to 7.25 Mword with memory
protection;
( i ) table memory of 32 Kwords r a m ;
(j) a clock for timing programs— sadly lacking on the AP-120B.
The increase in arithmetic precision and addressing range generally lift the
THE FPS AP-120B AND DERIVATIVES 225
Operation:
statement 10 roo
program (1.5) (M flop/s) «1/2
A= B+ C 0.88 5
CALL VADD 1.06 16
A = B*C 1.07 5
CALL VMUL 1.04 17
0.30 7
00
>
II
A = b*(C —D)
OPTC = 1 0.8 —
OPTC = 3 3.4 —
A = B + C *(D —E)
OPTC = 1 1.0 —
OPTC = 3 3.2 —
228 PIPELINED COMPUTERS
FPS-
Problem FPS-164 FPS-264 164/M AXf
Theoretical peak — —
33(1)
performance — 99(4)
11 38 341(15)
(»•co, « 1 /2 ) (1.07,5) — —
FORTRAN §1.3.3
Livermore 3 3.0e — —
inner product
Livermore 6 1.1e — —
tridiagonal
Livermore 14 1.5e — —
particle pusher
LINPACK3 1.4 4.7 —
FORTRAN"
«=100
Assembled — — 6(1)*
inner loop
n = 100 2.9 10 20(15)*
Matrix-vectord — — 15(1 )h
best assembler 26(4)h
n = 300 8.7 33
Notes
a Solution of linear equation using DGEFA and DGESL for
matrices of order 100 (Dongarra et al 1979).
b All FORTRAN code (Dongarra 1985).
c BLAS routines optimised in assembler (Dongarra 1985).
d FORTRAN matrix-vector method of Dongarra and Eisenstat
(1984). Matrix order 300. Best reported assembler (Dongarra
1985).
e Gustafson 1985.
f Number of MAX boards used in parentheses,
g FPS 1985b.
h Dongarra 1986.
to all the multipliers. The clock period of the MAX board is the same as the
FPS-164 main c pu , namely 182 ns, hence each board has a peak performance
of 22 Mflop/s. The logic of the FPS-164/MAX is implemented in cm o s v l s i ,
230 PIPELINED COMPUTERS
F I G U R E 2 .4 6 A rc h ite c tu re o f a m a trix a c c e le ra to r (M A X ) b o a rd .
and the arithmetic pipelines of the MAX board (shown in figure 2.47) use
the pipelined WEITEK arithmetic chips.
MAX boards occupy memory board positions of the FPS-164, and it is
possible to upgrade an existing FPS-164 to an FPS-164/MAX. The MAX
boards look to the host computer to be the top 1 Mword of the 16 Mword
of its address space, leaving a maximum addressable normal memory of
15 Mword. The FPS-164/MAX uses the same memory board as the FPS-264,
with 0.5 Mword per board of static n m o s chips. A full FPS-164/MAX has
29 memory board slots of which 14 are used to hold the 7 Mword of physical
main data memory, and 15 slots are used for the 15 MAX boards. The
availability of 256 K-bit static n m o s chips will allow 1 Mword per memory
board and a physical memory size of 15 Mword, to match the full addressing
capability.
The idea of the MAX board is to speed up the arithmetic in a nest of two
or three DO loops, such as one finds in many matrix operations, in particular
in the code for matrix multiply. In this example the 31 164-c pu s of a full
system would be used to simultaneously calculate the 31 inner products that
are required to produce 31 elements in a column of the product matrix. The
FPS architecture is already optimised for the efficient calculation of inner
THE FPS AP-120B AND DERIVATIVES 231
products, and the only problem is to ensure that the required data is available
to the pipelines. One of the vector registers of each of the 31 receives
c p u s
one of the 31 rows of the first matrix, whilst the elements of the column of
the second matrix are broadcast, one-by-one, to all c as the 31 inner
p u s
(2.18)
from the local vector A h and accumulating the inner product in C(I,J) which
is taken from the local vector C,. In these operations all the 31 work c p u s
the inner product, and the DO-J loop moves from column to column. All
T H E F P S A P -120B A N D D E R IV A T IV E S 233
re-use in time refers to the fact that C(I,J) and A(I,K) are continually re-used
from local memory in the DO-K and DO-J loops.
The purpose of the re-use facility is to limit the need for transfers between
the main data memory and the MAX boards, and to perform the maximum
amount of arithmetic between such transfer so as to dilute the penalty of
loading the registers. We have expressed this before by the computational
intensity, /, which is the number of floating-point operations per memory
reference. In the above matrix multiply example we have two floating-point
operations per execution of statement 1, and the references are the read of
A and B, and the store of C (there is hardware provision for the initial
clearing of C). Thus
(2.19)
This is to be compared with the hardware parameter, / 1/2, which is half the
ratio of asymptotic arithmetic performance to memory bandwidth in the
relevant case of memory transfer overlap which occurs on the FPS-164 (see
§1.3.6 and equation (1.20)). The memory bandwidth to the MAX boards,
r™, is one word per clock period, and the maximum arithmetic rate, r^ , is
62 arithmetic operations per clock period, hence
( 2. 20 )
whence
(2.21b)
Thus we find that when the conditions for re-use in space and time are
satisfied, a performance within 3% of the maximum peak performance is
possible.
The MAX boards can only execute a limited number of instructions of the
type that are given in table 2.10. The operation of the MAX boards and their
registers are memory-mapped onto the top Mword of the addressable
16 Mword, as shown in figure 2.49. That is to say that the boards are operated
simply by writing and reading to appropriate parts of the upper Mword of
234 PIPELINED COMPUTERS
T A B L E 2 .1 0 T h e in s tru c tio n s th a t m a y b e e x e c u te d b y a M A X b o a rd , a n d
th e p e a k p e r f o r m a n c e in M f l o p /s fo r 1 a n d 15 b o a rd s . T h e s a m e p e r f o r m a n c e
ap p lies to b o th real a n d c o m p le x arith m etic. F u ll v e c to r o p e ra tio n s u se
J(I) = I
M A X b oard s
N am e F O R T R A N p r o g r a m lin e 1 15
F I G U R E 2 .4 9 T h e m e m o ry m a p p in g o f M A X b o a rd s o n to 16 M w o rd ad d ress sp ace
of an F P S -1 64 .
T H E F P S A P -120B A N D D E R IV A T IV E S 235
the FPS-164 address space. This is divided into 16 individual MAX memory
maps of 64 Kword each. The first 15 of these maps operate the 15 MAX
boards individually, and the last is the broadcast segment which operates all
the boards in unison. The first 32 Kword of the memory map addresses is
the vector register storage which allows for eight registers of 4K elements
each. The first MAX implementation, however, limits the vector length to
2K elements. Next are the eight scalar registers and the vector index registers.
The MAX board is operated by placing appropriate words in the ‘advance
pipe’ section.
The nature of the software that is available for driving the FPS-164/MAX
can be seen from the following FORTRAN code that implements the matrix
multiply discussed above
CALL SYSSAVAILMAX(NUMMAX)
MAXVEC = 8*NUMMAX + 4
NUMVEC = MAXVEC
( 2. 22 )
IF(NUMVEC .LE. 0) GOTO 10
CALL PLOADD(A(I,l ),N, 1, NUMVEC, ITMA, 1, IERR)
DO 20 J = 1, N
CALL PDOT(B( 1,J),1, N, C(I,J), 1, NUMVEC, ITMA, 1,0, IERR)
20 CONTINUE
10 CONTINUE
systems, based on linking together multiple FPS processors. The IBM loosely
coupled array of processors ( /C A P ) is the brainchild of Enrico Clementi and
is installed at the IBM Kingston laboratory. In 1985 a similar configuration
was installed at the IBM Scientific Center, Rome, to be the first computational
heart of the newly set up ‘European Center for Scientific and Engineering
Computing’ (ECSEC).
A simplified drawing of the IBM /C A P computer system is shown in
figure 2.50. Ten FPS-164 computers, each with 4 Mbytes of main data
memory are connected by 2-3 Mbyte/s channels to IBM host computers
(Berney 1984). Seven are connected to an IBM 4381 for computational work
and three to an IBM 4341 for program development, although all ten can
be switched to the IBM 4381, making a system with a theoretical peak
performance of 110 Mflop/s. The actual performance on quantum chemical
problems for the ten-FPS-164 configuration is reported to be about the same
as a CRAY-1S, or about 60 Mflop/s (Clementi et al 1984). Possible
enhancements to the initial configuration involve the addition of two MAX
boards to the ten FPS-164s, which gives a peak performance of 550 Mflop/s.
If the maximum number of 15 boards were added to each FPS-164 the peak
performance would be raised to 3.4 Gflop/s.
The above computer system is described as a loosely coupled array because
in the initial configuration there was no direct connection between the
computing elements (the FPS-164s), and because the connection to the host
is by slow channels. Consequently, only problems which exhibit a very large
grain of parallelism can be effectively computed. That is to say that a very
large amount of work must be performed on data within the FPS-164 before
the results are transferred over the slow channels to the host, or via the
host to other FPS-164s. A fast 22 Mbyte/s FPS bus (called FPSBUS,
directly connecting the FPS-164s, has subsequently been developed by FPS
which substantially reduces the overhead of transferring data between the
FPS-164s.
A similar project to the above has been developed for some years at Cornell
University’s Theory Center under the direction of Professor Kenneth Wilson.
Initially, this comprised eight FPS-100 processors connected by a custom
24 Mbyte/s bus. The work is now part of the Cornell Advanced Scientific
Computing Center which is sponsored by NSF, IBM and FPS. It is
envisaged that up to 4000 MAX boards could be interconnected to give
a peak performance rate of about 40 Gflop/s.
In order to quantify the synchronisation and communication delays on
the /C A P , measurements have been made of the performance parameters
(roo,Si/2 ,/ i/ 2) and these are discussed in §1.3.6, part (iv). The benchmark has
been conducted using either the channels or the FPS BUS for communication,
and gives rise to the following total timing equations (Hockney 1987d)
(2.23a)
(2.23b)
where m is the number of I/O words and s the number of floating-point
operations in the work segment (see §1.3.6, part (iv)). The first two terms in
equations (2.23) are a fit to the synchronisation time, the third term is the
time spent on communication, and the last term is the time spent on
calculation. The fact that the communication term is inversely proportional
to the number of processors, p, in the case of the channels, shows that the
channels, although slow, are working in parallel. On the other hand, we see
that the communication time does not depend on p in the case of the
FPSBUS showing that the bus, although much faster, is working serially.
In order to compare the use of the channels with the use of the FPSBUS
we equate the two timing formulae (2.23a) and (2.23b) and obtain the equation
for the equal performance line ( e p l ):
(2.24)
This relationship is plotted on the (p,m) phase plane in figure 2.51. Given a
number of processors p and the number of I/O words m, a point is specified
on this plane. Its location in the plane determines whether bus or channel
communication should be used. There is an infinity in the relationship (2.24)
at p = 9.78 (broken line) showing that the channels will always be faster if
238 P IP E L IN E D C O M P U T E R S
Numbe r of F P S - 1 6 4 s , p
F I G U R E 2 .5 1 P h a se d ia g ra m c o m p a rin g th e u se o f th e F P S B U S w ith
t h e u s e o f t h e c h a n n e l s . F o r a n y n u m b e r o f p r o c e s s o r s c h o s e n , p, e i t h e r
t h e F P S B U S o r t h e c h a n n e l s is f a s t e r d e p e n d i n g o n t h e n u m b e r o f I / O
w o rd s m . F o r m o re th a n a b o u t 10 p ro ce sso rs th e ch an n els are alw a y s
faster.
F I G U R E 2 .5 2 O v e ra ll a rc h ite c tu re o f th e F P S -5 0 0 0 serie s o f c o m p u te rs .
T H E F P S A P -120B A N D D E R IV A T IV E S 239
there are ten processors or more. This is because the channels have a smaller
start-up and synchronisation overhead than the bus, and a faster asymptotic
rate if there are ten or more working in parallel. That is to say, more than
ten 2-3 Mbyte/s channels working in parallel are faster than a 22 Mbyte/s bus
working serially. We also show in figure 2.51 the line corresponding to
1 Mword per processor, which is a typical main memory size for an FPS-164
installation. Values of m above this line are inaccessible in such an installation,
because problems requiring such a magnitude of I/O would not fit into the
memory of the installation. However, the memory size can be increased to
28 Mword/FPS-164 (using 1 Mword memory boards) corresponding to a
line off the top of the diagram.
clock or the slower FPS-100 with a 250 ns clock. Both have the same
c p s
of between 0.25 and 1 Mword. The are FPS XP-32 computers with a
a c s
To
cp and scm
multiplier and adder were both compressed to one board each. The XP-32,
on the other hand, fits the whole processor onto a single board and,
furthermore, includes a second floating-point adder. This is achieved by using
fast Schottky chips with a 6 MHz clock. The multiplier is now reduced
v l s i
to a single chip (from the three boards in the AP-120B), namely the 32-bit
WEITEK WTL-1032. Similarly the floating-point adders each use the
WEITEK WTL-1033 floating-point chip. The rest of the logic uses the
a l u
arranged in two banks, and the has 4K 32-bit words also arranged in
t c m
two banks. Overall control of the XP-32 is exercised by the executive unit
(e ) which can operate simultaneously with the arithmetic unit (
u ) , thereby a u
providing for the parallel execution of I/O and address calculation with
floating-point arithmetic. The performs all communication of programs
e u
the reside in
e u , which contains 2K 80-bit microcode instructions.
e u p r o m
acts as the main data memory of the control processor. In the case of the
AP-120B control processor, it operates as described in §2.5.2 for the fast
memory (333 ns access) with a 167 ns clock period. In the case of the FPS-100
control processor, the memory works on the slower 250 ns clock period. The
arithmetic coprocessors may also have direct memory access ( ) to the d m a
s c m by taking turns with the with the available memory cycles, according
c p
read or write one word per clock period (but not both at the same time),
giving a total memory bandwidth of either 6 Mword/s (24 Mbyte/s) or
s c m
4 Mword/s (16 Mbyte/s). However, the memory is organised such that any
individual XP-32 coprocessor may only use half this bandwidth, thereby
allowing two on an FPS-5000 system before the memory bus restricts
a c s
rO0 = 2 M op/s
XPISNC Wait for transfer (or arithmetic) to finish.
fo o
Operation Configuration (M flop/s) ^ 1/21" OF S1/2
TABLE 2.12 Values of peak performance, r«,, and f l/2 for a single
FPS XP-32 arithmetic coprocessor when performing triadic ZVSASM
operations on data originating in system common memory.
There are only two techniques for introducing parallelism into computer
hardware, replication and pipelining; pipelining can be considered as
replication which has been made possible through sequence, as each
component of replication in a pipeline follows another in time. Pipelined
operations are performed by overlapping their simpler component operations
using parallel hardware. This is performed such that at any given time,
component parts of a sequence of operations are being processed in the
pipeline. In this way a single operation will share the pipeline with a number
of other operations as it progresses through the various stages.
The fundamental difference between pipelining and spatial replication
is that the parallel component operations of a pipeline are quite likely to
perform different tasks, which when performed in sequence make up the
operation required. There is obviously a limit to the parallelism available by
splitting an operation into subtasks in this way, unless the operations are
extremely complex. Although complete programs are complex, and the use
of pipes of concurrent tasks as a style of programming can be very attractive,
such large pipelines are very application-specific. Thus for general use
pipelining can only provide a limited degree of parallelism, by exploiting
commonly used complex operations such as floating-point arithmetic.
Pipelining is, however, the most attractive form of parallelism available,
because pipelining does not create the same communications problems found
using spatial replication. A pipeline is designed to reflect the natural data
flow of the operation being performed, whereas spatial replication will utilise
either a fixed network or a programmable connection network. In the first
case the network may not necessarily reflect the data flow required in the
245
246 MULTIPROCESSORS AND PROCESSOR ARRAYS
processor, the product of the two to some extent determining the power of
the overall system;
(b) the complexity of the switching networks, which will determine the
flexibility of the system and hence whether the power obtained by replication
can be utilised by a large class of problems;
(c) the distribution of the control to the system, i.e. whether the whole
array is controlled by a central control processor, or whether each processor
has its own controller; and
(d) the form of the control to the system, which may be derived from the
flow of control through a predefined instruction sequence or a control
structure more suited to declarative programming styles, such as data flow
or reduction.
It should be noted that in this model the processors may not have a simple
atomic structure, but may themselves contain replication or concurrency, as
in the case of the replicated pipeline structures referred to above.
(0 Data flow
In data flow, an instruction does not execute under the influence of a program
counter, but instead is able to execute if and only if all of its operands are
available. A dataflow program can therefore be considered as a directed
graph, along which data tokens flow, with the output from an operation
being connected with an arc of the graph to the operations that consume its
result. Instructions in a physical machine would generally be represented as
packets containing operations, operands (as data or references) and tag fields
giving meaning to this data. The latter is required because the state of the
machine can no longer provide a context from which the interpretation of
THE ALTERNATIVE OF REPLICATION 249
the data may be derived. During execution these tags would be matched with
result packets, containing tag and data fields, and when all operands to a
given instruction have been matched, the instruction can be queued for
immediate execution.
To illustrate this consider the program and execution for the expression
given below
(A -f B)*(C -f D)
The program, comprising a number of instruction packets, is ‘loaded’ into
the system by injecting those packets into a program memory. The program
execution would be initiated by injecting data packets, containing values for
A, B, C and D into the system. One implementation of the variables A to D
would be to assign unique tags to them which would be stored in the
instruction packets. The data would then need to be similarly tagged. A
matching unit could then match the tags of the data to the corresponding
instructions as the data circulated through the system. For example, the two
addition operations would attract the associated data and then become
available for execution, possibly in parallel. There is scope for pipelining and
replication in data flow computers.
Pipelining can be introduced in the flow of data packets (sequence of
operations) through the system. Replication can be achieved by sharing the
instruction packets between processors. If this latter form of parallelism is
exploited, then some equitable means of sharing the program packets and
associated tags between the processors is required.
Only when these two addition operations have been completed, generating
values for the bracketed sub-expressions, will the multiplication operation be
able to execute. It can be seen that the execution strategy is data driven and
commences from the innermost level of nesting of an expression and proceeds
outwards. Obviously, in a realistic program, the data flow graph or program
will be very much more complex than this simple example. However, this
example is sufficient to illustrate the notion of asynchronous parallelism being
totally controlled by the data-driven mechanism. Because all data dependencies
have been resolved within the graph description of the program, no explicit
parallel declarations are required to allow data flow programs to run on
multiple processors. Programs must be decomposed, however, in ways from
which parallelism may be extracted. A simple list recursion, for example,
would produce a sequential algorithm, whereas a recursive dividing of the
list, expressing the function to both halves, would generate an algorithm
containing parallelism. This recursive halving is the basis for many common
algorithms, for example quick sort, and is a classical expression for generating
implicit parallelism.
250 MULTIPROCESSORS AND PROCESSOR ARRAYS
(ii) Reduction
Reduction as a means of computer control can also yield parallelism without
explicit control. Reduction is based on the mathematics of functions and
lambda calculus, and using this formalism programs can be considered as
expression strings or as parse trees. For example, a program could be
represented by the following expression, either as a string, or in structured
form as a parse tree for the expression, with the operator at the root and
two subtrees containing 4 + ’ operators and the operands A and B and C and
D respectively.
((A + B)*(C + D))
Whereas in data flow execution is data driven, in a reduction strategy
execution is demand driven. Thus if this program was entered into the system,
or activated by a request for its result from a larger program, then a series
of rewrites would take place, reducing this expression to its component
operations. A rewrite is the procedure of taking an expression or tree and
reducing that expression, performing the operation if leaves are known, or
by activating its sub-expression or subtrees if they are not. The term
‘reduction’ is perhaps misleading, for early operations in this sequence will
generate more programs for execution, as new subtrees or sub-expressions
are activated.
Each subtree may of course be distributed on concurrent hardware for
evaluation in parallel. At some later stage the program has been reduced to
its component operations, which in a similar manner to data flow systems
can be represented by tagged packets. It can be seen that reduction approaches
the evaluation of an expression from the outermost nesting and works
inwards, generating work as it proceeds.
A research group at Imperial College has been building an architecture
for the implementation of functional languages by reduction (Darlington
and Reeve 1981). This machine is based on transputers, and a few prototype
machines were delivered by ICL, one to Imperial College in 1986. Another
THE ALTERNATIVE OF REPLICATION 251
practical implementation of a reduction architecture is being funded by the
Alvey program at University College, London (Peyton-Jones 1987a, b). For
further reading concerning recent practical work on data flow, reduction and
other advances in parallel processing see Chamber et al (1984) and Jesshope
(1987b).
Despite the fact that both of the above methods of computer control
generate parallelism without explicit command, they can also suffer from
inefficiencies when compared with the more exploited control flow strategy.
In both cases a substantial amount of computation can be required in
organisation. There may be tens or even hundreds of instructions executed
for every useful instruction (e.g. floating-point operation in a number-
crunching application). This is very inefficient when compared to the highly
optimised control flow computers which have been developed over the last
three to four decades of von Neumann computing. However, these architectures
attempt to increase the level of abstraction of the programming model towards
one in which the computer executes the specification of a problem. An analogy
to this situation would be to compare programming in assembler and a
high-level language, where the latter should not be compared with the former
in terms of efficiency, unless programming efficiency is also considered. In
this case, there have been shifts towards architectures for executing high-level
languages (Organick 1973) which have minimised any loss of efficiency paid
for the higher level of abstraction.
In the same way, as research continues in the field of declarative systems,
architectures will become more efficient as refinements are made in the
hardware implementations and in compiler technology to exploit these
improvements. Indeed, in recent presentations on data flow research at
Manchester (Gurd 1987), results were presented which show the Manchester
data flow machine comparing very favourably with conventional architectures.
Functional language implementations are likely to follow this development
path but are currently some five years behind the development stage of data flow.
One fundamental limitation in these architectures is that of communication
bandwidth between processors, but this is shared with all replicated systems.
This problem grows with the size of the replicated system and because data
flow and reduction machines effectively require the distribution and
communication of programs as well as data, the communications bandwidth
requirements are greater and are likely to be more of a limitation if these
strategies are adopted. For example, data packets of around 100 bits are
common in such architectures, even for 16-bit operations. In control flow
only one of a pair of operands need ever be communicated through a
communication network. In data flow and reduction, many of these large
packets may need to be communicated in order to obtain a single useful
operation. Latency or pipelining is an effective tool in combating communication
252 MULTIPROCESSORS AND PROCESSOR ARRAYS
complexity, but for an effective system there must be a balance of load between
communication and processing and, as indicated above, this load balancing
is biased against communication as the degree of replication increases.
Generally the most efficient replicated systems either communicate data
or programs between processors, whichever requires the least bandwidth.
For example, if two processors need to enter into a sustained communication
with each other, it may well be appropriate for the code of each processor
to gravitate to each other, rather than for data to be passed between them.
In s im d machines of course it is only necessary to pass data.
The simplest of all processors use the bit-serial approach and many
thousands of these may be combined very cheaply, by exploiting v l s i
technologies. In this way the processing power is spread very thinly over a
single bit slice of the data, which must be highly parallel. It can be seen that
given the parallelism, this approach provides very efficient use of the hardware
for all forms of data: boolean, character, integer and floating-point. It can
also be shown that, for a given number of logic elements, this approach
provides the maximum computational power at least for simple operations.
Consider as a simple problem the addition of a number of pairs of b-bit
numbers (N say, where N ^ b), given b 1-bit full adders with delay time t fa.
Figure 3.2 gives the truth table for a full adder and also shows an
implementation.
One solution to the problem is to link the full adders in a chain, which
then adds all bits within a word. This is sometimes called a parallel adder
FIGURE 3.2 Truth table and circuit diagram for a full adder, constructed
from exclusive-OR and N A N D gates. The inputs are A, B and a carry-in
signal Cin and the outputs are S (the sum) and Cout the carry-out. The truth
table also shows the signals P and G, functions of A and B only, indicating
that a carry is propagated at this position or generated at this position.
THE ALTERNATIVE OF REPLICATION 255
The bit-slice approach uses each 1-bit adder as an independent unit, which
adds pairs from different words in parallel. The carry out from one step must
be held in a register, for input at the next step. This is illustrated in figure
3.4 and the time to add each bit slice, ib, is given simply by:
(3.3)
256 MULTIPROCESSORS AND PROCESSOR ARRAYS
i i
M em ory
B it
¿7-1
O ne
w ord
B 't0 B it s lic e
X Y S X Y S X Y S X Y S
where is the time taken to catch the carry signal. The total time to complete
the problem using the bit-slice approach is given by
(3.4)
(3.5)
and (3.6)
Given these, a carry-out state may also be defined using the carry in:
(3.7)
Figure 3.5 shows the layout of a carry look-ahead unit based on equations
(3.6) and (3.7). This unit may be incorporated into a tree-like structure, to
give the carry look-ahead adder. This is illustrated for an eight-bit word in
figure 3.6. It can be seen that in general, b full adders and b— 1 carry
look-ahead units are required. The delay for this circuit for b > 2 is given by
the time to obtain the most significant sum bit, which is given by:
(3.8)
STP
has as outputs a modified propagate and generate signal (which can
used to cascade the circuit) and a carry-out signal C'.
(3.9)
Here the delays in the carry look-ahead unit ( t Cl ) have been equated to the
full adder delays t fa.
Similar techniques can be applied to obtain fast multiplication (Waser
1978) but again there are penalties to pay in the increased number of logic
gates required. Thus we have shown techniques that are available to build
faster but more complex functional units, although the relative cost increases
and the efficiency can never match the bit-slice approach (unless r M>>t fa).
This does not imply that all hardware should employ the bit-slice approach,
as for single scalar operations the more complex hardware must be used.
Thus the distribution of processing power will depend on several factors, of
which the cost-performance ratio and expected parallelism of the workload
are most prominent.
3.3.1 Introduction
The theory and construction of switching networks are fundamental to the
success of large-scale parallelism, which has now become feasible through
the exploitation of v l s i technology. Replication on a large scale, as described
in §3.2.1, is not viable unless connections can be established, either between
processors or between processors and memory, in a programmable manner.
Such connections can be established using switching networks—a collection
of switches, with a given connection topology. Much of the early theory
concerning switching networks was motivated by the needs of the telephone
industry (Clos 1953, Benes 1965 ) but the convergence of this and the computer
industry, in computer networks, digital telephone exchanges and now in
parallel computers, has led to much more interest in this area. This section
explores switching network architecture, with particular emphasis on large-
scale parallel processing. Further reading in this same area can be found in
a recent book by Siegel (1985).
Switching networks provide a set of interconnections or mappings between
two sets of nodes, the inputs and the outputs. For N inputs and M outputs
there are N M well defined mappings from inputs to outputs, where by well
defined we mean that each output is defined in terms of one and only one
input. Figure 3.7 illustrates this by giving all well defined mappings between
3 inputs and 2 outputs. A network performing all N M such mappings we will
call a generalised connection network ( g c n ).
260 MULTIPROCESSORS AND PROCESSOR ARRAYS
FIGURE 3.7 All possible mappings from three inputs to two outputs:
(a) one-to-many; (b) one-to-one.
FIGURE 3.8 Two representations of the crossbar switch from four inputs to
four outputs.
(3.10)
F I G U R E 3 .9 E x c h a n g e p e rm u ta tio n s.
(3.11)
The bar denotes the complement of a given bit. Thus the kth exchange
permutation can be defined by complementing the kth bit of the binary
representation of x.
(3.13)
(3.14)
These are also illustrated in figure 3.14 for n = 3 and k = 2. Clearly
SWITCHING NETWORKS 263
and
It can also be seen from figure 3.10 that the subshuffles (least significant
bits) treat the set as a number of subsets, performing the perfect shuffle on
each. The supershuffles, however, shuffle the whole set, but increase the width
of the data shuffled.
(3.17)
The sub- and superbutterfly are also illustrated in figure 3.11, for n = 3 and
k = 2. Again
and
(3.19)
(3.20)
Figure 3.11 also illustrates the bit reversal for n = 3. However, the two
permutations are not always equivalent, as will be shown later when we
consider the algebra of permutations.
(3.22)
(3.23)
These are illustrated in figure 3.12 for n = 3 and k = 2.
Since P(l) is the identity permutation (equation 3.28c) and ft{k) is its own
inverse (equation 3.26c),
0 1 2 3 ^ 5 6 7
Outputs
This network is simple and cheap to construct but is not very suitable for a
large number of processors. However, it may be generalised to more than
one dimension. For example, if P = Qk and Q = 2q, then the k-dimensional
nearest-neighbour network can be defined as follows:
The permutations here can be considered as shifting north, south, east and
west over a two-dimensional grid, with wraparound at the edges. The
SWITCHING NETWORKS 269
Outputs
FIGURE 3.14 The nearest-neighbour network N N (1).
It can be seen that this is not a very satisfactory switch, as it leaves four
disconnected subsets of processors. Because of this the perfect shuffle exchange
270 MULTIPROCESSORS AND PROCESSOR ARRAYS
Outputs
FIGURE 3.16 Perfect shuffle network PS(1).
0 1 2 3 ^ 5 6 7
Outputs
FIGURE 3.17 Perfect shuffle exchange PSE(1) and perfect shuffle
nearest-neighbour PSN N (1) networks; broken lines convert PSE(1) to
PSN N (1).
Network
Maximum
distance D{k) H Q - 1) kQ/2 OO k( 2q- l ) Hq + lq/ 2 D - 1)
Fan out F{k) H Q - 1) H Q - 1) 00 k ( 2q- l ) k ( 2q- \ )
SWITCHING NETWORKS 273
the PSE and PSNN networks, however, the algorithm is best considered in
terms of the binary representation of the address or identifier of each
processor. The algorithm is illustrated schematically in figure 3.19, showing
data being propagated to all processors from processor number 0. Consider
first the PSE network: the shuffle and exchange permutations are described
in §3.3.2 by equations (3.11) and (3.12). These equations complement the
least significant bit of the address of a processor and perform a circular left
shift on the address of a processor respectively. Thus the problem can be
more easily specified as one of generating the addresses of all processors from
the source address by complementing the least significant bit of the address
and left-circular-shifting the bits of the address.
An intuitive lower bound can be derived for the number of steps that are
required to complete this operation. With the operations that are available,
the address that is ‘farthest’ from the source address is that which is its
complement. To complement an n-bit address requires n complement
operations (least significant bit only) and n — 1 shift operations, giving a total
of 2 n — 1 operations. Figure 3.19 illustrates that by suitable masking all
addresses can be generated while complementing the source address.
Implementing this algorithm on the PSE network, only one half of the
exchanged data is ever actually used, either the left shift or right shift,
depending on the parity of the source of the data. Therefore, exactly the same
algorithm can be used in the PSNN network. The odd and even masked
exchanges can be simulated in the same time using the left or right shifts of
the NN network. The algorithm is illustrated in a different form in figure 3.20
and the above results for the fan out process are summarised in table 3.1.
274 MULTIPROCESSORS AND PROCESSOR ARRAYS
1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 Binary
1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0
1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Decimal
X Exchange
X X Shuffle
X X X Exchange
X X X X Shuffle
X X X X X X Exchange
X X X X X X X X Shuffle
X X X X X X X X X X X X Exchange
X X X X X X X X X X X X X X X X Completed
FIGURE 3.20 A diagram showing the fan out operation using the shuffle-exchange
network. The binary and decimal addresses of the processors are shown at the top
of the table and the operations performed are shown on the right of the table. The
crosses represent the propagation of the information from address 0.
TABLE 3.2 Maximum distance D(k) in a 4096 processor array. The cost
function C is shown in parentheses.
that no network is any better than any other. Indeed it can easily be shown,
using the algebra of permutations in §3.3.3, that all four networks are
equivalent; the difference in cost reflects the increased bandwidth from
multiple equivalent permutations.
It is perhaps not too surprising that the most common network found in
processor arrays is the two-dimensional NN network as in the ILLIAC IV
(McIntyre 1970), ICL DAP (Flanders et al 1977), Goodyear M PP (Batcher
1980), GEC GRID (Robinson and Moore 1982), NTT AAP (Komdo et al
1983). Linköping University LIPP (Ericsson and Danielson 1983), NCR
276 MULTIPROCESSORS AND PROCESSOR ARRAYS
processor needs access to every memory bank in the system. To provide such
access a full connection network is required.
We have already met the full crossbar network (figure 3.8), which is a full
connection network, and also encountered its major disadvantage— that the
number of gates required to implement it grows as the square of the number
of inputs. To put this in perspective, consider it as an alternative to
the single-stage networks considered in the previous section to connect 4096
processors. This would require at least 16 million transistors but probably
many times more than this for a switch of reasonable performance. The two-
dimensional NN switch on the other hand could be implemented in a mere
25 000 transistors.
Multistage networks can also provide a cheaper alternative to the complete
crossbar switch, when a full connection network is required. These networks
are based on a number of interconnected crossbar switches, the most common
being built up from the 2 x 2 crossbar switch. This switch is illustrated in
figure 3.21 and can generate two permutations and a further two broadcast
mappings.
If one bit is used to control this switch then only the permutations in
figure 3.21(a, b) will be selected. Using an array of N /2 such switches we
can define the /cth-order exchange switch, a single-stage network, which
requires N /2 control bits as:
(3.29)
(a) (b)
0 •------- ------• 0 0
(b )
Alternatively if two control bits are used for each 2 x 2 crossbar, then all
four mappings in figure 3.21 can be generated. Thus we can extend equation
(3.29) to give the /cth-order generalised exchange switch, which requires N
control bits:
(3.30)
Here l (k) and u(k) are the upper and lower broadcast mappings which are
defined by:
(3.31)
Obviously this reduction can be continued recursively giving:
(3.32)
This can be modularised using the relationship in equation (3.29) giving:
Finally this can be simplified, by noting that the pre- and post-permutation
of the inputs and outputs of a switch only redefine the order of these sets.
Thus we obtain an expression for the binary Benes network (Lenfant 1978):
(3.33)
This is illustrated in figure 3.22(b). By definition this is a full connection
network giving AM possible permutations; moreover, if we generalise this
using the GE(1) switch, then all N N well defined mappings can be established.
It can be seen that the banyan switch is only the binary n-cube without the
final unshuffle and that the R network is derived from the first half of the
binary Benes switch, followed by a shuffle. Another interesting point which
has been proved by Parker (1980) is the relationship between Q, C and R,
which gives the following identity:
The omega and binary «-cube networks are illustrated in figure 3.23.
280 MULTIPROCESSORS AND PROCESSOR ARRAYS
[a)
Although these switches are not full connection networks, they do provide
a very rich class of permutations, suitable for many applications on
multiprocessor systems. Apart from mesh and matrix manipulation, they
are highly suited to the and related algorithms ( Pease 1969 ) and to sorting
f f t
model to devolve the control of the switch to the processor, so that it may
be set from local state. The RPA is described in some detail in §3.5.4.
For the crossbar switch, with N inputs, N 2 control bits are required, one
SWITCHING NETWORKS 281
four banks of parallel memory. It can be seen that the rows of A may be
accessed without conflict but that data in a column of A all reside within
the same memory bank. There is no way therefore that this data may be
accessed in parallel.
With a switching network that can connect all memory banks to a given
processor, it is very desirable to have a memory or data structure in which
rows, columns and other principle substructures from arrays may be accessed
without conflict. One method of achieving this is through skewed storage
schemes. Figure 3.25 shows one such scheme for our 4 x 4 matrix. This allows
(3.34)
where |_/ J gives the integer floor function off and \f\gis the value off modulo g.
Table 3.3 gives these address mappings for our 4 x 4 example and figure 3.26
illustrates this. It is perhaps a bad example as one of the principle sets of
subarrays, the forward diagonals, causes memory conflict. For an N x N
matrix the forward diagonal has a skip distance of N -h i, which in our
example is 5, the same as the number of memory banks. However, all other
linear subarrays can be accessed without conflict; these include rows, columns
and backward diagonals. As an example consider the access of row 2 of this
matrix. The start address is 1 and the skip distance is 4, which define the
linear address of each element of this row. Thus:
and correspondingly
AN HISTORICAL PERSPECTIVE 285
1,1 0 0 0
2,1 1 1 0
3,1 2 2 0
4,1 3 3 0
1,2 4 4 1
2,2 5 0 1
3,2 6 1 1
4,2 7 2 1
1,3 8 3 2
2,3 9 4 2
3,3 10 0 2
4,3 11 1 2
1,4 12 2 3
2,4 13 3 3
3,4 14 4 3
4,4 15 0 3
that were current at that time, when logic was relatively very fast and
expensive, and storage was relatively slow and inexpensive. These constraints
resulted in the word-organised processor-memory structure (the so called
von Neumann architecture), which kept the expensive parts busy (i.e. the
processors built from vacuum tubes).
In modern computer design, both memory and logic are now cheap and
the critical component has become the wire or interconnect. Many hundreds
of thousands of transistors are routinely designed into custom v l s i circuits
and will probably be pushing into the millions in the near future. Certainly
circuits with in excess of one million transistors have already been fabricated
(see Chapter 6). Because memory and logic can now be made using the same
technology, the speed and costs are comparable; indeed it is very desirable
to integrate memory and logic onto the same chip. These new technological
factors require architectures with an equal balance between processing and
memory and, moreover, require the components separated in the von
Neumann architecture to be much more closely integrated.
The processor array, and to some extent the multiprocessor system, has
taken an alternative evolutionary path to the conventional von Neumann
processor in that many processors cooperate on a single problem, each
accessing data from its own memory. It should be noted that with a full
permutation switch between processors and memories, the notion of ownership
may take on a very temporary nature! The grid-connected s im d computer
AN HISTORICAL PERSPECTIVE 287
(ii) SOLOMON
The grid-connected computer was further consolidated in the SOLOMON
computer design (Slotnick et al 1962, Gregory and McReynolds 1963) which
was a 32 x 32 array of pe s , and although this was also never built, it was the
precursor to the ILLIAC (University of Illinois Advanced Computer) range
of designs, which culminated in the ILLIAC IV and Burroughs BSP designs
(see §§1.1.4 and 3.4.3). The SOLOMON design also had a major influence
on the ICL DAP (see §3.4.2) and was perhaps notable for introducing another
control concept into the s im d processor array, that of mode control (or
activity as it would be called today). The mode of a pe was a single-bit flag,
which could be set with reference to local data and then used to determine
the action of later instructions. In particular it could be used to inhibit storage
of results in the array where not set, and thus provide a local conditional
operation.
This mode or activity control is equivalent to the following construct:
FOR ANY PROCESSOR
IF MODE
THEN ACTION 1
This parallel IF-T H E N construct is sometimes written
WHERE MODE
THEN ACTION 1
on the ILLIAC IV project. Although only one BSP machine was ever built,
it is described in §3.4.3 as it represented the state-of-the-art in array processors
design at that time. However, it had to compete commercially with established
vector processors such as the CRAY-1. To do this it should have been able
to outperform the CRAY-1 on maximum performance, but it gave only a
fraction of the CRAY’S peak performance. The fact that, unlike the CRAY,
the BSP was designed to give a high fraction of its maximum performance
on a wide range of problems when programmed in FORTRAN, did not seem
to cut any ice with the end users or their representatives. There is a moral
here perhaps.
LSI DAPs. However, we will describe the first production version here. For
the interested reader the pilot machine was first described in Reddaway (1973)
and was evaluated on several applications in Flanders et al (1977), and the
LSI DAPs are considered later in this section.
The main-frame DAP was constructed in units of 16 processors and
associated memory on one 12 in x 7 in circuit board. This contained around
80 16-pin t t l integrated circuits, with typical levels of integration being 10-40
gates per chip. A single gate delay of 5 ns gives an overall clock cycle of
200 ns, including memory access. All memory for one processor is provided
by a single chip, initially a 4K static m o s device. Thus 256 boards comprise
the array section of the DAP, which together with control unit boards and
host store access control are housed in a single air-cooled cabinet, occupying
some 20-30 ft2 of floor space. This is illustrated in figure 3.27. The cost of
the complete DAP system was about £500000, in addition to the cost of the
2900 host computer. However, it should be noted that the extra memory
provided by the DAP would have cost a significant part of this figure in any
case.
Figure 3.28 illustrates a typical 2900 system, which consists of an order code
processor and a store access controller, both cross connected with a number
FIGURE 3.27 The DAP array memory and access unit (courtesy of
ICL).
292 MULTIPROCESSORS AND PROCESSOR ARRAYS
of memory units. One or more of these memory units may be a DAP, which
provides memory in the conventional way and may also be instructed by the
order code processor to execute its own DAP code. The memory may still
be used while the DAP is processing, by stealing unused memory cycles.
Protection of DAP code and data is obtained by giving various access
permissions. For example, read only and read execute segments may be
defined in the DAP to protect data and code.
The store access controller provides memory access to peripherals and
also provides a block transfer facility between memory units. Thus if the
DAP is considered as the number crunching core of the system, then
conventional store can be considered as fast backing store to the DAP, with
the facility for pre- and post-processing by the order code processor.
Figure 3.29 identifies the major components and data highways in the DAP.
Interface to the 2900 system is provided by the DAP access controller and
the column highway, which has one bit for each column of processors in the
DAP array. Thus one 2900 64-bit word corresponds to a row across the
DAP memory. Incrementing the 2900 address first increments down the
columns of the DAP array and then through the 4K DAP address space.
The column highway also provides a path between rows of the DAP array
and the mc u registers, which can be used for data and/or instruction
modification. Finally, the column highway provides the path for the m cu to
fetch DAP instructions from the DAP store. DAP instructions are stored
two per store row and one row is fetched from memory in one clock cycle.
AN HISTORICAL PERSPECTIVE 293
1o
FIGURE 3.30 A simplified diagram of the ICL DAP processing
element.
This family of instructions takes 1.5 cycles to execute and during this time
reads an operand and writes the sum bit back to the same location (most
instructions take only one cycle to execute, see below), thus saving half a
cycle over an accumulator addition followed by a store accumulator
instruction.
Parity processing elements (not illustrated in the figures) are incorporated
into the design. These check both memory and logical functions (Hunt 1978).
Also when acting as 2900 memory, a full Hamming error code is maintained,
which gives single error correction and double error detection on every 64
bits of data read.
Instructions in the DAP are executed in two phases, the fetch and execute
cycles. Each of these cycles is 200 ns in the main-frame DAP. However when
an instruction appears after and within the scope of a special hardware DO
loop instruction, then the first phase which fetches the instruction will only
AN HISTORICAL PERSPECTIVE 295
be performed once for all N passes of the scope of the loop. The DO loop
instruction has two data fields, a length field which indicates the scope of
the loop and a count field which may be modified and gives the number of
times the loop is to be executed. The maximum length of the loop is 60
instructions and the maximum loop count is 254. Within a loop, instructions
may have their addresses incremented or decremented by 1 on each pass.
The DO loop is essential for building up software to operate on words of
data. The rate of instruction execution is asymptotically one every clock
period in the loop, compared with one every 1.5 clock periods when
instructions have to be fetched for each execution (two instructions are fetched
in one fetch cycle).
Most DAP instructions have fields as illustrated in figure 3.31(a). The
operation code and inversion field effectively specify the instruction. The
inversion bit creates pairs of instructions which are identical, with the
exception that one of the inputs to the instructions is inverted. For example
QA and QAN are the related pair which load the Q register from the contents
of the A register. QAN inverts the input. Many DAP instructions have such
complementary pairs. The two other 1-bit fields specify whether the DAP
instruction is to have its address incremented or decremented within a DO
loop.
The two 3-bit fields specify m c u registers. The first is set if a data register
is required and the second if modification of the instruction is required. The
two remaining fields give either a store address or an effective shift address
in shift instructions. These fields can be modified by the m c u register specified
in the modifier field.
Register-register Register-memory
1 - b it a d d i t i o n 1-bit a d d itio n to s to re
Full or half add, Full or half add,
sum to Q, sum to store,
carry to C carry to C
V e c to r a d d itio n
Z<-X 17 241
z<-x*s 40-130 32-102
Z < -X **2 125 33
Z<-X + Y 150 27
Z«- SQRT(X) 170 24
Z<- X* Y 250 16
Z<- LOG(X) 285 14
Z < -X /Y 330 12
Z<- MAX(X,Y) 33 124
Z<- MOD(Z) 1 4096
IZ<- IX + IY 22 186
S<-SUM (X) 280 175
S<- MAX(X) 48 85
AN HISTORICAL PERSPECTIVE 301
have organised the central BSP memory so that many regular array subsets
can be accessed in parallel without conflict.
Other lessons that Burroughs have learnt from their ILLIACIV experiences
concern the organisation of the control processor, another weak point in the
design of the ILLIAC. Whereas the ILLIAC IV had a small buffer memory
and only limited processing power, the BSP provides a 256 Kword memory
and a more complex scalar processor in addition to the control processor.
This memory is used for code and data.
The details of the BSP described here are taken from the pre-production
prototype (Burroughs 1977a-d, Austin 1979), which was built and tested by
1980. Figure 3.35 shows the overall configuration of a BSP system.
The interface between the system manager and the BSP provides two data
highways, one slow and one fast. The slow highway (500 Kbyte/s) interfaces
the I/O processor directly with the BSP control unit and is used for passing
messages and control between the two systems. The second fast data highway
(* Mwords/s) interfaces the I/O processor with the file memory controller
and is used for passing code and data files for processing on the BSP. The
file memory provides a buffered interface between front-end and back-end
systems for high bandwidth communication.
The file memory is one of the three major components of the BSP. The
FIGURE 3.36 BSP control processor and array section, showing major
components and data paths.
other two, the control processor and processor array are illustrated in
figure 3.36 and described below.
The control processor portion of the BSP contained four asynchronous
units, which between them provided array control, job scheduling, I/O and
file memory management, error management and finally communication of
commands between the system manager and BSP.
The control processor had 256 Kwords of 160 ns cycle mo s memory. This
memory held both scalar and vector instructions and scalar data and had a
data highway to the file memory.
The scalar processing unit was a conventional register oriented processor,
which used identical hardware to that found in the 16 arithmetic units ( a u s )
of the array. However it differed from the au s in that it had its own instruction
processor, which read and decoded instructions stored in the control
processor memory.
The scalar processor had 16 48-bit general-purpose registers and was
clocked with a cycle of 80 ns. It performed both numeric and non-numeric
304 MULTIPROCESSORS AND PROCESSOR ARRAYS
1 .0
.
FIGURE 3.37 Format of the BSP high-level machine language and vector
descriptors: (top) FORTRAN code; (bottom) BSP vector form.
for both fetching and storing, reservations had to be made which are
illustrated by the broken lines. These areas must not be overlapped. The
second template in figure 3.38 has a space between fetching A and B ; this is
to accommodate a reservation from an execution of a previous template.
Returning to our example we can now see how the template control unit
selected the appropriate template and incremented the addresses. This is
shown in figure 3.39. We assume that the first template starts up the pipe.
The control unit will therefore select the first of the two alternative templates
shown in figure 3.38, as this minimises the number of clocks required. There
is no reservation required when fetching the next set of 16 operands, so the
same template is used for a second time. Note that the arithmetic unit is the
critical resource here. All successive templates are of the second variety, which
allows for the reservation required by the store cycle. It can be seen that after
this point all memory and all arithmetic unit cycles are being used.
Figure 3.39 also illustrates a common design constraint when overlapping
input and output to memory. It can be seen that although all memory and
arithmetic unit cycles are being used, the routing networks are both
under-utilised. Although one cycle is used for each memory cycle, routing
and memory operations cannot both be overlapped using a single switch.
This is because routing is a post-memory operation on fetch and a
pre-memory operation on store. Thus both switching networks are required and
both will always be under-utilised, leaving the control unit only memory and
arithmetic unit cycles to optimise.
What happened between instructions was important to the BSP’s
performance on short vectors. Let us assume that the example code for matrix
multiplication was followed by the following loop:
The BSP array control unit would determine whether there were any
FIGURE 3.42 Measurement of nx/2 and for the BSP, for a single
vector instruction with the pipeline flushed before and after the operation;
n1/2 = 150, rœ = 50 Mflop/s. (Timings courtesy of Burroughs Corporation.)
+ 320 33 50
— 320 33 50
* 320 33 50
1280 12.5 12.5
2080 7.7 7.7
FIGURE 3.43 Measurements of nl/2 and r^ for the BSP in the steady state, with
successive vector operations being overlapped; nl/2 = 25, riX) = 4 8 Mflop/s. (Timings
courtesy of Burroughs Corporation.)
314 M U L T IP R O C E S S O R S A N D P R O C E S S O R A R R A Y S
system, as in normal use; data may not be read from a location unless this
bit signifies ‘full’ and data may not be written to a location unless the bit
signifies ‘empty’. Normally, after a read operation the bit is set to ‘empty’
and after a write operation the bit is set to ‘full’. With a little thought it can
be seen that this mechanism is sufficient to implement the protocol required
for an Occam channel (see §4.4.2). This is sufficient therefore to implement
all synchronisation and data protection primitives.
The structure of the individual pe m is particularly interesting because each
may process up to 50 user instruction streams, from up to seven user tasks.
The multiple instruction streams share a single eight-stage instruction
execution pipeline with a great deal of hardware support. Instruction streams
or processes are effectively switched on every clock cycle, so the distribution
of processor resources is exceedingly fair. The organisation of this pipeline
is shown in figure 3.46. Thus a single pe m is itself an example of a pipelined
mimd computer (see figure 1.8). The pe m is controlled by a queue of process
tags (one for each instruction stream), which rotate around a control loop.
The tag contains the program status word for the instruction stream or process
that it represents. The process tag contains among other things the program
counter for that process, which is updated on each pass through the
INC PSW box. The instructions themselves are stored in a 1 Mword program
memory, and local data is held either in 2048 64-bit registers or in 4096
read-only memory locations, which are intended for frequently used constants.
Separate pipelined functional units are provided for floating multiply, add
and divide, integer ( i f u ) and create operations ( c f u ), and references to shared
memory ( s f u ). As the process tags rotate around the control loop, the
316 MULTIPROCESSORS AND PROCESSOR ARRAYS
FIGURE 3.46 A diagram illustrating the operation of a single HEP pem . The
diagram shows the control loop on the left and the execution loop on the right.
The asynchronous switch is decoupled from the synchronous operation of the
functional units. (INC denotes increment, if denotes instruction fetch and d f
denotes data fetch, ps w denotes process status word.)
appropriate data rotates around the data loop. Data leave the registers, pass
through the appropriate pipeline, and the result is returned to the register
memory under the control of the passage of the process tag through the
instruction pipeline. As the process tag passes through the first stage of the
instruction pipeline, the instruction is brought from the program memory;
during the second stage the data referenced by the instruction are fetched
from the register or constant storage to the appropriate functional unit;
during stages three to seven the data pass through the unit and the function
is completed; and finally in the eighth stage, the result of the operation is
stored back in the registers.
The above description applies only to instructions that access data from
AN HISTORICAL PERSPECTIVE 317
the registers or constant store. If, however, an instruction refers to data stored
in the shared memory ( d m m s ), then the process tag is removed from the
process queue and placed in the s fu queue while data is being retrieved from
shared memory. The tag is said to be waved off and its place in the process
queue can be used by another process or the process queue can be shortened.
This mechanism allows the process queue to contain only active processes.
In this way each process status word ( ps w ) in a process tag can always have
its program counter updated on each pass around the control queue. Thus
all operations are synchronous, with the exception of access to shared
memory. This is handled in the s fu queue, which re-inserts the waved-off
process into the process queue only when the data packet has arrived from
the appropriate d d m .
A new instruction stream (called a process) is initiated with the machine
instruction (or FORTRAN statement) CREATE. This uses the c f u to create
a new process tag and to insert it into the process queue. It is then said to
be an active process. It remains so, until either the process is completed (i.e.
a successful QUIT instruction or FORTRAN RETURN statement is
executed), or the tag leaves the process queue while waiting for data from
the shared memory (wave-off). Unlike the transputer, there is no mechanism
for a process to be passively delayed until a time-out.
It is instructive to examine the performance of the instruction pipeline as
the number of active processes increases. In the case of a single task (collection
of user processes) the process queue may be thought of as a circular queue
of up to 50 tags. The minimum length of the process queue is eight and if
less than eight processes are active, there will be some empty slots in the
queue. These slots can be filled by creating additional processes (until eight
are active) without changing the length of the queue, or the timing of the
other processes. Thus as more processes are created, the total instruction
processing rate increases linearly with the number of active processes, because
there are less empty slots in the process queue and hence more instructions
being processed in the same time. This continues until there are eight active
processes and the pipeline is full. In this condition the processing rate is at
a maximum. One instruction leaves the pipeline every 100 ns, giving a
maximum rate of 10 Mips per pe m , or 160 Mips for a full system of 16 pe m s .
If more than eight active processes are created, the only effect is to increase
the length of the process queue. In this way the instruction processing rate
for each process decreases, leaving the total instruction processing rate
constant at 10 Mips per pe m .
To summarise this mechanism, we would expect the processing rate to rise
linearly with number of active processes until the instruction pipeline is full
318 MULTIPROCESSORS AND PROCESSOR ARRAYS
and then to remain constant. The instruction pipeline becomes full when
there are eight active processes. Since most active processes will make some
use of the shared memory and hence become waved off, then in practice more
than eight active processes will be required to maintain a full instruction
pipeline. In fact, in FORTRAN programs, about 12 to 14 active processes
are required to maintain a full instruction pipeline, which allows from four
to six waved-off processes.
The beauty of this system is that even if a particular user is unable to
provide sufficient processes to fill the instruction pipeline, the pipeline will be
automatically filled with processes from other tasks or jobs. Thus the HEP
can effectively multitask at the single-instruction level, in a single processor
with no processor overhead. There is of course an overhead for process
creation and deletion.
In order to interpret the effective performance of the HEP on applications
dominated by floating-point operations, it is necessary to know how many
instructions are required per floating-point operation. This variable is called
i3 and modifies the asymptotic performance per pem as indicated below:
Since the HEP has no vector instructions, this loop must be programmed
with scalar instructions. If the loop were coded in assembler it could be coded
in six instructions, namely:
In this case i3 = 6 and = 1.7 Mflop/s. If, however, all variables were
stored in registers, instructions (a), (b) and (d) would be unnecessary, making
¿3 = 3 and doubling the asymptotic performance to 3.3 Mflop/s. Further
AN HISTORICAL PERSPECTIVE 319
Any further increase in P increases the overhead (i.e. the value of s 1/2) without
improving the value of r^. At the optimum point we have:
FO RK /JO IN = 1.7 Mflop/s and s 1 / 2 = 828
The above method of synchronisation requires the process to be created
and destroyed dynamically. A process is created when required by the
CREATE statement (FORK) and allowed to die on completion (JOIN). This
is obviously an inefficient method of obtaining synchronisation in the program
and therefore leads to high values of s 1/2. The alternative is to create static
processes once only, and to achieve synchronisation using other means. The
hardware provides the mechanism for synchronisation in its full/empty tags
on each location in shared memory. Therefore, using a shared variable,
synchronisation can be achieved using a semaphore. A counter initialised to
the number of concurrent processes used can be decremented once by each
process on completion of its share of the work. All completed processes
then wait for the variable to become zero, at which point synchronisation
has been achieved.
This is a software implementation of barrier synchronisation, which has
substantially less overhead than dynamically creating and destroying
AN HISTORICAL PERSPECTIVE 321
I/O
bus
FIGURE 3.49 Diagram showing the structure of the ICL mini-DAP,
with the addition of a fast I/O plane and buffer.
FIGURE 3.50 A photograph of the prototype mini-DAP, showing the physical size
of the cabinet. (Photograph courtesy of AMT Ltd.)
326 MULTIPROCESSORS AND PROCESSOR ARRAYS
FIGURE 3.51 A photograph of an array board from the mini-DAP, showing the
technology used. The large square packages are l s i gate array chips containing 16
processors. (Photograph courtesy of AMT Ltd.)
Word
length Operation Performance
Other applications where this class of architecture excels are where broadcast
and reduction operations over data are required. For example, in associative
processing, set and database manipulation etc.
claimed by Hillis in his book of the same name (Hillis 1985) to be a ‘new
type of computing engine’, to a large extent the implementation draws heavily
on the established bit-serial grid computer developments that have preceded
it. What distinguishes the connection machine from its predecessors is the
use of a complex switching network which provides programmable connections
between any two processors. These connections may be changed dynamically
during the execution of programs, as they are based on packet-routing
principles. Data is forwarded through the network, which has the topology
of a twelve-dimensional hypercube, to an address contained within the packet.
Because of the topology of the network, one bit of the binary address
corresponds to one of two nodes in each dimension of the hypercube (see §3.3).
The connection machine derives its power and name from this ability to
form arbitrary connections between processors; however, there are many
compromises in the design and it is not clear that the designers had sufficient
experience in the implementation technologies. A twelve-dimensional hypercube
requires a considerable amount of wiring and, as is expounded in Chapter 6,
this can contribute excessively to cost and performance degradation. However,
the principles behind the machine are laudable, in providing a virtual replicated
architecture, onto which a user description of the problem to be solved can
be transparently mapped.
The Connection Machine is not the only bit-serial development that has
considered the problems of information representation and array adaptability.
Southampton University’s RPA project (§3.5.4) is also an adaptive array
architecture which has a communications system capable of creating arbitrary
connections between processors and varying these dynamically. The connection
machine implements connections in a bit-serial packet network, arranged in
a binary cube topology and giving a slow but general connection capability.
The RPA, on the other hand, implements connections by circuit switching
over a connection network which conforms to the underlying implementation
technology, which is planar. However, there is provision for long-range
communication, which can be irregular. These long-range connections are
implemented at the expense of processing power, by using processing elements
as circuit switch elements. This latter trade-off is one which need not be fixed
at system design time. For example, in the connection machine, 50% of the
array chip is dedicated to packet routing. In the RPA, all of the chip is used
for pe s , but at compile or run time some of these may be used simply for
switching elements.
The reader is encouraged to compare and contrast these two approaches
to this communications problem. The direct connection provides an efficient
hardware implementation and gives a high bandwidth. The packet switch,
on the other hand, is more costly, but gives high network efficiencies over
REPLICATION— A FUTURE WITH v l s i 331
§4.3; it is a processor array which uses many simple processors, but which
can adapt its physical structure in a limited way to accommodate a variety
of data structures. The RPA is designed to support structure processing over
the widest range of data structures, consistent with the adaptable grid network
used; however, this is not as restrictive as would be expected, as will be seen
later, for it includes structures such as binary trees. The RPA was developed
at Southampton University with funding from the U K ’s Alvey program
(Jesshope 1985, Jesshope 1986a, b, c, Jesshope et al 1986, Rushton and
Jesshope 1986, Jesshope and Stewart 1986, Jesshope 1987a, b). It is similar
to the DAP and connection machine in that it is an array of single-bit
synchronous processing elements with a common microcontroller. It is not
a true simd machine, however, as local modification of an instruction is
allowed by distributing some of the control word fields across the array. This
allows the array to adapt to different situations. Figure 3.57 illustrates this
concept for rectangular arrays.
It is well known that a synchronous structure such as a large s im d machine is
limited in the size to which it can be extended, as clock and control information
must be distributed to synchronise the system. Any skew in distributing these
signals will reduce the clock frequency. There is therefore an optimal size of
RPA which will depend on the characteristics of the implementation and
FIGURE 3.58 The RPA computer system, showing the array, the
controller and host, with the interaction between the three components.
338 MULTIPROCESSORS AND PROCESSOR ARRAYS
as a bit slice of a larger processing unit, which will support a wide range of
common microprocessor operations. The composite processing unit supports
binary operations, a wide range of shift operations and bit-parallel arithmetic
operations. These processing units can be configured, dynamically if required,
from any connected path of RPA . The most useful configurations are
p e s
closed subgraphs of the RPA array, as these support cyclic and double-length
shift operations in a single microcycle and also greatly simplify serial
arithmetic when applied to words of data stored in processing units. For
example, performing 16-bit arithmetic serially using four-bit processing units,
the carry bit from one four-bit addition at the most significant bit will be
adjacent to the least significant bit, where it will be required on the following
cycle. If multi-bit processors are configured as closed loops of , then the
p e s
two operand and two result buses. The al u provides logical and bit-serial
operations, using local operands only and the neighbour-select provides a
means of transferring data from one pe to another. Bit-parallel arithmetic is
made possible by the use of both a lu and neighbour-select, which together
form a fast Manchester carry-chain adder. The key to the flexibility of the
array lies in the use of three preset control fields, which are provided in each
pe in the array, R in figure 3.59. These control fields determine switch settings
and edge effects in arithmetic and shift operations.
Two of these fields, each of two bits, specify the control of the nearest-
neighbour switch (left and right shifts) the reconfiguration register also
contains a further two-bit field, which codes the significance of the pe in the
larger processing unit. The codes used are for most significant bit, least
significant bit and any other bit. The remaining code is used for a very
powerful feature which allows the inputs from the switch to be connected
directly to the outputs. This allows the distributed control of communications
such as the row and column highways found in the ICL DAP and
GEC GRID. However, it does this in a much more flexible manner, as the
combination of direction and significance codes can be used to bus together
any connected string of pe s . These bus connections may either be used to
distribute data to all pes s o connected, or indeed be used to bypass pe s in
order to implement connection structures that would not otherwise be
possible, containing long-range connections.
The storage provided in each pe is stack-based. There is a bit stack and
an activity stack, for storing single bits of data in a pe . Each comprises eight
bits of data and both are identical, with the exception that the top bit of the
activity stack can be used to conditionally disable the operation of the p e .
Storage is also provided for words of data in the p e . These are arranged as
a stack of eight-bit words with parallel-to-serial and serial-to-parallel
conversion provided by a pair of shift registers. This structure allows bit-serial
(or word-serial) arithmetic to be performed, without the awkward bit reversal
encountered using stacks. Finally, a single-bit I/O port allows up to 64K
bits of external storage to be connected to each p e .
The internal r a m has dual control mechanisms which allow stack or
random access to the bytes of a word with local or global control of either.
For example, the stack can be conditionally pushed or popped, depending
on the value of the activity stack. The store structure also contains a full
n-place shift, which is locally controlled, and comparitor circuits between the
two shift registers providing serial-parallel conversion. These facilities
provide much enhanced floating-point capability within the array. For a
given cycle time and array size, the floating-point performance of the RPA
can be a factor of ten above that of the ICL DAP architecture. The price
340 MULTIPROCESSORS AND PROCESSOR ARRAYS
paid for this is a more complex processing element. At the time of writing,
a test chip containing a single p e has been fabricated. It uses a 3 /¿m n-well
c m o s process and occupies a little under 4 mm2. We anticipate going into
production with a 16-pe , which will occupy about 0.8 cm2. Figure 3.60
shows the RPA test chip.
mim dlayer can exploit the irregular or independent parallelism also found
in many algorithms.
Although a single RPA/transputer system can be considered as a
conventional s im d array, each pe may, under locally determined conditions,
store or generate some or all of its control word for use in subsequent
operations. This adaptive modification of s im d control allows each processor
some autonomy of action. Most si m d computers provide minimal control at the
processor level, being usually restricted to a data-dependent on/off switch. This
structure maintains the advantage of fine grain synchronisation while allowing
a limited adaptability to differing processing requirements in the array.
Figure 3.58 shows the control structure of a single RPA. Two major control
loops can be identified. The first is between the array and its microsequencer.
This is conventional; the controller supplies microcontrol words and addresses
to the array (a wide word data path) and receives condition signals from the
array and possibly from the other devices in the system. The second control
loop involves data from the array memory, which can be accessed by
processes executing on the host system. This loop allows a coarser grain
synchronisation that may involve events from other transputer/RPA systems.
The use of this loop involves the initiation by the host of some action in the
array, which will set data in the array memory; this data will subsequently
be retrieved by the host and used to determine subsequent actions.
One of the major design gcrds of the RPA system has been to map this
second control loop into an OCCAM programming model in such a way
that the programmer can trade real and virtual concurrency in an application.
This will allow for applications code to be ported between different
transputer/RPA configurations.
The implementation alternatives in mapping the RPA control structure
into OCCAM are determined by the granularity of processes running on the
array. Should array processes be complete programs, running concurrently
and perhaps communicating with the host? Or should they be considered
as indivisible extensions to the host instruction set? It is the communications
required between host and array which provide the discrimination between
these two alternatives. If we define a basic array process as being an indivisible
unit of computation, so far as communication between host and array is
concerned, such that the order may be modelled by the following OCCAM
fragment
then any decision concerning the partitioning of control between host and
array controller is in effect a decision about whether constructions of basic
array processes may be determined by the array controller, or whether they
must be determined by the transputer host.
If sequences of these basic array processes are to be determined by the
array controller, allowing complete programs (processes) to run on the array,
then for concurrent operation of the host and array, there must be
communication between host and array to provide the necessary synchronisation
events. This concurrent operation between host and array is illustrated in
OCCAM in the example below.
PAR
SEQ - - HOST PROCESS
ARRAY ! SOMETHING
HOST. PROCESS. 1
ARRAY ? SOMETHING.ELSE
HOST. PROCESS.2
SEQ - - ARRAY PROCESS
ARRAY? SOMETHING
ARRAY.PROCESS. 1
ARRAY! SOMETHING.ELSE
ARRAY.PROCESS.2
In this example the host initiates action in the array and proceeds. At some
later stage the two processes synchronise by a communication initiated from
the array and both then proceed, perhaps modifying their actions based on
that communication. In this control model synchronising communications
may be initiated by host or array.
If we wish to overlap processing and communication, so that efficiency is
not lost in achieving synchronisation, then this model must be modified to
allow both host and array to run their respective processes in parallel. This
is illustrated in the example below, where the synchronising communications
have been hidden within host and array processes.
PAR
PAR - - HOST PROCESS
HOST. PROCESS. 1
HOST. PROCESS.2
PAR - - ARRAY PROCESS
ARRAY.PROCESS. 1
ARRAY.PROCESS.2
To implement this model requires an array sequencer that can support
REPLICATION— A FUTURE WITH v l s i 343
two process queues, for active and suspended processes, with support for
suspending and activating processes to achieve synchronisation.
Figure 3.61 gives a more detailed view of the transputer control system,
which contains a single multiport process queue, through which the transputer
and microcontroller communicate. Within this implementation, array
microcode routines provide processes which can be considered as indivisible
extensions to the host’s instruction set, and which map into the OCCAM
language. PAR and SEQ constructors, and array and host processes, may
now be freely mixed, with communication (hidden in the example below)
providing synchronising events and control flow where required:
- - HOST/ARRAY PROCESS
SEQ
PAR
HOST. PROCESS. 1
SEQ
ARRAY. PROCESS. 1
ARRAY? SOMETHING
PAR
HOST. PROCESS.2
IF
SOMETHING = GOOD
ARRAY. PROCESS.2
TRUE
ARRAY.PROCESS.3
Using this system, a sequence of array processes which require no
synchronising communications may be buffered in the array controller with
control passing from one to the next without recourse to host interaction.
They are added to the process queue in the array controller and then executed
from this queue in sequence. Likewise, parallel array processes may also be
queued for execution on the array.
A considerable volume of low-level software has been implemented in the
design of the RPA computer system, using the RPA simulator-based
microcode development system (Jesshope and Stewart 1986). This provides
a simulator of a 32 x 32 array and controller, implemented on the binary
image of the microcode. It uses the graphics hardware of an ICL Perq
workstation to implement this. The system provides programming by menu
and is illustrated in figure 3.62. It can provide interactive data and
REPLICATION— A FUTURE WITH v l si 345
TABLE 3.11 Performance estimates for the RPA computer system over
1024 operations. (All figures in millions of operations per second.) All
floating-point operations are full IEEE specification, with denormalisation
and rounding. IBM format floating-point is very much faster.
Operation Performance
FIGURE 3.63 Illustration of the use of local control fields in the mapping of regular
data structures over the RPA; the key gives the significance of the control fields.
(a) Two eight-bit microprocessors, configured from closed cycles of RPA p e s (note
that m s and l s bits are adjacent, enabling efficient shift operations and carry handling).
(b) The same two microprocessors, with connections configured to create a local bus
structure, to support broadcast operations from the least significant bit.
348 MULTIPROCESSORS AND PROCESSOR ARRAYS
Figure 3.63 illustrates the codes required in the local control fields to
implement some regular configurations, which together give some idea of the
power and flexibility of this adaptive array structure. The three configuration
fields are shown in each processing element in figures 3.63(a) to 3.63(c). These
are coded as shown in the key and correspond directly to the configuration
store state in each RPA pe .
Figure 3.63(a) shows the configuration for two 8-bit processors. Notice
that these are closed loops and thus support the full range o f ‘byte’-oriented
operations.
To perform multiplication, it is possible to configure the processors of
figure 3.63(a) to broadcast data from one bit to all other bits in a given
processor and to do this in each processor simultaneously. This configuration
(figure 3.63(h)) gives the facility of a parallel bus structure, in which multiple
broadcasts can be performed concurrently.
Figure 3.63(c) shows how the array can be used to map a binary tree of
single-bit processors onto the RPA. This is achieved with a 50% utilisation
of the pe s . Some 25% of the pes are not used at all and another 25% are
used in bus configuration and only pass inputs to outputs, providing
long-range communication where required.
Less regular structures can be mapped onto the array by implementing a
packet-based communications structure over the adaptable nearest-neighbour
communications network, using a layer of microcode. Such an implementation
has been coded on the RPA and the results of one cycle of the algorithm are
illustrated in figure 3.64. We have implemented 32-bit packets, with two
absolute address fields for x and y pe s . Using this scheme only one data
packet may be buffered in on-chip RAM, but simultaneous transmission and
reception of packets is possible in the absence of data collision. However,
with the global reduction facility implemented over the RPA array, it is
possible to detect a potential collision deterministically in very few microcycles
and buffer packets out to external r a m .
The total circuit-switched bandwidth of the RPA is two bits in and two
bits out of every pe in every microcycle, a total of 4 x 1010 bit/s,
approximately equivalent to the CM-1 machine (see §3.5.3), although this
machine is 64 times larger. In packet-switched mode, it requires between 100
and 200 microcycles to forward 32 bits of data one pe nearer their destination,
giving a total of between 3 and 6 x 109 bit/s over the entire array. Thus,
the flexibility of addressed data communication degrades the communications
bandwidth by an order of magnitude; however, the static bandwidth is high
and this is very much more efficient compared to an implementation on a
non-adaptive array. This system allows irregular and dynamic data structures
to be implemented.
350 MULTIPROCESSORS AND PROCESSOR ARRAYS
Figure 3.66 illustrates the power of the T800 transputer chip, by considering
some transputer board sub-assemblies. Figure 3.66(a) shows a 2 Mbyte single
transputer board, a 1-2 Mflop/s component; figure 3.66(b) illustrates a 16
transputer board, a 16-32 Mflop/s component; and figure 3.66(c) illustrates
a 42 transputer board, a 42-84 Mflop/s component! All boards are double-
extended eurocard format. Of course one is trading processing power for
memory in this sequence, but it can be seen that there is a great incentive to
provide computing models which utilise a relatively small amount of ram ,
for example pipelines. With the rate at which gate density is increasing in
mo s technologies, it would not be surprising to see the demise of the r a m
chip, in favour of a processor-memory device, of which the transputer is the
harbinger. The achievement of 4 Mflop/s and 64 Kbytes on a single silicon
die in 1990 would not be a surprising feat.
FIGURE 3.67 The INMOS transputer, (a) A block diagram of the chip
architecture, (b) The transputer registers.
store non-local;
jump;
conditional jump; and
call.
These single instructions can use the four-bit constant contained in the
data field, giving a literal value between 0 and 15, or an address relative to
the workspace pointer for local references, i.e. an offset of 0 to 15. The
non-local references give offsets relative to the top of stack, or A, register.
For larger literals, two additional instructions provide the ability to build
data values from sequences of these byte instructions. All instructions
356 MULTIPROCESSORS AND PROCESSOR ARRAYS
commence their execution by loading their four-bit data value into the
operand register, and the direct functions above terminate by clearing this
register. There are, however, additional instructions which load this register
and whose action is to shift this result left by four places, therefore increasing
its significance by a factor of 16. These instructions do not clear the operand
register after execution. Two instructions— prefix and negative prefix— allow
positive and two’s complement negative values to be constructed in the
operand register, up to the word length of the processor.
This data can be used as an operand to the above direct instructions.
Statistics of the analysis of compiler-generated code indicate that the above
direct instructions are the most commonly used operations and addressing
modes. Moreover, these results also show that the most commonly used
literal values are small integer constants. This simple instruction set therefore
allows the most commonly used instructions to be executed very rapidly.
However, the story does not end there, for even the most ardent R ise advocates
would require more than this handful of instructions; 64-128 is typical for
other R ise processors.
The operate instruction allows the operand register to be interpreted as
an opcode, giving, in a single byte instruction, an additional 16 instructions,
which operate on the stack. What is more the prefix instruction can be used
to extend the range of functions available. Currently, all instructions can be
encoded with a single operand register prefix. The most frequently used
indirect operations are encoded into the nibble contained in the first byte.
As an example, to multiply the top of stack, with a value of 16 bits significance,
it would require the following code sequence:
prefix—most significant 4 bits
prefix—next least significant 4 bits
prefix—last prefix bits
load constant—contains the last 4 bits
prefix—indirect function
operate—decodes operand register as multiply.
This sequence takes four cycles to load the operand and 40 to execute the
multiply, not such a large overhead for the simplicity. Indeed, if account were
taken of speed improvement in processor cycle time because of the simple
decoding, then the benefits of the R ise approach the transputer has taken are
well justified. The sequence is also compactly encoded, requiring only
six bytes of data (two bus cycles to load the instructions).
FIGURE 3.70 cont. (d) The initial state of two yet-to-communicate channels.
(e) The state during communication; note that both process P and Q are descheduled.
Figure 3.71 shows the time required to transmit a message of a given size
on a transputer with 10 MHz links and 12.5 MHz processor clock, as found
in the first T414 transputer products. The maximum or asymptotic transfer
rate is approximately 0.5 Mbyte/s, which can be achieved in both directions
over the link. A half-performance message length of approaching one byte
is also observed. Various processor and link rates will modify these observed
parameters, with the start-up time being proportional to the processor clock
and the asymptotic bandwidth being, to a first order, proportional to the
REPLICATION— A FUTURE WITH v l s i 361
link rate. Thus, on current best T414 silicon (20 MHz processor and 20 MHz
links), an asymptotic bandwidth of 1 Mbyte/s should be seen, with a half
bandwidth message length of 1 byte.
The link protocol uses an 11-bit data packet, containing a start bit, a bit
to distinguish data and acknowledge packets, eight data bits and a stop bit.
The acknowledge packet is two bits long and contains a start and stop bit.
The protocol allows for a data byte to be acknowledged by the receiving
transputer, on receipt of the second (distinguishing) bit. This would allow
for continuous transmission of data packets from the transmitting transputer,
providing that the signal delay was small compared with the packet
transmission time.
This pre-acknowledgement of data packets is not implemented on current
T414 transputers, and an acknowledge packet will not be sent until the
complete data packet has been received. However, this protocol has been
implemented on the T800, giving a maximum theoretical transfer rate of
approaching 2 Mbyte/s, per link, per direction. This bandwidth could cause
degradation on bi-directional traffic, as data and acknowledge packets must
be interleaved.
(iv) Performance
The performance of the transputer is dependent on a number of factors, for
362 MULTIPROCESSORS AND PROCESSOR ARRAYS
example, the clock speed of the part, which may vary from 12.5 MHz to
20 MHz. Also, if data is coming from off-chip ra m , the access times will
depend on the speed of the memory parts used and, for most operations, this
is a major factor in the speed of operation. At best, the external memory
interface will cycle at three transputer clock cycles (150 ns for the 20 MHz
processor). Internal memory, on the other hand, will cycle within a single
processor cycle. It is clear from this that on-chip data will provide significant
speed gains. The transputer reference manual gives a breakdown of cycles
required for the execution of various OCCAM constructs. We have, however,
performed a number of benchmarks on a single transputer board, containing
a T414 rev A transputer with a 12.5 MHz internal clock. The results of these
timings are given in table 3.12 below.
When using OCCAM the programmer is encouraged to express parallelism,
even if the code fragment is configured onto a single transputer. This style
of programming should be encouraged, but only providing that parallel
processes can be created without excessive overhead. INMOS claim that in
the transputer this overhead is no larger than a conventional function call
in a sequential language. In order to test this we have executed the following
code on a single (12.5 MHz) transputer.
Performance
Internal External
Operation memory memory
This code was executed for various values of m and n, where the product
m *n= 1024. This was executed in internal and external memory, and the
results are summarised in figure 3.72. Using off-chip memory, it can be clearly
seen that the overheads of process creation do not become significant until
128 parallel processes are used to perform 1024 integer multiplications; and
in the case of on-chip memory, when 256 parallel processes are used to
perform the 1024 operations. These correspond to eight and four operations
for each process respectively.
The overheads of creating parallel processes are therefore small. These
figures also show the advantages of programming transputers so that code
and data both use on-chip memory. If the transputer is used in this mode of
operation, it is likely that code for the algorithm will be distributed across
a number of transputers. This is called algorithmic parallelism, and the
technique is illustrated in §4.5.2. In this situation, data will probably be
sourced from communications links. We therefore executed code on connected
transputers to simulate the effect of various packet sizes on operation times.
The OCCAM code executed on the transputer was a sequence of integer
multiplications executed from internal memory, with both sets of operands
sourced from two externally configured OCCAM channels and the results
sent to an externally configured OCCAM channel. All I/O was buffered so
that communications and operations could all proceed in parallel. The results
are summarised in figure 3.73, where curves are plotted for a given packet
length, showing time required against total number of operations performed;
the curves are compared with the ideal, which gives time for internally sourced
data.
Part
T800-30 T800-20
Precision
Operation Single Double Single Double
up and down the systems’ hierarchy, it will not go away. For example, we
can trade chip and wiring complexity in a fully switched system against the
large diameter of a static network.
In the latter case, when processing a partitioned data structure, one that
is shared between the processors in the system, then only in the special case
where each partition of the data structure is independent will no communications
be required between processors. More generally, data must be shared between
processors in the system and in many problems the communications
complexity can dominate the complexity of the algorithm. For example, in
sorting or computing matrix products each element in the resulting data
structure requires information from all other elements in the original
structures.
Such problems with global communications properties scale rather
unfavourably with the extent of parallelism in the system, unless the
connectivity between processors reflects that between partitions of the data
structure. It is shown in figure 3.73 above that on local communications the
number of operations (integer multiplications) required per word of data
received over the INMOS link is two. It is also clear that a significant
degradation of the communication bandwidth due to a large-diameter
network would require a relatively coarse granularity in the partitioning of
the algorithm. The transputer with its four links could be configured directly
into the following networks: a two-dimensional grid, a perfect shuffle
exchange network, a butterfly network, or other four-connected topologies
(see §3.3.4). In all cases, however, if the data partitioning does not match the
underlying network, then communication bandwidth will be degraded in
relation to the processing rate of the system, in proportion to the diameter
of the network. The diameter of the networks above varies from log2n to
n112. On applications that become communications-limited, performance will
not scale linearly with processors added to the system.
The technique which avoids this unfavourable scaling is to allow arbitrary
permutations to be established between processors, using a crossbar switch
or its equivalent. Although the latter has many advantages and allows an
arbitrary scaling of parallelism, the costs of such a switch will tend to grow
with the square of the number of processors in the system. The costs of the
fixed networks, however, vary linearly with the number of processors added
to the system. However, because the transputer communicates over a
high-bandwidth serial circuit, the costs of fully connecting transputers via
their link circuits is not prohibitively expensive, or at least is not so for arrays
of up to several thousand transputers, which with the T800 gives multiple
gigaflop/s machines.
A transputer system which allows arbitrary networks of transputers to be
REPLICATION— A FUTURE WITH vl si 367
configured was first proposed by one of the authors and is now the subject
of a major ESPRIT project in advanced information processing. One of the
motivating factors behind this design was to use switched nodes of transputers
to implement program-derived networks, so that the transputers could be
configured in algorithmic networks. This effectively creates a more powerful
node which has a higher communications performance than a single
transputer. Algorithmic parallelism can be exploited, with the placed processes
and network configuration forming a static or quasidynamic data flow graph
of the algorithm (see §4.5.2 for details).
To understand the arguments behind this, consider the following. A
transputer has processing power P and communications bandwidth C, with
the ratio of these, C /P , determining how soon a system would become
communications-bandwidth-limited as more transputers were added to the
system. Now consider taking a small number of transputers, say n, and
arranging them as in figure 3.74. It has been shown (Nicole and Lloyd 1985)
that this configuration will implement any graph of the n labelled transputers.
where most transputers have only 256 Kbytes of fast static r a m (eight chips),
packed eight per board using hyper-extended, triple eurocards. At supernode
level, one of the transputers, designated the control transputer, will have the
switch chips mapped as I/O devices and will be able to configure the local
network. This device is also the master on an eight-bit control bus, which is
connected to all other transputers in the supernode. This bus provides the
ability to set and read signals such as reset, analyse and error on any
transputer, and provides a low-bandwidth communications medium between
the transputers for control and debug purposes. This bus also provides
IF ANY and IF ALL synchronisation between the transputers, so that global
events may be signalled to the controller. This is necessary for non-static use
of the switch, to provide a time reference when link activity has ceased so
that it may be reconfigured. Each first-level supernode will also have r a m
and disc servers, both implemented as transputers hooked into the switch.
The controller and bus are linked at the second level by the second-level
controller, which sets the outer switches and acts as bus master for a bus on
which each first-level controller is a slave. A prototype supernode was
completed in the third quarter of 1987, with first samples of the T800 transputer,
which was developed by INMOS as part of the same ESPRIT collaboration.
The partners in the collaboration are the Apsis SA, Grenoble University,
INMOS Ltd, the UK government research establishment RSRE, Southampton
University, Telmat SA and Thorn EMI pic.
4 Parallel Languages
4.1 INTRODUCTION
370
INTRODUCTION 371
(Burns 1985) and Modula 2 (Wirth 1981) embody this technique. Access to
the data is only provided through the shared procedures. The encapsulation
of data and access mechanisms in this way defines the object, and an objective
language is one which enforces this regime by building an impenetrable fortress
around these objects. This can be done at a high level, such as in
OBJECTIVE C and ADA (Cox 1986), or at a very low level in the system,
as found in SMALLTALK 80 (Goldberg and Robson 1983).
Having encapsulated a programmer’s efforts in the creation of such
constrained objects, a mechanism must be provided in order to evolve or
enhance that object, without a full-scale assault on its defences. This is the
second technique of objective languages— inheritance. Inheritance allows the
programmer to create classes of objects, where those classes of objects may
involve common access mechanisms or common data formats. For example,
given a class of objects ‘array’, we may wish to use this to implement a
subclass of objects, ‘string’, which are based on array objects but have access
mechanisms specific to strings. Conversely, given a set of access mechanisms
to an object, we may wish to extend the range of type of that object, but
exploit the access mechanisms which already exist. These are examples of
inheritance. The mechanism for implementing inheritance is to replace the
function or procedure call to an object by a mechanism involving message
passing between objects. The message provides a key by which the access
mechanism can be selected. This is equivalent to a late or delayed building
of an access mechanism to a function call or procedure. Thus only at run
time, when a selection is made from a class, is the appropriate access
mechanism or data type selected.
The notion of encapsulation can readily be distributed for, as is indicated
above, implementations are often based on message passing. The notion of
inheritance is not so easily distributed, as it is based on a class tree defining
and extending objects. A distributed implementation of a class of objects
would, if naively mapped onto a processor tree, become heavily saturated at
the root. However, by exploiting the applications parallelism and replicating
the class structure (program) where required, an efficient mapping can result.
What is more, because of the underlying packet nature of communications,
a dynamic load balancing may be readily implemented. Object-oriented
languages can then be considered as prime candidates for the efficient
exploitation of parallel systems, where ‘efficient’ implies both programmer
and machine utilisation efficiency. The field is new, however, and little work
has been published to date in this area. The interested reader is referred to
the OOPSLA 1986 proceedings, published as a special issue of the SIGPLAN
notices (Meyrowitz 1986).
Although we have briefly reviewed some modern language trends, this
IM P L IC IT P A R A L L E L IS M A N D V E C T O R IS A T IO N 375
chapter (with the exception of CMLISP) follows the trends in the development
of imperative languages, and the way they have evolved to deal with various
aspects of hardware parallelism. Section 4.3 deals with parallelism which is
introduced through structure, as found in the siMD-like architectures. These
languages are ideal for array processors, such as the ICL DAP, and vector
processors, such as the CRAY-1. Section 4.4 covers the use of process or task
parallelism that is found in mim d systems. However, the most common
exploitation of parallelism is provided by the automatic vectorising compilers
used for FORTRAN on most vector computers. This form of parallelism is
extracted from loop structures within sequential code. It has the advantage
(the only advantage perhaps) that existing sequential code for production
applications can benefit from the speed-up obtained from vector instruction
sets implemented on pipelined floating-point units. This approach and its
potential disadvantages are discussed in detail in §4.2 below.
4.2.1 Introduction
The high-level language has been developed as a programming tool to express
algorithms in a concise and machine-independent form. One of the most
common languages, FORTRAN, has its roots in the 1950s and because of
this, it reflects the structure of machines from that era: computers which
perform sequences of operations on individual items of scalar data.
Programming in these languages therefore requires the decomposition of an
algorithm into a sequence of steps, each of which performs an operation on
a scalar object. The resulting ordering of the calculation is often arbitrary,
for example when adding two matrices together it is immaterial in which
order the corresponding elements are combined, yet an ordering must be
implied in a sequential programming language. This ordering not only adds
verbosity to the algorithm but may prevent the algorithm from executing
efficiently.
Consider for example the FORTRAN code to add two matrices:
no 10 j - i y n
no l o i>:i. r n
a( i »j > a < i f j > + n ( T •> j )
10 CONTINUE
Here the elements of A and B are accessed in column major order, which is,
by definition, the order in which they are stored. On many computers, if this
order had been reversed, the program would not have executed as efficiently.
In this example, because no ordering is required by the algorithm, it is unwise
to encode an ordering in the program. If no ordering is encoded the compiler
376 P A R A L L E L L A N G U A G E S
may choose the most efficient ordering for the target computer. Moreover
should the target computer contain parallelism, then some or all of the
operations may be performed concurrently, without analysis or ambiguity.
The language APL (Iverson 1962) was the first widely used language which
expressed parallelism consistently, although its aim was for concise expression
of problems rather than their parallel evaluation. Iverson took mathematical
concepts and notations and based a programming language on them.
However, the major difference between the mathematical description of an
algorithm and a program to execute it is in the description and manipulation
of the data structures. APL therefore contains powerful data manipulation
facilities.
Chapters 2 and 3 have described new developments in computer architecture.
These architectures embed parallelism of one form or another into the
execution of instructions in the machine. Computing has arrived therefore
(nearly two decades after Iverson) at the situation where there is a need to
express the parallelism in an algorithm for parallel execution.
be more likely to avoid a severe mismatch with his problem’s data structures.
However, as more parallel computers (and languages) become available, the
problems of transporting programs between them become very severe
(Williams 1979). We will return to this problem of portability later.
instructions for the target machine. To do this the compiler must match
segments of code to code templates which are known to be vectorisable. For
example one of the simplest templates would be:
DO< LABEL >Y = <CONST >TO< CONST >
<LABEL><ARRAY VAR>(Y)= (ARRAY VAR>(Y)<OP>
<ARRAY VAR>(Y)
This would simply translate into one or more vector operations, depending
on target instruction set and difference between the loop-bound constants.
More complex templates may involve variable loop bounds, array subscript
expressions, conditional statements, subroutine calls and nested constructs.
This process can be considered as an optimisation or transformation,
usually performed on the source code or some more compact tokenised form.
For example, the compiler for the Texas Instruments Advanced Scientific
Computer (ASC NX) performs the optimisation on a directed graph
representation of the source code. However, for simplicity, only the source
code transformations will be considered in the examples given in this chapter.
It is clear that the most likely place in which to find suitable sequences of
operations for vectorisation is within repetitive calculations, or DO loops in
FORTRAN. Thus the vectorising compiler will analyse DO loops, either the
innermost loop, or possibly more. The ASC NX compiler analyses nests of
three DO loops and if there are no dependencies can produce one machine
instruction to execute the triple loop. The Burroughs Scientific Computer
(BSP) vectorising compiler analyses nested DO loops. It will reorder the
loops if one of the innermost loops contains a dependency.
In general, the transformation performed on one or more DO loops is a
change in the implied sequence or order of execution. In the sequential
execution of the DO loop, the order implied is a statement-by-statement
ordering, for each given value of the loop index. The order required for
parallel execution is one in which each statement is executed for all given
index values, before the following statement is executed for any. This
transformation can only be performed when there is no feedback in and
between any of the statements within the loop. Detecting and analysing these
dependencies is the major task in vectorisation.
where A.LT.O supplies the vector mask which ‘controls’ the vector operation
It is possible to vectorise single- and the more complicated multi-statement
conditions, using this masking technique.
This may look contrived, but is typical of the indexing constructs which can
inhibit vectorisation. It is not recursive, although simply applying the ordering
transformations will produce the wrong results. The implied ordering would
put old values of A(7) into OLD(7); the transformed ordering puts new values
of A(7) (i.e. NEW(7 + 1)) into OLD(7). The problem then is a simple question
of timing and a good compiler would therefore reorder this calculation, or
provide temporary storage for the old values of A(7).
The equivalent loops below are both vectorisable:
The true recursive construct is similar to the above example and can be
expressed in one line, or hidden using multiple assignments. Two examples
are given below:
IM P L IC IT P A R A L L E L IS M A N D V E C T O R IS A T IO N 381
In the first case if either I or J were invariant within the loop being considered,
then this expression would be linear and vectorisable, otherwise it would not
be.
In this last example a recurrence in J has been isolated and the innermost
and outermost loops in M and L have been vectorised. Notice that although
the BSP hardware can handle first-order linear recurrence, the recurrence
vectorisation process will only be invoked if no other vectorisation can be
performed on a given set of DO loops. This is because the parallel evaluation
of a first-order linear recurrence is not 100% efficient (see §5.2).
Section 4.2 above gives one solution to the problem of portability, which is
to continue to use existing (sequential) languages and to make the system
responsible for generating the mapping of the problem or algorithm onto
the underlying hardware. However, techniques for automatically generating
parallelism are largely limited to optimisations of loop constructs for
execution on vector processors. The automatic exploitation of other forms
of parallelism, for example replicated systems, is not well understood and
results to date have shown poor utilisation of processor and communications
resources. One of the major problems in this respect is the lack of a formalism
for describing the structure of the problem within a consistent model of
computation which will facilitate a mapping of that structure onto the
underlying machine’s structure. This problem of data mapping is not
described well in a language which must use loop structures and indexing to
express it. Other more formal approaches treat whole array objects and
384 PARALLEL LANGUAGES
classes of mappings that can be applied to them (for example, see §3.3 and
Flanders (1982)).
Even targeted on a vector computer, the vectorising compiler is not the
ideal route to portability. The FORTRAN programmer will of course
optimise his sequential code, so that the compiler will recognise as much as
possible as being vectorisable. Because of the underlying differences in
the vector hardware, optimal loop lengths and access patterns into memory
will change from machine to machine. The solution to the portability using
vectorisation therefore fails, as the programmer treats FORTRAN as in the
past, as an assembler language.
An alternative solution places the onus on the programmer to explicitly
declare which parts of his code are to be executed in parallel. In this way a
parallel structure can be chosen to express the solution of the problem and
not the underlying machine structure. The portability problem now becomes
one of implementation, as an efficient mapping of the programmer’s expression
of parallelism must now be automatically mapped onto the target hardware.
Two techniques are available to explicitly express parallelism: by using a
description of the data structure to express parallelism, structure parallelism,
or by using a description of the program or process structure to express
parallelism, process parallelism. They can in some circumstances be very
similar, as a distributed program’s process structure may be designed to
exploit the structure of its data. To make a distinction in these circumstances,
assume that structure parallelism is defined at the granularity of a single
operation and that the operations are carried out as if simultaneously over
every element of a data structure. Process parallelism, on the other hand, is
defined with a large granularity, with an instruction stream and state
associated with each element, or (more likely) with each partition of the data
structure.
It is really the underlying computational model that distinguishes these
two forms of parallelism, and the manner in which load is balanced across
a system. In structure parallelism, one can consider virtual processors to be
associated one per data structure element, with activated data being mapped
onto the available processors in a manner which balances the load between
processors. Thus, load sharing is achieved through data-structure element
redistribution—data remapping. For example, if only one row of a distributed
matrix was selected for processing, a redistribution of the data elements in
that row may be required to maintain a load on all processors. This may
also be viewed as a redistribution of virtual processors to actual processors
to maintain an even load. In process parallelism, on the other hand, the
program partition or process is virtualised and load balancing occurs by the
distribution of processes across the processors. Process parallelism is explored
STRUCTURE PARALLELISM 385
in more detail in §4.4: this section is concerned only with the exploitation
of structure parallelism.
Structure parallelism is in essence a formal method of expressing the result
of the vectorising transformations given in §4.2 above, as the operational
semantics of the vector processor are as described above for structure
parallelism. The vectorisation approach works for vector computers because
pipelined access to a single memory system can mask the data transformation
implicit in these semantics, and also because the efficiency of a pipeline is an
asymptotic function of vector length (see §1.3).
Structure parallelism has been used in many language extensions, usually
to express the s i m d parallelism found in processor arrays, e.g. DAP FORTRAN
(ICL 1979a). However, these languages have tended to express the parallelism
of the underlying hardware, thus leaving the programmer to define the data
transformations required to maintain a high processor load. They can
therefore be considered as low-level, machine-specific languages. The emerging
standard for FORTRAN, FORTRAN 8 X (ANSI 1985), is proposing extensions
to allow any array to be manipulated as a parallel data structure. This is a
welcome proposal, although somewhat late and incomplete; it will, however,
allow the programmer to express the structure and parallelism inherent in
an algorithm, rather than in that of the hardware. We explore this emerging
language standard later in this section.
Using structure parallelism, the programmer is free to use virtually unbounded
parallelism in the expression of an algorithm. Completely general languages
of this type allow all data structures to be treated as objects on which
operations can be performed in parallel. This departure is best viewed as
allocating a virtual processor to each element of the data structure. Then, at
a given stage in an algorithm, some of these data-structure elements will be
activated, either explicitly by selection from the data structure (for example
a row from a matrix), or implicitly by a conditional construct in the language
(such as the WHERE statement in FORTRAN 8 X). The compiler, or indeed
the system at run time, will then allocate activated virtual processors to real
processors in the system.
This idea of virtually unbounded parallelism could prove an embarrassment
in a process-based approach, where the overhead in creating many instruction
streams may be greater than the work involved. In both structure and process
parallelism, a transformation from parallel to sequential can be made at
compile time for a given target architecture, in order to optimise performance.
However, in the case of process parallelism, the transformation is not so
straightforward and it would be unlikely that a run-time system could be
developed to do this efficiently. In the case of data-structure parallelism, the
transformation is simple, merely one of ordering sets of similar operations.
386 PARALLEL LANGUAGES
reduce the rank of the object by selection. Thus, for example, given a
three-dimensional array A, it is said to have rank 3, and any reference to
the name A alone will imply a reference to all elements in parallel. Indexing
in any one of the dimensions of A may be considered as a selection operation,
which reduces the rank of A by one. Thus if all three dimensions are indexed,
a scalar or data object of rank 0 is selected from A. More details of selection
mechanisms are given below. It should be noted that this concept already
has some precedent in sequential languages; in FORTRAN for example, the
reference to an array name in the READ and WRITE statement implies a
reference to all elements of that array.
Although this approach is the most general, many language extensions
have made a compromise, in that the number of dimensions in which the
array can be considered as a parallel object are limited. Any further
dimensions must be indexed within the body of the code, thus providing sets
of parallel array objects of limited dimensions. This mixed approach has been
adopted almost exclusively on processor arrays, where the number of
dimensions which may be referenced in parallel corresponds to the size and
shape of the hardware. It has the advantage of distinguishing the parallel
and sequential access found in processor array memories, but the disadvantage
of being machine-dependent and hence non-portable.
Examples of this approach are found in CFD (Stevens 1975), GLYPNIR
(Lawrie et al 1975) and ACTUS (Perrott 1979), all languages proposed or
implemented for the ILLIAC IV. CFD and GLYPNIR manipulate one-
dimensional arrays of 64 elements, which map onto the ILLIAC array of
processors. ACTUS is more general, but the original implementation
developed for the ILLIAC IV allowed only one dimension of a PASCAL
array to be referenced in parallel. Another example of this mixed approach
is found in DAP FORTRAN (ICL 1979), where either the first or first two
dimensions of an array may be referenced in parallel. Again these first two
dimensions must correspond to the DAP size. Both DAP FORTRAN and
CFD are discussed further in §4.4.
Selecting reduced rank objects This mechanism is perhaps the most important,
388 PARALLEL LANGUAGES
(a)
be argued that the same should be true for indexing with arrays of integers
and that it should be a special case of the familiar indirect indexing found
in sequential languages. However, there are at least two ways in which this
indirect indexing may be used, and both cannot be represented by the same
syntax.
One interpretation gives a linear mapping or cartesian product of linear
mappings over more than one dimension. This is shown below, using familiar
sequential constructs, where IV and JV are the integer vectors defining the
mappings:
(a) A(/,JV(J))
(b) A(IV(/),JV(J))
Here (a) gives a linear mapping of the second dimension of A and (b) gives
a cartesian product of linear mappings in both dimensions of A. It can be
seen that if this use were applied in parallel, with subscript elision
(a) A(,JV)
(b) A(IV,JV)
390 PARALLEL LANGUAGES
then the resulting arrays are of the same rank as A, but have a range in the
dimensions selected given by the mapping vector. The mapping produced
may be one-to-many; however the values of the elements of the mapping
vector must all lie within the range of the dimension in which it is used. Thus
if A is an N x N array and IV and JV are vectors of range M, then all
elements of IV and JV must be less than or equal to N 9 and (a) would give
an N x M array and (b) an M x M array. Both are arbitrary mappings of
the elements of A.
The other interpretation of this construct gives a projection over one or more
dimensions of an array. This is a rank-reducing operation and is shown below
using similar sequential constructs. Here however JV is an integer vector
and JM an integer array, both of which define a slice of the array which is
projected over the remaining dimensions:
(c) A(/,JV(/))
(d) B(/,J,JM (/,J))
The important point to notice here is that the index vector JV and index
array JM have the same shape as the array slice they select, and are indexed
in the same way. When using subscript elision there is no way to differentiate
between the two uses of indirect indexing, as (c) and (d) yield the following
parallel constructs
(c) A(,JV)
(d) B(„JM)
It can be seen that the syntax of the parallel construct given by (c) is identical
to that given by (a), but they have completely different semantics. Here if A
is an N x N array and Ba n Ai x i Vx i V array, then JV must be an N-element
vector and JM an N x N array. Both reduce the rank of the arrays they are
indexing as if they were scalars; however, the slice obtained is not laminar
but projected over the elided dimensions. Again the values of the elements
of the index array must be less than or equal to the range of the dimension
from which they select. These two techniques are illustrated in figure 4.3, for
a 4 x 4 array.
It will be shown later that the projection variant of this structure, when used
in conjunction with another construct, is able to emulate the general mapping
technique. The obvious choice of semantics for this construct is therefore as
a projection. This convention is adopted in all later examples, unless otherwise
specified. This selection technique may be used in conjunction with other
indexing techniques in other dimensions in the array, including other index
arrays. However all indexed arrays must conform to the slice of the array
produced. Thus TABLE (,,/,J,X) is a valid selection from TABLE, even if
STRUCTURE PARALLELISM 391
the selection would give an object of the same rank, but with only those
elements selected by the values ‘true’. Similar techniques can be used to
selectively update an array object. These are discussed in more detail later.
Shift indexing Strictly speaking this is not a selection mechanism; however
as it is often implemented as an indexing technique it is included in this
section for completeness. It is an alignment mechanism and can be used to
shift or rotate array objects along a given dimension. A good example of its
use is in mesh relaxation techniques (see §5.6.1), where each point on the
mesh is updated from some average of its neighbouring points. The simplest
case is where a point takes on the average values of its four nearest neighbours
in two dimensions. This could be expressed as below, where the syntax is
from DAP FORTRAN:
symbol, * for example, where A(* — N) would shift A in its first dimension
N places to the right.
Rank-reducing functions It has been shown that the rank of an array object
may be reduced by indexing or selection. Another way in which it may be
reduced is by the repeated use of a binary operator between elements in one
or more dimensions of the array. Although this could be described in the
language as a sequence of operations, this would not give the opportunity
of using parallelism for the reduction. For example the sum of N elements
can be performed in log2N steps in parallel (see §5.2.2). Table 4.1 gives a list
of the most common reduction operations. Each has two parameters, the
array and the dimension along which the reduction is to be performed.
Functions should also provide for a reduction over all of the dimensions of
the array. In APL similar functions are provided as composite operators
where @/gives the reduction using the binary operator @. The syntax of
combining operators is considered in a paper by Iverson (1979).
Rank-increasing functions It is often necessary to increase the rank of an
object to obtain conformity. The most frequent use of this is in operations
between scalars and arrays. This is often termed broadcasting in parallel
SUM { A, k ) +
*k
PRO D ( A , k ) *
h
OR (B,k) V o r B ( i u . . . J k, . . . J l)
M AX(4,/c)
ik
MIN ( A , k )
ik
396 PARALLEL LANGUAGES
then this list defines the shape and mapping of A over an ordered set of
n r - I nk data locations (not necessarily contiguous). An element from this
set is located by providing an index list, which is used to generate
the mapping function/D, giving the element’s position in the set:
The MAP statement would redefine this mapping function over the same
ordered set of elements, by providing a MAP list ml,...,m q in the MAP
statement.
This redefines the shape of A, so that given a new index list iu ...,i q, an
element is selected from the set by the new mapping function / M:
notation. Notice also that in this example, all NLM multiplications are
expressed in parallel without loss of generality in the algorithm.
which conform (i.e., both have shape (N,Ai,L)). The product of these arrays
is then reduced by summation over the middle dimension (i.e. for all M
subspaces of shape (N,L)), giving the required result, an N by L matrix, which
is assigned to M ATMULT. Notice that in this form, declared with non-square
matrices, any error in specifying the appropriate dimension selectors
(1, 2, or 3) would produce a compile-time error, as either the multiplication
or assignment would not conform.
This expression of the algorithm implies no sequencing whatsoever (this
is now left to the compiler), so here all four variants of the algorithm
(§5.3.1-§5.3.4) can be extracted from this one code.
402 PARALLEL LANGUAGES
end
end
repeat
This code, when used with a calling sequence assigning RECUR to F, gives
the transforms in place. It uses only the temporary storage defined by the
array-valued function and returns the transformed values in bit reverse order.
It is left as an exercise for the reader to recode the function using the flow
diagram in figure 5.12, to produce the results in natural order.
Again it should be noted that this description of the algorithm incorporates
all three schemes described in §5.5. Scheme A is obtained if the compiler
sequences the last dimension of the arrays and scheme B is obtained if the
compiler sequences the first dimension of the arrays. If, however, there is
STRUCTURE PARALLELISM 403
REAL M(,),V( )
INTEGER u y i( )
LOGICAL ML(,),VL( )
M (/,) row / of M,
M (,J) column J of M,
M( ¡,J) element I,J of M,
V(I) element / of V,
M(VL,) row of M,
M(,VL) column of M,
M(ML) element of M,
T(VL) element of V,
M(VI,) vector containing M(VI(/),/) in element /,
M(,VI) vector containing Af(/,VI(/)) in element /.
In these examples, the vector and logical matrices must have one and only
one element set to .TRUE., and the integer vector VI must have all of its
elements in the range 1 to N. Indexing is generalised in DAP FORTRAN,
by allowing suitable expressions in the place of variables in the above
examples.
Routing can also be applied as an indexing operation using the + or —
symbol in either of the constrained dimensions examples are given below.
STRUCTURE PARALLELISM 405
(iv) Functions
Because expressions in DAP FORTRAN can be matrix- or vector-valued,
the function subprogram definition has been extended to include matrix and
vector valued functions, as well as the more usual scalar-valued function. The
type of the function is declared in the function statement. Thus
REAL MATRIX FUNCTION MATMULT
declares a function MATMULT which returns a real matrix valued result.
406 PARALLEL LANGUAGES
are based on information obtained from the ANSI X3J3 working document
(X3J3/S8, version 95) dated June 1985 and should not therefore be considered
as fixed. They may be changed, removed or added to prior to acceptance as
a proposed standard. Having given this disclaimer, we add that the array
features described here have been stable for some time.
The array extensions in the new language follow quite closely the proposals
outlined in §4.3.1. There has, however, been a new data type introduced as
a consequence of the addition of the array-processing features. This new type
is the BIT type, which has two values, M’ and ‘0’ and together with bit
operations represents the two-value system of boolean mathematics. This
type has been added to support the use of boolean mask variables to enable
and disable array operations.
The rank of the array is given by the number of colons, so these arrays
have the same rank as the earlier explicit-shape array declarations. The size
and shape of such allocate-shape arrays are undefined until the array has
been allocated. No reference to it or any of its elements may be made until
the array has been allocated.
are supplied, the rank of the section selected is zero, which corresponds to
a scalar.
Alternatively, an array section may be selected by the use of index ranges
in appropriate subscript positions. The section is therefore defined (as is the
parent array) by the cartesian product of these index ranges. The subset is
selected from the index range by means of a section selector, which takes
one of two forms: a triplet defining base with extent and skip indices, giving
a monotonic sequence, or a rank-one integer array. Using the triplet
(separated by colons)
are both valid sections from the explicit-shape array defined above. The first
is a vector of ten elements, defined using subscripts in the last two dimensions,
and the second is a three-dimensional array formed using the odd indices
from the first dimension. Exactly the same sections could have been defined
using a rank-one array in the first subscript, i.e.
In this example the array DIAG is an array of rank one, which is the ALIAS
array; it has been associated with the storage in the parent array ARRAY,
in such a way that the elements of DIAG map onto the diagonal of ARRAY.
In general, the mapping onto the parent array can be any linear combination
of the dummy subscripts of the alias array. There must be a range
specifier given for the dummy subscript which defines the range over
which the mapping is defined. In one statement, therefore, a linear subscript
transformation and dynamic range can be specified.
Another example, given below, redefines the first 100 elements of a rank-
one array VECTOR, to create a rank-two alias array ARRAY. In this example,
the two dummy subscripts define the extent of the two dimensions of ARRAY,
and the expected mapping of this aliasing is defined by the linear combination
of these used to subscript VECTOR.
Any reference or assignment to the alias array will actually modify the
elements of the host array, as specified by the subscript mapping. In the
example given, therefore, an assignment to DIAG will modify the first N
locations of the leading diagonal of ARRAY.
An alias array, once IDENTIFYed can be treated just like any other array
object, including being subscripted or sectioned, being passed as an actual
argument to a subprogram, or even as a parent object to another IDENTIFY
statement. Non-deterministic use of a many-to-one ALIASing is not allowed,
as there are similar restrictions to those applied on vector selectors.
It can be seen, therefore, that this is a very powerful feature for selecting
subarrays. However, it is likely to prove difficult to implement on many
processor arrays, for example the ICL DAP, for it implies the run-time
manipulation of a global address space. It is far better suited to the vector
supercomputer, where access to memory is sequential and is specified by a
constant stride through memory, which simply provides the linear subscript
translation. On a processor array it could, for example, be implemented using
412 PARALLEL LANGUAGES
selective update over the parent structure, or if this were too inefficient it
would require the assignment to an alias array, with subsequent mapping
and merging with the parent array, as this is a runtime construct it is quite
likely that packet-switched communication will be required.
(vii) Array intrinsic functions and procedures
Because the semantics of the new language allows the manipulation of array
objects as first-class citizens, it allows their use as parameters in all the existing
intrinsic functions in FORTRAN. It does this by making all such functions
polymorphic, in much the same way as the operators are overloaded. Thus
SIN(A) would produce as a result an object which conforms to its argument.
Therefore, if A were a matrix of a given size or shape, the function would
return another matrix, of the same size and shape, but whose elements were
the values obtained by applying the SIN function to each of the argument’s
elements.
There are also many new intrinsic functions in the FORTRAN 8X proposal,
which have been added to support the array extensions to the language. For
example, there are enquiry functions which will provide the range, rank or
size of an array. There are also array manipulation functions, such as
MERGE, SPREAD, REPLICATE, RESHAPE, PACK and UNPACK. Most
of these functions have equivalent operators in the APL language (Iverson
1962).
This section is not intended to give a complete description of the new
FORTRAN standard, which in any case is not, at the time of writing, even
a draft proposal. It attempts instead to give a flavour of the constructs of
interest to users of array and vector parallel computers. The interested reader
is referred to the source document (Campbell 1987), or its successor, which
may be obtained from the Secretary of the ANSI X3J3 Committee. There is
also a paper by the two British delegates on the FORTRAN X3J3 Committee
(Reid and Wilson 1985) which gives examples of the use of the FORTRAN
8X features, and the book Fortran 8X Explained by Metcalf and Reid (1987).
4.3.4 CMLISP
In contrast to the languages described above, which are based on the array
structure and FORTRAN, this section describes a language based on the list
data structure and the common LISP language. An introduction to the LISP
language is given below (for continuity) however, the interested reader is
referred to one of the many introductory books on this language (Winston
and Horn 1981, Touretsky 1974).
In LISP, the central objects in the language are lists; even the functions
that operate on them and their definitions are lists. Indeed, functions defined
in LISP may take other functions as arguments or produce functions as
STRUCTURE PARALLELISM 413
results. These are known as higher-order functions and their use provides a
very powerful and expressive programming environment. The reduction
operator ‘V in APL is a higher-order function; it takes another operator and
an array as its arguments and applies the operator between every element
of the array. For example, in APL. \ + A would sum the elements of A, and
\*A would form the product of elements of A.
In LISP the list is represented by a sequence of items, separated by space
and enclosed by parentheses. For example:
item— is an atom;
(iteml item2)—is a list containing two atoms;
((iteml item2))— is a list containing one item, which is itself a list
containing two atoms;
((iteml item2) item3)—is a list containing two items, one a list and
one an atom.
The basic operations defined in LISP provide for the construction and
dissolution of list objects, for example:
CAR—returns the first atom of a list;
CDR— returns the tail of the list;
CONS— returns a list constructed from an atom and a list;
LIST—returns a list constructed from atoms.
CMLISP (Hillis 1985) was designed as a programming environment for the
connection machine and is based on common LISP, which has a long history
at MIT. It is designed to support the parallel operations of the connection
machine, which is a s i m d machine. However, because of the expressiveness
of the LISP language, it is possible to define MiMD-like operations in CMLISP.
However, CMLISP does reflect the control flow of the host computer and
microcontroller of the connection machine, while at the same time allowing
operations to be expressed over parallel data structures. Connection machine
LISP is to common LISP what FORTRAN 8X is to FORTRAN 77. The
language is fully documented in Hillis and Steele (1985).
The three artifacts that have been added to the common LISP language
to produce CMLISP are the xector, which is an expression of parallelism;
the alpha notation, which is a higher-order function expressing parallelism
of operation across a xector; and the beta reduction, which is a higher-order
function expressing reduction. The beta reduction can be equivalent to the
A PL\operator described above.
414 PARALLEL LANGUAGES
(0 The xector
The parallel data structure in CMLISP is called the xector, which loosely
speaking corresponds to a set of processors or virtual processors and their
corresponding values. It is a parallel data object and can be operated on,
giving element-by-element results. In this sense it is reminiscent of the
FORTRAN 8X array. CMLISP supports parallel operations to create,
combine, modify and reduce xectors. Unlike DAP FORTRAN, the CMLISP
xector is not hardware-dependent; the xector size or scope is not constrained,
but can be of user-defined size. Indeed the xector’s size may vary dynamically.
Another difference between FORTRAN 8X and CMLISP arises from the
nature of the data structures involved. In FORTRAN the data structure is
the array, and a FORTRAN 8X array is, in essence, a parallel set of indices
and values. However, because the structure is implicitly rectangular, the
indices are not important. Manipulation of these indices is expressed in the
language as a set of data movement operators, which shift the array in a
given direction and for a given distance. In CMLISP the xector also comprises
an index/value pair, where both index and value are LISP objects. Moreover,
because the objects are lists, a significant coding will often be placed on the
index set. In other words, the xector represents a function between LISP
objects. The domain and range are sets of LISP objects and the mapping
associates a single object in the range with each object in the domain. Each
object in the range is an index and has a corresponding object in the domain,
which is its value.
The implementation is such that it is assumed that each element of a xector
is stored in a separate processor where the index is the name of that processor,
an address stored in the memory of the host machine, and the value is the
value stored at that processor. Hillis (1985) introduces a notation for
representing xectors, as follows:
(John-►Mary Paul-►Joan Chris-♦Sue}.
Here the set of symbols John, Paul, Chris is mapped onto the set of symbols
Mary, Joan, Sue. This notation reflects the view of a xector as a function. A
special case of the xector is the set, where a set of symbols is mapped onto
itself;
which is equivalent to
The last special case is the constant xector, which maps every possible
index value onto a constant value. For example, the xector which mapped
all values onto the number 100 is represented by:
This assignment sets the value of the symbol ‘Wife— of’ to the xector
defined above. The ’ signifies that the item following is an atom. Having
assigned a xector, the symbol can be used in other functions; for example,
to reference a value
{XREF Wife— o f ’ John)
evaluates to Mary. Similarly XSET will change a value for a given index and
XMOD will add another index value pair, if they do not already exist.
Functions are also provided to convert between xectors and regular LISP
objects. For example,
(LIST—TO — VECTOR’(5 2 10))
evaluates to the xector special case
More interestingly
which is a xector of -I- functions. When this object is applied to two xectors,
the effect of applying it is to perform an elementwise composition of the
values of the xectors. In FORTRAN 8X, this notation is implicit, as the
operators have been overloaded, to represent both scalar and parallel
operations, depending on context. Thus
and
but if a or b evaluated to a xector, then the use of *•’ would cancel the
application of a. Thus, if a were a xector,
The combined use of alpha and this use of the beta notation in CMLISP
provides a very powerful tool for constructing all manner of functions. A
number of function definitions which make use of this combination of
operators are given below. In these examples alpha provides broadcast and
STRUCTURE PARALLELISM 417
A final example defines the xector length, by summing a value of one, over
every index of the xector parameter. This is achieved by the device of defining
a function ‘Second— One’, which returns its second parameter. The use of
this function within an alpha expression, using the dotted xector parameter
x as the first parameter to the function, effectively produces a xector of ones,
of the same size as x:
(DEFUN Xector— Length (x)(/? + a (Second— One*x 1)))
(DEFUN Second— One (xy)(y))
The use of beta as described in the examples above is only a special case
of a more generalised function. In its most general form, /? takes two xectors
as arguments. It returns a third xector, which is created from the values of
the first xector and the indices of the second xector. This is illustrated in the
example below:
In fact this can be viewed as a routing operation, for it sends the values
of the first xector to the indices specified by the second xector. For example,
418 PARALLEL LANGUAGES
it is found that
The special case described above, with only a single xector argument,
assumes the second argument to be the constant xector, so that all values of
the first xector are reduced by the qualifying operator. The implementation
will define which index, or processor number, the reduction is performed to.
In a machine, this is naturally the control sequencer.
s i m d
4.4.1 Introduction
As introduced in the previous section, there are two fundamental techniques
available to explicitly express parallelism; structure parallelism and process
parallelism. The previous section also described various implementations of
the first of these, structure parallelism. However, CMLISP, and even APL,
both provide mechanisms which treat operators as the basic values of the
parallel data structure. We could, of course, imagine whole programs as being
the components of the parallel structure, in which case we have a description
of process parallelism. To reiterate then, the distinction is one of granularity,
for structure parallelism was defined above as the granularity of a single
operation over every element of a data structure. Process parallelism, however,
requires distributed sequences of operations.
The underlying computational model and the manner in which load is
balanced across a system is also very different between these two approaches.
As has been shown in the previous section, in structure parallelism one can
consider the processors to be associated one per data structure element, with
activated data being mapped onto the available processors in order to balance
the load. Thus, load sharing is achieved through data structure element
redistribution. In process parallelism, however, the process is virtualised and
PROCESS PARALLELISM 419
wait forever, or for a system time-out, as neither can write before reading.
Breaking the programmed symmetry breaks the deadlock.
It is the fundamental asynchrony, or global non-determinism, which makes
programming in this model difficult. Consider that we have n instruction
streams, and that it is possible to give a time reference for any instruction
in the overall system. Consider what the next state of the system is after a
given instruction has executed. In the absence of any synchronisation, there
are n possible choices of the next state of the system, following that n2, etc.
It is this exponential growth of trace that makes debugging process-based
languages so very difficult. A classical situation is the manner in which a bug
will mysteriously vanish when code to trace the system’s behaviour is added
to a faulty program. This is, of course, because the timing of the system has
been altered.
This is one small section in a book covering all aspects of parallelism, and
we restrict ourselves here to a discussion of one process-based language, OCCAM
The reason this language has been chosen is because it is a complementary
language to the transputer. Indeed, the OCCAM language is very intimately
related to the transputer hardware, and it is recommended that this section
and §3.5.4, describing the transputer, should be read together.
For more details about other process-based languages and a more general
discussion on the theory of this approach, the reader is referred to an excellent
introduction to this topic (Ben-Ari 1982), and the book Communicating
Sequential Processes by Hoare (1986).
starts, performs some actions and then terminates. At the lowest level, the
primitives of the language are themselves considered as processes and are
called primitive processes. Composite processes can be constructed from
primitive or other constructed processes by a number of process constructors.
The scope of these constructors is indicated in the text of the program by a
fixed layout, with indentation of two spaces. For example,
SEQ
A
B
is read: perform in sequence first the process A and then the process B. A
similar constructor is used to express parallel execution of processes:
PAR
A
B
PAR
SEQ
A
B
SEQ
C
D
Here two processes are executed in parallel; one is the sequence of process
A followed by B, the other is the sequence of process C followed by process D.
It is shown above how processes, including the language’s primitives,
can be combined to execute in sequence or in parallel, this latter being the
422 PARALLEL LANGUAGES
is formally equivalent to
SEQ
A
B
STOP
as process C will never execute.
the space must be covered. For this reason, an IF statement will often finish
with a combination of TRUE and SKIP. For example:
IF
a= 1
A
a=2
B
TRUE
SKIP
This code provides choices for a = 1 and a = 2, and covers the test space
with the TRUE condition. Because of the semantics of the statement, the
SKIP process will only be executed if a is not equal to one and is not equal
to two.
The IF statement provides deterministic choice within an OCCAM
program; the choice is determined by the value of variables within the scope
of the constructor, and the semantics of its execution. The same cannot be
said if choice were to be determined by the state of a channel, for the
state of channels cannot be determined at any given time because they
are asynchronous. Thus, non-deterministic choice is provided by another
constructor, the ALT. The ALT constructor provides a number of alternative
processes and, like the IF constructor, each process has a choice which
determines the process to execute. Unlike the IF constructor, the choices
must include an input on a channel; they may also contain a condition (as
in the IF constructor). The choices in an ALT constructor are called guards,
after Dijkstra (1975). For example:
running := TRUE
VAR x
WHILE running
ALT
onoff? ANY
running := NOT running
This program will multiplex inputs received from channels 1 and 2 onto
channel 3, until it receives any input on the channel ‘onofT. The ALT process
will start, wait for one of the non-deterministic choices to be satisfied (in this
case by waiting for an input and evaluating the condition), execute the
guarded process, and then terminate.
This program also illustrates a number of other aspects about OCCAM;
for example the input to ANY provides a signal, as the data associated with
the communication is discarded. Only the synchronisation is relevant. Notice
also how the program is terminated. The variable ‘running’ must be set as
a guarded process in this example, otherwise the ALT statement may not
terminate (an ALT process behaves like an IF process in this respect). For
example, it could be reset after the WHILE test, but before the ALT test, in
which case none of the guards would ever be satisfied on that pass of the
WHILE process. The program illustrates variable declaration, which can be
associated with any process or construct in the language, and its scope is
determined by the persistence of that process. Thus the variable x would be
allocated off the process stack prior to the execution of the WHILE
constructor, and reclaimed after it has terminated. Indeed it is possible,
although a little silly, to do the following:
(iv) Replicators
With sequential, as well as parallel, constructors, it is often desirable to use
replication. OCCAM allows this as an extension of the syntax of the respective
constructors. For example:
SEQ i = 0 FOR 5
A
B
will perform the sequence of processes A then B, five times. The processes
may contain arrays which use i as an index, or may use / as a label in other
ways, just as in other sequential loop constructs. The values taken by i are
[0 1 2 3 4]. The use of the parallel replicator is more interesting, for
428 PARALLEL LANGUAGES
FIGURE 4.9 The concept of the processor farm. (a) Model of the
processor farm, (b) A linear network of farm processes, (c) A tree network
of farm processes.
432
GENERAL PRINCIPLES 433
as Poisson’s equation. Of the many areas that we do not have the space to
consider, the most important are probably optimisation, root finding,
ordinary differential equations, full and sparse general linear equations, and
matrix inversion and eigenvalue determination. The reader is referred to
Miranker (1971), Poole and Voight (1974), Sameh (1977), Heller (1978) and
the references therein for a discussion of some of these topics. Other
perspectives on parallel computation are given by Kulisch and Miranker
(1983) and Rung (1980).
The analysis of a parallel algorithm must be performed within the
framework of a particular computational timing model. Here we use
the (r^, nl/2) model of the timing behaviour of the hardware (§1.3.2) to
develop the n 1/2 method of algorithm analysis for computation (§5.1.6),
s i m d
and the s 1/2 method of analysis for m i m computation (§5.1.7). Many other
d
models exist, for example that of Bossavit (1984) for vector computation,
and Kuck (1978) and Calahan (1984) for computation.
m i m d
the problem and on the skill with which the algorithm is implemented on
the computer by the programmer or compiler during the operation of coding.
In this chapter we examine the algorithms that are suitable for the solution
of a range of common problems on parallel computers. If the parallelism in
the algorithm matches the parallelism of the computer it is almost certain
that a high-performance code can be written by an experienced programmer.
However, it is outside the scope of this book to consider the details of
programming for any particular computer, and the reader is referred to the
programming manuals for his particular computer. Most parallel computers
provide a vectorising compiler from a high-level language, usually FORTRAN
with or without array processing extensions (see Chapter 4), and it should
be possible to realise most of the potential performance in such a language
In most cases it will be necessary to code the key parts of the program in
assembler language— thereby controlling all the architectural features of the
machine—if the ultimate performance is to be obtained (for example supervector
performance on the CRAY X-MP, see Chapter 2).
5.1.2 Parallelism
At any stage within an algorithm, the parallelism of the algorithm is defined
as the number of arithmetic operations that are independent and can therefore
be performed in parallel, that is to say concurrently or simultaneously. On
a pipelined computer the data for the operations would be defined as vectors
and the operation would be performed nearly simultaneously as one vector
instruction. The parallelism is then the same as the vector length. On a
processor array the data for each operation are allocated to different
processing elements of the array and the operations on all elements are
performed at the same time in response to the interpretation of one instruction
in the master control unit. The parallelism is then the number of data elements
being operated upon in parallel in this way. The parallelism may remain
constant during the different stages of an algorithm (as in the case of matrix
multiplication, §5.3) or it may vary from stage to stage (as in the case of
SERICR, the serial form of cyclic reduction, §5.4.3).
The architecture of parallel computers is often such as to achieve the best
performance when operations take place on vectors with certain lengths (that
is to say, certain numbers of elements). We shall refer to this as the natural
hardware parallelism of the computer. The 64 x 64 ICL DAP, for example,
provides three types of storage and modes of arithmetic for vectors of,
respectively, length one (horizontal storage and scalar mode), length 64
(horizontal storage and vector mode) and length 4096 (vertical storage and
matrix mode). Although these modes are achieved through software, they
are chosen to match the hardware dimensions of the DAP array and thus
436 PARALLEL ALGORITHMS
constitute three levels of natural parallelism, each with its own level of
performance (see §3.4.2). On pipelined computers without vector registers, such
as the CYBER 205, the average performance (equations (1.9) and figure 1.16)
increases monotonically as the vector length increases, and one can only say
the natural hardware parallelism is as long as possible (up to maximum
vector length allowed by the hardware of the machine, namely 64K — 1). On
pipelined computers with vector registers, such as the CRAY X-MP, the
performance is best for vector lengths that are multiples of the number of
elements held in a vector register. In the case of the CRAY X-MP with vector
registers holding 64 elements of a vector, the natural parallelism is 64 and
multiples thereof.
The objective of a good programmer/numerical analyst is to find a method
of solution that makes the best match between the parallelism of the algorithm
and the natural parallelism of the computer.
where we assume that the processing elements have the same hardware
performance on the para- and actual computers.
One may also use the ratio of the execution times on the paracomputer
for two algorithms as a relative measure of their performance. However, this
measure may not be a good indication of the relative performance on actual
computers, because it ignores the time required to transfer data between
distant processing elements of the array (the routing delays). Grosch (1979)
has used the concept of the paracomputer to compare the performance of
different algorithms for the solution of Poisson’s equations on processor
arrays with different interconnection patterns between the processing elements.
He compares the common nearest-neighbour connections, as available in the
GENERAL PRINCIPLES 437
ICL DAP (see §3.4.2), with and without long-range routing provided by the
perfect shuffle interconnections as proposed by Stone (1971).
Although devised for the comparison of processor arrays, the concept of
the paracomputer can be related to the study of pipelined processors. The
para-computer corresponds to a pipelined processor with zero start-up time,
no memory-bank conflicts and a half-performance length nx/2 of infinity.
Remembering that nl/2 = s + / — 1 (see equation (1.6b)), where s t is the set-up
time and / the number of subfunctions that are overlapped, one can see that
the paracomputer is approached as /, the extent of parallel operation in the
pipeline, becomes large. At the other extreme one can define a perfect serial
processor as a pipeline processor with only one subfunction (i.e. the arithmetic
units are not segmented) and which also has no set-up time or memory
conflicts. The perfect serial processor therefore corresponds to a pipelined
processor with nx/2 = 0. Actual pipelined designs, with a finite number of
subfunctions and therefore a finite value of w1/2, then lie between the perfect
serial processor and the para-computer, depending on their half-performance
length. Characterised in this way, we obtain the spectrum of computers
described in §1.3.4 and shown in figure 1.15.
(/) Vectorisation
The first stage, however, in converting an existing program to run on a vector
computer is the reorganisation of the code so that as many as possible of
the DO loops are replaced by vector instructions during the process of
compilation using a vectorising compiler. This process of vectorisation may
be all that is done, and leaves a program comprised of two parts: a scalar
part to be executed by the scalar unit of the computer, and a vector part to
be executed by the vector unit. It is quite usual for the of the scalar unit
to be ten times slower than the of the vector unit, so that the execution
time of the algorithm may depend primarily on the size of the scalar part of
the code, and rather little on the efficiency with which the vector part of the
code is organised (see §1.3.5 and figure 1.19). However, there are many
algorithms associated particularly with the solution of partial differential
equations, in which all the floating-point arithmetic can be performed by
vector instructions, and we give some examples in this chapter. For these
vector algorithms the best organisation of the vectors (i.e. the choice of the
elements composing them and their length) is of critical importance. The
GENERAL PRINCIPLES 441
(5.3)
or as
(5.4c)
where h = s/q. The expression in square brackets in equation (5.4c) is the factor
by which a traditional serial complexity analysis, r ~ 1s, is in error when
applied in a vector environment. It also shows that it is not the absolute
value of nl/2 which is important, but its ratio to the average length of a vector
operation in the algorithm: i.e. nl/2/h.
factor. Thus, within the approximations of this analysis, tz1/2 is the only
property of the computer that affects the choice of algorithm.
When comparing algorithms, equal performance lines ( e p l ) play a key role.
If T(a) and T(b) are the execution times for algorithms (a) and (b) respectively,
then the performance of (a) equals or exceeds that of (b), P(a)^ P (b), if
T(b) ^ T(a), from which one obtains
(5.5)
the case of the chemical phase diagram, the values of parameters describing
the conditions (e.g. temperature and pressure) determine a point in a
parameter plane that is divided into regions in which the different states of
matter have the lowest energy. One could say that nature then chooses this
state from all others as the best for the material. In the case of the algorithmic
phase diagram, parameters describing the computer and problem size
determine a point in a parameter plane that is divided into regions in which
each algorithm has the least execution time. This algorithm is then chosen
as the best.
It is helpful to adopt certain standards in the presentation of such
algorithmic phase diagrams, in order to make comparisons between them
easier. It is good practice to make the x axis equal to or proportional to h 1/2,
and the y axis equal to or proportional to the problem size n. In this way
algorithms suitable for the more serial computers (small n1/2) appear to the
left of the diagram, and those suitable for the more parallel computers
(large n1/2) to the right. Similarly, algorithms suitable for small problems are
shown at the bottom of the diagram, and those suitable for large problems
at the top. A logarithmic scale is usually desirable for both axes, and in the
simplest case of the (w1/2, n) plane the horizontal axis specifies the computer
and the vertical axis the problem size. Some examples of algorithmic phase
diagrams are given in figures 5.3, 5.9, 5.10, 5.26, 5.27 and 5.28 (Hockney 1982,
1983). A simple example of their preparation and interpretation is given in
§5.2.3. This compares two algorithms. A more complicated example comparing
four algorithms is given in Hockney (1987a). Figure 5.28 compares two
algorithms, giving the best value of an optimisation parameter to use in any
part of the phase plane.
We have seen in equation (5.4c) that the ratio n1/2/h is more important
in the timing than nl/2 itself. Similarly, we find that algorithmic phase
diagrams are usually more compactly drawn if the x axis is equal to the ratio
of n 1/2 to problem size: n l/2/n. This ratio, rather than n 1/2 itself, determines
whether one is computing in a serial environment (small values) or a parallel
environment (large values). The ratio also has the advantage of being
independent of the units used to measure n.
(5.8a)
where (r^ k, nl / u ) are the parameters for the kth type of arithmetic operation,
and sk and qk are the operations’ counts for the kth type of arithmetic
operation. The time for the algorithm is then given by
(5.8b)
where s = YJksk and q = Y^Mk are the total operations’ counts, as before.
The calculation of the above average values is similar to calculating the
average performance of a computer, using different mixes of instructions (for
example, the Gibson mix and Whetstone mix). The weights used in the above
expressions (5.8), (sk/s) and (qk/ q \ are the fraction of the arithmetic which
is of type k and the fraction of the vector operations which are of type k
respectively. The important fact is that these ratios will be relatively
independent of the problem size n, and the analysis can proceed as before
using h1/2 instead of nl/2. In most cases, however, it will be adequate to treat
rœ and nl/2 as constant and interpret the algorithmic phase diagrams for the
range of parameter values that arises.
the same analysis, simply as separate instruction streams. The essence of the
work segment is that the computation is synchronised before and after each
segment. That is to say all the work in a segment must be completed before
the next segment can begin. The time i, to execute a work segment is therefore
the sum of the time to execute the longest of the instruction streams plus the
time to synchronise the multiple instruction streams: that is to say, from
equation (1.16)
(5.9)
where s, is the number of floating-point operations between pairs of numbers
in the ith work segment, subsequently called the amount of work in the
segment, or the grain of the segment; E{ is the efficiency, Ep, of process
utilisation in the ith work segment (the subscript p is now dropped);
is the asymptotic (or maximum possible) performance in Mflop/s as before;
and s 1 / 2 is the synchronisation overhead measured in equivalent floating-
point operations.
In equation (5.9) the value of both and s 1 / 2 will depend on the number
of instruction streams or processors used. For example, if there are p
processors the r^ in equation (5.9) is p times the asymptotic performance of
one processor. The value of s 1 / 2 also depends on the type of synchronisation
and the efficiency of the software tools provided for synchronisation.
Measured values of (r^, s1/2) for a variety of different cases are given for the
CRAY X-MP in §2.2.6 (Hockney 1985a), for the Denelcor HEP in
Chapter 3 (Hockney 1984a, 1985c, Hockney and Snelling 1984), and for the
FPS-5000 in Curington and Hockney (1986).
Having established the timing for a work segment, we are now in a position
to consider a mi md program. In any mi md program there will be a critical
path, the time of execution of which is the time of execution of the whole
program. In some cases the critical path is obvious (there may only be one
path), and in other cases it may be very difficult to determine or even be
data-dependent and therefore unknown until run time. However, to proceed
further with a timing analysis we must assume that the critical path is known.
The time Tfor a mi md algorithm is calculated by summing equation (5.9)
for each work segment along the critical path, giving
(5.10)
where q is the number of work segments along the critical path of the
algorithm, s = 1sf is the total amount of work along the critical path and
E = s/C^UiSi/Ei) is the average efficiency of process utilisation along the
critical path of the algorithm.
For algorithms that fit into the above computational model, we see that
448 PARALLEL ALGORITHMS
5.2 RECURRENCES
(5.11b)
where Xj is the sum of the first j numbers in the sequence dx,..., dn.
The partial sums may be evaluated simply from the recurrence
(5.11c)
(5.12a)
FIGURE 5.1 The routing for the sequential sum method for forming
all the partial sums of eight numbers, Xj = £j[ = i dk, j = 1, 8. The
vertical axis is time, the horizontal axis is storage location or processing
element number. Routing of data across the store is indicated by an
arrow.
this algorithm in figure 5.1 for the case n = 8 . The sequence of evaluations
takes place from the bottom moving upwards; operations that can be
evaluated in parallel are shown on the same horizontal level. It is clear that
at each time level only one operation can be performed (parallelism = 1 ),
and we say that the sequential sum algorithm has
(5.12b)
and the operations’ counts are
(5.12c)
(5.13a)
FIGURE 5.2 (a) The routing diagram for the parallel cascade sum method of
forming partial sums. Zeros are brought in from the left as the vector of accumulators
is shifted to the right. If only the total sum x 8 is required, then only the operations
shown as full circles and bold lines need be performed, (b) Martin Oates’ method of
computing all partial sums.
(5.14b)
and we note that the cascade total sum algorithm has the same number of
scalar operations as the sequential sum method, even though the reorganisation
of the calculation into a binary tree has introduced the possibility of parallel
operation. This total sum method is the algorithm normally referred to as
the cascade sum method.
It has been pointed out by Martin Oates (private communication) that
all partial sums can also be computed in a parallel fashion with approximately
half the number of redundant arithmetic operations that are present in the
original cascade partial sum method shown in figure 5.2(a). Martin Oates’
variation is shown in figure 5.2(b). The partial sums are built up hierarchically.
First, adjacent pairs are added, then pairs of these are combined to form all
the partial sums of groups of adjacent four numbers, then these are combined
in pairs to form sums of eight, and so on. At each level n/2 additions are
performed in parallel, and there are log2n levels as before, giving operations’
counts of
(5.14c)
up for the extra complexity of the program. Oates’ method turns out to be
a special case of a class of algorithms for the so called ‘parallel prefix
calculation’ that have been described by Ladner and Fisher (1980). In the
following section we will analyse the performance of the original method,
and leave it as an exercise for the reader to calculate the effect of using the
Oates’ variation.
or
(5.15b)
The algorithmic phase diagram corresponding to equation (5.15b) is given
in figure 5 .3 (a) for a ratio of vector-to-scalar asymptotic speeds of
R ^ = 2, 5,10, 50,100. The diagram is interpreted as follows. Lines of constant
nlj2 lie at 4 5 ° to the axis, and the line for n 1 / 2 = 2 0 corresponding to the
CRAY-1 is shown. This computer has % 10, and the intercept between
456 PARALLEL ALGORITHMS
(a)
the line of constant nl/2 and the line of constant gives the break-even
problem size. Roughly speaking, the conclusion is that the summation of up
to about eight numbers should be done in the scalar unit (or by scalar
instructions), whereas larger problems should be done in the vector unit. It
is obvious, and borne out by the diagram, that as the rate of vector-to-scalar
speed decreases, the scalar unit should be used for longer vector problems.
In considering the calculation of all partial sums, we have three alternatives
to consider; the use of the sequential and cascade partial sum methods in
the vector unit, and the sequential method in the scalar unit. We need not
consider the use of the cascade partial sum method in the scalar unit but it
will always perform worse than the sequential method, since it has more
arithmetic. There are therefore three equal performance lines to compute for
each of the three possible pairs of algorithms.
RECURRENCES 457
FIGURE 5.3 cont. (b) Comparison of three algorithms for the calculation
of all partial sums. The sequential and cascade method executed in the
vector unit are compared with the sequential method executed in the
scalar unit. The ratio of vector to scalar speed is = 10.
(5.16b)
Given the above operations’ counts, the formula for the equal performance
458 PARALLEL ALGORITHMS
line between the two methods can be immediately written down from
equation (5.5):
(5.16c)
The comparison of the sequential method in the scalar unit with the above
two methods in the vector unit can be immediately written down using the
operations’ counts (5.16a,b). The equal performance lines obtained by
substituting in equation (5.7) are: with the sequential method
(5.17a)
unit. For example, if nl/2 = 20, problem sizes less than n& 10 and greater
than n % 2000 should be solved in the scalar unit. In the former case the
vectors are not long enough to make up for the vector start-up overhead,
and in the latter case too much extra arithmetic is introduced to make the
cascade method worthwhile. In the intermediate region 10 ^ n < 2000 the
use of the vector unit with its higher asymptotic performance is worthwhile.
However, for nlj2 ^ 120 (in the case of = 10), the vector start-up overhead
is too large ever to be compensated by the higher vector performance, and
the sequential sum method in the scalar unit is always the best. Figure 5.3(b)
can be drawn for other values of as was done in figure 5.3(a). The
boundary line between the use of the scalar and the vector unit will move
further to the right as R ^ increases and to the left as it decreases. The equal
performance line between the two algorithms executed in the vector unit,
however, will remain unchanged.
If the partial sum problem is being solved on the paracomputer, or a finite
processor array with more processing elements than there are numbers to
be summed, then we take n1/2 = oo. In this case the time is proportional to
the number of vector operations, q. Consequently, the cascade method is
always the best, because it only has log2« vector operations compared with
( n — 1) for the sequential method. If an algorithm is being chosen for a
processor array, it is also necessary to consider carefully the influence of the
necessary routing operations on the time of execution. The number of routing
operations is the same for the cascade and sequential methods (both n — 1
routings, see equations (5.12d) and (5.13d)). Therefore, the inclusion of the
time for routing will not affect the choice of algorithm. However, depending
on the relative time for a routing and an arithmetic operation, and the
algorithm under consideration, the time for routing may be an important
component of (or even dominate) the execution time. If routing is y times
faster than arithmetic, that is to say
then the ratio of time spent on routing to the time spent on arithmetic is,
for the cascade sum method,
(5.19a)
(5.19b)
460 PARALLEL ALGORITHMS
The above shows that routing will always dominate arithmetic for vectors
longer than n3, where
(5.19c)
Since y = 10 is a typical value, this will happen for any but the most trivial
problems. This shows that there is no point in improving the arithmetic speed
of a processor array unless a corresponding reduction in routing time
can be made.
One way of eliminating the time spent on routing is to provide some
long-range connections between the processors. In the above introductory
analysis we have assumed for simplicity a one-dimensional array with
nearest-neighbour connections. Most large processor arrays are, however,
configured as two- or multi-dimensional arrays. The ICL DAP, for example,
is configured as a two-dimensional array of 64 x 64 processors. If these are
treated as a single vector of 4096 elements written row by row across the
array, then a single routing operation to the nearest-neighbour connections
between adjacent rows results in a shift of 64 places. The more general
case of data routing in /c-dimensional arrays is discussed in §5.5.5 and in
Jesshope (1980a, b, c). Other interconnection patterns, such as the perfect
shuffle, offer other long-range connections (see §3.3.4 and §3.3.5). The reader
is invited to consider the effect of such connections on the comparisons
made above.
(5.20)
This requires
2 n arithmetic operations with parallelism 1 , ( 5 .2 1 a)
and
n routings with parallelism 1 . ( 5 .2 1 b)
The routing diagram for the sequential algorithm is given in figure 5.4. For
simplicity, we shall not count separately the different types of arithmetic
operation although they may have different execution times. The ratio of a
multiplication time to an addition time rarely exceeds two and is often quite
close to unity. In particular, on a pipelined computer both the addition and
RECURRENCES 461
FIGURE 5.4 The routing diagram for the sequential evaluation of the
general first-order recurrence. In this case n = 4. Variables linked by a
brace are stored in the same p e . One pe is used to evaluate each term of
the recurrence.
multiplication pipes, when full, deliver one result every clock period. For
n > n l/2 the average time for an addition or multiplication operation is very
nearly the same.
The equivalent parallel algorithm to the cascade sum method is known as
cyclic reduction, and has a wide application in numerical analysis, particularly
when one is trying to introduce parallelism into a problem. For example we
will use it again to solve tridiagonal systems of algebraic equations in a
parallel fashion (see §5.4.3). The original recurrence (equation (5.1 la)) relates
neighbouring terms in the sequence, namely Xj to x j - 1. The basic idea of
cyclic reduction is to combine adjacent terms of the recurrence together in
such a way as to obtain a relation between every other term in the sequence,
that is to say to relate Xj to Xj_2. It is found that this relation is also a linear
first-order recurrence—although the coefficients are different and the relation
is between alternate terms. Consequently the process can be repeated (in
a cyclic fashion) to obtain recurrences relating every fourth term, every eighth
term, and so on. When the recurrence relates every n terms (i.e. after log2n
levels of reduction), the value at each point in the sequence is related only
to values outside the range which are known or zero, hence the solution has
been found. When the method is used on serial computers, the number of
recurrence equations that are used is halved at each successive level, hence
the term ‘cyclic reduction’. On a parallel computer we are interested in
462 PARALLEL ALGORITHMS
keeping the parallelism high and will not, in fact, reduce the number of
equations that are used at each level. Therefore the name of the algorithm
is somewhat misleading.
We will now work out the algebra of the cyclic reduction method. Let us
write the original recurrence relation for two successive terms as:
(5.22a)
and
(5.22b)
Substituting equation (5.22b) into equation (5.22a) we obtain
(5.23a)
(5.23b)
where equation (5.23b) is a linear first-order recurrence between alternate
terms of the sequence with a new set of coefficients given by
where
a f = a " - l'afS2l\, (5.24b)
d!P = aj-l~ + dV~1(, (5.24c)
and initially
(5.24d)
o
o .
<3
II
II
in parallel. After all, the equation for dV has superficially the same appearance
as the original sequential recurrence with x replaced by d. The fundamental
difference between equation (5.24c) and the original recurrence (5.22a) is that
the values of df~ 2l~l and on the right-hand side of equation (5.24c) are
all known values that were computed at the previous level (/— 1). These
d(jl~l) are distinct variables from the d f on the left-hand side. The latter
{ d f \ j = 1 ,..., n} may therefore be evaluated by a single operation or vector
instruction. These relationships are clarified by the routing diagrams for the
evaluation of a f (figure 5.5) and d^ (figure 5.6).
In these diagrams we show only the values of a (j l) and d f that are actually
used, and the arithmetic operations that are necessary. We note that only
about half the values of a f are required. In particular none is required at
/ = 3. The amount of parallelism varies from about n at the start to
approximately n/2 at the final level. On pipelined computers which have a
variable vector length, it would increase performance to reduce the vector
length to the correct value at each level. On processor arrays with n ^ N or
the paracomputer, in which the execution time is not affected by the vector
length, the parallelism can be kept equal to n at each level by loading all
cij = 0 for — n / 2 ^ j and dj = 0 for —n ^ j ^ O or otherwise contriving
that out-of-range values are picked up as zeros.
It is obviously quite complex to evaluate the performance of the cyclic
reduction algorithm, taking into account the reduction in parallelism at each
level. We may however get a lower bound by assuming that the parallelism
remains as n at each level, and that all the a f are calculated. The
average parallelism is \_(n — 1 ) + (n —2) + ... + (n — 2r) + ... + w/2 ]/lo g 2n
= n\_ 1 —( 1 —n ~ 1 )/log 2 w], hence, asymptotically for large n, we have
31og2H arithmetic operations with parallelism n
FIGURE 5.5 The routing diagram for equation (5.24b), the parallel
calculation of the coefficients a f in the cyclic reduction algorithm applied
to the general linear first-order recurrence for the case n = 8.
464 PARALLEL ALGORITHMS
FIGURE 5.6 The routing diagram for equation (5.24c), the parallel
calculation of the coefficients d f in the cyclic reduction algorithm applied
to the general linear first-order recurrence. The values of a are supplied
from the calculation shown in figure 5.5, which proceeds in parallel with
this figure.
and (5.25)
2( n— 1 ) routing operations with parallelism n.
We leave it as an exercise for the reader to carry out a similar analysis
to that performed earlier for the sequential and cascade sum methods,
and to take into account fully the reduction in parallelism at each level. The
above general first-order recurrence becomes the cascade sum for the special
case dj = 1 for which makes all the multiplications in figure 5.6 and
the whole of figure 5.5 redundant.
The cyclic reduction algorithm can be implemented in a vector form of
FORTRAN by the code:
x - n
DO 1 L - 1 f L0 G2 N
X - A*SHTFTR<X»2**(L-1)) + X (5.26a)
.1 A = A *SHIFTR(Ar2**(L~l ) )
(5.26b)
where X, D and A have been declared as vectors. In the implementation
of the above code, we note that the main memory address of the vector X
MATRIX MULTIPLICATION 465
and of the vector SHIFTR, which are both required for the evaluation of
statement (5.26a), are separated by powers-of-two memory banks. The same
is true for the vector A in statement (5.26b). Memory-bank conflicts (see
§5.1.5) are therefore likely to be a serious impediment to the rapid evaluation
of the cyclic reduction algorithm for serial and pipelined computers with the
number of memory banks equal to a power of two. The Burroughs BSP,
because of its choice of a prime number of banks (17), does not suffer from
this problem (se §3.3.8).
(5.27)
where the first subscript is the row number and the second subscript is the
column number.
(5.28a)
(5.28b)
where we assume that all elements C(/,J) of the matrix are set to zero before
entering the code. The assignment statement in the code (5.28b) forms the
inner product of the ith row of A and the 7 th column of B. It is a special case
of the sequential evaluation of the sum of a set of numbers that was discussed
in §5.2 (let dk = AikBkJ in equation (5.1 lb), then xn = CitJ-). We therefore have
the option of evaluating it sequentially as in the code (5.28) or by using the
cascade sum method. The considerations are the same as those given in §5.2.
Some computers provide an ‘inner or dot product’ instruction (e.g. the
466 PARALLEL ALGORITHMS
CYBER 205, see table 2.4), and the use of this must be considered in any
comparisons.
There is, however, more parallelism inherent in the evaluation of the matrix
product than is present in the problem of evaluating a single sum. This is
because a matrix multiplication involves the evaluation of n2 inner products
and these may be performed n at a time (the middle-product method) or n2
at a time (the outer-product method).
(5.29a)
(5.29b)
Every term in the loop over / can be evaluated in parallel, so that the loop
(5.29b) can be replaced by a vector expression. In an obvious notation, the
code can be written:
(5.30a)
(5.30b)
where C( ,J ) and A( ,/C) are vectors composed of the Jth and Kth columns
of C and A. The addition, + , is a parallel addition of n elements, and the
multiplication, *, is the multiplication of the scalar B(K,J) by the vector
A( , K ). The parallelism of the middle-product code is therefore n, compared
with 1 for the original inner-product method. This has been obtained by the
simple process of interchanging the order of the DO loops. Note that we
could have moved the J loop to the middle, thus computing all the inner
products of a row in parallel. However, elements of the column of a matrix
are usually stored in adjacent memory locations (FORTRAN columnar
storage), consequently memory-bank conflicts are reduced if vector operations
take place on column vectors. Thus the code of (5.30) is usually to be preferred.
The middle-product method, when programmed in assembler, is found to
be the best code on the CRAY-1 computer. Supervector performance of
138 Mflop/s is observed. This implies that, on average, almost two arithmetic
MATRIX MULTIPLICATION 467
operations are being performed per clock period (one operation per clock is
equivalent to 80 Mflop/s). This is possible because the multiplication and
addition operations in statement (5.30b) can be chained to act as a single
pipelined composite operation delivering one element of the result vector
C( ,J) per clock period.
It is interesting to note that the middle-product method is found to have
superior performance to the inner-product method, even on computers such
as the CDC 7600 that do not have explicit vector instructions, and are not
usually classified as parallel computers. However the CDC 7600 does have
pipelined arithmetic units and their performance is improved if requests for
arithmetic are received in a regular fashion, as in the serial code (5.29b) for
vector operations. This idea has been exploited in a set of carefully optimised
assembler routines, called STACKLIB (see §2.3.5), that has been written at
the Lawrence Livermore Laboratory. These routines perform various dyadic and
triadic vector operations, such as the ‘vector + scalar x vector’ statement in
code (5.30b), and obtain performance improvements of about a factor of two
over code written for serial evaluation, such as (5.28b).
(5.31a)
(5.31b)
(5.32)
(5.33)
468 PARALLEL ALGORITHMS
(5.34a)
(5.34b)
It is clear that there is little to choose between the two methods from this
point of view for computers such as the CRAY-1 with small values of n1/2.
Other considerations, such as the ability to work in the vector registers and
to use chaining, favour the middle product. However there are obvious
advantages to the outer product in pipelined computers in the last case in
which n2 > nl/2 > n.
match the size of the matrices (also in Jesshope and Hockney 1979). For
example, how should one best compute the product of two 16x16 matrices
on a 64 x 64 ICL DAP? Obviously it would be very wasteful to adopt the
outer-product method and only use one sixteenth of the available processors.
Jesshope and Craigie note that there are n3 multiplications in the product
of two n x n matrices (n2 inner products, each with n multiplications; see
equation (5.27)), and that all these products can be evaluated simultaneously
with a parallelism of n3. The summation of the n terms of all the n2 inner
products can be achieved in log2n steps, also with parallelism n3, using the
cascade sum method. The equivalent FORTRAN code is:
(5.35a)
(5.35b)
In the above, the loop (5.35a) performs all the n3 multiplications and the
loop (5.35b) evaluates the cascade sum. After the execution of the four nested
loops (5.35b) the C(/,J) element of the product is found in location C(/,J,1).
The above code can be expressed succinctly, using the parallel constructs
outlined in §4.3.1 (see pp 400-1) by the following single statement:
(5.35c)
where the subscripts on the pe (,) array designate the row and column
number of the pe and are to be evaluated as FORTRAN statements.
The result of this mapping is that the 16 values with the same first two
subscripts are stored in a compact 4 x 4 array of processing elements, and
routing during the summation is kept to a minimum. The long-range routing
that is required during the expansion of A and B can be effectively performed
470 PARALLEL ALGORITHMS
on the ICL DAP using the broadcast facility (see §3.4.2). The SUM function
in statement (5.35c) can also be optimised using bit-level algorithms.
(5.37)
TRIDIAGONAL SYSTEMS 471
or in matrix-vector notation:
(5.38)
The Gaussian elimination algorithm may be stated as follows:
(5.39a)
(5.39b)
and
(5.39c)
or
(5.39d)
or
(5.39e)
x„ = gn,
x i = gi - y v ix i+l9 i = ( n - 1), (n —2 ) , . . 1 . (5.39f)
In the forward elimination stage two auxiliary vectors are computed, w and
e, which are functions only of the coefficients in the matrix A. These vectors
are the coefficients in the triangular decomposition of A into the product of
a lower triangular matrix L and an upper triangular matrix U :
A = LU (5.40)
where
472 PARALLEL ALGORITHMS
= 0»,
(5.42c)
x i = gi - w ix i+u i = n — 1, n — 2,..., 1.
The scalar operations are now reduced to 6n without precalculation and 4n
with precalculation of the single vector w.
The three loops (5.39a, c, f) or (5.42) of the Gaussian elimination algorithm
are all sequential recurrences that must be evaluated one term at a time.
Hence the parallelism of the algorithm is 1. This, together with the fact that
vector elements are references with unit increment and that the number of
TRIDIAGONAL SYSTEMS 473
(5.43a)
If we now let
(5.43b)
and rearrange, we have
(5.44a)
with
(5.44b)
Equation (5.44a), not surprisingly, is the homogenous form of the original
equations. It is a linear second-order (or three-term) recurrence. Hence the
problem of finding vvf is the same as the problem of solving the original
equations. In §5.4.3 we will show how to use cyclic reduction directly to solve
such recurrences. The recursive doubling algorithm, although it uses cyclic
reduction, proceeds somewhat differently, as we now show.
For the sake of generality we will solve the second-order recurrence:
(5.45a)
Equation (5.45a) can be expressed as follows:
(5.45b)
or as
(5.45c)
where
(5.45d)
with
general form of equation (5.11a) except that the multiplying factor is now a
matrix. The recurrence can be solved using the cyclic reduction procedure
described for the scalar recurrence in §5.2.4, with appropriate interpretation
in terms of vectors and matrices. Having found v, (taking hf = 0), the values
of y, are known from its components, and vvf is calculated from equation
(5.43b). This completes the recursive doubling algorithm for the parallel
evaluation of the LU decomposition of a tridiagonal system by the Gaussian
elimination recurrences.
It might be thought that one could solve the original equations by this
method, because they are the same as equation (5.45a). However, this
approach is thwarted because one does not have a starting value for Vj. This
is because such a tridiagonal system arises from a second-order p d e with
boundary conditions at the two ends (i.e. one may consider y0 and yn+l to
be known). However, the recurrence can only be evaluated progressively if
starting conditions are given for two adjacent values—say y0 and y u giving
a starting value for v^ If such starting values are available, the method above
does provide a method of parallel solution for the general linear second-order
recurrence. This would occur if the equations arose from a second-order
initial-value problem, rather than the boundary-value problem presently
under discussion.
In order to estimate the number of operations we will make the approximation
that the parallelism remains at n throughout. Then we have, neglecting
constant terms:
(5.46a)
The time to execute the algorithm on a computer with a half-performance
length of nl/2 is therefore proportional to
(5.46b)
vectors. The equations solved were those arising from the finite-difference
approximation to Poisson’s equation and had the same coefficients at each
mesh point. These equations are discussed in detail in §5.6. For convenience
the number of mesh points and therefore the number of equations was taken
to be a power of two. Neither of these restrictions is necessary to the method
as has been shown by Swarztrauber (1974) and Sweet (1974, 1977). For
simplicity here, however, we will assume that n = n' — 1, where n' = 2q and q
is an integer, and solve the general coefficient problem defined in equation
(5.37).
Writing three adjacent equations we have for i = 2, 4 ,...,n' — 2:
(5.47)
The special end equations are included correctly if we set x0 = x n>= 0. If the
first of these equations is multiplied by a, = —ai/ bi- l9 and the last by
(5.48a)
(5.48b)
The equations (5.48) relate every second variable and, if written for
i = 2, 4 ,..., n' —2, are a tridiagonal set of equations of the same form as the
original equations (5.47) but with different coefficients (a(l), 6(1), c(1)). The
number of equations has been roughly halved. Clearly the process can be
repeated recursively until, after log2(n') —1 levels of reduction, only the
central equation for i = ri/2 remains. This equation is
(5.49a)
(5.49b)
TRIDIAGONAL SYSTEMS 477
The remaining unknowns can now be found from a filling-in procedure. Since
we know x0, x n 7 2 and xn>the unknowns midway between these can be found
from the equations at level r — 1 using
for i = n'/4 and 3n'/4. The filling-in procedure is repeated until, finally, all
the odd unknowns are found using the original equations.
The cyclic reduction procedure therefore involves the recursive calculation
of new coefficients and right-hand sides, for levels / = 1 , 2 ,..., q — 1 , from
(5.50)
where
and
with the initial values a}0) = ah fej.0) = h, and c(°} = ch followed by the recursive
filling-in of the solution, for / = q, q — 1 ,..., 2 , 1 from
(5.51)
where
and x 0 = x n<= 0 when they occur. The routing diagram for this algorithm is
shown in figure 5.7 for the case n' = 8 . For convenience we define the vector
Pi(ah bh ch ki) to indicate all the values calculated by equations (5.50).
The number of operations involved in the evaluation of equations (5.50) is
(5.52)
The time for this part of the calculation on a computer with a half-performance
478 PARALLEL ALGORITHMS
FIGURE 5.7 Routing diagram for the serial cyclic reduction algorithm
(SERICR) for n' = 8. Rectangular boxes indicate the evaluation of
equation (5.50) and the diamonds the evaluation of equation (5.51). The
variables calculated are written inside the boxes with the notation
pi = (ah bh ch ki).
(5.53a)
In these comparisons we will only keep terms of order n' and log2n', hence
for n1/2 > 1 and log2n' > 1 approximately we have
(5.53b)
The evaluation of equation (5.51) requires
(5.54a)
(5.54b)
TRIDIAGONAL SYSTEMS 479
(5.54c)
The manner of performing cyclic reduction that has just been described has
the least number of scalar arithmetic operations and is therefore the best
choice for a serial computer. We will therefore call the algorithm the serial
variant of cyclic reduction with the acronym SERICR. The total time required
for its execution is therefore proportional to:
(5.55)
FIGURE 5.8 Routing diagram for the parallel cyclic reduction algorithm
(PARACR) for n' = 8. The special vector p0 = (0,1,0,0).
480 PARALLEL ALGORITHMS
if one takes
(5.56a)
or
pi*0 = (0,1,0,0),
The above special values, when inserted into the defining equation (5.48a),
lead to the equation
(5.56b)
which gives the correct boundary values. We may therefore either consider
the solution of the original finite set of equations, or alternatively an infinite
set extended with the coefficients (5.56a). This only amounts to adding
equations such as (5.56b) outside the range of the original problem. Either
point of view is equally valid, but the latter view is more appropriate for the
parallel variant of cyclic reduction because it defines the required values of
pi and x, outside the originally defined problem. After calculating pjfl) (actually
only the values of fr, and /c, are required), the value of x, is obtained from
equation (5.51):
(5.57)
The terms in x on the right-hand side of equation (5.51 ) do not occur because,
at this level of reduction, they refer to values outside the range 1 ^ i ^ n and
by equation (5.56b) are zero.
The number of operations in the PARACR algorithm is clearly
with parallelism n (5.58a)
and the time of execution is proportional to
(5.58b)
with \b/a \ = <5, which is equally or less diagonally dominant than the original.
If this simpler set of equations can be solved to a certain approximation,
then the original equations will be solved more accurately. The cyclic
reduction recurrence (5.50) for this case is
and then
(5.60a)
(5.60b)
where the subscript i is dropped because the coefficients are the same for all
equations, and we use the fact that c{l) = a(l). Dividing equation (5.60b) by
equation (5.60a) we obtain the recurrence relation for the diagonal dominance:
<5(0) = <5,
(5.61a)
Hence if the initial diagonal dominance S > 2, the diagonal dominance will
grow quadratically at least as fast as equation (5.61a), and
(5.61b)
The finite difference approximation of the Helmholtz equation,
(5.62a)
(5.64b)
The first term shows how a greater demand for accuracy (larger e_1)
necessitates more levels of reduction and the second term shows how the
number of levels is reduced as the diagonal dominance increases. If S is close
to two, the full recurrence (5.61a) must be evaluated. In any case, the
practical approach is to measure the diagonal dominance, S(l\ at each level
of reduction from the values of a(l\ b{l) and c(/), and to stop the reduction if
equation (5.64a) is satisfied.
Of course there is no saving in the algorithm unless ? < lo g 2 (n') — 1» the
maximum number of levels for complete reduction. This leads to the result
that truncated reduction can produce savings when
(5.65a)
where
(5.65b)
If we take the example e = 2 _ 2 O% 10 _ 6 (32-bit single-precision on the
IBM 360) and 5 = 4 (as occurs in the harmonic equations of §5.6.2), we obtain
(5.65c)
in any subroutine for cyclic reduction. The savings in execution time can be
substantial if hundreds or thousands of equations are involved.
Hence using
we obtain
(5.66a)
(2) For cyclic reduction with vector length kept at n and no filling-in stage
(termed PARACR),
484 PARALLEL ALGORITHMS
hence
(5.66b)
(3) For cyclic reduction with vector length halving at each level of reduction
and including a filling-stage (termed SERICR),
s= \ln q= 171og2/i
hence
(5.66c)
In the above we also ignore possible savings from truncated reduction which
should certainly be taken into account if the systems are strongly diagonally
dominant.
The recursive doubling algorithm has a poorer performance than either
variant of cyclic reduction and will not be further considered. It would, of
course, be used if it were required to find the LU decomposition of the
equations. For the stated problem of actually solving the equations, cyclic
reduction is always better. The choice is therefore between the parallel and
serial variants of cyclic reduction (PARACR and SERICR respectively).
Given the operations’ counts s and q, the algorithmic phase diagram can
now be drawn by using equation (5.5). We find that the parallel variant has
the better or equal performance (P PA ^ ^
r a c r ) when
s e r i c r
(5.67a)
The equality gives the boundary line of equal performance. This is plotted
in figure 5.9 together with the regions of the diagram in which each algorithm
is superior. For large n 1 / 2 we have:
(5.67b)
This shows that for the paracomputer (n 1 / 2 = oo) the parallel variant,
PARACR, is the best algorithm for all orders n of the equations (hence the
name of the variant).
For finite values of n1/2 there is always some n( % 0.42n1/2) for values greater
than which the serial variant is superior. This result is analogous to that
found in §5.2.3 for the solution of recurrences. The value of n 1 / 2 measures
the amount of hardware parallelism. If the vector length is much greater than
this for a particular problem, then for this problem the computer will act
like a serial computer— that is to say the parallelism of the computer is too
small to have an influence on the performance. In this circumstance the
criterion of performance on a serial computer will be relevant: namely that
TRIDIAGONAL SYSTEMS 485
FIGURE 5.9 The selection of the best algorithm for the solution of a
single tridiagonal system of n equations on a computer with a half-
performance length of n l/2: SERICR, cyclic reduction with reduction of
vector length; PARACR, cyclic reduction without reduction of vector
length. (From Hockney (1982), courtesy of North-Holland.)
the best algorithm is that with the least number of scalar arithmetic
operations, i.e. the serial variant of cyclic reduction (SERICR).
If, alternatively, we are presented with the problem of solving a set of m
tridiagonal systems each of n equations, we have the choice of either
applying SERICR or PARACR in parallel or sequentially to the m systems,
or using the best serial method (the Gaussian elimination recurrence described
in the last paragraph of §5.4.1) to all systems in parallel. For computers
with a large natural parallelism ^m n, for example large processor arrays
(the ICL DAP) or pipelined computers with a large ni/2 (the CYBER 205),
the best algorithm is likely to be the one with the most parallelism. For such
cases we compare Gaussian elimination applied in parallel to all systems
(MULTGE) with parallelism m, with SERICRpar and PARACRpar in which
the named cyclic reduction algorithm is applied in parallel to all systems.
486 PARALLEL ALGORITHMS
The operations’ counts and average vector lengths for these alternatives are
as follows.
(5.68a)
(5.68b)
(5.68c)
Using equation (5.5) we obtain the following relationships, in which the
inequality determines which algorithm has the better performance, and the
equality gives the equation for the equal performance line on the ( n l/2/ m , n)
parameter plane.
P SERICRpar ^ P MULTGE w h e n
(5.69b)
The lines defined by equations (5.69) are shown in figure 5.10 and divide the
(nl/2/m, n) parameter plane into regions in which each of the three methods
has the best performance. This is rather like a chemical phase diagram, and
there is even a ‘triple point’ at n %7, n i/2/m %7.7 where the three algorithms
have the same performance. We find that for n lj2 = oo (the paracomputer),
PARACRpar as expected, the best algorithm for all n. However for finite
n l/2 and n l/2 > 10m there is always some «, greater than which SERICRpar
is favoured. If the number of systems m is greater than w1/2, we find that the
application of the best serial algorithm in parallel to the m systems is always
the best. In the region 1 < nl/2/m < 10 all three algorithms may be favoured
in the complex way displayed by the diagram.
TRIDIAGONAL SYSTEMS 487
FIGURE 5.10 The selection of the best algorithm for the solution of
m tridiagonal systems of n equations for computers with a large natural
parallelism ^mn. We compare the Gaussian elimination (MULTGE),
and the serial (SERICRpar) and parallel (PARACRpar) versions of cyclic
reduction when applied in parallel to all m systems. (From Hockney
(1982), courtesy of North-Holland.)
(5.70b)
488 PARALLEL ALGORITHMS
FIGURE 5.11 The selection of the best algorithm for the solution of
m tridiagonal systems of n equations for a computer with limited
parallelism of approximately m or n (here n 1/2 = 100). We compare
Gaussian elimination applied in parallel to all m systems (MULTGE),
with the serial (SERICRseq) and parallel (PARACRseq) versions of cyclic
reduction applied sequentially to the m systems.
Figure 5.11 shows the comparison between these two algorithms and
MULTGE for the case w1 / 2 = 100. The vertical line separating the PARACRscq
and SERICRseq methods is obtained from the equality in equation (5.67a)
or figure 5.9, and is true for all m. Multiple Gaussian elimination will have
a performance better than or equal to the parallel variant of cyclic reduction
applied sequentially when
or when
(5.70c)
or when
(5.70d)
We find, broadly speaking, that the multiple application of the sequential
TRANSFORMS 489
Gaussian algorithm is the best when the number of systems exceeds one tenth
of the number of equations in each system. This will be the case for most
applications arising from the solution of partial differential equations (see
§5.6). There is a relatively small part of the diagram favouring the parallel
variant of cyclic reduction, and this becomes smaller as n 1/2 decreases. Indeed
for n l/2 ^ 10, PARACRseq is never the best algorithm.
5.5 TRANSFORMS
(5.71a)
where fj are the set of n complex data to be transformed and J k are the set
of n complex harmonic amplitudes resulting from the transformation. A direct
evaluation of the definition (5.71a) would require n complex multiplications
and n complex additions per harmonic, or a total of 8 n2 real arithmetic
operations for the evaluation of all n harmonics. The value of the f f t algorithm
is that it reduces this operation count to approximately 5nlog2n real
arithmetic operations. The ratio of the performance of the f f t to the direct
evaluation on a serial computer is therefore
(5.71b)
This ratio varies from about 18 (n= 128) to about 102 (n= 1024) and to
about 5 x 104 (n % 106). It is not surprising therefore that the publication of
the f f t by Cooley and Tukey (1965) has led to a major revolution in numerical
methods. Although Cooley, Lewis and Welch (1967) cite earlier works
containing the idea for n log2wmethods (e.g. Runge and Konig 1924, Stumpff
1939, Danielson and Lanczos 1942, Thomas 1963) these were not generally
known. Before 1965 Fourier transformation was thought to be a costly n2
process that could only be conducted sparingly and was usually best avoided;
after 1965 it became a relatively cheap process, orders of magnitude faster
than had been previously thought.
The transform (5.71a) has the inverse (or Fourier synthesis):
(5.72a)
(5.72b)
(5.72c)
492 PARALLEL ALGORITHMS
(5.73a)
where
(5.73b)
(5.73c)
The association of the negative sign with Fourier analysis in equation
(5.73b) and the positive sign with Fourier synthesis is arbitrary. We have
followed the conventions of transmission theory and Bracewell ( 1965). Clearly
the choice is immaterial and can be accommodated by the appropriate choice
of a>„. The algorithm is the same in both cases. Being the nth root of unity,
con has the important property that
(5.74a)
and thus
where s and t are integers. The definitions (5.73b, c) also show that any
common factor may be removed from or inserted into both the subscript
and superscript of without altering its value. Thus, for example,
(5.74b)
Frequently the function to be transformed is real, say gp and the transforms
are expressed in terms of sines and cosines. Thus, assuming n is even,
(5.75a)
where
(5.75b)
and
(5.75c)
The coefficients of ak and bk of the real transform are simply related to the
TRANSFORMS 493
(5.78b)
(5.79b)
(5.79c)
494 PARALLEL ALGORITHMS
The number of real scalar arithmetic operations is 2jtt log2n + 3jtt, where
the second term arises from the post-manipulation of Jk in equations
(5.78)-(5.80).
The calculation of the Fourier series (5.75a) from the coefficients (5.75b, c)
can be performed by reversing the above procedure as follows:
(5.81a)
(5.81b)
(5.82a)
(5.82b)
(5.83a)
(5.83b)
and also
(5.83c)
(5.83d)
We then calculate the complex Fourier synthesis of the n j l complex values
J k and the resulting n/ 2 complex values contain the required synthesised
values in successive real and imaginary parts, thus
f j ^ Y ^ p ( ^ T f ) 7 k = ^ j + i0 2 j +u j = 0 , l,. .. ,n /2 — 1 .
k=o \ n/z /
(5.83e)
TRANSFORMS 495
The above procedure involved a pre-manipulation of the n real data into the
appropriate n/ 2 complex values which are then transformed using the
complex f f t . As in analysis, the number of real scalar arithmetic operations
is 2 jn log2n + 3
The calculation of the real Fourier transform is discussed by Cooley et al
(1967) and by Bergland (1968). A different approach, involving the folding
of data before multiplying by sines and cosines, is described by Hockney
(1970). The latter method is the amalgamation of hand computation methods
as given, for example, by Runge (1903, 1905) and Whittaker and Robinson
(1944) and gives in addition to the periodic transform described above, the
finite sine and cosine transformations. The latter together with the calculation
of the Laplace transform are discussed by Cooley et al (1970). Other important
publications on the f f t are Gentleman and Sande (1966), Singleton (1967,
1969), Uhrich (1969) and the book on the subject by Brigham (1974). In this
chapter we aim only to bring out the main features of the f f t algorithm and
discuss some of the considerations affecting its implementation on parallel
computers. For this purpose we will limit the discussion to the binary (or
radix-2 ) case for which n = 2q (where q is integral) although the algorithm
can be applied efficiently to any n that is a product of small primes preferably
repeated many times. Such mixed radix algorithms are described by Singleton
(1969) and Temperton (1977, 1983a, 1983c). A form of the binary algorithm
particularly well suited to parallel processing has been given by Pease (1968),
and the performance of several of the above algorithms has been compared
on the CRAY-1 by Temperton ( 1979b), and on the CYBER 205 by Temperton
(1984) and Kascic (1984b). Vectorisation of the fast Fourier transform is
discussed by Korn and Lambiotte (1979), Wang (1980) and Swarztrauber
(1982, 1984). The implementation of fast radix-2 algorithms on processor
arrays is considered by Jesshope (1980a).
The above algorithms, which are all variations and extensions of the
algorithm published by Cooley and Tukey (1965), may be described as the
conventional fast Fourier transform. Their efficiency relies on the factors of n
being repeated many times. Curiously, there is another form of the fast Fourier
transform, called the prime factor algorithm ( p f a ), in which n is split into
non-repeating factors which are all mutually prime to each other. These
methods were first introduced by Good (1958, 1971) and Thomas (1963),
and subsequently developed by Kolba and Parks (1977), Winograd (1978),
and Johnson and Burrus (1983). The method has found most application in
signal processing (Burrus 1977, Burrus and Eschenbacher 1981), and is fully
described in the books by McClellan and Rader (1979), and Nussbaumer
(1982). The practicalities of implementing the pf a on current vector computers
have been discussed by Temperton (1983b, 1985, 1988).
496 PARALLEL ALGORITHMS
(5.84a)
where
(5.84b)
and i is an integer subscript with values
(5.84c)
The above defines a collection of n l ~ l transforms. Each transform is
distinguished by its identification number i which is the index of the first
datum from which the transform is calculated. The remaining data for the
transform are separated from the first by the interval n2~l. The length of
each transform is 2l. Thus (Z)J f is the fcth harmonic of the ith transform at
level /. t When Z= 0 we have j = k = 0 and
(5.85a)
Hence the level-zero transforms are the initial data. At level Z= q, where
q = log 2 n, we have i = 0 and
(5.85b)
Thus at level log2n there is only one partial transform which is proportional
to the required complete transform over all the original n data.
(i)
fThe notation /? will be used in the text only.
TRANSFORMS 497
(5.86a)
j=o
(5.86b)
Hence
(5.87a)
Replacing k by k + 2 1 gives:
(5.87b)
498 PARALLEL ALGORITHMS
(5.88a)
(5.88b)
(5.88c)
If (Z+1 )7? overwrites (l)J k and (/ + 1 )J ic+ 2' overwrites (Z)/ ? +„2 -<,+i> then the
transform can be performed in-place without the need for any auxiliary
storage. However the final harmonics are then obtained in reverse binary
order. That is to say, if k = kq_ x2q~l + kq_^2q l + ••• + /ii2 + /c0, where kp
is the pth digit in the binary representation of k, then this harmonic will be
found in location (q)J k\ where k' = k02q~l + k 12q~2 + ... + kq- 22 + kq- 2- If
the harmonic analysis is to be followed by a synthesis step which reverses
the above steps, then there is no need to sort the harmonics into natural
order. This is the case during the solution of a field problem by the convolution
method. If the harmonics are required as output, however, a sorting step
must be inserted after the basic algorithm. Such a sorting is often referred
to as bit-reversal, because the harmonics occur in bit-reversed order.
Alternatively, if overwriting is eliminated by placing the result of the
recurrence in a second array, sorting may take place at each level as is shown
in the data flow diagram in figure 5.12.
The fast Fourier transform of equations (5.88) may be reversed by solving
for level / in terms of level / + 1 and the inverse transform obtained as
follows:
(5.89a)
(5.89b)
TRANSFORMS 499
(5.89c)
(5.89d)
(5.90a)
(5.90b)
(5.90c)
(5.90d)
5.5.3 Vectorisation
The techniques for the implementation of the f f t on vector pipelined
computers can be illustrated by considering the FORTRAN code for the
Cooley-Tukey algorithm of equations (5.88) that is given in figure 5.13. This
program comprises a control routine (top) which calls the subroutine RECUR
(bottom) to evaluate the recurrence (5.88). RECUR (F, G, W, L, N) performs
the recurrence once for L = l on the input data in the complex vector F and
puts the output in the complex vector G. The vector W contains powers of
the nth root of unity and N = n is the total number of complex variables. At
successive levels the input and output alternate between the arrays F and G,
and calls to RECUR are therefore made in pairs in the control routine. Since
F and G are different vectors the output never overwrites the input. This
permits the harmonics to be kept in natural order and allows most compilers
to vectorise the statements in the DO 11 loop. We assume that W has been
previously loaded with powers of the nth root of units such that
(5.91a)
TRANSFORMS 501
FIGURE 5.13 The control program (top) and the subroutine RECUR
for scheme A (bottom) with vectorisation of the DO II loop.
(5.91b)
(5.92a)
(5.92b)
(5.92c)
The last term in equation (5.92c) is the normal serial operation count for
the f f t algorithm (take n 1/2 = 0), and the first term shows the effect of
hardware parallelism through the value of n 1/2.
It is obvious that the DO K1 and DO II loops of the above code can be
interchanged without altering the effect of the statements in the DO 20 loop.
If this is done, as is shown in figure 5.14, the DO K1 loop becomes the
innermost and is replaced by vector instructions. The vector length is 2\ or
1,2,4, ...,« /2 in successive passes, and the storage interval is w2_(Z+1), or
n / 2, n/4 ,...., 4,2,1. Memory-bank conflicts are a serious problem in the early
stages of this algorithm but can be avoided in the manner described above.
The variable W is now a vector in the inner loop and the multiplication is
a vector*vector instruction. The time to execute this alternative algorithm
(scheme B) is proportional to:
(5.93a)
(5.93b)
TRANSFORMS 503
Setting / ' = log2(n) —1 —/, and reversing the order of the first summation
one obtains:
(5.93c)
(5.93d)
(5.93e)
We therefore find that, if taken to completion, the two alternative schemes
execute in the same time. However they have different characteristics. Scheme
A starts with long vectors which reduce in length as the algorithm proceeds,
whereas scheme B starts with short vectors which increase in length as the
algorithms proceed. The performance of any parallel computer improves as
the vector length increases, hence a combined algorithm suggests itself.
Perform the first p levels of the f f t using scheme A and the last q —p levels
using scheme B. This combined algorithm executes in a time proportional to
(5.94a)
Hence
2P= n/2p, 22p = n, (5.95b)
and
p = \log2n. (5.95c)
With this optimal selection of p, the minimum vector length is 2P= J n (or
J \ n if n is not a multiple of four) and the execution time (assuming n is a
multiple of four) is given by
(5.96)
This type of combined algorithm was developed independently by Roberts
(1977) and Temperton (1979b). Temperton’s version is based on the
Gentleman-Sande recurrence (5.90) and forms the basis of the subroutine
CFFT2 written in CAL assembler by Petersen (1978) for the CRAY X-MP
computer scientific subroutine library.
Two further simplifications are usually incorporated in any efficient
algorithm. These concern the value of the multipliers = (ok*i2} 1 that
occur in both the Cooley-Tukey and the Gentleman-Sande recurrences.
When k = 0 the multiplier is unity and an unnecessary multiplication by this
value can be avoided by writing a separate loop for this case. In the calculation
of the first level of partial transforms, when 1= 0 in the recurrences (5.88),
this occurs for every evaluation of equations (5.88). At other levels, it occurs
at the first use of the equations. In the calculation of the second level of
partial transforms, when / = 1 in equations (5.88), k = 0,1 and the multipliers
are 1 and i. Since multiplication by i merely interchanges the real and
imaginary parts and changes the sign of the real part, no multiplications are
required in the calculation of the second-level transformations either. Again
a separate loop is justified.
Some arithmetic may be saved if the number of elements n being
transformed contains factors of four (Singleton 1969). It is then advantageous
to combine two applications of the recurrence (5.88) into a single recurrence
involving four input values and four output values(/ +1*/•. If we suppose
n is a power of four and therefore log2n is even, the transform can be performed
by evaluating the recurrence
(5.97a)
TRANSFORMS 505
(5.97b)
(5.97c)
(5.97d)
(5.97e)
The data flow diagram for the power-of-four transform (n = 42) is shown in
figure 5.15. The evaluation of the recurrence (5.97) requires three complex
multiplications in order to calculate a, b and c, and eight complex additions
occur in equations (5.97b, c,d,e). That is 34 real operations for ^log2n values
of l and n/4 combinations of k and L This is a total of 4.25nlog2n real
operations compared with 5nlog2n for the power-of-two transform, or a
saving of 15%. When / = 1, k = 0, all the multipliers are unity and as in the
case of the power-of-two transform, special code is justified to avoid
unnecessary multiplications. The power-of-two and power-of-four transforms
can easily be combined to give an efficient transform for any power of two,
that first removes all factors of four from n and then, if necessary, removes
a final factor of two.
FIGURE 5.15 The data flow diagram for the power-of-four transform
for n = 16. The diamonds evaluate the recurrence (5.97). The number in
the diamond is the power of the nth root of unity used in the evaluation
of the constant a.
that is performed in the DO 30 loop. The complex arrays U and E are copies
of the data array F with a sign change corresponding to the negative sign
in equation (5.88b). The complex array V contains copies of the multipliers
a>21^ in the correct place so that all n multiplications of equations (5.88) can
be performed in parallel in the first statement of the DO 30 loop. The second
statement of the DO 30 loop then performs all the n additions of equations
(5.88) in parallel.
As described in the previous section, there are no genuine multiplications
in the first two calls to RECUR, when / = 0 and 1. Special code for these
cases and the use of a power-of-four transform when possible would be
incorporated in an efficient program. The routing inherent in the DO 20
TRANSFORMS 507
FIGURE 5.16 Subroutine RECUR for PAR AFT with vector length N
in the DO 30 loop.
loop must be considered when assessing the time of execution of the algorithm.
In processor arrays where arithmetic is slow compared with routing, such
as the ICL DAP, the DO 20 loop will not consume a major part of the
execution time. For example, data routing accounts for 10-20% of the overall
time when performing 1024 complex transforms on a 32 x 32 ICL DAP
(Flanders et al 1977). On arrays of more powerful processors the DO 20
routing loop will assume relatively more importance. Several processor arrays
have special routing connections and circuitry designed to perform efficiently
the routing necessary in the fast Fourier transform. Examples are the
Goodyear STARAN and the Burroughs BSP (see Chapter 3).
Examination of figure 5.16 shows that the PAR AFT algorithm requires
one complex multiplication and one complex addition to be performed at
each level with parallelism n. This is equivalent to eight real operations with
parallelism n at log2n levels or, assuming routing is unimportant, an execution
time proportional to
(5.98a)
By comparing equation (5.98a) with the time for the A + B scheme in equation
(5.96) we find that PARAFT has a higher performance than A + B if
(5.98b)
The regions of the («, nl,\) plane favoured by the two algorithms on this
508 PARALLEL ALGORITHMS
FIGURE 5.17 The regions of the (n, m / n l/2) plane in which either the
(A -|- B) scheme or the PAR AFT algorithm has the higher performance
when applied in parallel to m fast Fourier transforms each of length n.
When m = 1 the graph applies to the selection of the best algorithm for
a single transform.
(5.99a)
(5.99b)
Compared with equation (5.4b) these have average vector lengths of
TRANSFORMS 509
(5.99c)
(5.99d)
Equations (5.99) show that the (A + B)par algorithm has the better performance
when
(5.99e)
The regions of the (n,m/ni/2) plane favouring the two methods are shown
in figure 5.17. We see that the (A + B)par scheme is favoured whenever either
the number of transformations or the length of the transforms becomes large.
If, alternatively, there is advantage to restricting the parallelism of the
algorithm to approximately n or m, then we can ask whether it is advantageous
to repeat the best algorithm for a single transform m times, or to take the
best serial algorithm and perform it in parallel on the m systems (the
MULTFT algorithm). The best serial algorithm is either scheme A or scheme
B with n 1/2 = 0 and requires 5n log2n real scalar arithmetic operations. The
MULTFT algorithm has average vector length h ml ,ltft = an d will execute
in a time proportional to:
(5.100a)
When applied sequentially m times, on the m different systems to be
transformed, the other algorithms will execute in a time proportional to:
(5.100b)
(5.100c)
and with average vector lengths
(5.100d)
From these timings we conclude that the MULTFT algorithm will have
a superior performance to the (A -F B)seq scheme if
(5.101a)
(5.101b)
510 PARALLEL ALGORITHMS
FIGURE 5.18 The selection of the best algorithm for calculating m fast
Fourier transforms each of length n, on a computer with natural
parallelism approximately m or for which the A + B scheme is the best
method for calculating a single transform (see figure 5.17). The best serial
ff t is applied in parallel to the transforms, MULTFT, or the A + B
scheme is applied sequentially, (A -l- B)seq. In this case a single curve
applies for all values of n 1/2, and divides the regions of the (n, m) plane
in which either the MULTFT or (A + B)seq scheme has the better
performance. The open circles are interpolated from the measured results
of Temperton (1979b) on the CRAY-1.
The full curve in figure 5.18 is a plot of equation (5.101a) and shows the
region of the (n,m) plane in which either the MULTFT or the (A + B)seq
scheme has the superior performance. We note that equation (5.101a) is
independent of n l/2 and a single curve applies for all computers. We conclude,
roughly speaking, that if the number of systems to be transformed exceeds
one tenth of the length of the systems, then it is advantageous to use the best
serial algorithm on all systems in parallel, rather than the best parallel
algorithm on each system in sequence. Temperton (1979b) compared the
actual performance of two FORTRAN codes on the CRAY-1 for algorithms
comparable to MULTFT and the (A -1- B)seq scheme. His measurements are
shown as the open circles and the broken line. The trend of Temperton’s
measurements agrees with the predictions of the simple theory given above,
and would be in absolute agreement if the predicted speed of MULTFT
relative to the (A + B)seq scheme were increased by a factor between two and
three. Such relative behaviour might be expected because the MULTFT
TRANSFORMS 511
algorithm has much simpler indexing than the (A 4 - B)seq scheme (the
increment of the innermost vectorised loop over the m systems is always unity)
and memory is therefore accessed in the most favourable fashion.
Figure 5.19 is a similar comparison between MULTFT and the sequential
application of the PARAFT algorithm. It is for use when we conclude from
figure 5.17 that PARAFT is the best algorithm for performing a single
transform. The following limiting values may be useful:
(5.102a)
(5.102b)
and for large n a limiting value of m is reached
(5.102c)
FIGURE 5.19 The selection of the best algorithm for calculating m fast
Fourier transforms each of length n, on a computer with a natural
parallelism of approximately n or m, and for which the PARAFT
algorithm is the best method of calculating a single transform (see
figure 5.17). PARAFT is applied sequentially to the m transforms,
PARAFTseq. The curve shows the boundary in the ( n/ nl/2i m/ n i/2) plane
between regions in which either the MULTFT or PARAFTseq method
has the better performance.
512 P A R A L L E L A L G O R IT H M S
the execution time. Such delays only affect the performance of processor
arrays on which the PARAFT algorithm is most likely to be the best from
the point of view of arithmetic operations (see figure 5.17). We will therefore
consider the routing problem for this algorithm. The routing delay will
obviously depend on the connectivity pattern between the processors of the
array. We will therefore consider a general class of such processor connections,
which includes most of the processor arrays that have been manufactured.
The treatment here follows that of Jesshope (1980a) who gives a compre-
hensive analysis of the implementation of fast radix- 2 transforms on processor
arrays. Further results on routing and transpositions in processor arrays are
given by Jesshope ( 1980b,c). In addition to the above, Nassimi and Sahni
(1980) consider the implementation of bit-reversal and the perfect shuffle.
Let us consider a processor array with P processors, arranged in a cartesian
k-dimensional array with Q processors in each coordinate direction, then
(5.103a)
and, in accordance with normal practice, we take Q to be a power of 2
(5.103b)
Each processor has access to data in its own memory and that of its immediate
neighbours in each coordinate direction. Note, in particular, that diagonal
connections are not present. For the case q ^ 2 this connectivity pattern
requires 2k connections to each processor and kP data paths for the whole
array, if we assume that a data path can be used for transferring data in
either direction. For the case q= 1 the above counts are halved. At the edges
of the array the processors are connected in a periodic sense in each dimension.
If / represents the coordinate direction and it the coordinate in the /th
direction, periodicity means that all coordinates are interpreted modQf*
hence
(5.104a)
and nearest-neighbour connectivity means that processors are connected if
their coordinates differ by ± 1 in one, and only one, coordinate direction.
We wish to use the above /c-dimensional array as a store for a one-
dimensional array { f h i= l,...,n } of data to be transformed. To avoid
complication we shall assume that the number of processors equals the
number of data. Numbering the processors sequentially along the 1st, 2nd
to the kth dimension we have the correspondence
(5.104b)
(5.104c)
(5.105a)
(c) Disable odd groups of n/2l +1 processors and change the sign of
the work space.
(d) Ada the work space to the data array.
End
In the above formulation the numbers of unit routing operations are
(5.105b)
as / takes on the value 0 to log 2 (n) —1. If the P processors are connected in
a linear array (i.e. k = 1 ) then the total number of routing operations is
(5.106a)
(5.106b)
The first term in equation (5.106a) takes into account that, for / = 0, the
periodicity of the array allows the required routing to be achieved by a single
routing to the left or right by n/2 places. The multiplier two in the second
term takes into account that in step (b) above, the odd and even groups are
routed in separate operations.
If the array is multidimensional and, as we are assuming, the data fills the
processor array exactly, then the first relative movement of data by n / 2 vector
elements can be achieved by a routing of Q/2 along the kth dimension, the
second movement of n/4 by a routing of Q/4 in the kth dimension. The
successive routings by half the previous value can be continued in the kth
dimension until the routing is unity. To route by half this amount we must
now turn to motion in the ( k — l)th dimension by an amount Q / 2 which
again can be continued until the routing is unity. To route by half this amount
we must now turn to motion in the (k —2 )th dimension and so on until all the
dimensions have been used up. This is illustrated for a 16 x 16 array of
processors in figure 5.20. Obviously the above process requires routings equal
to equation (5.106a), but with Q replacing n, for each of the k dimensions.
Accordingly the total number of routings of complex numbers for k-
dimensional connectivity is
(5.107a)
The minimum number of such complex routings occurs if Q = 2, that is to
say for a binary hypercube, in which case
(5.107b)
TRANSFORMS 515
The numbers of complex routing operations that are required by the other
computers are, using equation (5.107a):
(5.108a)
(5.108b)
(5.108c)
The number of real parallel arithmetic operations after the above parallel
complex routings is 81og2n [see equation (5.98)]. Therefore, if we assume
(5.109a)
where the leading factor of two takes into account that each complex routing
in equation (5.108c) is a movement of two real numbers. For the computers
under consideration, we have approximately:
(a) ‘ILLIAC IV’
y= 10 , *r / *a = 2.5%. (5.109d)
It appears therefore that routing delays, although significant, do not dominate
the calculation of complex Fourier transforms on practical designs of
processor arrays, and may justifiably be ignored in a first estimate of the
performance of fast transform algorithms. However, we see in §5.5.6 that the
reverse is true for the calculation of number theoretic transforms on certain
types of processor arrays.
In general, routing will dominate over arithmetic when tR/ t A ^ 1 or,
recalling that q = log 2 Q, when
(5.110a)
or for large Q when
(5.110b)
Then to the nearest power of two we have:
(5.110c)
and hence the smaller the value of y , the smaller the arrays must be kept to
prevent routing from dominating the execution time. Small values of y will
also occur in bit-organised computers such as the ICL DAP, when they are
used for short-word-length integer arithmetic (say eight-bit words when y % 2 )
as might arise in picture processing (see §5.5.6).
The above result is important because the advent of very large-scale
integration technology, with 1 0 4 or more logic elements per chip, makes
possible the manufacture of very large arrays. Such large arrays are attractive
from the cost point of view, and may apparently give very high performance.
However, the above result shows that, when one considers actual algorithms,
one must be aware of the limitations when the linear dimensions become large.
Although we have specifically discussed the routing problem for the fast Fourier
transform algorithm, very similar routings occur in many parallel algorithms—
e.g. the parallel algorithm for the evaluation of a first-order recurrence given
in §5.2.2. The result must therefore be considered to be fairly general.
518 PARALLEL ALGORITHMS
(5.111a)
(5.111b)
(5.111c)
FIGURE 5.22 The discrete orthogonal functions akj used in the number
theoretic Rader transform (a = 2) for t = 2. Functional values are in the
range 0 , 1, .. ., F2 — 1 = 16, and are represented by five bits. The transform
length is n = 8 (n ~ 1 = 15), and k = 0 , 1, . .. , n — 1 = 7. Functions for k ^ 4
are the mirror image about the vertical line j = 4 of the function that is
drawn for 8 —k (i.e. (xkj = cc(n~k){n~j)).
520 PARALLEL ALGORITHMS
(5.112a)
Summing the geometric series and using the fact that a" = 1, we obtain
(5.112b)
(5.112c)
(5.113a)
(5.113b)
If the binary digits are numbered from 0 to ft, starting at the least significant
digit, n ~ 1 has ones in the zeroth bit and in bits ft —1 to ft —1 —t inclusive.
To prove equation (5.113b) we calculate
and, since 2b = Ft — 1,
(5.113c)
TRANSFORMS 521
It is clear that arithmetic modulo Ft requires several tests and special cases,
and it is unlikely that the number theoretic transforms will have any speed
advantages over the Fourier transform on computers that provide hardware
floating-point arithmetic. On bit-serial and bit-addressable computers such
as the ICL DAP or the Goodyear STAR AN, which do not provide hardware
floating-point, the situation is radically different, for in these cases special
arithmetic routines can be microcoded at the bit level for modulo arithmetic.
In particular the shifts apparently involved when multiplying by a can
be avoided by addressing the appropriate bits in the memory and any
multiplication by akj takes no longer than the single subtraction that is
necessary for normalisation. The comparison between modulo Ft and
floating-point arithmetic on the ICL DAP is shown in table 5.1. It is clear
that modulo arithmetic is about 1 0 times faster than floating-point on this
machine.
If the Rader transform is computed by the same technique as was used in
the PARAFT method for the Fourier transform (see §5.5.4), there will be
one parallel addition operation and one parallel multiplication operation (by
a power of two) for each of the log2n levels of the algorithm. The total number
of modulo Ft arithmetic operations, each with parallelism n, for the
/c-dimensional array of n = Qk = 2qk processors of §5.5.5, is therefore
(5.116a)
The routing requirements are identical to those in the PARAFT algorithm,
therefore
(5.116b)
+ 36 150
— 30 150
X 24 § 250
Route 10 10
and the ratio of the time spent in routing to that spent on arithmetic is, for
the Rader transform,
(5.116c)
where y is the ratio of the time for a modulo Ft arithmetic operation to the
time for a unit routing operation. Referring to table 5.1, we see that y = 3
for 33-bit working with t = 5 on the 64 x 64 ICL DAP (k = 2,Q = 64,q = 6 ),
leading to
(5.117a)
and
(5.117b)
Hence routings dominate the arithmetic time for the calculation of Rader
transforms on the ICL DAP and similar machines.
The time spent on routing will equal or exceed the time spent on arithmetic
when
(5.118a)
or for large Q when
(5.118b)
The curve for equality and the regions of the Q —y plane in which routing
or arithmetic dominate are shown in figure 5.21. It can be seen that the linear
dimensions of processor arrays should be kept as small as possible if the
Rader transform is to be efficiently calculated. Since y is likely to be less than
four for modulo arithmetic, linear dimensions greater than 32 will lead to
algorithms that are dominated by the time spent in rearranging the data in
store, rather than performing useful calculations on the data.
The above analysis is based on a transform equal to the size of the processor
array. However in solving a three-dimensional partial differential equation,
the transform is likely to be much larger than the array size, in which case
the overheads due to routing decrease dramatically. For example, the 261 %
overhead given in equation (5.117b) for a 64 x 64 transform becomes only
25% for a 64 x 64 x 64 transform on a 64 x 64 ICL DAP (Jesshope 1980a).
Notwithstanding the fact that routing may dominate arithmetic, the overall
performance of the Rader transform far exceeds that of the Fourier transform
on machines such as the ICL DAP. If we compare the amount of arithmetic
needed to transform n integer values in the Rader transform with that needed
to transform n real values (i.e. w/2 complex values) in the Fourier transform,
524 PARALLEL ALGORITHMS
(5.119a)
where the coefficients A, B, C, D, E are arbitrary functions of position. This
equation encompasses the principal equations of mathematical physics and
engineering (the Helmholtz, Poisson, Laplace, Schrodinger and diffusion
equations) in the common coordinate systems (cartesian (x, y), polar (r, 0 ),
cylindrical (r, z), axisymmetric spherical (r, 0 ) and spherical surface ( 0 , </>)).
If the above equation is differenced on an n x n mesh of points using standard
procedures (see e.g. Forsythe and Wasow 1960), one obtains a set of
algebraic equations, each of which relates the values of the variables on five
neighbouring mesh points:
(5.119b)
where the integer subscripts p,q = 1 ,..., n, label the mesh points in the x and
y directions respectively. The coefficients a, b, c, d and e vary from mesh
point to mesh point and are related to the functions A, B, C, Z), E and the
separations between the mesh points in a complicated way through the
particular difference approximation that may be used. The right-hand-side
variable f p q is a linear combination of the values of p(x, y) at the mesh points
near (p, q). In the simplest case it is the value of p(x, y) at the mesh point (p, q).
Iterative procedures are defined by starting with a guess for the values of
(f)pq at all the mesh points, and using the difference equation (5.119b) as a
basis for calculating improved values. The process is repeated and, if
successful, the values of </> converge to the solution of equation (5.119b) at
all mesh points. In the simplest procedure the values of </> at all mesh points
are simultaneously adjusted to the values they would have by equation
(5.119b) if all the neighbouring values of (j) are assumed to be correct, namely
(j) at each mesh point is replaced by a new value:
(5.120)
Since the replacement is to take place simultaneously, all values of </>on the
right-hand side are ‘old’ values from the last iteration and the starred values
on the left-hand side are the ‘new’ and, hopefully, improved values.
The above method of simultaneous displacements was first considered by
Jacobi (1845), and is often called the Jacobi method. It is ideally suited for
PARTIAL DIFFERENTIAL EQUATIONS 527
The above convergence factor, 2j, can be used to calculate tf,; the number of
iterations required to reduce the error by a factor 1 0 -p :
(5.121b)
t The sum over all mesh points of the square of the difference between the approximate
and exact solution of the equations.
528 PARALLEL ALGORITHMS
Thus a modest error reduction of 10" 3 on a typical 128 x 128 mesh would
require about 24000 iterations. Such slow convergence makes the Jacobi
method useless for practical computation, even though it is highly suitable
for implementation on parallel computers.
The most commonly used iterative method on serial computers is the
method of successive over-relaxation by points or s o r . In this method a
weighted average of the ‘old’ and starred values is used as the ‘new’ value
according to
K : = + ( 1 - (5A22a)
where a>is a constant relaxation factor, normally in the range 1 ^ co ^ 2 , that
is chosen to improve the rate of convergence. For the model problem, it may
be shown that the best rate of convergence is obtained with:
(5.122b)
(5.123c)
530 PARALLEL ALGORITHMS
(5.125)
Equation (5.125) is a tridiagonal system for all the starred values along a
PARTIAL DIFFERENTIAL EQUATIONS 531
line, with the right-hand side depending on known values from the line above
and below. The Chebyshev accelerated successive line over-relaxation ( s l o r )
proceeds line-wise using equations (5.122) and (5.123a), only now p the
convergence factor of the Jacobi method performed line-wise is, for the model
problem,
p = cos(7r/n)/[2 —cos(7c/n)] ~ 1 —n 2/ n 2. (5.126a)
Equation (5.122b) still applies with this revised value of p, leading to
with useful resolution are likely to have between 32 and 256 mesh points
along each dimension. Thus it is natural to match the parallelism of the
CRAY X-MP with a side of the mesh and use algorithms which have a
parallelism of n or n/2. On the other hand it is more natural to match the
parallelism of the ICL DAP with two dimensions of the problem and use
algorithms which have a parallelism of n2 or n2/ 2. This choice is made even
more compelling when one considers that the ICL DAP is wired as a
two-dimensional array of processors.
Having chosen the level of parallelism that is best suited to the computer,
one may ask whether Chebyshev s or or Chebyshev slo r is the best algorithm
to use. On the CRAY X-MP using n/2 parallelism, sor will be the best
algorithm if
(5.128a)
where the factor takes into account the better convergence of the slor
method. Since t 2 = t 3 [see equations (5.124b) and (5.127a)], the above
condition (5.128a) can never arise, and we conclude that the slor will always
be the best on the CRAY X-MP. Put another way, the time per iteration is the
same when computed with a parallelism of n / 2 , and slor therefore wins
because of its faster convergence.
For the ICL DAP using n2/ 2 parallelism, we find that sor will be the best
algorithm if
(5.128b)
or
(5.128c)
This condition is satisfied for n ^ 2, that is to say for all useful meshes. We
thus find the s or method the best implementation on the ICL DAP. This is
because the parallel solution of tridiagonal equations by cyclic reduction that
is required if we use the slo r method introduces extra arithmetic that is not
needed in the simpler sor scheme. The better convergence of slo r is not
enough to make up for this.
The last iterative method to be considered is the alternating direction
implicit method or a d i . This is illustrated in figure 5.23(c and d). When it is
applied to the solution of the difference equation,
(5.129a)
by solving repeatedly
(5.129b)
(5.129c)
where w= 0 , 1 ,..., is the iteration number and the parameter r„, which
changes every double sweep, is adjusted to improve the convergence of the
approximations </>(n) to the solution </>. If we are concerned with the general
linear second-order pd e (equation (5.119a)) Lx and Ly are tridiagonal matrices.
Equation (5.129b) involves computing a right-hand side at a cost of seven
operations per mesh point, followed by the solution of n independent
tridiagonal systems (one for each horizontal line of the mesh) each of length
n (see figure 5.23(c)). Clearly these systems may be solved in parallel either
with a parallelism of n o r « 2, as in the case of the slor method. The iteration
is completed by performing a similar process, only now the tridiagonal systems
are solved along every vertical line of the mesh (see figure 5.23(d)).
If a parallelism of n is used (e.g. on the CRAY X-MP) we use the MULTGE
method and the computer time for a complete iteration is proportional to
(5.130a)
On the other hand using a parallelism of n2 (e.g. on the ICL DAP) and the
PARACRpar method the computer time is proportional to:
(5.130b)
The fact that data is first referenced by horizontal lines and then by vertical
lines may complicate the implementation of adi on some computers. If the
mesh is stored so that adjacent elements in vertical lines are adjacent in the
computer memory (FORTRAN columnar storage) then the solution of
equation (5.129b) will present no problems because the increment between
vector elements is unity (remember we run the vector across the systems
being solved). However, in solving equation (5.129c) the increment between
vector elements will be equal to the number of variables in a column. If n is
a power of two, memory-bank conflicts can be a problem even in computers
such as the CRAY X-MP, that allow increments other than unity. This may be
overcome by storing the mesh as though it had a column length one greater
than its actual length. On the other hand, on computers such as the CDC
CYBER 205 which only permit vectors with an increment of unity, the second
step of the a di can only be performed after a rotation of the whole mesh in
the store. The cost of this manipulation may make adi an unattractive method
on such machines, compared with the slo r method that only references data
in the horizontal direction. Experience with ad i and slo r is varied and there
534 PARALLEL ALGORITHMS
(5.131)
in which the coefficients are constant and given by: apq = bpq = cpq = dpq = 1
and epq = —4. These techniques, often called rapid elliptic solvers or r e s
algorithms, are an order of magnitude faster than the iterative methods
described in §5.6.1 and can be applied to meshes containing 10000 or more
points, r e s methods are inherently highly parallel and therefore especially
well suited to exploiting the architecture of parallel computers. Rapid elliptic
solvers have been reviewed by Swarztrauber (1977) and by Hockney (1980),
and the performance of different algorithms compared by Hockney (1970,
1978) and Temperton (1979a) for serial computers, and Grosch (1979) and
Temperton (1979b, 1980) for parallel computers.
Rapid elliptic solvers are defined as those methods with an operation count
of order n log2n or better [some e.g. FACR(/) are of order n log 2 (log 2 n)],
which immediately suggests the involvement of fast transform algorithms (see
§5.5). The simplest method is obtained by taking the double Fourier transform
of equation (5.131):
(5.132a)
where
(5.132b)
and
(5.132c)
PARTIAL DIFFERENTIAL EQUATIONS 535
and similarly for J k,t and f pq. Equation (5.132a) permits the calculation of
the Fourier harmonic amplitudes of the solution, </>k,z, by dividing the
harmonics of the right-hand side, J kJ, by the known numerical factors in the
square brackets. The following method of multiple Fourier transform ( m f t )
therefore suggests itself:
(a) Fourier a n a ly s e /^ using fft
Both the double Fourier analysis and the synthesis can conveniently be
performed by first transforming all lines of data in the x direction as in
figure 5.23(c) and then transforming the resulting data by lines in the y
direction as in figures 5.23(d). This may clearly be seen by re-expressing
equation (5.132b) equivalently as
(5.134)
where the inner summation is the transform in x and the outer summation
the transform in y. We have assumed doubly periodic boundary conditions
in the above explanation. However, Dirichlet (given value) conditions can
be achieved if the finite sine transform is used and Neumann (given gradient)
conditions if the finite cosine transform is used. These refinements do not
affect the issues involved in parallel implementation.
The line transforms in equation (5.134) are all independent and may be
performed in parallel with a parallelism of n or n2. In the first option we
perform n real transforms in parallel using the best serial algorithm, that is
to say the MULTFT method of equation (5.100a) with m = n. The total time
for the algorithm is therefore:
(5.135a)
(5.135b)
The factor of four in equation (5.135a) comes from the need to transform all
points four times—an analysis and synthesis in both x and y directions—and
the factor j arises because we are performing real rather than complex
transforms. Alternatively if we perform n transforms in parallel using the
parallel PARAFTpar algorithm then we have a parallelism of n2 and
(5.136a)
(5.136b)
536 PARALLEL ALGORITHMS
(5.137b)
The regions of the (n 1 /2 /w, n) plane which favour the use of the two algorithms,
using equation (5.137b), are shown in figure 5.24. We see that, for vectors of
any reasonable length (say n> 10), the PARAFTpar method is favoured if
(w1 /2 /^) ^ 0.5. That is to say, not surprisingly, that if the computer ‘looks’
parallel when measured on the scale of the vector (n1/2> n / 2 ), then an
algorithm that is designed to maximise the extent of parallel operation is
favoured.
FIGURE 5.24 Regions of the (nlj2/n, n) plane favoured for the imple-
mentation of the m f t method b y either the MULTFT algorithm or the
PARAFTpar algorithm. From equation (5.137b).
PARTIAL DIFFERENTIAL EQUATIONS 537
(5.138)
(5.139b)
on computers such as the ICL DAP. James and Parkinson (1980) have
pd e s
assessed the effectiveness of the method in the solution of the Poisson equation
for a gravitational problem on a 65 x 65 x 65 mesh.
One of the earliest and most successful rapid elliptic solvers is the method
of Fourier analysis and cyclic reduction or FACR algorithm (Hockney 1965,
1970, Temperton 1980). This was devised to minimise the amount of
arithmetic on serial computers by reducing the amount of Fourier trans-
formation, and has an asymptotic operation count equal to 4.5n2 log2(log2/i),
when optimally applied (see Hockney and Eastwood 1988). The FACR
algorithm also proves to be highly effective on parallel computers.
The discrete Poisson equation (5.131) may be rewritten line-wise as follows:
(5.140)
where the elements of the column vectors (f)q and f q are, respectively, the
values of the potential and the right-hand side along the qth horizontal line
of the mesh. The matrix A is an n x n tridiagonal matrix with a diagonal of
—4 and immediate upper and lower diagonals of unity. It is a matrix that
represents the differencing of the differential equation in the x direction.
If we multiply every even-line equation like equation (5.140) by —A, and
add the equation for the odd line above and the odd line below, we obtain
a set of equations relating the even lines only, namely:
/ i 1) = / , - i - A / , + / , + 1. (5.141c)
This constitutes one level of cyclic reduction, line-wise, of the original
equations. The resulting equations (5.141a) are half in number (even lines
only) and have the same form as the original equations (5.140). The process
of cyclic reduction may therefore be repeated recursively, yielding reduced
sets of equations for every fourth, eighth, sixteenth line etc at levels r = 2,3,4
etc. At each level there is a new central matrix A(r) that can be expressed as
a product of 2r tridiagonal matrices:
(5.142)
(5.143c)
Having found the solution of every 2*th line from equation (5.143c), the
intermediate lines can be filled in successively from the intermediate level
equations
(5.144)
It will be found, when applied to the n2~(r+1) unknown lines, that the values
of 0 on the right-hand side of equation (5.144) are values that have been
found at the previous deeper level. Because of the factorisation (5.142), the
solution of equation (5.144) requires the successive solution of 2r tridiagonal
systems.
The above algorithm is referred to as FACR(/) where the argument gives
the number of levels of cyclic reduction that are performed before the
equations are solved by Fourier analysis. It was shown numerically by
Hockney (1970) and proved analytically by Swarztrauber (1977) that there
is an optimum value 1= 1* & log2(log2n) which leads to the minimum total
number of arithmetic operations. As more cyclic reduction is performed
(increasing /) less Fourier transformation takes place (there are fewer lines
to do it on); however, it is necessary to solve more tridiagonal systems in
using equation (5.144) to fill in the intermediate lines. The exact position of
the optimum /* therefore depends on the relative efficiencies of the computer
codes that are used for the f f t and the solution of tridiagonal systems. A
better f f t code will lead to lower /* and a better tridiagonal solver to higher
/*. The best strategy is probably to write a computer code for general / and
measure the optimum value. This has been done by Temperton (1980) with
his code PSOLVE on the IBM 360/195 (optimum /* = 2), and by Temperton
(1979b) on the CRAY-1 (optimum /* = 2 in scalar mode, /* = 1 in vector
mode). It is evident that the introduction of parallelism into the implemen-
540 PARALLEL ALGORITHMS
FIGURE 5.25 The patterns of related data in the different steps of the
FA CR (l) algorithm. The arrows link data that are related in either a
Fourier transformation or a tridiagonal system of equations.
figure 5.25 for the case of a FACR(l) algorithm, in which there is one level
of initial cyclic reduction.
The FACR(/) algorithm consists of five stages, and can be implemented
either to minimise the total amount of arithmetic, s (serial variant), or to
minimise the number of vector operations, q (parallel variant), as was
described in §5.1.6. The serial variant is called SERIFACR and the parallel
variant PARAFACR. We will now describe both methods, and then compare
the two, using the n i/2 method of algorithm analysis. The objective of the
analysis is first to find which variant is best on a particular computer, and
secondly to choose the optimum value of /. This will be done by drawing
the appropriate algorithmic phase diagrams (Hockney 1983).
number of points along one side of the mesh. The five stages of the FACR(/)
algorithm are:
(5.145a)
(5.145b)
(5.145c)
(5.145d)
The leading coefficient is taken as five, rather than eight, because the equations
(5.143b) have two coefficients that are unity (a, = c, = 1).
(d) f f t synthesis— on n2~l lines in parallel using the MULTFT algorithm.
Each transform is real and of length n:
(5.145e)
(e) Filling-in— of the intermediate lines by solving equations (5.144),
involving (2-1-5 x 2r)n operations per line on n2~ir+1) lines, for r = l — 1,
/ — 2, . . . , 0
(5.145f)
(5.145g)
(5.146a)
PARTIAL DIFFERENTIAL EQUATIONS 543
(5.146b)
where the first square bracket is the normal serial computer operation count!
and the second square bracket takes into account the effect of implementation
on a parallel computer. We note that the first bracket has a minimium for
positive Z~ log2(log2n) but that the second bracket increases monotonically
with /. Thus, for parallel computers which have n 1/2> 0, the minimum in
¿se r if a c r moves to smaller Z, as asserted earlier.
The equal performance line between the algorithm with Zlevels of reduction
and that with Z+ 1 is easily found to be given by
(5.147)
The form of equation (5.147) suggests that a suitable parameter plane for
the analysis of SERIFACR is the (w1/2/n, n) phase plane, and this is shown
in figure 5.26. The equal performance lines given by equation (5.147) divide
the plane into regions in which Z= 0,1,2,3 are the optimum choices. Lines
of constant value of n 1/2 in this plane lie at 45° to the axes, and the lines for
nl/2 = 20,100,2048 are shown broken in figure 5.26. These lines are considered
typical for the behaviour, respectively, of the CRAY-1, the CYBER 205, and
the average performance of the ICL DAP. For practical mesh sizes (say
n < 500) we would expect to use Z= 1 or 2 on the CRAY-1, Z= 0 or 1 on the
CYBER 205, and Z= 0 on the ICL DAP. The lower of the two values
for Z applies to problems with n < 100. Temperton (1979b) has timed a
SERIFACR(Z) program on the CRAY-1 and measured the optimum value
of Z= 1 for n = 32, 64 and 128. This agrees with our figure except for Z= 128,
where figure 5.26 predicts Z= 2 as optimal. This discrepancy is probably
because Temperton uses the Buneman form of cyclic reduction (see Hockney
1970) which increases the computational cost of cyclic reduction and tends
to move the optimum value of Zto smaller values. For a given problem size
(value of n), figure 5.26 shows more serial computers (smaller n 1/2) to the
t Hockney (1970, 1980) quoted 4.5Z for the leading term of the first bracket. This was
because scalar cyclic reduction (6 operations per point) was assumed for the
solution of the tridiagonal systems, instead of Gaussian elimination as assumed here
(5 operations per point). Other assumptions can make minor and unimportant
differences to this equation.
544 PARALLEL ALGORITHMS
left and more parallel computers (larger n 1/2) to the right. We see, therefore,
that the more parallel the computer, the smaller is the optimum value of /.
In the SERIFACR algorithm the vectors are laid out along one or other
side of the mesh and never exceed a vector length of n. It is an algorithm
suited to computers that perform well on such vectors, i.e. those that have
n 1/2 < n, and/or which have a natural parallelism (or vector length) which
matches n. The latter statement refers to the fact that some computers (e.g.
CRAY X-MP) have vector registers capable of holding vectors of a certain
length (64 on the CRAY X-MP). There is then an advantage in using an
algorithm that has vectors of this length and therefore fits the hardware
design of the computer. For example, the SERIFACR algorithm would be
particularly well suited for solving a 64 x 64 Poisson problem on the
CRAY X-MP using vectors of maximum length 64, particularly as this
machine is working at better than 80% of its maximum performance for
PARTIAL DIFFERENTIAL EQUATIONS 545
vectors of this length. On other computers, such as the CYBER 205, there
are no vector registers and n l/2 ~ 100. For these machines it is desirable to
increase the vector length as much as possible, preferably to thousands of
elements. This means implementing the FACR algorithm in such a way that
the parallelism is proportional to n2 rather than n. That is to say, the vectors
are matched to the size of the whole two-dimensional mesh, rather than to
one of its sides. The PARAFACR algorithm that we now describe is designed
to do this.
(5.148a)
(5.148e)
Summing the above, we find that the time per mesh point for the PARAFACR
algorithm is proportional to
(5.149a)
where
(5.149b)
(5.150a)
where
(5.150b)
is the best for large n 1/2 (greater than OAn, the more parallel computers).
Lines of constant n l/2 are shown for the CRAY-1 and CYBER 205. We
conclude that SERIFACR should be used on the CRAY-1, except for small
meshes with n < 64 when PARAFACR(2) is likely to be better. On the
CYBER 205, PARAFACR is preferred except for very large meshes when
SERIFACR(2) (300 < n < 1500) or SERIFACR(l) (n > 1500) is better.
serial algorithm and repeat this in all directions. Finally, we may transform
n2 lines in parallel using the parallel algorithm PARAFT and obtain a
parallelism of n3. Similar considerations can be applied to the levels of
parallelism used in the implementation of the iterative methods. There is
clearly not space here to develop and contrast these alternatives; however
all the necessary data and principles have been developed earlier in the
chapter for the reader to do it himself for his particular problem.
6 Technology and th e Future
Over the last five years, since the publication of the first edition of this book,
we have seen major advances in the use of parallel processing. Up until that
time the dominant architectures in the supercomputer market were the
pipelined vector machines, such as the CRAY-1. The 1980s, however, have
seen many manufacturers turning to replication in order to meet computational
demands. All of the vector computer manufacturers of the late 1970s have
looked for more performance by including multiple processors, usually
coupled by shared memory. Another major development has been the
INMOS transputer and other similar single-chip solutions to the use of
multiple processors in a single system. The T800 transputer, for example, can
perform at a continuous rate of 1 Mflop/s, and can be connected into
four-connected networks of other transputers. The N-cube (Emmen 1986)
processor can perform at about 0.5 Mflop/s and can be connected in
ten-connected networks (hypercubes containing up to 1024 processors). In
both cases 1000 chips is not an excessive number to be included into a single
system, and therefore these c m o s v l s i devices will compete in performance
with the more expensive ecl supercomputers.
Indeed, it is instructive to look at the technological aspects of the
improvements in supercomputer performance. In the CRAY-1 range of
machines, for example, although performance is up by a factor of six during
this period, only a factor of 1.5 of this is due to an improvement in clock
rates; the other factor of 4 comes from the use of multiple processors. This
amply illustrates the diminishing returns in very high-speed logic implemen-
tations of large monolithic machines. It is quite likely that gate delays in
this range of machines have improved by considerably more than the
1.5 times that the clock rate would imply; however, the length of wire in
this range of machines has not been significantly reduced. It is this that is
limiting the greater improvements in clock rate. The CRAY-2, by clever
refrigeration engineering, has been able to reduce its clock because of a more
551
552 TECHNOLOGY AND THE FUTURE
compact physical size. This engineering, however, does not tend to produce
cheap machines.
It can be said, therefore, that replication has become accepted as a necessary
vehicle (desirable or not) towards greater computational power. Where we
should go now that this barrier has been overcome is the purpose of this
chapter. We will restrict ourselves mainly to architectural considerations,
although the impact on algorithms and applications will not be ignored, as
the latter is the main driving force behind the continuing quest for more
computational power.
From the introduction to this book we have seen that increases in
computational power have averaged a factor of ten every five years or so
(see figure 1.1). This has been due to demand and there is no reason to
suspect that this demand should suddenly abate. In fact the signs indicate
that this demand for more computational power is increasing. The US
Department of Defense launched a programme to develop v l s i processors,
with capacities of 3 x 109 operations per second (Sumney 1980). Other
more important applications are also begging for more computational
power (Sugarman 1980), many of which are important to our life style (e.g.
modelling energy resources, weather and climate, and even people in
computer-assisted tomography).
Like all new developments, the use of parallelism in computer systems
started at the top end of the computer market. The scientific main-frame or
supercomputer, as it now seems to be called, is costly but provides a state-
of-the-art performance (currently around 500 Mflop/s). However, as with all
developments that prove cost effective, they soon find niches in a more general
market, and become accepted in wider ranges of applications.
Applications which will provide large markets in the future have moved
away from the scientific and simulation areas and are likely to be concerned
with the more immediate aspects of many people’s lives. For example, expert
systems and database applications will become more frequently used. The
Japanese fifth-generation program, and following this the UK Alvey and
EEC ESPRIT programs, has provided a large amount of activity in this area.
Other examples are: human interfaces, such as speech input and natural
language understanding; image processing applications, for office systems or
factor automation; and of course other signal processing applications for
systems as diverse as mobile cellular radio and radar guidance systems.
There is no doubt that the provision of processing power for these
applications will come from parallel processing. It is also obvious that this
demand is being driven by the cheap processing power provided by the
high-volume v l s i end of the semiconductor market. We therefore direct most
of this chapter towards the technological trends, their effectiveness and
CHARACTERISATION 553
6.1 CHARACTERISATION
In the case of demand gates, power dissipation will depend on the average
clocking frequency per gate,Jc. Thus assuming two logic transitions per cycle
iVa will be limited by equation (6.3).
(6.3)
The average clocking frequency will generally be less than the system clock,
as not all gates will make transitions every clock period. This is particularly
important in memory technology, where only a few memory cells are accessed
in each memory cycle. However, because of circuit size, static memory cells
usually use loss load logic. Tables 6.1 and 6.2 summarise the above
relationships.
There are other considerations which limit the number of gates on a single
chip, the most obvious being gate area. Gate area is a deciding factor for the
cost of the chip, it measures the ‘silicon real estate’ required. However as
minimum dimensions on devices approach the 1 /xm barrier, area usage is
more dominated by interconnections, as will be discussed later.
Thus we now have three parameters which characterise a given technology:
gate delay t d, power dissipation per gate PD (alternatively switching energy
£ s w = t ^ d ) and finally gate area. In the following discussion we will use
^sw (pj)
TABLE 6.2 Levels of integration possible for a loss load gate technology
as a function of average gate power dissipation.
shown in figure 6.1. ‘Free electrons’ are introduced into the silicon crystal
lattice by n-type impurities and ‘free holes’ are created in the lattice by p-type
impurities. Holes are the absence of an electron, which move and create
currents exactly as free electrons but of course in the opposite direction. They
are, however, less mobile. It is because the boundaries between these regions
can be controlled with great precision ( ~ 0.1 fim) when compared with planar
dimensions ( ~ 1 fim) that bipolar technologies tend to be faster. Their logic
circuits, however, are more complex.
There are many different technologies within these two broad categories,
depending on circuits, materials and processing features. For example, many
of the experimental technologies rely on novel processing steps. These reduce
device dimensions and consequently reduce power and increase speed.
556 TECHNOLOGY AND THE FUTURE
Thus it can be seen that n m o s gates only draw significant power when their
output is low; also transient power spikes are not significant.
The packing density of the basic n m o s inverter gate using 2 /im rules is
about 25 000 gates per mm2 based on the geometries of an inverter, although
this density is rarely found in practice.
m o s TECHNOLOGIES ( n m o s AND c m o s ) 559
transistors in series. The NOR gate is constructed with series p-type and
parallel n-type transistors.
Another attractive feature of c m o s technology is that it only draws current
when switching (figure 6.5(6)). This is simply verified by noting that in any
complementary transistor pair, one device is always off. This means that the
power dissipation will be a function of average clocking rate. This is a great
advantage in memory technology, where currently n m o s is approaching the
thermal barrier of about 1-2W per package (Wollesen 1980). c m o s has
slightly poorer packing densities than n m o s and is a more expensive process.
However, c m o s is now more competitive and has become the major process
of the 1980s.
One disadvantage of all mo s processes, which is not shared by bipolar
devices, is that they cannot be run hot. m o s devices suffer a decrease in speed
of a factor of two over a 100°C temperature rise. This means that mo s
technologies are more firmly bound to the 1-2W per package thermal
dissipation, whereas bipolar technologies are not.
Parallel computers, especially large replicated designs, are very well suited
to the continuous revolution that is taking place in the micro-electronics
industry. Very large-scale integration ( v l s i ), with 105 or even 106 gates on
a single integrated circuit, is now commonplace and the technology marches
on with the continuing advances in processing facilities, in particular the
resolution of printing or ‘writing’ of the circuits on the silicon slice.
The effects of scaling down device sizes have been a topic of much interest
and, for mo s transistors, rules which maintain well behaved devices have
been known for some time (Dennard et al 1974, Hoeneisen and Mead 1972).
For a scaling factor of K (> 1), if all horizontal and vertical distances are
scaled by 1/ K and substrate doping levels are scaled by K, then using voltage
levels also scaled by 1/ K the characteristics of devices should scale as given
in table 6.3 (Hayes 1980).
As we have already seen, it is desirable to decrease the propagation delay,
but not at the expense of the power dissipated. With these scaling rules, the
delay is scaled by 1/ K and the power dissipated per device is scaled by a
factor of \ / K 2, both very favourable. However it should be noted that the
packing density is increased by \ / K 2 and thus the power density remains
the same. Thus it is possible to increase packing densities by device scaling,
without the problems of approaching the thermal dissipation barrier.
There are problems in this scaling, as the current density increases by a
factor of K , which may cause reliability problems. If current densities become
Characteristic Scaling
too high, metal connections will migrate with the current flow. Other problems
occur because voltages must be scaled with devices, reducing the difference
between logic levels. Noise levels remain constant due to the thermal energy
of the discrete particles, and this can be a very critical problem. Scaling
devices without scaling voltages lead to a K 3 increase in power density.
A final problem in device scaling occurs in the increased relative size and
delay of the devices required to drive the external environment. Thus the full
benefit of scaling may be found inside the chip but as soon as signals must
leave the chip, only diminishing returns for scaling will be observed.
Although scaling is very attractive and may seem to be unlimited, there
are certain fundamental limitations due to the quantum nature of physics.
These are discussed in some detail in Mead and Conway (1980). In practice
these limits should not be approached until devices shrink to 0.25 ¡im size,
a further ten-fold reduction over current mainstream processes.
In contrast, the scaling for bipolar transistors is neither so rigorous nor
so well defined. Bipolar transistors may be scaled without scaling vertical
dimensions; indeed with present technologies it would be very difficult to
scale this dimension to a very great extent. As transit times in bipolar
transistors are dependent on the vertical dimension, e c l propagation delay
times will not scale linearly with planar dimensions. However there will be
some reduction in propagation delay due to capacitative effects and power
scaling will be similar to mo s devices.
A paper by Hart et al (1979) has looked at the simulation of both bipolar
and mo s devices as they are scaled down. Their results are summarised in
figures 6.6 and 6.7. Experimental sub-micron devices which have already been
fabricated confirm these simulated trends (e.g. Fang and Rupprecht 1975,
Sakai et al 1979). How soon such devices will be in production is a difficult
question to answer. The seemingly simple scaling rules presuppose many
improvements in processing technologies and there is a great deal of
development between the yields suitable for experimental devices and the
yields suitable for v l s i production.
similar to that used in the CRAY cabinet. It can be seen that one dimension
of the backwiring plane is reduced by a half. The ideal shape is one with
complete spherical symmetry as this minimises wire lengths. Perhaps the
CRAY-2 will come in this shape!
Other features in the CRAY-1 which minimise propagation delay are
matched transmission lines which are resistively terminated and the extremely
high chip packing densities. These features cause considerable cooling
problems, and in excess of 100 kW of heat power must be extracted from
the 100 or so cubic feet of cabinet space. To give some feeling for this power
density, imagine putting a 1 kW electric element into a biscuit tin and trying
to keep it just above room temperature. Even with such a feat of refrigeration
50% of the 12.5 ns clock period on the CRAY-1 is an allowance for
propagation delay.
To given an idea of the scales involved in this problem consider the
following: in a perfect transmission line signals travel at the speed of light
THE PROBLEMS WITH SCALING 565
which is 952 x 106 ft/s, i.e. a little under 1 ft/ns. In practice, transmission lines
have capacitance and resistance and the speed of propagation of a signal will
be attenuated. For an RC network the diffusion equation describes the
propagation of signals:
(6.4)
where R and C are the resistances and capacitances per unit length. The
diffusion delay varies quadratically with length for constant R and C, and is
the critical delay for on-chip signal propagation.
Returning to table 6.3, we see that although the RC delay remains constant
during scaling, this figure is based on a scaled-down wire. It is reasonable
to assume that the chip itself will remain the same size, so as to reap the
benefits of the increased packing density. If this is so, distances will be
relatively larger, with the net effect that tracks spanning the chip will be
slower to respond by a factor of K 2, using the same drive capability. Indeed
the situation is worse than this, for this absolute delay must really be compared
to the now faster gate delay, which has scaled by X -1, giving an overall
perceived degradation of K 3 in the responsiveness of a global wire. Although
the effects of this unfavourable scaling are only just being felt, this factor is
likely to have a major effect on the design of systems exploiting the new
technologies. They will be blindingly fast at a local level, but increasingly
566 TECHNOLOGY AND THE FUTURE
sluggish at the global or off-chip levels. The implications of this are now
considered.
As processing techniques improve and more devices are integrated onto
single circuits, the scaling factors described above will have a major effect on
the architectures implemented, and indeed on the design methodologies. The
problems that must be overcome include:
The latter problem arises because of the constant view of the world that
is presented by external components, such as pads, pins, and circuit tracking.
It introduces a discrepancy between on- and off-chip performance. On-chip
circuits may cycle at 25-50 MHz, but it is difficult to cycle pad driver outputs
at this rate, without using excessive power.
The problems associated with wire density can be illustrated by considering
the effects of scaling on the interconnection of abstract modules on a
chip. If we assume the same scaling as for the electrical parameters,
then we could obtain K 2 more modules of a given complexity on a
scaled chip. If we assume that each module is connected to every other
module on the chip, then for n modules n2 wires are required. The
scaled circuit requires (nK2)2 wires giving a K 4 increase in wire density
for a corresponding K 2 increase in module density. This is obviously
a worse case, as in general the modules will not all be fully connected.
However, the best case, which maintains a balance between wire and
gate density, would require each module to be connected to one and
only one other module. As mentioned earlier, the problem of communications
is thus traded between system and implementation levels; the linear network
that would result is only suitable for a few applications (see §3.3). However,
this is known to be a major problem, and mandates design styles and
architectures to be adapted to meet or ameliorate it. The solutions, as
in most engineering situations, are found in squeezing the problem on all
fronts, which may (for example) include the introduction of new interconnect
techniques employing optical connections (Goodman et al 1984).
Looking at these problems in terms of design, we can find styles which
can be used to advantage at both the circuit and systems levels. At the circuit
level we can employ large and regular gate structures, which overcome the
problems of wire density. Effectively, they increase module size and ameliorate
the wire density problem by containing a regular network of local or bused
THE PROBLEMS WITH SCALING 567
(a)
FIGURE 6.9 A comparison of different generations of microprocessors.
(a) An Intel 8080 8-bit microprocessor.
568 TECHNOLOGY AND THE FUTURE
be seen that the introduction of registers for storing partial results within the
data path creates local temporal regions instead of a global temporal region.
The clocks required to synchronise the flow of data through the pipeline are
still global signals; however, their latency does not prejudice the operation
of the system, providing any delays to different modules can be equalised.
Indeed, if required, the system may be made asynchronous or self-timed, by
providing a local agreement between stages as to when data should be passed.
This requires a handshake between adjacent modules. Asynchronous pipelines
are a programming technique which is described in §4.5.2.
Figure 6.11 illustrates the chip floor-plan of a typical processor array,
which uses replication as a means of improving performance. Again it can
be seen that connections between the replicated modules have been chosen
to reflect the planar nature of the medium and are therefore local. The same
arguments concerning control apply equally to the processor array structure.
THE PROBLEMS WITH SCALING 569
(a)
Contro l bus
FIGURE 6.11 Diagram showing the floor-plan and regular local
communications in the RPA p e chip.
570 TECHNOLOGY AND THE FUTURE
Either a synchronous system is used, in which case a clock and even the
control word will be globally distributed from a single source. Otherwise the
system may be made self-timed, in which case each processor would have its
own clock and local control store. The former is typical of a sim d computer
and the latter would correspond to a multi-transputer chip. Such designs
have been considered by INMOS, and (moreover) it has been proposed that
they could be automatically generated from OCCAM programs by a silicon
compiler (Martin 1986).
It seems ironic that the problems found in v l s i systems’ design seem to reflect
those in society at large. Chips are rapidly evolving into a two-class
society—as in systems comprising many v l s i circuits there are the privileged,
who can communicate locally or on-chip, and the unfavoured, who must
resort to slow off-chip communications. With these inequalities the partitioning
of a system becomes very important. There are three main considerations
when partitioning a system into integrated circuits. These are yield, pin-out
and power dissipation, which are considered below. However, the problems
above concerning the discrepancy between on- and off-chip performance will
also constrain the partitioning. Points of maximum bandwidth must now be
maintained on-chip.
It is interesting to note that components from most semiconductor
companies maintain the point of maximum bandwidth in a system off-chip,
and then add complexity to the system in order to minimise the bandwidth
through this bottleneck. The interface is, of course, the memory-processor
interface, which carries code and data between memory parts and processor
parts in all microprocessor systems. The complexity introduced to combat
the excessive bandwidth at this point includes on-chip cache memories,
instruction prefetch and pipelining, and more complex instruction sets. There
are of course exceptions to this approach, which involve an integration
of memory and processor function, and a deliberate attempt to reduce
unnecessary complexity. A good example of this style can be found in the
INMOS transputer, described in §3.5.5.
6.6.1 Yield
Ideally one would like to implement the whole of a system onto a single
integrated circuit, as driving signals off-chip is slow, consumes a great deal
of power, is inherently unreliable and requires a large volume to implement.
SYSTEM PARTITIONING 571
However, defects are introduced into the silicon during processing which can
cause circuits not to work as expected. These defects include:
(a) crystal defects in the materials used;
(b) defects in the masks used to pattern the silicon;
(c) defects introduced during processing (e.g. foreign particles);
(d) defects introduced by handling (e.g. scratching, photoresist damage);
(e) defects introduced by uneven processing (e.g. metal thinning);
(f) pinholes between layers;
(g) crystal defects introduced during processing.
Because of the presence of these defects, only a fraction of the chips on a
processed wafer will be completely functional (assuming a correct design).
For example, consider a 3 inch wafer with 100 chip sites fabricated in a mo s
technology. A typical yield for such wafers would be around 30%, implying
an average of 30 working chips per wafer. The number of defects occurring
on a wafer is usually assumed to be distributed randomly and expressed
as a number per cm2. During the last 20 years, major advances have been
made in clean room technology, equipment used, and materials and masks.
As a consequence it is now possible to produce reasonable yields on chips
up to and even over 1 cm2.
The simplest yield model assumes a random distribution of point defects
and, furthermore, assumes that a single defect anywhere on the chip will
cause that chip to fail. In this case the probability of finding any defects on
a given chip can be calculated using the Poisson distribution and the
parameters D (the defect density) and A (the area of the chip). The probability
of a chip being good using Poisson statistics is given by
P(D,A) = exp( —DA).
This model, however, does not accurately reflect the real behaviour of a
fabrication process, although it gives a fair approximation to the region of
very low yield. Chips of an area of many times 1/D will have a vanishingly
small probability of not containing a defect. Areas must be kept to a few
times l/D to give reasonable yields. The defect densities found in a good
process will vary from 1-5 cm " 2. Thus chips of around 1 cm2 will give modest
yields.
In reality the defects are not randomly distributed and many more defects
will be found in the periphery of the wafer when compared with the centre.
Also the assumption that all defects can be modelled as points is not valid
and many defects are large compared with the feature sizes found on the
chip. These are called area defects. Mathematically, points are very much
easier to deal with than areas. Area defects may be modelled with modifications
572 TECHNOLOGY AND THE FUTURE
6.6.2 Pin-out
Another limitation on the partitioning of systems is the number of pins that
may be provided between the integrated circuit and the outside world. This
is constrained by packaging and power dissipation requirements, and also
contributes to the bandwidth limitation at the chip boundary.
A good design for a v l s i chip is therefore a portion of the system that is
to some extent self-contained, and has as few wires as possible to the outside
world. However, this often conflicts with other design requirements for v l s i ,
as found in partitioned, regularly connected structures such as grid-connected
processor arrays. The problem is that as more and more processors are
included onto a single silicon substrate, then the bandwidth required between
chips also becomes larger. This increase can either vary with the area of
processors enclosed, as in the case of external memory connections found in
many simd chips, or as the perimeter of the array, which is a partitioning of
the grid network.
External connections to a chip are expensive; on-chip pads consume a
relatively large area, and the size of a pad (100-150 /im2) must remain
constant, despite any scaling of the circuit. The driver circuits also remain
constant in size, to maintain drive capabilities. Pad drivers also consume a
large amount of power, which can also cause considerable noise on the supply
rails if many pads change state at the same time (as in a bus for example).
SYSTEM PARTITIONING 573
Externally, large packages are expensive and consume valuable board area.
Although packaging technologies are now providing dense, low-area chip
carriers, such as pin grid arrays, and surface mount packages, the density of
pins makes circuit boards more expensive due to the additional layers required
to handle the track density.
Single-chip packaging is not the only boundary that can be drawn in
system partitioning, as many chips may be mounted on a substrate before
packaging. Manufacturers are experimenting with packages containing
hybrids for commercial use. They have been used in military applications for
some time. Examples are ceramic thick film and even metallised silicon.
Although these techniques are relatively new and still expensive when
compared to conventional packaging technologies; however, their obvious
advantages will force their development.
To find a good partitioning in a system, graph theory may be used. At a
given level of description, any system may be described by a connected graph,
where the system components are the nodes (these may be gates or functional
blocks, for example) and the wires connecting them are the edges of the
graph. A good system partitioning divides the graph into subgraphs, where
each subgraph contains a high degree of connectivity and the graph formed
by the partitioning has a low degree of connectivity.
The situation is not quite so simple as this, however, as no system has
only a single implementation and there are always a number of trade-offs
that may be made to reduce the pin-out of a given partitioning. For example,
at the expense of additional delay, signals may be encoded and then decoded
on-chip prior to use. Also a single pin may be used for many signals by
time-multiplexing the data.
The INMOS link implemented on the transputer provides a good example
of such a trade-off. The transputer has four links to connect it to other
transputers, each implemented as a pair of wires transmitting a byte of data
in an 11-bit data packet. The link is bidirectional and can transmit data and
acknowledge packets in any order. A large proportion of chip area has been
used to optimise the speed of this link (20 MHz). This is a good compromise,
however, as the alternative parallel implementation would have required at
least ten wires per link. Scaling this figure by a factor of eight to provide the
four bidirectional links gives a large difference in pin count. Thus, some speed
has been lost in return for a massive pin reduction and a modest use of chip
area, the non-critical resource.
i.e. printed circuit boards and forced air or convected cooling. Beyond this,
more sophisticated packaging technologies are required, such as hybrid
ceramic substrates, cooling fins, heat clamps, and liquid immersion technology.
All of these techniques increase system costs.
As an aside, low-voltage, low-temperature cm o s provides a very fast
technology, which is now comparable to e c l . ETA are building super-
computers from VLSI cm o s gate arrays, using liquid nitrogen temperatures
(77 K) and low voltages. They are obtaining 100-200 MHz system speeds.
In any system there is a dynamic power dissipation (Pd), which is due to
capacitative loading and is proportional to the number of gates ( Ng), their
average frequency of operation < /) , and the voltage through which they are
switched ( I d d ):
A typical value for P dep in an nmos circuit (the effective gate load) is about
50 kQ, giving a power dissipation for a single gate of about 0.1 mW.
(i) Memory
Memory is one of the most ideal structures for wsi implementation. It is very
regular, consists of a very small replicated module (a transistor and capacitor
in dynamic r a m ) and has regular interconnections, i.e. word and bit lines. It
is not surprising, therefore, that it has been one of the most successful products
to exploit redundancy. Although many experiments have been undertaken
to construct wafer-scale circuits, it is in the area of yield enhancement of
marginal state-of-the-art chips that the successes in memory design have been
made commercially. However, memory does not perform processing, and the
availability of cheap memory has perpetuated the von Neumann architecture,
with its separate processor, control and memory unit.
is required from a system, where provision has been made to off-line and
repair the faulty module when detected.
proposed (Jesshope and Bentley 1986, 1987) and are being implemented in
silicon. These schemes would allow arrays of up to 16 x 16 RPA pe s to be
fabricated on a single 3 inch wafer, using 3 /mi cm o s technology. More
aggressive design rules (1 /mi, say) and a larger wafer (5 inches, say), could
be expected to give a 25-fold increase in density over the above estimate.
This would give up to 200 Mflop/s for floating-point operations and up to
2 Gflop/s for 16-bit integer operations—quite an impressive performance for
a computer that you could slip into your pocket!
We do not have the space for a complete treatment of this subject here,
but for further reading those interested are referred to the recent book edited
by Jesshope and Moore (1986). This is the proceedings of a conference held
at Southampton University, at which no fewer than three prototype wafer
designs were displayed; one from Sinclair Research was demonstrated while
configuring itself. Like the Southampton design, this demonstration was a
memory circuit, but configured from a linear chain of cells, constructed from
a spiral of good sites on the wafer.
A.l MISCELLANEOUS
(em pty):: =
(digit):: = 0| 112|3|4|5|6|7|8|9
(lower case letter):: = a |b |c |d |e |f|g |h |i|j|k |l|m |n |o |p |q |r|s |t|u |v |w |x |y |z
( pipelined >:: = p | ( empty >
(SI prefix):: = K |M |G |T |(em p ty )
(unsigned integer):: = ( digit>|(unsigned integer)(digit)
( power>::= (unsigned integer)|(em pty>|(low er case letter)
(multiplier):: = (lower case letter>| ( unsigned integer)(SI prefix)
|(unsigned integer> •(unsigned integer>|( comment>
( factor >:: = ( multiplier ><powcr>| ( factor >* ( factor >
(com m ent):: = ((any sequence of symbols>)|(empty)
( statement separator >:: = ;
( statement >:: = ( definition >| ( highway definition > | ( structure >
Examples
<empty >:: =
<digit >:: = 3;9
(lower case letter):: = c;z
(pipelined):: = ;p
(unsigned integer):: = 34; 128
581
582 A P P E N D IX
(p o w e r ):: = ; 2; s
(m u ltip lier ):: = n; 16;8G;9.5
(fa c to r ):: = n * m ; 642; n* 128*322
(c o m m e n t):: = ; (bipolar EC L )
A.2 E UNITS
(E symbol):: = B |C h |D |E |F |IO |P |U |S
(E identifier):: = (E sym bol)( pipelined)|(E identifier)(digit)
(operation time in ns>:: = (m ultiplier)|(com m ent)
(b it width of operation):: = (m ultiplier)
|(b it width of operation),(m ultiplier)|(com m ent)
<E unit):: = <E identifier>^T/w'dTho™^ranon><comment>
Examples
(E identifier):: = E; F12; Bp46
(operation time in ns>:: = t;(4 milliseconds)
(bit width of operation):: = b ;(4 Bytes); 16,32
(E unit):: = E; F12^°(*); S(omega)
A.3 M UNITS
(M symbol):: = M |0
(M identifier):: = (M symbol> ( pipelined>|( M identifier)(digit)
(access time in ns>:: = (m ultiplier)|(com m ent)
(num ber of words):: = (m ultiplier)*
|(num ber of words> ( number of w ords>|(com m ent)*
(bits in word accessed>::= (m ultiplier)|(com m ent)
(size of memory):: = (number of w ords)(bits in word accessed)
| (com m ent)
<M unit):: = <M identifier>/^ofmemory"8><comment>
Examples
(M identifier):: = Mp; M l; M2; M3; 016
(access time in ns>::= 100;(4 milliseconds)
(num ber of words):: = n * ;(l,2 or 4 MBytes)*;
(bits in word accessed):: = b; 32
(size of memory):: = n*b; n*32; ; n*; *32; 2K*8*64
<M unit):: = M; 0 1 6 J& ; M 2K. 8(2716 EPROM);
D A T A P A T H S 583
A.4 COMPUTERS
Examples
(highway identifier):: = H; H3 (twisted pair)
(tim e per word in ns):: = t; 12.5;(lms)
(d ata bits):: = 64;n
(address bits>:: = 16;a
( number of paths >:: = 4; n
(size of path):: = 4*{64 + 16}; 4*64; 64; 64 + 16
(d ata bus):: = —; -----—----
4*{64 + 1 6 }
(cross connection):: = x ; x x x
(connection):: = -H 3 (twisted pair)----- ; x x (Banyan Network) x x
-H 3 -
( no connection >:: = |
( simplex left >:: = ( ----- ; ( x x ; ( - ; ( x
(simple right):: = ----- >; x x > ;-> ; x >
(duplex):: = < -> ;< lg -.,-3l >; < —' “ - >
(half duplex >:: = ( - / - > ; ( ---------H 2— / — H 3-------- )
(d ata path) ::= x
(highway definition ):: = H3 = { { - - > , ( - } / ( ~4 }
A.6 STRUCTURES
( unit >:: = ( E unit > | ( M unit > | ( computer > | ( lower case letter >
( primary >:: = ( unit > | { structure } <connec,ivi,y><pipeiined>
|(parallel structure)(pipelined)
( secondary >:: = ( primary > | ( factor > ( primary > | ( factor > ( primary >
| (em pty)
S T R U C T U R E S 585
Examples
(u n it):: = E; C [64P]; a
(prim ary):: = { - E - M - a } p ; { - E l - , E2};{E1 /E2/E3}; IO
(secondary):: = 16*322P;4{3F,2P}
(structure):: = - E - M - a ; { E - { - M l- M 2 - ,- > } ----- M 3}- a;
{322P },nn
(concurrent list):: = I, E, M
(sequential list):: = 3E1 /E2/E3
(parallel structure):: = {I, E, M}; {3E1 /E2/E3}; { -E -, |M —,—}
(definition):: = 3E1 = {E( + ), E(*), E( - )}; E2 = {F(*)/F( + )/B};
c = I [ 64P ] f nn; M 1 = { - M — >} - M|{J&
R eferences
The following is an alphabetical list of papers and books referred to in the text. The
abbreviations follow the British Standards Institution (1975) rules which are also an
American ANSI standard. They are used by INSPEC in Physics Abstracts (published
by the Institute of Electrical Engineers, London).
587
588 REFERENCES
Arvind D K, Robinson I N and Parker I N 1983 A VLSI chip for real time image
processing Proc. IEEE Symp. on Circuits and Systems 405-8
Aspinall D 1977 Multi-micro systems Infotech State of the Art Conference: Future
Systems (Maidenhead: Infotech) vol. 2 45-62
------ 1984 Cyba-M Distributed Computing ed F B Chambers, D A Duce and
G P Jones (London: Academic) 267-76
Auerbach 1976a Auerbach Corporate EDP Library of Computer Technology Reports
(18 volumes) (Pennsauken, NJ: Auerbach Publishers Inc.)
------ 1976b Cray Research Inc. Cray-1 Auerbach Corporate EDP Library of Computer
Technology Reports (Pennsauken, NJ: Auerbach Publishers Inc.)
Augarten S 1983 State of the art (US: Tickner and Fields)
Ausländer L and Cooley J W 1986 On the development of fast parallel algorithms
for Fourier transforms and convolution Proc. Symp. Vector and Parallel Processors
for Scientific Calculation (Rome) 1985 (Rome: Accademia Nazionale dei Lincei)
Ausländer L, Cooley J W and Silberger A J 1984 Numerical stability of fast convolution
algorithms for digital filtering IEEE Workshop on VLSI Signal Processing 172-213
(London: IEEE)
Austin J H Jr 1979 The Burroughs Scientific Processor Infotech State of the Art
Report: Supercomputers vol. 2 ed C R Jesshope and R W Hockney (Maidenhead:
Infotech Int. Ltd) 1-31
Baba T 1987 Microprogrammable Parallel Computer (Cambridge, MA: MIT)
Babbage C 1822 A note respecting the application of machinery to the calculation
of astronomical tables Mem. Astron. Soc. 1 309
------ 1864 Passages from the Life of a Philosopher (Longman, Green, Longman,
Roberts and Green) Reprinted 1979 (New York: Augustus M Kelly)
Babbage H P 1910 Babbage’s Analytical Engine Mon. Not. Roy. Astron. Soc. 70
517-26, 645
Backus J W, Bauer F L, Green J, Katz C, McCarthy J, Naur P, Perlis A J,
Rutishauser H, Samelson K, Vauquois B, Wegstein J H, van Wijngaarden A and
Woodger M 1960 Report on the algorithmic language ALGOL60 Numer. Math.
2 106-36
Barlow R H, Evans D J, Newman I A and Woodward M C 1981 The NEPTUNE
parallel processing system Internal Report Department of Computer Studies,
Loughborough University, UK
Barlow R H, Evans D J and Shanehchi J 1982 Performance analysis of algorithms on
asynchronous parallel processors Comput. Phys. Commun. 26 233-6
Barnes G H, Brown R M, Kato M, Kuck D J, Slotnick D L and Stokes R A 1968
The ILLIAC IV computer IEEE Trans. Comput. C-17 746-57
Batcher K E 1968 Sorting networks and their applications AFIPS Conf. Proc. 32
307-14
------ 1979 The ST ARAN Computer Infotech State of the Art Report: Supercomputers
vol. 2 ed C R Jesshope and R W Hockney (Maidenhead: Infotech Int. Ltd) 33-49
------ 1980 Design of a massively parallel processor IEEE Trans. Comput. C-29
1-9
Bell C G and Newell A 1971 Computer Structures: Readings and Examples
(New York: McGraw-Hill)
Ben-Ari M 1982 Principles of Concurrent Programming (London: Prentice-Hall)
Benes V 1965 Mathematical Theory of Connecting Networks and Telephone Traffic
(New York: Academic)
Bentley L and Jesshope C R 1986 The implementation of a two-dimensional
REFERENCES 589
redundancy scheme in a wafer scale, high speed disc memory Wafer Scale
Integration (Bristol: Adam Hilger)
Berg R O, Schmitz H G and Nuspl S J 1972 PEPE— an overview of architecture,
operation and implementation Proc. IEEE Natl. Electron. Conf. 27 312-7
Berger H H and Wiedman S K 1972 Merged-transistor-logic m t l — a low cost bipolar
logic concept IEEE J. Solid St. Circuits SC-7 340-6
Bergland G D 1968 A fast Fourier transform algorithm for real-valued series Commun.
Assoc. Comput. Mach. 11 703-10
Berney K 1984 IBM eyes niche in burgeoning supercomputer market Electronics July
12 4 5 -6
Blackman R B and Tukey J W 1959 The Measurement of Power Spectra (New York:
Dover Publications Inc.)
Bloch E 1959 The engineering design of the STRETCH computer Proc. East. Joint
Comp. Conf. (New York: Spartan Books) 48-58
Booch G 1986 Object oriented development IEEE Trans. Software Eng. SE-12 211-21
Bossavit A 1984 The ‘Vector Machine’: an approach to simple programming on
CRAY-1 and other vector computers PDE Software: Modules, Interfaces and
Systems ed B Engquist and T Smedsaas (Amsterdam: Elsevier Science BV,
North-Holland) 103-21
Bourne S R 1982 The Unix System (International Computer Science Series) (London:
Addison-Wesley)
Bracewell R 1965 The Fourier Transform and Its Applications (New York:
McGraw-Hill)
Brigham E O 1974 The Fast Fourier Transform (Englewood Cliffs, NJ: Prentice-Hall)
British Standards Institution 1975 The abbreviation of titles of periodicals: part 2.
Word-abbreviation list Br. Stand. Specif. BS4148 Part 2
Brownrigg D R K 1975 Computer modelling of spiral structure in galaxies
PhD Thesis, University of Reading
Bruijnes H 1985 Anticipated performance of the CRAY-2 NM FECC Buffer 9 (6) 1-3
Bucher I Y 1984 The computational speed of supercomputers Supercomputers : Design
and Applications ed K Hwang (Silver Spring, MD: IEEE Comput. Soc.) 74-88
Bucher I Y and Simmons M L 1986 Performance assessment of supercomputers
Vector and Parallel Processors: Architecture, Applications and Performance Evaluation
ed M Ginsberg (Amsterdam: North-Holland) (Also preprint LA-UR-85-1505,
LANL, USA)
Budnik P and Kuck D J 1971 The organisation and use of parallel memories IEEE
Trans. Comput. C-20 1566-9
Buneman O 1969 A compact non-iterative Poisson-solver Stanford University Institute
for Plasma Research Report No 294
Burks A W 1981 Programming and structural changes in parallel computers Conpar 81
ed W Händler (Berlin: Springer) 1-24
Burks A W and Burks A R 1981 The ENIAC : first general-purpose electronic computer
Ann. Hist. Comput. 3 (4) 310-99
Burks A W, Goldstine H H and von Neumann J 1946 Preliminary discussion of the
logical design of an electronic computing instrument in John von Neumann,
Collected Works vol. 5 ed A H Taub (Oxford: Pergamon) 35-79
Burns A 1985 Concurrent programming in Ada (Ada Companion Series) (Cambridge:
Cambridge University Press)
Burroughs 1977a Burroughs scientific processor— file memory Burroughs Document
61391B
590 REFERENCES
Energy, USA
Hunt D J 1978 UK Patent Application No. 45858/78
------ 1979 Application techniques for parallel hardware Infotech State of the Art
Report: Supercomputers vol. 2 ed C R Jesshope and R W Hockney (Maidenhead:
Infotech Int. Ltd) 205-19
Hwang K and Briggs F A 1984 Computer Architecture and Parallel Processing
(New York: McGraw-Hill)
Ibbett R N 1982 The Architecture of High Performance Computers (London:
Macmillan)
IBM 1980 Josephson computer technology IBM J. Res. Dev. 24 (2)
ICL 1979a DAP: FORTRAN language reference manual ICLTech. Pub. TP 6918
------ 1979b DAP: Introduction to FORTRAN programming ICLTech. Pub. TP 6755
------ 1979c DAP: APAL language ICLTech. Pub. TP 6919
IEEE 1978 Special issue on fine line devices IEEE Trans. Electron. Devices ED-25 (4)
------ 1979 Special issue on v l s i IEEE Trans. Electron. Devices ED-26 (4)
------ 1983 A proposed standard for binary floating-point arithmetic (draft 10.0) IEEE
Computer Society Microprocessor standards Committee Task P754 Publication P754
(January)
Infotech 1976 Multiprocessor systems Infotech State of the Art Report: Multiprocessor
Systems ed C H White (Maidenhead: Infotech Int. Ltd)
INMOS 1984 Occam Programming Manual (Englewood Cliffs, NJ: Prentice-Hall)
------ 1985 IMSt414 Transputer Reference Manual (Bristol: INMOS Ltd)
------ 1986 Occam 2 Product Definition (Bristol: INMOS Ltd) (preliminary)
Iverson K E 1962 A Programming Language (London: Wiley)
------ 1979 Operators ACM Trans. Program. Lang. Syst. 1 161-76
Jacobi C G J 1845 Ober eine Neue Auflosungsart der bei der Methode der Kleinsten
Quadrate Vorkommenden Linearen Gleichungen Astr. Nachr. 22 (523) 297-306
James R A and Parkinson D 1980 Simulation of galactic evolution on the ICL
distributed array processor I UCC Bull. Cambridge University Library 2 111-4
Jensen C 1978 Taking another approach to supercomputing Datamation 24 (2) 159-75
Jesshope C R 1980a The implementation of fast radix-2 transforms on array processors
IEEE Trans. Comput. C-29 20-7
------ 1980b Some results concerning data routing in array processors IEEE Trans.
Comput. C-29 659-62
------ 1980c Data routing and transpositions in processor arrays ICLTech. J. 2191 -206
------ 1982 Programming with a high degree of parallelism in Fortran Comput. Phys.
Commun. 26 237-46
------ 1984 A reconfigurable processor array Supercomputers and Parallel Computation
ed D J Paddon (Oxford: Clarendon) 35-40
------ 1985 The RPA— optimising a processor array architecture for implementation
using VLSI Comput. Phys. Commun. 37 95-100
------ 1986a Computational physics and the need for parallelism Comput. Phys.
Commun. 41 363-75
------ 1986b Communications in wafer scale systems Wafer Scale Integration ed C R
Jesshope and W R Moore (Bristol: Adam Hilger) 65-71
------ 1986c Building and binding systems with transputers Proc. IBM Europe Inst.
1986 Parallel Computing (Austria) to be published
------ 1987a The RPA as an intelligent transputer memory system Systolic Arrays ed
REFERENCES 599
Academic) 85-99
Kilburn T, Edwards D B G and Aspinall D 1960 A parallel arithmetic unit using a
saturated transistor fast-carry circuit Proc. IEE Part B 107 573-84
Kilburn T, Edwards D B G , Lanigan M J and Sumner F H 1962 One-level storage
system Inst. Radio Eng. Trans. EC-11 (2) 223-35 Reprinted in Bell and Newell
1971 Ch 23, 276-90
Kilburn T, Howarth D J, Payne R B and Sumner FH 1961 The Manchester University
ATLAS operating system Part I: Internal Organization Comput. J. 4 222-5
KimuraT 1979 Gauss-Jordan elimination by v l s i mesh-connected processors Infotech
State of the Art Report : Supercomputers voi. 2 ed C R Jesshope and R W Hockney
(Maidenhead: Infotech Int. Ltd) 273-90
Kogge P M 1981 The Architecture of Pipelined Computers (New York: McGraw-Hill)
Kogge P M and Stone H S 1973 A parallel algorithm for the efficient solution of a
general class of recurrence equations IEEE Trans. Comput. C-22 786-93
Kolba D P and Parks T W 1977 A prime factor FFT algorithm using high-speed
convolution IEEE Trans. Acoust. Speech Signal Process. 25 281-94
Kondo T, Nakashima T, Aoki M and Sudo T 1983 An l s i adaptive array processor
IEEE J-SSC 18 147-56
Korn D G and Lambiotte J J Jr 1979 Computing the fast Fourier transform on a
vector computer Math. Comput. 33 977-92
Kowalik J S (ed) 1984 High-Speed Computation (NATO ASI Series F: Computer
and System Sciences voi. 7) (Berlin: Springer)
------ (ed) 1985 Parallel MI MO Computation: HEP Supercomputer and its Applications
(Cambridge, MA: MIT)
Kowalik J S and Kumar S P 1985 Parallel algorithms for recurrence and tridiagonal
systems Parallel MI MD Computation: HEP Supercomputer and Applications
(Cambridge, MA: MIT) 295-307
Kuck D J 1968 ILLIAC IV software and application programming IEEE Trans.
Comput. C-17 758-70
------ 1977 A survey of parallel machine organisation and programming Comput. Surv.
9 29-59
------ 1978 The Structure of Computers and Computations vol. 1 (New York: Wiley) p 33
------ 1981 Automatic program restructuring for high-speed computation Lecture
Notes in Computer Science 111: Conpar 81 ed W Händler (Berlin: Springer) 66-77
Kuck D J, Lawrie D H and Sameh A (eds) 1977 High Speed Computers and Algorithm
Organization (New York: Academic)
Kulisch U W and Miranker W L (eds) 1983 A New Approach to Scientific Computation
(New York: Academic)
Kung H T 1980 The structure of parallel algorithms Advances in Computers ed Yovits
(New York: Academic) 65-112
Ladner R E and Fisher M J 1980 Parallel prefix computation J. Assoc. Comput.
Mach. 27 831-8
Lambiotte J R Jr and Voight R G 1975 The solution of tridiagonal linear systems
on the CDC STAR-100 computer ACM Trans. Math. Software 1 308-29
Lang T 1976 Interconnections between processors and memory modules using the
shuffle exchange network IEEE Trans. Comput. C-25 496-503
Lang T and Stone H S 1976 A shuffle-exchange network with simplified control IEEE
Trans. Comput. C-25 55-65
Larson J L 1984 An introduction to multitasking on the CRAY X-MP-2 multiprocessor
IEEE Comput. 17 (7) 62 -9
REFERENCES 601
------ 1985 Practical concerns in multitasking on the CRAY X-MP ECMWF Workshop
on Using Multiprocessors in Meteorological Models Report November 92 -110
(Reading, UK: European Centre for Medium Range Weather Forecasts)
Lavington S H 1978 The Manchester Mark I and ATLAS: a historical perspective
Commun. Assoc. Comput. Mach. 21 4 -1 2
Lawrie D H 1975 Access and alignment of data in an array processor IEEE Trans.
Comput. C-24 1145-55
Lawrie D H, Layman T, Baer D and Randal J M 1975 GLYPNIR— a programming
language for ILLIAC IV Commun. Assoc. Comput. Mach. 17 157-64
Lawson C, Hanson R, Kincaid D and Krogh F 1979 Basic linear algebra sub-programs
for Fortran usage Assoc. Comput. Math. Trans. Math. Software 5 (3) 308-71
Lazou C 1986 Supercomputers and their Use (Oxford: Oxford University Press)
Lea R M 1986 WASP A WSI associative string processor for structured data
processing Wafer Scale Integration ed C R Jesshope and W R Moore (Bristol:
Adam Hilger)
Lee C Y and Pauli M C 1963 A content addressable distributed logic memory with
applications to information retrieval Proc. IEEE 51 924-32
Lee K Y, Abu-Sufah W and Kuck D J 1984 On modelling performance degradation
due to data movement in vector machines Proc. 1984 Int. Conf. Parallel Processing
(Silver Spring, MD: IEEE Comput. Soc.) 269-77
Lenfant J 1978 Parallel permutations of data: a Benes network control algorithm for
frequently used bijections IEEE Trans. Comput. C-27 637-47
Leondes C and Rubinoff M 1952 D IN A — a digital analyser for Laplace, Poisson,
diffusion and wave equations AIEE Trans. Commun, and Electron. 71 303-9
Linback J R 1984 CMOS gates key to ‘affordable’ supercomputer Electronics Week
October 29 17-19
Lincoln N R 1982 Technology and design trade-offs in the creation of a modern
supercomputer IEEE Trans. Comput. C-31 349-62
Lint B and Agerwala T 1981 Communication issues in the design and analysis of
parallel algorithms IEEE Trans. Software Eng. SE-7 174-88
Lorin H 1972 Parallelism in Hardware and Software: Real and Apparent Concurrency
(Englewood Cliffs, NJ: Prentice-Hall)
Lubeck O, Moore J and Mendez P 1985 A benchmark comparison of three
supercomputers: Fujitsu VP-200, Hitachi S810/20 and Cray X -M P/2 IEEE
Computer 18 (12) 10-24
McClellan J H and Rader C M 1979 Number Theory in Digital Signal Processing
(Englewood Cliffs, NJ: Prentice-Hall)
McCrone J 1985 The dawning of a silent revolution Computing: The Magazine July
18 12-13
McIntyre D 1970 An introduction to the ILLIAC IV computer Datamation 16 (4) 6 0 -7
McLaughlin R A 1975 The IBM 704: 36-bit floating-point money maker Datamation
21 (8 )4 5 -5 0
McMahon F 1972 (Code and information available from L-35, LLNL, PO Box 808,
CA 94550, USA. See also Riganah and Schneck 1984.)
Madsen N and Rodrigue G 1976 A comparison of direct methods for tridiagonal
systems on the CDC STAR-100 Reprint UCRL-76993 Lawrence Livermore
Laboratory Rev. 1
Maples C, Rathbun W, Weaver D and Meng J 1981 The design of MIDAS— a
Modular Interactive Data Analysis System IEEE Trans. Nucl. Sci. NS-28
3746-53
602 REFERENCES
1 0 (3 )7 2 -8 0
------ 1979 Numerical aerodynamic simulation facility project Infotech State of the
Art Report: Supercomputers vol. 2 ed C R Jesshope and R W Hockney (Maidenhead:
Infotech Int. Ltd) 331-42
Stone H S 1970 A logic-in-memory computer IEEE Trans. Comput. C-19 73-8
------ 1971 Parallel processing with the perfect shuffle IEEE Trans. Computers C-20
153-61
------ 1973 An efficient parallel algorithm for the solution of a tridiagonal system of
equations J. Assoc. Comput. Mach. 20 27-38
------ 1975 Parallel tridiagonal solvers Assoc. Comput. Mach. Trans. Math. Software
1 289-307
Strakos Z 1985 Performance of the EC2345 array processor Computers and Artificial
Intelligence 4 273-84
------ 1987 Effectivity and optimizing of algorithms and programs on the host-
computer/ array-processor system Parallel Computing 4 189-207
Stumpff K 1939 Tafeln und Aufgaben zur Harmonischen Analyse und Periodogramm-
rechnung (Berlin: Springer)
Sugarman R 1980 Superpower computers IEEE Spectrum 17 (4) 28-34
Sumner F H (ed) 1982 State of the Art Report: Supercomputer Systems Technology
Series 10 No 6 (Maidenhead: Pergamon Infotech Ltd)
Sumner F H, Haley G and Chen E C Y 1962 The central control unit of the ATLAS
computer Proc. IF IP Congr. 657-62
Sumney L W 1980 v l s i with a vengeance IEEE Spectrum 17 (4) 2 4-7
Swan R J, Fuller S H and Siewiorek D P 1977 Cm*: a modular multi-microprocessor
Proc. National Computer Conference 46 637-44
Swarztrauber P N 1974 A direct method for the discrete solution of separable elliptic
equations SIAM J. Numer. Anal. 11 1136-50
------ 1977 The methods of cyclic reduction, Fourier analysis and the FACR algorithm
for the discrete solution of Poisson’s equation on a rectangle SIAM Rev. 19
490-501
------ 1979a The solution of tridiagonal systems on the CRAY-1 Infotech State of the
Art Report: Supercomputers vol. 2 ed C R Jesshope and R W Hockney (Maidenhead:
Infotech Int. Ltd) 343-59
------ 1979b A parallel algorithm for solving general tridiagonal equations Math.
Comput. 33 185-99
------ 1982 Vectorising FFTs Parallel Computations ed G Rodrigue (London:
Academic) 51-83
------ 1984 FFT algorithms for vector computers Parallel Computing 1 45-63
Sweet R A 1974 A generalised cyclic-reduction algorithm SIAM J. Numer. Anal. 11
506-20
------ 1977 A cyclic-reduction algorithm for solving block tridiagonal systems of
arbitrary dimension SIAM J. Numer. Anal. 14 706-19
Tamura H, Kamiya S and Ishigai T 1985 FACOM VP-100/200: Supercomputers
with ease of use Parallel Computing 2 87-107
Taylor G S 1983 Arithmetic on the ELXSI 6400 IEEE Proc. 6th Ann. Symp. on
Computer Architecture 110-5 (London: IEEE)
Tedd M, Crespi-Reghizzi S and Natali A 1984 Ada for Multiprocessors Ada Companion
Series (Cambridge: Cambridge University Press)
Temperton C 1977 Mixed radix fast Fourier transforms without reordering ECMWF
Report No. 3 European Centre for Medium-range Weather Forecasting, Shinfield
REFERENCES 607
Institute)
------ 1978 USA standard FORTRAN USA X3.9 1978 (New York: USA Standards
Institute)
Vajtersic M 1982 Parallel Poisson and biharmonic solvers implemented on the EGPA
multi-processor Proc. 1982 Int. Conf. on Parallel Processing (Silver Spring, MD:
IEEE Comput. Soc.) 72-81
------ 1984 Parallel marching Poisson solvers Parallel Computing 1 325-30
VAPP 1982 Proceedings Vector and Parallel Processors in Computational Science
(Chester 1981) Comput. Phys. Commun. 26 217-479
------ 1985 Proceedings Vector and Parallel Processors in Computational Science
(Oxford 1984) Comput. Phys. Commun. 37 1-386
Varga R S 1962 Matrix Iterative Analysis (Englwood Cliffs, NJ: Prentice-Hall)
Vick C R and Cornell J A 1978 PEPE architecture— present and future AFIPS Conf.
Proc. 47 981-1002
Waksman A 1968 A permutation network J. Assoc. Comput. Mach. 15 159-63
Wang H H 1980 On vectorizing the fast Fourier transform BIT 20 233-43
------ 1985 A parallel method for tridiagnal equations ACM Trans. Math. Software
7 170-83
Ware F, Lin L, Wong R, Woo B and Hanson C 1984 Fast 64-bit chipset
gangs up for double precision floating-point work Electronics SI (14) July
12 99-103
Waser S 1978 High speed monolithic multipliers for real-time digital signal processing
IEEE Comput. 11 (10) 19-29
Watanabe T 1984 Architecture of supercomputers— NEC supercomputer SX system
NEC Res. Dev. 73 1-6
Watson I and Gurd J R 1982 A practical dataflow computer IEEE Computer 15 (2) 51 -7
Watson W J 1972 The Texas Instruments advanced scientific computer COMPCON
’ 72 Digest 291-4
Wetherell C 1980a Array processing for FORTRAN Lawrence Livermore Laboratory
Computer Documentation UCID-30175
------ 1980b Design considerations for array processing languages Software Practice
and Experience 10 265-71
Whittaker E T and Robinson G 1944 Calculus of Observations (Glasgow: Blackie)
260-83
Widdoes L C Jr and Correll S 1979 The S-l project: developing high performance
digital computers Energy Technol. Rev. (Lawrence Livermore Laboratory)
September 1-15
Wilkes M V and Renwick W 1949 The EDSAC, an electronic calculating machine
J. Sci. Instrum. 26 385-91
Wilkes M V, Wheeler D J and Gill S 1951 The Preparation of Programs for
an Electronic Digital Computer (Cambridge, MA: Addison-Wesley)
Wilkinson J H 1953 The pilot ACE Computer Structures: Readings and Examples Ch.
11 ed C G Bell and A Newell (New York: McGraw-Hill) 193-9
Williams F C and Kilburn T 1949 A storage system for use with binary digital
computing machines Proc. I EE 96 (3) 81-100
Williams S A 1979 The portability of programs and languages for vector and
array processors Infotech State of the Art Report : Supercomputers vol. 2 ed C R
Jesshope and R W Hockney (Maidenhead: Infotech Int. Ltd) 382-93
REFERENCES 609
l o g ic - in - m e m o r y , 28, 59 80-1
m u lt ip le i n s t r u c t io n s t r e a m / m u l t i p l e Cray Research Incorporated, 118
d a ta strea m ( mi m d ), 57, 74, Cray Seymour, 16, 18, 20, 118
77-81 CRAY-1, 4, 7 -8 , 18-19, 34, 57-8, 64,
multiple instruction stream/single 71-2, 75, 77, 84-5, 93-4, 98,
data stream ( m i s d ), 57 117-18, 123, 246
orthogonal, 28-30, 59 CRAY-1M, 117
parallel, 73-4 CRAY-1S, 19, 117
pipelined, 5, 75, 82-4, 87 CRAY-2, 7, 20, 34, 117-18, 146-55
reduction, 78 (rx , nil2), 153-4
ring, 44, 80-1 architecture
serial von Neumann, 4, 56, 58, 7 3-4 CPU, 152
single instruction stream/multiple overall, 150
data stream ( s i m d ), 57 background processor, 151-3
single instruction stream/single data chaining (absence of), 152
stream ( s i s d ), 56-7 common (shared) memory, 149
spectrum of, 92 -5 foreground processor, 151
unicomputers, 73-7 liquid immersion cooling, 147-8
Connection Machine, 329-34 memory (phased), 149-50
performance, 334 module, 149
Connection network, 259, 278 performance, 153-4, 204-5
generalised, 259 Unix, 153
programmable, 247 vector registers, 151
CONS, computer command, 413 CRAY-3, 20-1, 155
Control Data Corporation (CDC), 16, CRAY X-MP, 19-20, 34, 49-50, 85,
21,44, 155, 185 105, 117, 118-47, 149, 161,
Control 180-1, 215, 246
asynchronous (a), 31, 70 (foo, nXj2\ 142-4
flow, 33, 248 (roo, s 1/2), 144-6
horizontal (h), 70, 75, 77 a lg e b r a ic - s t y le s t r u c t u r a l n o t a t i o n
issue-when-ready (r), 71, 75 (a sn ) d e s c r ip t io n , 133
lock-step (1), 70-1 b u ffe r r e g is te r s , 123-5
organisation, 248 c h a in in g , 131-3
Convex C-l, 49-51 c o m p r e s s in s t r u c t io n s , 137-8
Convolution methods, 190, 537 DD-49 d is c , 127
Cooley J, 491 e v e n ts so ftw a re, 141-2
Cooley-Turkey f f t , 496-8 f u n c t io n a l u n its , 127-30
Cornell Advanced Scientific Computing i n s t r u c t io n b u ffe r s , 127
Center, 46-7, 237 in s t r u c t io n s , 131, 134-8
Cornell Center for Theory and 127
in t e r c o m m u n ic a t io n s e c t io n ,
Simulation in Science and I/O s e c t io n , 127
Engineering, 47 I/O Subsystem (IOS), 19, 119, 121,
Cornell University, 47, 118, 235 125, 126-7
Cosmic Cube (Cal Tech), 14, 41-2, locks software, 140-1
INDEX 615
parallelism, 5 439-46
pipelined, 78 vectorisation, 440-1
simulation of, 253 nb, see Vector breakeven length
shared memory, 79-81 N-cube/10, 42
switched, 78-9 NASA, 18, 27, 40
Mini-DAP, 324-9 Ames Research Laboratory, 20, 25
Ministry of International Trade and Langley Research Laboratory, 21
Industry (MITI), 32 NASF, 7
Minisupercomputers, 49-53 National Advanced Scientific
m i s d (multiple instruction stream/single Computing Centers, 185
data stream), 57 National Aerospace Laboratory
Mitsubishi Ltd, 33 (Japan), 191
Modcomp 7860, 40 National Physical Laboratory, 11
Modula 2, 374 Naval Research Laboratory
Modular redundancy, 577 (Washington), 22
Molecular dynamics, 490 NEC (Japan), 33, 199
Monte-Carlo scattering, 55, 101 NEC SX1/SX2, 23, 34, 199-201
Moore School of Engineering architecture, 200, 202
(Pennsylvania), 9 cooling (water), 200-1
m o s f e t s , 554, 556 FORTRAN, 200-1
Moto-oka, Professor, 32 performance, 200, 204-5
Motorola 68000, 48 technology, 199-200
Motorola 68020, 48 Network
Motorola 68881, 48 banyan, 279
MPP (Goodyear Aerospace), 30, 81 binary «-cube, 279
MU5, 8 binary-hypercube, 269
MULTFT algorithm, 509-11 control of, 277, 280, 282
Multiple Fourier transform ( m f t ) , cube, 80-1
534-8, 549 mesh, 80-1
Multiprocessing, 5 hierarchical, 80-1
Multiprocessors, 245-369 ICL DAP, 268
Multi-tasking, 101 ILLIAC IV, 269
Musical bits, 348 multi-stage, 276-82
nearest-neighbour, 268
w1/2, see half-performance length omega, 279
n1/2 method perfect-shuffle, 269
algorithm timing, 441-2 properties of, 271-6
phase diagrams, 443-4 R, 279
scalar and vector units, 445 reconfigurable, 80-1
serial and parallel ring, 268
complexity, 442-3 shuffle-exchange, 279
variants, 444-5 single-stage, 267-76
variation of (r^, n1/2), 445-6 star, 80-1
of vector ( s i m d ) algorithm analysis, von Neumann, 251, 285-6, 331
INDEX 621