Fulltext
Fulltext
Purdue e-Pubs
ECE Technical Reports Electrical and Computer Engineering
1-1-1992
W E. Cohen
Purdue University School of Electrical Engineering
Dietz, H. G. and Cohen, W E., "A Massively Parallel MIMD Implemented by SIMD Hardware?" (1992). ECE Technical Reports. Paper
280.
http://docs.lib.purdue.edu/ecetr/280
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] for
additional information.
A Massively Parallel MIMD
Implemented by SIMD
Hardware?
H. G. Dietz
WeE. Cohen
TR-EE 92-4
January 1992
Table of Contents
1. Introduction ..........................................................................................................................
1.1. Interpretation Overhead .........................................................................................
1.2. Indirection ..............................................................................................................
1.3. Enable Masking .....................................................................................................
1.4. Our Approach .........................................................................................................
2. Instruction Set Design ..........................................................................................................
2.1. Memory Reference Model .....................................................................................
2.1.1. Local Memory Model ..............................................................................
2.1.2. Global Memory Model ............................................................................
2.2. Assembly Language Model ...................................................................................
2.3. Prototype Instruction Set ........................................................................................
3. Emulator Design ..................................................................................................................
3.1. Shortening The Basic Cycle ...................................................................................
3.2. Minimizing Operation Time ..................................................................................
3.2.1. Maximizing Instruction Overlap ..............................................................
3.2.2. Reducing Operation Count ......................................................................
3.2.2.1. Subemulators ..............................................................................
3.2.2.2. Frequency Biasing .....................................................................
4. Performance Evaluation .......................................................................................................
4.1. High-Level Language Peak MFLOPS ...................................................................
4.2. Emulation Overhead ..............................................................................................
4.3. A Many-Thread Example .......................................................................................
4.3.1. The Program .............................................................................................
4.3.2. Performance .............................................................................................
5. Room for Improvement ........................................................................................................
5.1. Compiler ( m i m d c ) .................................................................................................
5.2. Assembler ( m i m d a ) ...............................................................................................
5.3. Emulator (mimd) ...................................................................................................
6. Conclusions ..........................................................................................................................
A Massively Parallel MIMD
Implemented By SIMD Hardware?
Abstract
Both conventional wisdom and engineering practice hold that a massively parallel MIMD
machine should be constructed using a large number of independent processors and an asynchro-
nous interconnection network. In this paper, we suggest that it may be beneficial to implement a
massively parallel MIMD using microcode on a massively parallel SIMD microengine; the syn-
chronous nature of the system allows much higher performance to be obtained with simpler
hardware. The primary disadvantage is simply that the SIMD microengine must serialize execu-
tion of different types of instructions -but again the static nature of the machine allows various
optimizations that can minimize this detrimental effect.
In addition to presenting the theory behind construction of efficient MIMD machines using
SIMD microengines, this paper discusses how the techniques were applied to create a 16,384-
processor shared memory barrier MIMD using a SIMD MasPar MP-1. Both the MIMD structure
and benchmark results are presented. Even though the MasPar hardware is not ideal for imple-
menting a MIMD and our microinterpreter was written in a high-level language (MPL), peak
MIMD performance was 280 MFLOPS as compared to 1.2 GFLOPS for the native SIMD instruc-
tion set. Of course, comparing peak speeds is of dubious value; hence, we have also included a
number of more realistic benchmark results.
' This work was supported in part by the Office of Naval Research (ONR) under grant number
N00014-91-J-4013 and by the National Science Foundation (NSF) under award number
9015696-CDA.
Page I
Massive MIMD
1. Introduction
Before discussing how a highly efficient MIMD machine can be built using a SIMD
microengine, it is useful to review the basic issues in interpreting MIMD instructions using a
SIMD machine. In the simplest terns, the way in which one interprets a MIMD instruction set
using SIMD hardware is to write a SIMD program that interpretively executes a MIMD instruc-
tion set. There is nothing particularly difficult about doing this; in fact, one could take a com-
pletely arbitrary MIMD instruction set and execute it on a SIMD machine.
For example, [WiH91] reported on a simple MIMD interpreter running on a MasPar MP-1
[BlagO]. Wilsey, et. al, implemented an interpreter for the MINTABS instruction set and indi-
cated that work was in progress on a similar interpreter for the MIPS R2000 instruction set. The
MINTABS instruction set is very small (only 8 instructions) and is far from complete in that
there is no provision for communication between processors, but it does provide basic MIMD
execution. In fairness to [WiH91], their MIMD interpreter was built specifically for parallel exe-
cution of mutant versions of serial programs -no communication is needed for that application.
Such an interpreter has a data structure, replicated in each SIMD PE, that corresponds to the
internal registers of each MIMD processor. Hence, the interpreter structure can be as simple as:
The only difficulty in implementing an interpreter with the above structure is that the simulated
machine will be very inefficient. There are several reasons for this inefficiency.
Page 2
I-
Massive MIMD
1.2. Indirection
Still more insidious is the fact that even step 1 of the above algorithm cannot be executed in
parallel across all PEs in many SIMD computers. The next instruction for each PE could be at
any location in the PE's local memory, and many SIMD machines do not allow multiple PEs to
access different memory locations simultaneously. Hence, on such a SIMD machine, any parallel
memory access made will take time proportional to the number of different PE addresses being
fetched from1. For example, this is the case on the TMC CM-1 [Hi1871 and TMC CM-2 [Thi90].
Note that step 3b suffers the same difficulty if load or store operations must be performed.
Since many operations are limited by (local) memory access speed, inefficient handling of
these memory operations can easily make MIMD interpretation on a SIMD machine infeasible.
This overhead can be averted only if the SIMD hardware can indirectly access memory
using an address in a PE register. Examples of STMD machines with such hardware include the
PASM Pro totype [SiN90] and the MasPar MP-1 [Bla90].
where ( i r == CMP) {
/ * e x e c u t e d o n l y by PEs i n which
i r h a s t h e v a l u e CMP; c c i s
n o t a c c e s s e d by o t h e r PEs
*/
cc = alu - mbr;
1
/ * u s e C's b i t w i s e l o g i c a l o p e r a t i o n s s o
t h a t c c = a l u - mbr i n t h o s e PEs where
i r == CME', a n d c c = c c i n t h e o t h e r s
*/
mask = - ( i r == CMP);
c c = ( ( c c & -mask) ( ( ( a l u - mbr) 6 mask);
Worse still, for some SIMD machines the technique used takes time proportional to the size of the
address space which could be accessed.
Page 3
--
Massive MIMD
which is relatively expensive. Notice that in addition to the bitwise operations, the above imple-
mentation requires a memory access (i.e., loading the value of cc) that would not be necessary
for a machine that supports enable masking in hardware. Because masking is done for each simu-
lated instruction, the masking cost effectively increases the basic interpretation overhead.
Examples of SIMD machines whose hardware can implement the appropriate masking
include the TMC CM-1 and the MasPar MP-1.
shared memory
interconnectionnetwork
(global router)
pode decoder/control unit -
I I
processor
0
processor
1
processor
2
... processor
width-1
I local
memory
1 local
m*OrY
1I local
memory
1 memory
Page 4
-
Massive MIMD
In our system, as shown in figure 1, the MasPar's ACU (Array Control Unit) becomes our
microcode decoder and control unit, synchronously managing the parallel system. The ACU
memory is thus the microcode store (with virtual memory paging support). Each SIMD PEs
becomes an essentially complete MIMD processor - except in that these processors do not have
any local microcode control. The local memory for each PE functions identically in the MIMD
organization, except in that the union of the local memories, with the help of the global router
network, forms a global shared memory. Note that even though global memory references must
pass through processors, this is done transparently under microcode control.
Given an appropriate SIMD microengine, the only remaining difficulty is the emulator
(interpreter) overhead associated with decoding and performing operations within each SIMD PE.
By careful construction of the MIMD instruction set and optimization of the emulation algorithm,
the effective interpreter overhead and number of instruction types can be reduced greatly.
The result is a MIMD emulation that typically achieves a large fraction of the peak speed
that a pure SIMD instruction set would obtain using the same SIMD microengine. As a true
microcoded implementation, it is possible that the MIMD machine would have peak performance
virtually identical to the equivalent SIMD machine.
Unfortunately, there are a number of compromises in the implementation of our proof-of-
concept prototype MIMD emulator as presented in this paper. By far the most important
compromise is that rather than directly using the SIMD microengine, our cumnt version of the
emulator is written in MPL [Masgl], a C language dialect that is compiled into the MasPar's
SIMD macroinstructions. This results in between about 115th and 1140th the peak SIMD perfor-
mance when executing pure MIMD code.
While these numbers rank our prototype 16,384-processor shared memory barrier MIMD as
a "marginal" supercomputer peaking in the low 100's of MIPS, and the MasPar MP-I is cheap
enough to even yield a reasonable MIPSIdollar rating using our cumnt emulator, that is not our
point. Our point is that, designing a SIMD microengine from scratch, the performance of this
new type of MIMD implementation could be superior to that obtained by more conventional
MIMD architectures.
The remainder of this paper explains the design, optimization, and prototype performance
of a MIMD machine constructed using a SIMD microengine.
Page 5
Massive MIMD
Page 6
Massive MIMD
* This remarkably low ratio is due to the fact that the MasPar MP-1 router is fast and local memory
accesses are slow due to 16-way sharing of local memory ports.
Page 7
Massive MIMD
matter of overlapping all their constituent microinstructions except for the ALU
operation. The problem is not lack of overlap, but rather the complexity of making
the best choice among the many possible microinstruction overlaps. By designing the
instruction set so that many microinstructions will form common subsequences, only
a small amount of overhead is associated with having a larger instruction set.
Even if there are many instructions in the MIMD instruction set and there are few
microinstructions in common, the emulation speed can be very good if only a few dif-
ferent instructions are to be executed in any given emulation cycle. For example, the
MIMD instruction set supported by our prototype emulator contains 38 different
instructions, many of which have little microinstruction overlap; however, even in the
very asynchronous MIMD program given in section 4.3, there were only an average of
6.28 different types of MIMD instructions executed in each emulator cycle. In section
3.2.2.2, we also describe how the number of different MIMD instruction types whose
execution is attempted in each emulator cycle can be artificially reduced.
In fact, the techniques used are so effective that the instruction set size and instruction execution
time have only indirect impact on emulation speed.
The single most severe constraint on instruction set size for the current MIMD emulator is
the desire to minimize the time taken for instruction fetch from local memory - ideally, the
instruction set would have no more than 256 instructions so that all opcodes will fit in 8 bits.
Because of memory address alignment constraints3, the use of 8-bit opcodes makes it difficult to
fetch more than an 8-bit immediate operand. Hence, our instruction set incorporates a constant
pool that holds up to 256 32-bit values.
The MasPar MP-1 microengine has no alignment cons~aints, but the SIMD macroinstruction set
unfortunately does.
Page 8
Massive MIMD
Page 9
- - ~..-
Massive MIMD
3. Emulator Design
Although the detailed design of the emulator is intertwined with the design of the instruc-
tion set and the SIMD microengine, for this paper we will make the simplifying assumption that
the SIMD microengine is the machine on which we have implemented our prototype: the MasPar
MP-1. Further, we will restrict the examples to the instruction set as given in table 1 and used in
the prototype emulator.
The most important optimizations of the emulator can be grouped into two categories:
reduction of the emulation overhead by shortening the basic emulator cycle or by maximizing
overlap (parallelism) in emulated execution of different types of instructions.
Page 10
Massive M M D
2. Reduce the number of different operations that must be executed in an emulator cycle
-ideally to just one operation, i.e., to SIMD code.
The following sections detail the methods used in the current emulator.
Page 11
Massive MIMD
33.2.1. Subemulators
Because the microengine is completely synchronous, it is relatively easy to construct
hardwart that will allow the control unit to check to see if there exists a processor in which a par-
ticular value meets some condition. In the MasPar, this is implemented by an operation called
globalor, which in just 10 clock ticks (less than Ips) ors together values from all the PEs. By
carefully encoding the instruction set, we can use a globalor of the opcode values to index a
control unit jump table to select the emulator that understands only those instructions that
could appear within the or mask value.
Within the current emulator, there are 32 such "subemulators." Obviously, the 32 subemu-
lators could not reasonably be generated by hand, so a C program was written to perfom this
task.
In addition, the choice of how instructions are grouped together into subemulators should
not be made at random. Instructions that share CSIs should be grouped together because
factoring-out the CSIs will make those interpreters execute faster. Hence, the NOS. Immed, and
CPool CSIs (described above) correspond to bit positions in both the opcode and the globalor
mask. It is also useful to make similar divisions based on the expected cost (Slow) and frequency
of execution (Rare), and these sets also correspond to opcode bits. The class membership of each
instruction in our MZMD instruction set is given in table 1.
To illustrate the subinterpreter structure, we present a complete subemulator set. However,
to keep the size reasonable, we have restricted the subemulator set to cover only the instructions
used in the example in listing 1.
Page 12
--
I - -. -
Massive MIMD
-f a c t :
Const 0 ;if (n)
Recursive i n t f a c t o r i a l
LdL
*/ JumpF
Const
int
LdL
f a c t ( i n t n)
Const
i Const
if ( n ) i
LdL
r e t u r n (n * f a c t (n-1) ) ;
Const
1 Add
r e t u r n (1);
Jump -f a c t
1 r e t l a b 0:
Mu 1
Ret
LO :
Const
Ret
The complete subemulator set is given in listing 2. The Op- references are opcode values
or bit masks; the M- references are macros that actually perform the corresponding operation. If
we assume that all processors in the MIMD machine call -f a c t at the same time, all processors
would simultaneously execute the C o n s t instruction using the subinterpreter for classes Immed
and CPool ( c a s e 0x14). Suppose that some processors wish to execute Mu1 (classes NOS
and Rare) at the same time that others execute JumpF (Immed, NOS, and -001); then the
subemulator for Immed, NOS, CPool, and Rare would be executed ( c a s e Oxld). Notice that
the C program that builds the subemulators factors-out identical subemulators (:e.g., c a s e
0 x19 and ca s e 0 x l b ) to save ACU memory space.
Page 13
-- - -
Massive MIMD
A vaguely similar type of improvement is suggested in [NiTgO]. Nilsson and Tanaka envi-
sion a set of subinterpreters such that each subinterpreter emulates only a single type of instruc-
tion and all subinterpreters are executed once per interpreter cycle. Using statistics, they change
Page 14
Massive MIMD
the order of the subinterpreters to maximize the expected number of instructions executed per
processor per interpreter cycle. E.g., if the subinterpreters are in the order A, B, C then the
instruction sequence B, A takes 2 cycles - but B, A would take only one cycle if the subinter-
preters were ordered B, C, A. The problem is that this improvement is small and is essentially
incompatible with factoring-out portions of the emulated operations (i.e., instruction fetch).
4. Performance Evaluation
Our first proof of concept MIMD system was implemented on the Purdue University Paral-
lel Processing Laboratory's 16,384-PE MasPar MP-1 in July 1991, shortly after developing the
CSI algorithm and prototype implementation. The current version (January 1992) of the MIMD
system includes:
mimdc
A compiler, written in C using PCCI'S [PaD92]. The language is a parallel dialect of
Page 15
Massive MIMD
C called MIMDC. It supports most of the basic C constructs. Data values can be
either i n t or f l o a t , and variables can be declared as mono (shared) or p o l y
(private) [Phi89].
There are actually two kinds of shared memory reference supported. The mono vari-
ables are replicated in each processor's local memory so that loads execute quickly,
but stores involve a broadcast to update all copies. It is also possible to directly
access p o l y values from other processors using "parallel subscripting":
would use the values of i , j, and z on this processor to fetch the value of y from
processor j, add z , and store the result into the x on processor i. In addition to
using shared memory for synchronization, MIMDC supports barrier synchronization
[DiS89] using a w a i t statement.
mimda
An assembler, written in C. The stack-based MIMD assembly code output by
mimdc is assembled to generate both a listing file and an Intel-format hex load
module.
mimd
The MIMD interpreter, written in MPL (Maspar's SIMD C dialect) with the aid of
several specialized interpreter construction programs written in C and AWK. The
structure of m i m d was described in detail in section 3.
Benchmark programs were written in MIMDC and their performance was evaluated. A high
level language was used for the benchmarks because we feel that it both "keeps us honest'' and
provides a friendlier, more realistic, interface for program development.
Page 16
Massive MIMD
--
main0
int count
float sum
10000000/16384;
0.0;
I
int count - -
10000000/16384;
plural float sum 0.0;
while (count) I
do 100 float adds per loop, . . . ./ while (count1 t
/a
sum - sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum +
I . do 100 float adds per loop... * /
sum - sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
Sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
s u m + sum + sum + sum + sum + sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum; sum + sum + sum + sum + sum +
count - count -1;
count
sum + sum + sum + sum + sum + sum;
-count- 1:
t
/ * mimd emulator automatically prints time ./
) printf("Done: 99s DPU usage\n", dpuTimerElapsed0 );
t
In the MIMD program, all processors execute the same code sequence, only one instruction
is executed in the emulator for each MIMD cycle and processors are rarely idle. The other code,
written in MPL, executes with all processors enabled and is completely SIMD. Neither program
does any useful calculations, but the performance provides a good estimate of peak floating point
speed4. The emulator achieved 97.2 MFLOPS, or about 10% of the MPL program's 986
MFLOPS. Note that the Maspar's theoretical peak speed is 1,200 MFLOPS.
Page 17
r
Massive MIMD
to reduce the overhead is to write the emulator in the MasPar's SIMD assembly language instead
of in MPL; however, unless the emulator algorithm also is changed, the improvement would be
quite small. This is partly because MPL is low-level enough (e.g., register declarations) to
usually generate good code, and partly because we already use an AWK script to patch the few
obvious blunders made by MPL.
The insight that could remove most of the 48ps overhead is that the MasPar9s32-bit RISC
SIMD instruction set is implemented by microcode executed on 4-bit PEs. By implementing
the MIMD emulator as a single new microcoded instruction, emulateMIMD, the emulation
overhead per emulator cycle would almost certainly drop to less than lops.
Small additional improvements, at either the assembly or microcode level, could result by
slightly altering the emulation algorithm. Essentially, MPL only allows structured mixing of
control flow and enable masking; there are a few portions of the emulation that could profit from
directly manipulating enable masks.
Page 18
Massive MIMD
The program in listing 4 uses an exhaustive recursive search to determine, for all possible
combinations of faulty arcs in the master graph, the total number of faulty states for which it is
still possible to travel from node 0 to node 4. We do not claim that this is a good algorithm for
this problem, but it is a good example of "true MIMD" code.
All 16,384 processors begin by executing main ( ) . Each processor reads the master graph
and modifies it to produce a unique faulty graph by removing arcs corresponding to 0 bits in the
processor number (called t h i s in MIMDC). Since 16,384 is 214, we use a graph with 14 arcs
so that each processor will have a unique task. The function f o u n d p a t h ( ) determines
whether a path exists by a depth first search. It returns as soon as it has found a node that it has
already visited, has reached the destination, or has explored all arcs leaving the node. If it finds
an arc that goes to a node it has not visited, it recursively calls itself with the unexplored node.
When all faulty graphs have been explored, the reduceAdd ( ) function uses banier synchroni-
zation and distributed memory accesses to total the number of successful path traversals.
Page 19
Massive M M D
int
A Simple little program to compute some basic foundpathlint here, int there)
fault tolerance properties....
-- 00::
(
.I int i
int k
I* We are here. *I
1:
mono int m a s t e r - t o [ ~ ~ ~ ~(~ ] - been-there[herel - 1:
1 . o . 2 . o , 3 . 0 . 4 . 0 , 2 , 1 , 4 , 2 , 4 , 3
I;
mono int master-arcs - 14;
I * Try each arc outta here... * /
while [i < arcs) (
I* Found an arc out... .I
poly int fromlLINKS1: if (from[il here) I --
poly int to[LINKSI;
poly
poly
int
int arcs -
been-therelLINKS1;
0: if (been-there[ to[i] ] -
I * Have we been where it goes? * I
0) (
t
mybits -
mybits >> 1:
i - i + l ;
poly
poly
poly
int
int
int
reduce-tmp;
rk;
rj;
I poly int ri;
gotpath
)else(
-
if (this c (lccmaster-arcs)1 (
foundpath (0,4);
reduceAdd(int Val)
I
I. Recursive doubling summation ' I
qotpath 0; - ri 1; -
I
reduce-tmp Val; -
gotpath - reduceAdd(g0tpath): while (ri c width) I
wait;
.. -
(this + ri) b (width):
if (this --
/ * Let processor 0 speak for all.
0) (
*I rk
rj -
reduce-tmp[(l rkl;
print "There were ", (lccmaster-arcs),
" networks checked.\n";
print "Of these, ", gotpath,
wait;
reduce-tmp
ri ri cc 1;-reduce-tmp + rj:-
" could reach 4 from O.\n"; t
I return(reduce-tmp);
1 t
Under the M M D emulator, mimd, each processor executes its own path through the code.
Hence, the execution of this program differs greatly from a path search program for a SIMD
machine.
On average, there were 16.3 unique program counter (PC) values active for each emulation
cycle, which is roughly equivalent to 16 completely different programs executing. This average
was obtained in a program that only has a total of 234 instructions. The number of' PCs is also
limited by the wide range in processor work loads, which causes many processors to wait at the
first banier in reduceAdd ( ) . Averages of over 50 different PC values have been seen on a
version of g r a p h . m c that builds many different permutations of the graph for each processor
before summing the number of networks with paths between 0 and 4 - but that program was
Page 20
Massive MIMD
more complicated and the statistics were very similar. By any standard, the example code is very
dynamic.
43.2. Performance
Complete emulation statistics for the code of listing 4 are given in tables 2 and 3. For table
2, the interpret count gives the number of emulator cycles in which that instruction was executed;
the execute count is the total number of times that instruction was executed. Table 3 shows the
number of times each particular subemulator mask occurred.
The execution speed was determined in two steps. First, the code was timed using the
hardware timer available on the MasPar (80ns per tick). Then, the code was run using an instru-
mented version of the emulator to get the total number of emulator cycles needed to complete the
program, number of cycles that had a particular instruction, number of each instruction executed,
and the total number of instructions run. The instrumented MLMD emulator takes longer to run
the program. but preserves the relative time between processors, number of instructions exe-
cuted, and the number of emulator cycles in the MIMD program - this repeatability in itself is
an important advantage of our M M D implementation technique. The total number of instruc-
tions found by the instrumented emulator, divided by the run time of the uninstrumented emula-
tor, yields the average number of instructions executed per second for the uninstrumented emula-
tor.
For graph .mc, the average speed was 54.6 MIPS (excluding output from the print
instructions at the end of the program). On average each processor was executing 3,300 MIMD
instructions per second. This seems to be a very low number, but the processors on the MP-1 are
implemented by 4-bit slices and each cluster of 16 shares a single interface to its local memory.
Assuming that all of the processors are totally asynchronous, the maximum rate at which they
would be able to fetch an 8-bit instruction, execute a simple 32-bit operation, and update the pro-
gram counter is 123,000 instructions per second. Thus, the MLMD emulator had a slowdown of
less than 37 times the maximum performance that would be obtained if each processor had its
own instruction decoder and control -which would imply many times more hardware to imple-
ment each processor. We suspect, but cannot yet prove, that the additional hardware would actu-
ally increase processor hardware complexity by more than a factor of 37 (primarily due to the
complexity of floating point and network control). Also keep in mind that we are still talking
about the MPL-coded emulator speed versus pure MLMD hardware....
It should also be noted that, although graph. m c does not use floating point, this makes
little difference in performance. Actually, the 32-bit floating point instructions for multiplication
and division take significantly less time than the 32-bit integer versions; this is due to the lower
precision -just 24 bits of mantissa. Much of the run time of the emulator is due to decoding the
instruction and fetching operands (as was shown in the gf lop benchmark; see section 4.2).
Page 21
Massive MIMD
Page 22
--.-
- - - --
I
Massive MIMD
Page 23
Massive MIMD
This would allow the machine to more efficiently execute parts of algorithms that are inherently
SIMD, such as communication or reduction operations. The MIMD/SIMD switching will
vaguely resemble the facility provided in the PASM prototype [BeS91].
In the immediate future, the emulator will be modified to provide a rudimentary operating
system so that multiple users will be able to submit MIMD jobs asynchronously. In the current
version, the complete MIMD environment is set up when the emulator begins executing.
6. Conclusions
In this paper, we have presented the theory behind construction of efficient MIMD
machines using SIMD microengines. Further, we have detailed how we created a 16,384-
processor shared memory banier MIMD using a SIMD MasPar MP-1, and we have given meas-
ured performance figures that validate the approach.
The MIMD emulation software discussed in this paper, mimdc, mimda, and mimd, are
being set up as a public domain Beta test release, and will be available via an email server. The
email address will appear in the final version of this paper.
Page 24
- ---
Massive MIMD
References
[Adv89] Advanced Micro Devices, 29K Famity 1990 Data Book, Sunnyvale, Califomia,
1989.
[Bla90] T. Blank, "The MasPar MP-1 Architecture," 35th IEEE Computer Society Intema-
tional Conference (COMPCON), February 1990, pp. 20-24.
[BrN90] C.J. Brownhill and A. Nicolau, Percolation Scheduling for Non-VLIW Machines,
Technical Report 90-02,University of Califomia at Irvine, Irvine, Califomia, Janu-
ary 1990.
[BeS91] T.B. Berg and H.J. Siegel, "Instruction Execution Trade-offs for SIMD vs. MIMD
vs. Mixed Mode Parallelism," 5th International Parallel Processing Symposium,
April 1991, pp. 301-308.
[Cra91] Cray Research Incorporated, The CRAY Y-MP C90 Supercomputer System, Eagan,
Minnesota, 1991.
[Die911 H.G. Dietz, "Common Subexpression Induction," Midwest Society for Program-
ming Languages and Systems (MWSPLS) Spring Meeting, Digital Computer
Laboratory (DCL), University of Illinois at Urbana-Champaign, Urbana, Illinois,
April 20,1991.
[Di090] H.G. Dietz, M.T. O'Keefe, and A. Zaafrani, "An Introduction to Static Scheduling
for MIMD Architectures," Advances in Languages and Compilers for Parallel Pro-
cessing, edited by A. Nicolau, D. Gelertner, T. Gross, and D. Padua, The MIT Press,
Cambridge, Massachusetts, 1991, pp. 425-444.
[Di092] H.G. Dietz, M.T. O'Keefe, and A. Zaafrani, "Static Scheduling for Barrier MIMD
Architectures," The Journal of Supercomputing, accepted to appear.
[DiS89] H.G. Dietz, T. Schwederski, M. T. O'Keefe, and A. Zaafrani, "Static Synchroniza-
tion Beyond VLIW," Supercomputing 1989, November 1989, pp. 416-425.
[Hi1871 W.D. Hillis, "The Connection Machine," Scientific American, June 1987, pp. 108-
115.
[ma801 A.D. Klappholz, "An Improved Design for a Stochastically Conflict-Free
Memory/Interconnection System," 14th Asilomar Conference on Circuits, Systems,
and Computers, November 1980.
[Mas911 MasPar Computer Corporation, MasPar Programming Language (ANSi C cornpati-
ble MPL) Reference Manual, Software Version 2.2, Document Number 9302-0001,
Sunnyvale, Califomia, November 1991.
[NiT90] M. Nilsson and H. Tanaka, "MIMD Execution by SIMD Computers," Journal of
Information Processing, Information Processing Society of Japan, vol. 13, no. 1,
1990, pp. 58-61.
[PaD92] T.J. Parr, H.G. Dietz, and W.E. Cohen, "PCCTS Reference Manual (version 1.00),"
ACM SIGPLAN Notices, accepted to appear, February 1992.
Page 25
Massive MIMD
[Phi891 M.J. Phillip, "Unification of Synchronous and Asynchronous Models for Parallel
Programming Languages" Master's Thesis, School of Electrical Engineering, Pur-
due University, West Lafayette, Indiana, June 1989.
[SiN90] H.J. Siegel, W.G. Nation, and M.D. AUemang, "The Organization of the PASM
Reconfigurable Parallel Processing System," Ohio State Parallel Computing
Workshop, Computer and Information Science Department, Ohio State University,
Ohio, March 1990, pp. 1-12.
[St0841 H.S. Stone, "Database Applications of the Fetch-And-Add Instruction," IEEE Tran-
sactions on Computers, July 1984, pp. 604-612.
[ThG87] S. Thakkar, P. Gifford, and G. Fielland, "Balance: A Shared Memory Multiproces-
sor System," International Conference on Supercomputing, May 1987, pp. 93-101.
[ThigO] Thinking Machines Corporation, Connection Machine Model CM-2 Technical Sum-
mary," version 6.0, Cambridge, Massachusetts, November 1990.
[WiH91] P.A. Wilsey, D.A. Hensgen, C.E. Slusher, N.B. Abu-Ghazaleh, and D.Y. Hollinden,
"Exploiting SIMD Computers for Mutant Program Execution," Technical Report
No. TR 133-11-91, Department of Electrical and Computer Engineering, University
of Cincinnati, Cincinnati, Ohio, November 1991.
Page 26