0% found this document useful (0 votes)

34 views29 pages

Fulltext

Uploaded by

Nivedita Acharyya 2035

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views29 pages

Fulltext

Uploaded by

Nivedita Acharyya 2035

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Purdue University

Purdue e-Pubs
ECE Technical Reports Electrical and Computer Engineering

1-1-1992

A Massively Parallel MIMD Implemented by

SIMD Hardware?
H. G. Dietz
Purdue University School of Electrical Engineering

W E. Cohen
Purdue University School of Electrical Engineering

Follow this and additional works at: http://docs.lib.purdue.edu/ecetr

Dietz, H. G. and Cohen, W E., "A Massively Parallel MIMD Implemented by SIMD Hardware?" (1992). ECE Technical Reports. Paper
280.
http://docs.lib.purdue.edu/ecetr/280

This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] for
additional information.
A Massively Parallel MIMD
Implemented by SIMD
Hardware?

H. G. Dietz
WeE. Cohen

TR-EE 92-4
January 1992

fThis work was supported by the Office of Naval Research (ONR)

under grant number N00014-91-J-4013 and by the National Science
Foundation (NSF) under award number 9015696-CDA.
Massive MIMD

Table of Contents

1. Introduction ..........................................................................................................................
1.1. Interpretation Overhead .........................................................................................
1.2. Indirection ..............................................................................................................
1.3. Enable Masking .....................................................................................................
1.4. Our Approach .........................................................................................................
2. Instruction Set Design ..........................................................................................................
2.1. Memory Reference Model .....................................................................................
2.1.1. Local Memory Model ..............................................................................
2.1.2. Global Memory Model ............................................................................
2.2. Assembly Language Model ...................................................................................
2.3. Prototype Instruction Set ........................................................................................
3. Emulator Design ..................................................................................................................
3.1. Shortening The Basic Cycle ...................................................................................
3.2. Minimizing Operation Time ..................................................................................
3.2.1. Maximizing Instruction Overlap ..............................................................
3.2.2. Reducing Operation Count ......................................................................
3.2.2.1. Subemulators ..............................................................................
3.2.2.2. Frequency Biasing .....................................................................
4. Performance Evaluation .......................................................................................................
4.1. High-Level Language Peak MFLOPS ...................................................................
4.2. Emulation Overhead ..............................................................................................
4.3. A Many-Thread Example .......................................................................................
4.3.1. The Program .............................................................................................
4.3.2. Performance .............................................................................................
5. Room for Improvement ........................................................................................................
5.1. Compiler ( m i m d c ) .................................................................................................
5.2. Assembler ( m i m d a ) ...............................................................................................
5.3. Emulator (mimd) ...................................................................................................
6. Conclusions ..........................................................................................................................
A Massively Parallel MIMD
Implemented By SIMD Hardware?

H. G.Dietz and W . E. Cohen

Parallel Processing Laboratory

School of Electrical Engineering
Purdue University
West Lafayette, IN 47906
[email protected]

Abstract
Both conventional wisdom and engineering practice hold that a massively parallel MIMD
machine should be constructed using a large number of independent processors and an asynchro-
nous interconnection network. In this paper, we suggest that it may be beneficial to implement a
massively parallel MIMD using microcode on a massively parallel SIMD microengine; the syn-
chronous nature of the system allows much higher performance to be obtained with simpler
hardware. The primary disadvantage is simply that the SIMD microengine must serialize execu-
tion of different types of instructions -but again the static nature of the machine allows various
optimizations that can minimize this detrimental effect.
In addition to presenting the theory behind construction of efficient MIMD machines using
SIMD microengines, this paper discusses how the techniques were applied to create a 16,384-
processor shared memory barrier MIMD using a SIMD MasPar MP-1. Both the MIMD structure
and benchmark results are presented. Even though the MasPar hardware is not ideal for imple-
menting a MIMD and our microinterpreter was written in a high-level language (MPL), peak
MIMD performance was 280 MFLOPS as compared to 1.2 GFLOPS for the native SIMD instruc-
tion set. Of course, comparing peak speeds is of dubious value; hence, we have also included a
number of more realistic benchmark results.

Keywords: MIMD, SIMD, Microcode, Compilers, Common Subexpression Induction.

' This work was supported in part by the Office of Naval Research (ONR) under grant number
N00014-91-J-4013 and by the National Science Foundation (NSF) under award number
9015696-CDA.

Page I
Massive MIMD

1. Introduction
Before discussing how a highly efficient MIMD machine can be built using a SIMD
microengine, it is useful to review the basic issues in interpreting MIMD instructions using a
SIMD machine. In the simplest terns, the way in which one interprets a MIMD instruction set
using SIMD hardware is to write a SIMD program that interpretively executes a MIMD instruc-
tion set. There is nothing particularly difficult about doing this; in fact, one could take a com-
pletely arbitrary MIMD instruction set and execute it on a SIMD machine.
For example, [WiH91] reported on a simple MIMD interpreter running on a MasPar MP-1
[BlagO]. Wilsey, et. al, implemented an interpreter for the MINTABS instruction set and indi-
cated that work was in progress on a similar interpreter for the MIPS R2000 instruction set. The
MINTABS instruction set is very small (only 8 instructions) and is far from complete in that
there is no provision for communication between processors, but it does provide basic MIMD
execution. In fairness to [WiH91], their MIMD interpreter was built specifically for parallel exe-
cution of mutant versions of serial programs -no communication is needed for that application.
Such an interpreter has a data structure, replicated in each SIMD PE, that corresponds to the
internal registers of each MIMD processor. Hence, the interpreter structure can be as simple as:

Basic MIMD Interpreter Algorithm

1. Each PE fetches an "instruction" into its "instruction register" (IR) and updates its
"program counter" (PC).
2. Each PE decodes the "instruction" from its IR.
3. Repeat steps 3a-3c for each "instruction" type:
a) Disable all PEs where the IR holds an "instruction" of a different type.
b) Simulate execution of the "instruction" on the enabled PEs.
c) Enable all PEs.
4. Go to step 1.

The only difficulty in implementing an interpreter with the above structure is that the simulated
machine will be very inefficient. There are several reasons for this inefficiency.

1.1. Interpretation Overhead

The most obvious problem is simply that interpretation implies some overhead for the inter-
preter, even MIMD hardware simulating a MIMD with a different instruction set would suffer
this overhead. In addition, SIMD hardware can only simulate execution of one instruction type at
a time, hence, the time to execute a simulated instruction is proportional to the sum of the execu-
tion times for each instruction type.

Page 2

I-
Massive MIMD

1.2. Indirection
Still more insidious is the fact that even step 1 of the above algorithm cannot be executed in
parallel across all PEs in many SIMD computers. The next instruction for each PE could be at
any location in the PE's local memory, and many SIMD machines do not allow multiple PEs to
access different memory locations simultaneously. Hence, on such a SIMD machine, any parallel
memory access made will take time proportional to the number of different PE addresses being
fetched from1. For example, this is the case on the TMC CM-1 [Hi1871 and TMC CM-2 [Thi90].
Note that step 3b suffers the same difficulty if load or store operations must be performed.
Since many operations are limited by (local) memory access speed, inefficient handling of
these memory operations can easily make MIMD interpretation on a SIMD machine infeasible.
This overhead can be averted only if the SIMD hardware can indirectly access memory
using an address in a PE register. Examples of STMD machines with such hardware include the
PASM Pro totype [SiN90] and the MasPar MP-1 [Bla90].

13. Enable Masking

It is also important to note that the above algorithm assumes that it is possible for PEs to
enable and disable themselves (set their own masks). Although most SIMD computers have
some ability to disable PEs, in many machines it is either difficult to have the PEs disable them-
selves (as opposed to having the conml unit disable PEs, as in the PASM prototype [SiNgO]) or
some arithmetic instructions cannot be disabled because they occur in a coprocessor, as in the
TMC CM-2 [Thi90]. In such cases, masking can be circumvented by the use of bitwise logical
operations, e.g. a C-like SIMD wde segment:

where ( i r == CMP) {
/ * e x e c u t e d o n l y by PEs i n which
i r h a s t h e v a l u e CMP; c c i s
n o t a c c e s s e d by o t h e r PEs
*/
cc = alu - mbr;
1

might be implemented by all PEs simultaneously executing the C code:

/ * u s e C's b i t w i s e l o g i c a l o p e r a t i o n s s o
t h a t c c = a l u - mbr i n t h o s e PEs where
i r == CME', a n d c c = c c i n t h e o t h e r s
*/
mask = - ( i r == CMP);
c c = ( ( c c & -mask) ( ( ( a l u - mbr) 6 mask);

Worse still, for some SIMD machines the technique used takes time proportional to the size of the
address space which could be accessed.

Page 3

--
Massive MIMD

which is relatively expensive. Notice that in addition to the bitwise operations, the above imple-
mentation requires a memory access (i.e., loading the value of cc) that would not be necessary
for a machine that supports enable masking in hardware. Because masking is done for each simu-
lated instruction, the masking cost effectively increases the basic interpretation overhead.
Examples of SIMD machines whose hardware can implement the appropriate masking
include the TMC CM-1 and the MasPar MP-1.

1.4. Our Approach

Now consider building a true MIMD machine using a specially designed SIMD microen-
gine instead of simply implementing an interpreter on top of an existing SIMD machine.
Just as building an efficient interpreter would be infeasible unless the SIMD machine has
hardware supporting both indirection and masking, the SIMD microengine must incorporate
hardware for these functions. However, if we are designing a SIMD microengine, it is inexpen-
sive to make it support both indirection and masking. How do we know this? Because the
MasPar MP-1's SIMD instruction set is actually implemented by microcode on a SIMD
microengine that supports both indirection and masking. We are not claiming that the
MasPar MP-1 hardware is our ideal SIMD microengine, but it is close enough to allow us to
implement a proof-of-concept MIMD emulation- as presented in this paper.

shared memory
interconnectionnetwork
(global router)
pode decoder/control unit -
I I

processor
0
processor
1
processor
2
... processor
width-1

I local
memory
1 local
m*OrY
1I local
memory
1 memory

Figure 1: Block Diagram of MIMD using SIMD pngine

Page 4

-
Massive MIMD

In our system, as shown in figure 1, the MasPar's ACU (Array Control Unit) becomes our
microcode decoder and control unit, synchronously managing the parallel system. The ACU
memory is thus the microcode store (with virtual memory paging support). Each SIMD PEs
becomes an essentially complete MIMD processor - except in that these processors do not have
any local microcode control. The local memory for each PE functions identically in the MIMD
organization, except in that the union of the local memories, with the help of the global router
network, forms a global shared memory. Note that even though global memory references must
pass through processors, this is done transparently under microcode control.
Given an appropriate SIMD microengine, the only remaining difficulty is the emulator
(interpreter) overhead associated with decoding and performing operations within each SIMD PE.
By careful construction of the MIMD instruction set and optimization of the emulation algorithm,
the effective interpreter overhead and number of instruction types can be reduced greatly.
The result is a MIMD emulation that typically achieves a large fraction of the peak speed
that a pure SIMD instruction set would obtain using the same SIMD microengine. As a true
microcoded implementation, it is possible that the MIMD machine would have peak performance
virtually identical to the equivalent SIMD machine.
Unfortunately, there are a number of compromises in the implementation of our proof-of-
concept prototype MIMD emulator as presented in this paper. By far the most important
compromise is that rather than directly using the SIMD microengine, our cumnt version of the
emulator is written in MPL [Masgl], a C language dialect that is compiled into the MasPar's
SIMD macroinstructions. This results in between about 115th and 1140th the peak SIMD perfor-
mance when executing pure MIMD code.
While these numbers rank our prototype 16,384-processor shared memory barrier MIMD as
a "marginal" supercomputer peaking in the low 100's of MIPS, and the MasPar MP-I is cheap
enough to even yield a reasonable MIPSIdollar rating using our cumnt emulator, that is not our
point. Our point is that, designing a SIMD microengine from scratch, the performance of this
new type of MIMD implementation could be superior to that obtained by more conventional
MIMD architectures.
The remainder of this paper explains the design, optimization, and prototype performance
of a MIMD machine constructed using a SIMD microengine.

2. Instruction Set Design

Although there are many factors influencing the design of an instruction set, here we are
concerned only with making the instruction set execute efficiently and be powerful enough to
encode reasonable programs.

Page 5
Massive MIMD

2.1. Memory Reference Model

In many computers, execution speed is more often limited by memory reference time than
by the speed of arithmetic operations within a processor. The choice of memory reference model
is even more important in the design of a massively parallel machine:
1. Although each processor generally has local memory nearby, the bandwidth is usually
limited by VLSI pinout constraints and the need to minimize the number of memory
chips per processor. For example, on the MasPar MP-1 each group of 16 PEs shares a
single, time multiplexed, port to local memory.
2. Processors inevitably must communicate with each other, hence, some mechanism for
accessing data from another PE is needed. Massively parallel machines can have
massive amounts of memory distributed across all PEs; there is even a strong incen-
tive to spread local data across the machine simply because it might not fit in local
memory, which is typically small.
The following two sections address these issues.

2.1.1. Local Memory Model

The standard solution to the first problem is to use either registers or cache. Fortunately,
machines like the MasPar MP-1 have many registers... unfortunately, the same register must be
accessed by all enabled PEs. Without the ability for each PE to access a register of its chosing, it
is impossible for the PE to efficiently implement a register oriented model; each "register"
would have to be stored in local memory. Although the modification of the MasPar MP-1
hardware to support indirect register references (similar to those on the AMD 29K [Adv89])
would be relatively simple, as a practical matter, such a modification is beyond ihe scope of
academic research.
Hence, we are forced to reduce the number of memory references by using an instruction set
in which every instruction accesses the same register for an operand. Either an accumulator-
based or stack cache-based scheme is viable; we used a stack cache in which the top element on
the stack is cached in a particular register. Larger stack cache sizes are impractical due to the
overhead in manipulating registers to appear as a stack cache. The single element stack cache
averts one operand fetch on all unary and binary operations.

2.1.2. Global Memory Model

Although many small MIMD computers allow all processors to share access to a common
memory [ThG87:I[Cra91], it is very difficult to construct hardware that scales this feature up to
thousands of processors. Hence, the primary question becomes whether one should try to make
distributed memory hardware appear to software as slow shared memory or as distributed
memory accessed by explicitly sending a message to the processor for whom that memory is
local.

Page 6
Massive MIMD

There are two reasons that we use a shared memory model:

1. The shared memory model implies that all packets sent through the network at a given
time operate on the same size data - one memory word. This implies that SIMD net-
work control can be used without loss of efficiency [BeS91.]- and at a great savings
in network switch hardware complexity.
2. If an explicit, asynchronous, message-passing scheme were used, it would be neces-
sary to both buffer messages and to interrupt the receiving processor to process them.
These overheads would both complicate (i.e., slow) the emulator and result in longer
instruction sequences that could not be executed in parallel on the SIMD microengine.
For these reasons, supporting a shared memory memory model is actually likely to be more
efficient than using explicit message passing. Of course, one still should program so that most
references will be to objects in local memory, because global references will be slower. On the
16,384 PE MasPar MP-1 using the global router network, the ratio between global and local
references is approximately 10: 12.
There is, however, one other difficulty that arises in the above treatment of shared memory:
if every processor wants to access the same shared memory location, this may cause network
contention that would serialize the operations. Effectively, this was the problem that inspired
"Repetition Filter Memory" [Kla80] and "Fetch-and-Op" [St0841 for shared memory MIMD
computers.
Surprisingly, this problem is much easier to solve when the network control is SIMD
[BeS91]. In effect, races can be resolved by the SIMD microengine's control unit - and the
resulting value can simply be broadcast. For example, the current MIMD emulator allows a
second type of shared memory which is implemented using a copy in each processor; loads are
local memory references, stores have races resolved by the control unit and the result broadcast to
all copies of the variable. The SIMD network control also makes "Fetch-and-Op" efficiently
implementable without additional hardware.

2.2. Assembly Language Model

In implementing a MIMD machine using a SIMD microengine, it seems that the ultimate
limit on performance must be the fact that the SIMD microengine must serialize execution of dif-
ferent types of MIMD instructions. Hence, one would expect that the slowdown for an emulated
MIMD would be roughly proportional to the sum of the execution times of all instructions in the
instruction set. Fortunately, this need not be the case, because:
1. Here we are talking about a SIMD microengine, and many of the microinstructions
implementing different MIMD instructions are of the same type. Hence, it isn't a
matter of not being able to overlap the MIMD Mu1 and Div instructions, but a

* This remarkably low ratio is due to the fact that the MasPar MP-1 router is fast and local memory
accesses are slow due to 16-way sharing of local memory ports.

Page 7
Massive MIMD

matter of overlapping all their constituent microinstructions except for the ALU
operation. The problem is not lack of overlap, but rather the complexity of making
the best choice among the many possible microinstruction overlaps. By designing the
instruction set so that many microinstructions will form common subsequences, only
a small amount of overhead is associated with having a larger instruction set.
Even if there are many instructions in the MIMD instruction set and there are few
microinstructions in common, the emulation speed can be very good if only a few dif-
ferent instructions are to be executed in any given emulation cycle. For example, the
MIMD instruction set supported by our prototype emulator contains 38 different
instructions, many of which have little microinstruction overlap; however, even in the
very asynchronous MIMD program given in section 4.3, there were only an average of
6.28 different types of MIMD instructions executed in each emulator cycle. In section
3.2.2.2, we also describe how the number of different MIMD instruction types whose
execution is attempted in each emulator cycle can be artificially reduced.
In fact, the techniques used are so effective that the instruction set size and instruction execution
time have only indirect impact on emulation speed.
The single most severe constraint on instruction set size for the current MIMD emulator is
the desire to minimize the time taken for instruction fetch from local memory - ideally, the
instruction set would have no more than 256 instructions so that all opcodes will fit in 8 bits.
Because of memory address alignment constraints3, the use of 8-bit opcodes makes it difficult to
fetch more than an 8-bit immediate operand. Hence, our instruction set incorporates a constant
pool that holds up to 256 32-bit values.

23. Prototype Instruction Set

In the prototype MIMD emulator, we have implemented an instruction set that is as rich as
we felt was useful. Even as this paper is being written, we are considering a number of changes
including the addition of several new instructions.
A brief summary of the MIMD instruction set used in the current emulator appears in table
1. Mnemonics followed by i are operations using 8-bit immediate values, and those followed by
c use 32-bit values taken from the constant pool. The processor number, t h i s , and the number
of processors, width, are actually special entries in the constant pool that are initialized at pro-
gram load time; hence, they are accessed via Const instructions. The class membership
column of this table is discussed in section 3.2.2.1.

The MasPar MP-1 microengine has no alignment cons~aints, but the SIMD macroinstruction set
unfortunately does.

Page 8
Massive MIMD

ShL int shift left Op-NOS

ShR int shift right Op-NOS Op-Rare
St store Op-NOS
StD store into distributed memory 0p-NOS Op-Slow Gp-Rare
StL store local into stack Op-NOs
StS store into shared memory Op-NOS Op_Slow Op-Rare
Wait wait for barrier synchronization Op_Slow Op-Rare

Table 1: MIMD Instruction Set.

Page 9

- - ~..-
Massive MIMD

3. Emulator Design
Although the detailed design of the emulator is intertwined with the design of the instruc-
tion set and the SIMD microengine, for this paper we will make the simplifying assumption that
the SIMD microengine is the machine on which we have implemented our prototype: the MasPar
MP-1. Further, we will restrict the examples to the instruction set as given in table 1 and used in
the prototype emulator.
The most important optimizations of the emulator can be grouped into two categories:
reduction of the emulation overhead by shortening the basic emulator cycle or by maximizing
overlap (parallelism) in emulated execution of different types of instructions.

3.1. Shortening The Basic Cycle

There are many ways to reduce the basic emulator cycle time:
1. Keep processor state in microengine registers. In the case of our interpreter, the pro-
gram counter @c), instruction register (op), program relocation base address (addr),
constant pool base address (cp), top-of-stack cache (tos), and various other internal
variables are all kept in registers.
2. Don't use a linear sequence of enable-masking conditional tests to isolate an operation
type. For our emulator, a helper program was written in C to automatically generate
an optimal binary search tree for isolating operation types.
3. Either don't use a high-level language or use it, but take steps to ensure good code
will be generated. Our emulator is written in MPL and MasPar's MPL compiler usu-
ally generates fairly efficient code, but not always. In particular, the MPL,compiler is
obsessed with performing needless conversion of quantities from 8 to 32 bits - a
painful error when the processors are based on 4-bit slices. We repair this code gen-
eration blunder by using an AWK script to recognize and remove the needless conver-
sions from the assembly code for the main emulation loop.
In addition to the above, there are a number of minor coding tricks employed.

3.2. Minimizing Operation Time

With a small instruction set consisting of relatively cheap operations, the basic emulator
cycle time is more important than the serialization of execution of different operations. However,
a truly useful machine needs more operation types and must support at least a few expensive
operations (e.g., shared memory references). Hence, it is very important that there be some tech-
niques used to reduce the operation time.
There are two basic way in which the operation time can be reduced:
1. Increase the overlap, at the microcode level, between the various instructions that are
to be executed.

Page 10
Massive M M D

2. Reduce the number of different operations that must be executed in an emulator cycle
-ideally to just one operation, i.e., to SIMD code.

The following sections detail the methods used in the current emulator.

3.2.1. Maximizing Instruction Overlap

The concept of maximizing the microcode overlap for a series of operations is not new. In
fact, it is probably the single most common hand optimization used in writing microcode or code
for a SIMD machine. Unfortunately, the process had not been formalized and automated until
very recently.
The new compiler optimization, called "Common Subexpression Induction" (CSI)
[Diegl], accepts multiple independent threads of code and outputs a reorganized version of the
code that shares instructions across threads so that the minimum execution time is obtained.
Although the algorithm is far too complex to describe in this paper, the general flavor is that
operations from various threads are classified based on how they could be merged into single
instructions executed by multiple threads and then a heavily pruned search is executed to find the
minimum execution time code schedule using these merges. In fact, the development of the CSI
algorithm was the enabling technology that inspired our first MIMD interpreter.
Without the CSI algorithm, it is possible to find and factor-out common microinstruction
subsequences by hand only for very small, simple, instruction sets. For the MIMD emulator
presented here, hand tuning was inconvenient, but coding the emulator in MPL made it impossi-
ble to directly use our CSI tool (since the CSI tool generates unstructured control flow and mask-
ing). Hence, we used the CSI tool to locate the most advantageous subsequences and then hand
coded them in MPL.
These sequences included the basic instruction fetch and program counter increment, fetch-
ing the value for the next-on-stack (NOS), fetching the value for an 8-bit immediate (Immed),
and looking-up a 32-bit value in the constant pool (CPool). Without this factoring, the emulator
would be several times slower.

3.2.2. Reducing Operation Count

The second "trick" in speeding up execution of multiple different operations involves the
observation that the emulator need only be able to decode the instructions which might be exe-
cuted in this cycle, rather than the entire instruction set. But how can we know which instruc-
tions might be executed in this cycle without actually decoding them first?
There are two answers, both of which are used heavily in the current emulator: subemula-
tors and frequency biasing.

Page 11
Massive MIMD

33.2.1. Subemulators
Because the microengine is completely synchronous, it is relatively easy to construct
hardwart that will allow the control unit to check to see if there exists a processor in which a par-
ticular value meets some condition. In the MasPar, this is implemented by an operation called
globalor, which in just 10 clock ticks (less than Ips) ors together values from all the PEs. By
carefully encoding the instruction set, we can use a globalor of the opcode values to index a
control unit jump table to select the emulator that understands only those instructions that
could appear within the or mask value.
Within the current emulator, there are 32 such "subemulators." Obviously, the 32 subemu-
lators could not reasonably be generated by hand, so a C program was written to perfom this
task.
In addition, the choice of how instructions are grouped together into subemulators should
not be made at random. Instructions that share CSIs should be grouped together because
factoring-out the CSIs will make those interpreters execute faster. Hence, the NOS. Immed, and
CPool CSIs (described above) correspond to bit positions in both the opcode and the globalor
mask. It is also useful to make similar divisions based on the expected cost (Slow) and frequency
of execution (Rare), and these sets also correspond to opcode bits. The class membership of each
instruction in our MZMD instruction set is given in table 1.
To illustrate the subinterpreter structure, we present a complete subemulator set. However,
to keep the size reasonable, we have restricted the subemulator set to cover only the instructions
used in the example in listing 1.

Page 12

--
I - -. -
Massive MIMD

-f a c t :
Const 0 ;if (n)
Recursive i n t f a c t o r i a l
LdL
*/ JumpF
Const
int
LdL
f a c t ( i n t n)
Const
i Const
if ( n ) i
LdL
r e t u r n (n * f a c t (n-1) ) ;
Const
1 Add
r e t u r n (1);
Jump -f a c t
1 r e t l a b 0:
Mu 1
Ret
LO :
Const
Ret

Listing 1: MIMD Factorial- C and Assembly Code

The complete subemulator set is given in listing 2. The Op- references are opcode values
or bit masks; the M- references are macros that actually perform the corresponding operation. If
we assume that all processors in the MIMD machine call -f a c t at the same time, all processors
would simultaneously execute the C o n s t instruction using the subinterpreter for classes Immed
and CPool ( c a s e 0x14). Suppose that some processors wish to execute Mu1 (classes NOS
and Rare) at the same time that others execute JumpF (Immed, NOS, and -001); then the
subemulator for Immed, NOS, CPool, and Rare would be executed ( c a s e Oxld). Notice that
the C program that builds the subemulators factors-out identical subemulators (:e.g., c a s e
0 x19 and ca s e 0 x l b ) to save ACU memory space.

Page 13

-- - -
Massive MIMD

switch ((globalor op) r opmask) I

case 0x0: I. Opcodes i n every clarr ' I case 0x19: I* Op Immed Op NOS Op Rare * I
case 0x1: I* Op-Rare .I O~INOS
case Oxlb: /* 0 ~ 1 1 m m e d 0pISlow Op-Rare * I
case 0x2: I' O p Slow * I if lop r Op-NOS) (
case 0x3: I * 0p-Slow Op Rare .I U NOS
case 0x4: I* 0 p - ~ ~ 0 0 1*7 i? lop <- Op-Add) I
case 0x5: I' 0 p - C P O O ~ O p Rare * I n-Add
case 0x6: I* 0 p I ~ P o o l0 p ~ S l o w* I I else (
case 0x7: I* Op-CPool Op-Slow Op-Rare * I n-nu1
n LdL
I else (
if lop a Op-Immedl (
case 0x8: / * Op-NOS 'I U-Immed M-Ret
case Oxa: I' O p NOS O p Slow * I ) else (
case Oxc: I' OP-NOS Op-CPOO~ .I n-LdL
case Oxe: I* OP-NOS 0 p - C P O O ~ Op-Slow .I
i f (OP r OPINOSI
(-
break;
M-NOS U A d d
case Oxlc: I* Op Immed O p NOS O p CPool .I
case Oxle: I * 0pr1mmed O ~ ~ N O S O~ICPOO~
Op-Slow .I
break: if (op h Op-NOS) 1
U NOS
case 0x9: I* Op-NOS Op-Rare ' I i f top r op-Immed) I
case Oxb: I * O p NOS Op-Slow O p Rare .I U-Immed M-CPool U-JumpP
case Oxd: I. 0p-NOS Op CPool @ Rare * I I else 1
case Oxf: I * 0 ~ 1 ~ 0Os~ - C P O O ~OpIslou Op-Rare * / M-Add
if top C Op-NOS) (- 1
n NOS I else (
iY (op <- Op-Add) (: if (op r Op-Immed) (
n-Add M Immed
) else ( i? (op r Op-cPool) (
n-nu1 n-CPOOl
) 1
) else 1
M-LdL if (op <- O p Ret) (
1 if (op <; Op-Const) I
break; M-Const
) else (
case 0x10: I * O p Immed ' I M-Ret
case 0x11: / * ~ p - 1 m e d Op-Rare */ 1
case 0x12: / * O p r I m e d Op Slow * I ) else (
case 0x13: /' Op-Imed 0 ~ 3 1 0 Op-Rare
~ ./ if (op <- Op-Jump) (
if (op r Op-Imed) I M Jump
n-Immed M-Ret I else I
I else (
n-LdL
I
break:
break:
case 0x14: I * O p Immed OP-CPOOl */
case 0x15: I * O p I ~ m m e d Op-CPool OP-Rare ' 1 case Oxld: I* O p Immed Op NOS Op CPool Op-Rare .I
case 0x16: I * Op-Immed Op-CPool Op-Slow ' I case Oxlf: I * 0pI1mmed O ~ ~ N O0 p
S ~ c P o o lOp-Slow Op-Rare .I
case 0 ~ 1 7 : I * ~p Immed o p - C P O O ~ OP-slow OP-Rare I' ir lop r op-NOS) (
if lop r ~p-immedl 1 M wns
n Immed
i? (op L o p CPOOl) I
n-cpo01- I
if (op <- Op JumpF) (
I if (op <= Op-Add) I
if (op <- Op Ret) ( M-Add
i f (op <= op-const) ( I else I
M-const n-JumpF
) else 1 I
n-Ret
1
) else I
if (op <- O P - J ~ P ) ( I else (
n-J-P if top 6 Op-Inmed) I
I else ( n Immed
IT (op r op-CPOOl) I
n-CPOOl
1
break: I
if (op <- Op Ret) (
case 0x18: I* Op Immed Op NOS .I if (op <: Op-Const) I
case Oxla: / a opI1mmed O ~ ~ N OOp-Slow
S *I n-Const
if lop s Op-NOS) ( I else (
U-NOS M-Add M-Ret
) else I I
if (op op-Immedl I ) else (
U-Immed if (op <- Op-Jump) (
I n-JmP
if lop <- o p - ~ e t ) (. 1 else I
n-~et EL-LdL
) else ( I
n-LdL
)
1 break;
break; 1

Listing 2: Example Subemulator Set

A vaguely similar type of improvement is suggested in [NiTgO]. Nilsson and Tanaka envi-
sion a set of subinterpreters such that each subinterpreter emulates only a single type of instruc-
tion and all subinterpreters are executed once per interpreter cycle. Using statistics, they change

Page 14
Massive MIMD

the order of the subinterpreters to maximize the expected number of instructions executed per
processor per interpreter cycle. E.g., if the subinterpreters are in the order A, B, C then the
instruction sequence B, A takes 2 cycles - but B, A would take only one cycle if the subinter-
preters were ordered B, C, A. The problem is that this improvement is small and is essentially
incompatible with factoring-out portions of the emulated operations (i.e., instruction fetch).

3.2.2.2. Frequency Biasing

The second way to reduce operation count is what we call frequency biasing. Suppose that
a particular operation takes tl ticks to execute and another operation takes t2=5*tl. If these two
instructions were allowed to execute in each emulator cycle, the apparent execution time of both
would be about 6*tl. Suppose that instead, we would allow up to five emulator cycles of the first
instruction before attempting to execute the second instruction; the fast instruction will average
one execution every 2 emulator cycles, and the slow instruction will average one execution every
10 cycles. This is essentially an instruction-level variation on the concept of shortest job first
(SJF) scheduling, and yields the same benefits.
However, the benefits would be small were it not for an interesting property of most expen-
sive operations: if several expensive operations would have been executed just one or two emula-
tor cycles off from each other, delaying the operations will cause them to group together in
the same emulator cycle. For most operations, having more processors execute the operation
simultaneously does not significantly change the speed with which that operation is executed.
Hence, this "alignment" effect dramatically improves performance.
Notice that if we consider not two instructions, but two groups of instructions, the same
property holds.
In the current version of the emulator, only a small amount of frequency biasing is used.
Instructions that are in either the Slow or CPool classes are only allowed to execute, every other
cycle. Of course, we need not be able to decode these instructions in the subemulator set that
excludes these operations. Hence, there are actually two different subemulator sets, or a total of
64 subemulators, within the current emulator. Despite this, the complete emulator uses less than
80K bytes of ACU memory.

4. Performance Evaluation
Our first proof of concept MIMD system was implemented on the Purdue University Paral-
lel Processing Laboratory's 16,384-PE MasPar MP-1 in July 1991, shortly after developing the
CSI algorithm and prototype implementation. The current version (January 1992) of the MIMD
system includes:
mimdc
A compiler, written in C using PCCI'S [PaD92]. The language is a parallel dialect of

Page 15
Massive MIMD

C called MIMDC. It supports most of the basic C constructs. Data values can be
either i n t or f l o a t , and variables can be declared as mono (shared) or p o l y
(private) [Phi89].
There are actually two kinds of shared memory reference supported. The mono vari-
ables are replicated in each processor's local memory so that loads execute quickly,
but stores involve a broadcast to update all copies. It is also possible to directly
access p o l y values from other processors using "parallel subscripting":

would use the values of i , j, and z on this processor to fetch the value of y from
processor j, add z , and store the result into the x on processor i. In addition to
using shared memory for synchronization, MIMDC supports barrier synchronization
[DiS89] using a w a i t statement.
mimda
An assembler, written in C. The stack-based MIMD assembly code output by
mimdc is assembled to generate both a listing file and an Intel-format hex load
module.
mimd
The MIMD interpreter, written in MPL (Maspar's SIMD C dialect) with the aid of
several specialized interpreter construction programs written in C and AWK. The
structure of m i m d was described in detail in section 3.
Benchmark programs were written in MIMDC and their performance was evaluated. A high
level language was used for the benchmarks because we feel that it both "keeps us honest'' and
provides a friendlier, more realistic, interface for program development.

4.1. High-Level Language Peak MFLOPS

Although we have measured the peak floating point performance of hand-coded MIMD pro-
grams at from 280 MFLOPS to over 350 MFLOPS, we felt that the fairest comparison would be
to take essentially SIMD codes, written in MIMDC and MPL, and compare the MFLOPS
obtained. Listing 3 shows these two equivalent programs.

Page 16
Massive MIMD

Do 1 GFLOP worth of float adds.

/ Do 1 GFLOP worth of float adds.. ..
/

extern double dpuTimerElapsed0;

int
main ( l int

--
main0
int count
float sum
10000000/16384;
0.0;
I
int count - -
10000000/16384;
plural float sum 0.0;

while (count) I
do 100 float adds per loop, . . . ./ while (count1 t
/a
sum - sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum +
I . do 100 float adds per loop... * /
sum - sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
Sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
s u m + sum + sum + sum + sum + sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum + sum + sum + sum + sum +
sum + sum + sum + sum + sum + sum; sum + sum + sum + sum + sum +
count - count -1;
count
sum + sum + sum + sum + sum + sum;
-count- 1:
t
/ * mimd emulator automatically prints time ./
) printf("Done: 99s DPU usage\n", dpuTimerElapsed0 );
t

Listing 3: Peak FLOPS benchmark in MIMDC and MPL.

In the MIMD program, all processors execute the same code sequence, only one instruction
is executed in the emulator for each MIMD cycle and processors are rarely idle. The other code,
written in MPL, executes with all processors enabled and is completely SIMD. Neither program
does any useful calculations, but the performance provides a good estimate of peak floating point
speed4. The emulator achieved 97.2 MFLOPS, or about 10% of the MPL program's 986
MFLOPS. Note that the Maspar's theoretical peak speed is 1,200 MFLOPS.

4.2. Emulation Overhead

The above numbers also allow us to compute something much more meaningful: the emu-
lation overhead. Since our emulator records the number of emulation cycles executed, and we
know that the actual operations must have taken the time that the MPL program ran for, we were
able to determine that each emulator cycle had an overhead of about 4 8 ~ s This
. number was also
confirmed by other benchmarks.
Aside from the fact that 4 8 p is surprisingly fast, it is important to note that most of this
overhead could be eliminated by recoding the emulation in a different language. An obvious way
Note that the MasPar floating point operation time is not dependent on operand value, hence adding 0
values yields a valid time without the potential for overflow.

Page 17

r
Massive MIMD

to reduce the overhead is to write the emulator in the MasPar's SIMD assembly language instead
of in MPL; however, unless the emulator algorithm also is changed, the improvement would be
quite small. This is partly because MPL is low-level enough (e.g., register declarations) to
usually generate good code, and partly because we already use an AWK script to patch the few
obvious blunders made by MPL.
The insight that could remove most of the 48ps overhead is that the MasPar9s32-bit RISC
SIMD instruction set is implemented by microcode executed on 4-bit PEs. By implementing
the MIMD emulator as a single new microcoded instruction, emulateMIMD, the emulation
overhead per emulator cycle would almost certainly drop to less than lops.
Small additional improvements, at either the assembly or microcode level, could result by
slightly altering the emulation algorithm. Essentially, MPL only allows structured mixing of
control flow and enable masking; there are a few portions of the emulation that could profit from
directly manipulating enable masks.

43. A Many-Thread Example

While the above numbers are impressive, they should be impressive because for each emu-
lator cycle, all MIMD threads were executing the same instruction taken from the same relative
location in PE memory. Such code sequences are actually common in massively parallel MIMD
code, but it is much more important that most cases typically encountered perform reasonably. In
fact, the emulator structum is not designed to maximize best-case execution speed.
Recall that different instruction types execute serially in the SIMD microengine. Hence, a
MIMD program that tends to have a wide range of different instructions being encountered within
each emulator cycle should provide much poorer performance. These are also the cases that most
of the emulator's optimizations attempt to improve. A MIMD program with this property makes
a much tougher test case.

43.1. The Program

Unfortunately, most parallel benchmarks are more SIMD in nature, and would yield better
performance. Lacking a good "standard" example MIMD program, we selected a recursive
algorithm in which processors take radically different paths through the code based on their pro-
cessor numbers. The problem selected, and implemented in MIMDC, was a graph fault-tolerance
problem in which each of the 16,384 processors analyzed a unique graph derived from a master
graph. The master graph is shown in figure 2; each line represents two arcs, one in each direc-
tion.

Page 18
Massive MIMD

Figure 2: Initial Graph for graph.mc.

The program in listing 4 uses an exhaustive recursive search to determine, for all possible
combinations of faulty arcs in the master graph, the total number of faulty states for which it is
still possible to travel from node 0 to node 4. We do not claim that this is a good algorithm for
this problem, but it is a good example of "true MIMD" code.
All 16,384 processors begin by executing main ( ) . Each processor reads the master graph
and modifies it to produce a unique faulty graph by removing arcs corresponding to 0 bits in the
processor number (called t h i s in MIMDC). Since 16,384 is 214, we use a graph with 14 arcs
so that each processor will have a unique task. The function f o u n d p a t h ( ) determines
whether a path exists by a depth first search. It returns as soon as it has found a node that it has
already visited, has reached the destination, or has explored all arcs leaving the node. If it finds
an arc that goes to a node it has not visited, it recursively calls itself with the unexplored node.
When all faulty graphs have been explored, the reduceAdd ( ) function uses banier synchroni-
zation and distributed memory accesses to total the number of successful path traversals.

Page 19
Massive M M D

int
A Simple little program to compute some basic foundpathlint here, int there)
fault tolerance properties....

-- 00::
(
.I int i
int k

mono int master-from[LINKS] - (

o . 1 . ~ , 2 . o , 3 , o , 4 , 1 , 2 , 2 , 4 , 3 , 4
/ * Are we there yet? .I
if (here --
there) returntl):

I* We are here. *I
1:
mono int m a s t e r - t o [ ~ ~ ~ ~(~ ] - been-there[herel - 1:
1 . o . 2 . o , 3 . 0 . 4 . 0 , 2 , 1 , 4 , 2 , 4 , 3
I;
mono int master-arcs - 14;
I * Try each arc outta here... * /
while [i < arcs) (
I* Found an arc out... .I
poly int fromlLINKS1: if (from[il here) I --
poly int to[LINKSI;
poly
poly
int
int arcs -
been-therelLINKS1;
0: if (been-there[ to[i] ] -
I * Have we been where it goes? * I
0) (

int I. Nope. Go there now.... ' /

main0 if (foundpath(to[il, there) l (
( return (1);
I * Initialize poly copies of the maater graph I
with a11 arcs removed corresponding to the I
0 bite in the processor number )
I
int i 0:
int mybits
-
thin: -- I * No luck yet, try any other arcs... .I
i-iil;
int gotpath 0; I
while (i c master-arcs) ( /.Can't get there from here.. . . * I
if (mybits & 1) ( return(0):
fromlarcsl
tolarcsl
arcs - --
master-fromlil;
master-tori];
arcs + 1;
1

t
mybits -
mybits >> 1:
i - i + l ;
poly
poly
poly
int
int
int
reduce-tmp;
rk;
rj;
I poly int ri;

/ * Try to find the paths from node 0 to 1 * I int

gotpath
)else(
-
if (this c (lccmaster-arcs)1 (
foundpath (0,4);
reduceAdd(int Val)
I
I. Recursive doubling summation ' I
qotpath 0; - ri 1; -
I
reduce-tmp Val; -
gotpath - reduceAdd(g0tpath): while (ri c width) I
wait;
.. -
(this + ri) b (width):
if (this --
/ * Let processor 0 speak for all.
0) (
*I rk
rj -
reduce-tmp[(l rkl;
print "There were ", (lccmaster-arcs),
" networks checked.\n";
print "Of these, ", gotpath,
wait;
reduce-tmp
ri ri cc 1;-reduce-tmp + rj:-
" could reach 4 from O.\n"; t
I return(reduce-tmp);
1 t

Listing 4: Code for graph.mc.

Under the M M D emulator, mimd, each processor executes its own path through the code.
Hence, the execution of this program differs greatly from a path search program for a SIMD
machine.
On average, there were 16.3 unique program counter (PC) values active for each emulation
cycle, which is roughly equivalent to 16 completely different programs executing. This average
was obtained in a program that only has a total of 234 instructions. The number of' PCs is also
limited by the wide range in processor work loads, which causes many processors to wait at the
first banier in reduceAdd ( ) . Averages of over 50 different PC values have been seen on a
version of g r a p h . m c that builds many different permutations of the graph for each processor
before summing the number of networks with paths between 0 and 4 - but that program was

Page 20
Massive MIMD

more complicated and the statistics were very similar. By any standard, the example code is very
dynamic.

43.2. Performance
Complete emulation statistics for the code of listing 4 are given in tables 2 and 3. For table
2, the interpret count gives the number of emulator cycles in which that instruction was executed;
the execute count is the total number of times that instruction was executed. Table 3 shows the
number of times each particular subemulator mask occurred.
The execution speed was determined in two steps. First, the code was timed using the
hardware timer available on the MasPar (80ns per tick). Then, the code was run using an instru-
mented version of the emulator to get the total number of emulator cycles needed to complete the
program, number of cycles that had a particular instruction, number of each instruction executed,
and the total number of instructions run. The instrumented MLMD emulator takes longer to run
the program. but preserves the relative time between processors, number of instructions exe-
cuted, and the number of emulator cycles in the MIMD program - this repeatability in itself is
an important advantage of our M M D implementation technique. The total number of instruc-
tions found by the instrumented emulator, divided by the run time of the uninstrumented emula-
tor, yields the average number of instructions executed per second for the uninstrumented emula-
tor.
For graph .mc, the average speed was 54.6 MIPS (excluding output from the print
instructions at the end of the program). On average each processor was executing 3,300 MIMD
instructions per second. This seems to be a very low number, but the processors on the MP-1 are
implemented by 4-bit slices and each cluster of 16 shares a single interface to its local memory.
Assuming that all of the processors are totally asynchronous, the maximum rate at which they
would be able to fetch an 8-bit instruction, execute a simple 32-bit operation, and update the pro-
gram counter is 123,000 instructions per second. Thus, the MLMD emulator had a slowdown of
less than 37 times the maximum performance that would be obtained if each processor had its
own instruction decoder and control -which would imply many times more hardware to imple-
ment each processor. We suspect, but cannot yet prove, that the additional hardware would actu-
ally increase processor hardware complexity by more than a factor of 37 (primarily due to the
complexity of floating point and network control). Also keep in mind that we are still talking
about the MPL-coded emulator speed versus pure MLMD hardware....
It should also be noted that, although graph. m c does not use floating point, this makes
little difference in performance. Actually, the 32-bit floating point instructions for multiplication
and division take significantly less time than the 32-bit integer versions; this is due to the lower
precision -just 24 bits of mantissa. Much of the run time of the emulator is due to decoding the
instruction and fetching operands (as was shown in the gf lop benchmark; see section 4.2).

Page 21
Massive MIMD

Operation Interpret Count Execute Count

Add 1805 1899520
And 105 229376
Const 1333 5257088
Eq 992 344320
GT 816 761472
Jump 889 773632
JumpF 1028 1373440
Ld 1945 2882560
LdD 14 229376
LdL 1980 2411904
Mod 14 229376
POP 1 16384
Push 2244 5637376
Ret 705 87424
ShL 1141 1131392
ShR 210 229376
St 1156 1336832
StL 1044 702592
Wait 1018 10197504
totals 18440 35730944

Table 2: Instruction Statistics for g r a p h . mc.

Subemulator Mask Interpret Count

0 99
-low Op-Rare 30
%NOS 118
%NOS %Rare 20
-NOS Op_Slow Op-Rare 35
WImmed 104
Op_Immed Op-Rare 1
W I m m e d Op-Slow Op-Rare 5
Op_Immed Op-CPool 225
Op_Immed Op-CPool Op_Slow Op-Rare 26
Op-Immed Op-NOS 980
W I m m e d Op-NOS Op-Rare 121
Op_Immed Op-NOS Op-Slow Op-Rare 3
Op_Immed Op-NOS Op-CPool 56
Op_Immed Op-NOS Op-CPool %Rare 165
Op_Immed Op-NOS Op-CPool Op_Slow Op-Rare 947
total 2935

Table 3: Subemulator Statistics for g r a p h . mc.

Page 22

--.-
- - - --
I
Massive MIMD

5. Room for Improvement

Although the current MIMD emulation system works quite well, there are quite a few
improvements that should be made. The following is a summary of a few of the more significant
possible enhancements.

5.1. Compiler (mimdc)

Aside from being a rather stupid compiler (i.e., the only optimization performed is constant
folding), the compiler makes no attempt to perform code scheduling of any kind.
Within an individual processor, code scheduling can be used to improve performance by
matching generated code sequences to the order in which different operations are encountered in
the emulator. First, the compiler should attempt to be consistent in generating the same instruc-
tion pattern wherever possible. Second, because not all instructions are executed in every cycle,
some permutations of an instruction sequence will have lower expected execution times than oth-
ers.
Even greater performance improvements can be made by code scheduling across all proces-
sors. This involves complex timing analysis and static scheduling as a bamer MIMD architecture
[DiO9O:l[BrNW][Di092], but a MIMD implemented with a SIMD microengine provides exactly
the features needed for these optimizations to be applied.
The compiler is also guilty of a few obvious coding blunders. For example, w h i l e loops
should be coded to only have one JumpF rather than a JumpF and a Jump per iteration.

5.2. Assembler (mimda)

Although the assembler was constructed using a parameterized assembler (a local invention
called ASA) that is capable of minimizing length of span-dependent instructions, we do not
currently use this feature. Hence, the compiler often conservatively generates C o n s t instruc-
tions for which the assembler blindly generates C o n s t instructions. Instead, the assembler
should recognize C o n s t as the long form of the p u s h instruction, and should substitute
P u s h wherever possible.

53. Emulator (mimd)

Throughout this paper we have noted a number of things about the emulator that mark it
clearly as a proof-of-concept prototype rather than a "real" machine. Obvious improvements
include rewriting the emulator in the MasPar microcode, or at least in MasPar assembly language,
instead of MPL. There are also some optimizations that result in unstructured manipulation of
control flow and masking, and these could not be done in MPL, so the emulation algorithm will
change in future versions.
Various changes and additions to the instruction set are also likely. In particular, the bamer
mechanism will be made more general (currently it is an SBM, but will be upgraded to a DBM
[Di090]) and some provision for explicitly switching to pure SIMD execution will be added.

Page 23
Massive MIMD

This would allow the machine to more efficiently execute parts of algorithms that are inherently
SIMD, such as communication or reduction operations. The MIMD/SIMD switching will
vaguely resemble the facility provided in the PASM prototype [BeS91].
In the immediate future, the emulator will be modified to provide a rudimentary operating
system so that multiple users will be able to submit MIMD jobs asynchronously. In the current
version, the complete MIMD environment is set up when the emulator begins executing.

6. Conclusions
In this paper, we have presented the theory behind construction of efficient MIMD
machines using SIMD microengines. Further, we have detailed how we created a 16,384-
processor shared memory banier MIMD using a SIMD MasPar MP-1, and we have given meas-
ured performance figures that validate the approach.
The MIMD emulation software discussed in this paper, mimdc, mimda, and mimd, are
being set up as a public domain Beta test release, and will be available via an email server. The
email address will appear in the final version of this paper.

Page 24

- ---
Massive MIMD

References
[Adv89] Advanced Micro Devices, 29K Famity 1990 Data Book, Sunnyvale, Califomia,
1989.
[Bla90] T. Blank, "The MasPar MP-1 Architecture," 35th IEEE Computer Society Intema-
tional Conference (COMPCON), February 1990, pp. 20-24.
[BrN90] C.J. Brownhill and A. Nicolau, Percolation Scheduling for Non-VLIW Machines,
Technical Report 90-02,University of Califomia at Irvine, Irvine, Califomia, Janu-
ary 1990.
[BeS91] T.B. Berg and H.J. Siegel, "Instruction Execution Trade-offs for SIMD vs. MIMD
vs. Mixed Mode Parallelism," 5th International Parallel Processing Symposium,
April 1991, pp. 301-308.
[Cra91] Cray Research Incorporated, The CRAY Y-MP C90 Supercomputer System, Eagan,
Minnesota, 1991.
[Die911 H.G. Dietz, "Common Subexpression Induction," Midwest Society for Program-
ming Languages and Systems (MWSPLS) Spring Meeting, Digital Computer
Laboratory (DCL), University of Illinois at Urbana-Champaign, Urbana, Illinois,
April 20,1991.
[Di090] H.G. Dietz, M.T. O'Keefe, and A. Zaafrani, "An Introduction to Static Scheduling
for MIMD Architectures," Advances in Languages and Compilers for Parallel Pro-
cessing, edited by A. Nicolau, D. Gelertner, T. Gross, and D. Padua, The MIT Press,
Cambridge, Massachusetts, 1991, pp. 425-444.
[Di092] H.G. Dietz, M.T. O'Keefe, and A. Zaafrani, "Static Scheduling for Barrier MIMD
Architectures," The Journal of Supercomputing, accepted to appear.
[DiS89] H.G. Dietz, T. Schwederski, M. T. O'Keefe, and A. Zaafrani, "Static Synchroniza-
tion Beyond VLIW," Supercomputing 1989, November 1989, pp. 416-425.
[Hi1871 W.D. Hillis, "The Connection Machine," Scientific American, June 1987, pp. 108-
115.
[ma801 A.D. Klappholz, "An Improved Design for a Stochastically Conflict-Free
Memory/Interconnection System," 14th Asilomar Conference on Circuits, Systems,
and Computers, November 1980.
[Mas911 MasPar Computer Corporation, MasPar Programming Language (ANSi C cornpati-
ble MPL) Reference Manual, Software Version 2.2, Document Number 9302-0001,
Sunnyvale, Califomia, November 1991.
[NiT90] M. Nilsson and H. Tanaka, "MIMD Execution by SIMD Computers," Journal of
Information Processing, Information Processing Society of Japan, vol. 13, no. 1,
1990, pp. 58-61.
[PaD92] T.J. Parr, H.G. Dietz, and W.E. Cohen, "PCCTS Reference Manual (version 1.00),"
ACM SIGPLAN Notices, accepted to appear, February 1992.

Page 25
Massive MIMD

[Phi891 M.J. Phillip, "Unification of Synchronous and Asynchronous Models for Parallel
Programming Languages" Master's Thesis, School of Electrical Engineering, Pur-
due University, West Lafayette, Indiana, June 1989.
[SiN90] H.J. Siegel, W.G. Nation, and M.D. AUemang, "The Organization of the PASM
Reconfigurable Parallel Processing System," Ohio State Parallel Computing
Workshop, Computer and Information Science Department, Ohio State University,
Ohio, March 1990, pp. 1-12.
[St0841 H.S. Stone, "Database Applications of the Fetch-And-Add Instruction," IEEE Tran-
sactions on Computers, July 1984, pp. 604-612.
[ThG87] S. Thakkar, P. Gifford, and G. Fielland, "Balance: A Shared Memory Multiproces-
sor System," International Conference on Supercomputing, May 1987, pp. 93-101.
[ThigO] Thinking Machines Corporation, Connection Machine Model CM-2 Technical Sum-
mary," version 6.0, Cambridge, Massachusetts, November 1990.
[WiH91] P.A. Wilsey, D.A. Hensgen, C.E. Slusher, N.B. Abu-Ghazaleh, and D.Y. Hollinden,
"Exploiting SIMD Computers for Mutant Program Execution," Technical Report
No. TR 133-11-91, Department of Electrical and Computer Engineering, University
of Cincinnati, Cincinnati, Ohio, November 1991.

Page 26

Flynn's Taxonomy and SISD SIMD MISD MIMD
86% (14)
Flynn's Taxonomy and SISD SIMD MISD MIMD
7 pages
Parallel Processing Lecture2
No ratings yet
Parallel Processing Lecture2
62 pages
Flynn's Classification
No ratings yet
Flynn's Classification
3 pages
Lecture13 - Full IS1500
No ratings yet
Lecture13 - Full IS1500
34 pages
04 Hardware
No ratings yet
04 Hardware
109 pages
Cs8083 Notes Mcap
No ratings yet
Cs8083 Notes Mcap
187 pages
CS-3006!3!1 SIMD Intrinsic Programming Reduced
No ratings yet
CS-3006!3!1 SIMD Intrinsic Programming Reduced
55 pages
Unit IV CA
No ratings yet
Unit IV CA
73 pages
Edu Crackers 4
No ratings yet
Edu Crackers 4
15 pages
15.1 Processors, Parallel Processing and Virtual Machines
No ratings yet
15.1 Processors, Parallel Processing and Virtual Machines
25 pages
Flynns Classification
No ratings yet
Flynns Classification
12 pages
Lecture 3 Flynn's Classical Taxonomy
No ratings yet
Lecture 3 Flynn's Classical Taxonomy
29 pages
CS516: Parallelization of Programs: Overview of Parallel Architectures
No ratings yet
CS516: Parallelization of Programs: Overview of Parallel Architectures
43 pages
Basic Parallel Programming Methods
No ratings yet
Basic Parallel Programming Methods
8 pages
CS621 Week 03
No ratings yet
CS621 Week 03
54 pages
Mcap Notes
No ratings yet
Mcap Notes
186 pages
Unit4 Session2 Parallel Computing Classification
No ratings yet
Unit4 Session2 Parallel Computing Classification
15 pages
Cs8083 MCP Unit I Notes
No ratings yet
Cs8083 MCP Unit I Notes
31 pages
Ajay 1
No ratings yet
Ajay 1
2 pages
09 - Improve Energy Performance Using AspenONE Energy Management Solutions - Nattapol
100% (1)
09 - Improve Energy Performance Using AspenONE Energy Management Solutions - Nattapol
46 pages
Unit 1
No ratings yet
Unit 1
48 pages
F 23
No ratings yet
F 23
20 pages
Advanced Parallel Processing
No ratings yet
Advanced Parallel Processing
32 pages
A Comprehensive Survey of Various Processor Types & Latest Architectures
No ratings yet
A Comprehensive Survey of Various Processor Types & Latest Architectures
7 pages
Chap15 Sima Mimd
No ratings yet
Chap15 Sima Mimd
12 pages
Flynn's Classification
No ratings yet
Flynn's Classification
4 pages
Coa-Unit - 5 Notes
No ratings yet
Coa-Unit - 5 Notes
38 pages
Parallel & Distributed Computing: By: M. Imran Siddiqui
No ratings yet
Parallel & Distributed Computing: By: M. Imran Siddiqui
25 pages
A Comparative Analysis of SIMD and MIMD Architectures
No ratings yet
A Comparative Analysis of SIMD and MIMD Architectures
6 pages
Computer Architecture Flynn's Taxonomy
No ratings yet
Computer Architecture Flynn's Taxonomy
4 pages
Flynns Classification
No ratings yet
Flynns Classification
27 pages
Lect2 Classification
No ratings yet
Lect2 Classification
23 pages
SIMD and Associative Computational Models: Parallel & Distributed Algorithms
No ratings yet
SIMD and Associative Computational Models: Parallel & Distributed Algorithms
31 pages
Chapter 6 Parallel and Concurrent Computing
No ratings yet
Chapter 6 Parallel and Concurrent Computing
27 pages
Lec 5
No ratings yet
Lec 5
14 pages
Simmips A Mips System Simulator
No ratings yet
Simmips A Mips System Simulator
8 pages
SIMD Presentation
No ratings yet
SIMD Presentation
28 pages
Baker CHPT 5 SIMD Good
No ratings yet
Baker CHPT 5 SIMD Good
94 pages
ACA1
No ratings yet
ACA1
26 pages
Flynn Classification
No ratings yet
Flynn Classification
4 pages
5 Marks Q. Describe Array Processor Architecture
No ratings yet
5 Marks Q. Describe Array Processor Architecture
11 pages
ch.9 Pipeline MoDIFIED
No ratings yet
ch.9 Pipeline MoDIFIED
76 pages
Flynn Taxonomy
No ratings yet
Flynn Taxonomy
4 pages
Notes FT HA
No ratings yet
Notes FT HA
4 pages
Parallel Processing
No ratings yet
Parallel Processing
22 pages
Flynn's Classification
No ratings yet
Flynn's Classification
46 pages
Tunnel Development Gonzales Alonso Thesis
No ratings yet
Tunnel Development Gonzales Alonso Thesis
119 pages
Aca Unit 1.1
No ratings yet
Aca Unit 1.1
20 pages
Unit 4 COA
No ratings yet
Unit 4 COA
8 pages
Parallel Computing System
No ratings yet
Parallel Computing System
4 pages
BCSE412L - Parallel Computing 04
No ratings yet
BCSE412L - Parallel Computing 04
9 pages
Chapter
No ratings yet
Chapter
9 pages
Module-1: Metrics and Measures
No ratings yet
Module-1: Metrics and Measures
47 pages
CA Classes-221-225
No ratings yet
CA Classes-221-225
5 pages
Flynn's Taxonomy of Computer Architecture
No ratings yet
Flynn's Taxonomy of Computer Architecture
8 pages
Study of Architectural Design of VLSI: Veni Madhav Sharma, Javed Ali Mansuri, Sunil Sharma
No ratings yet
Study of Architectural Design of VLSI: Veni Madhav Sharma, Javed Ali Mansuri, Sunil Sharma
2 pages
Parallel Processing Report
No ratings yet
Parallel Processing Report
9 pages
Flynn's Taxonomy: 1. Sisd
No ratings yet
Flynn's Taxonomy: 1. Sisd
7 pages
IJARCCE6G S Prabhudev Parallel PDF
No ratings yet
IJARCCE6G S Prabhudev Parallel PDF
4 pages
Parallel Processing in Processor Organization: Prabhudev S Irabashetti
No ratings yet
Parallel Processing in Processor Organization: Prabhudev S Irabashetti
4 pages
Nmos Inverter Numerical
No ratings yet
Nmos Inverter Numerical
4 pages
PDF 1
No ratings yet
PDF 1
17 pages
Rts 3
No ratings yet
Rts 3
64 pages
CD Unit - 4
No ratings yet
CD Unit - 4
39 pages
Otraco Web Publications Haulroads and Productivity Mining Magazine August 2013 PDF
No ratings yet
Otraco Web Publications Haulroads and Productivity Mining Magazine August 2013 PDF
3 pages
2012 Radan Datasheets
No ratings yet
2012 Radan Datasheets
20 pages
JVM Deep Dive
No ratings yet
JVM Deep Dive
53 pages
Motor Control TI C2000
No ratings yet
Motor Control TI C2000
16 pages
Rts 4
No ratings yet
Rts 4
91 pages
Advanced Computer Architectures: 17CS72 (As Per CBCS Scheme)
No ratings yet
Advanced Computer Architectures: 17CS72 (As Per CBCS Scheme)
31 pages
210 - EC8392, EC6302 Digital Electronics - Question Bank 1
No ratings yet
210 - EC8392, EC6302 Digital Electronics - Question Bank 1
19 pages
Assignment 2
No ratings yet
Assignment 2
12 pages
Sylabus
No ratings yet
Sylabus
28 pages
Instruction Manual: Motion Controller With Sine Wave Commutation For EC-Motors
No ratings yet
Instruction Manual: Motion Controller With Sine Wave Commutation For EC-Motors
65 pages
Nervatla Artish 2005
No ratings yet
Nervatla Artish 2005
61 pages
Assignment 2 Vlsi
No ratings yet
Assignment 2 Vlsi
21 pages
CSE 420 Fall 2018 Module 1 Sample Questi
No ratings yet
CSE 420 Fall 2018 Module 1 Sample Questi
18 pages
18ec0443-Analog Electronic Circuits
No ratings yet
18ec0443-Analog Electronic Circuits
7 pages
210 - EC8392, EC6302 Digital Electronics - Question Bank
No ratings yet
210 - EC8392, EC6302 Digital Electronics - Question Bank
17 pages
NEA Mac Protocols Presentation
No ratings yet
NEA Mac Protocols Presentation
26 pages
23AD2102A - DBMS Workbook-1 1
No ratings yet
23AD2102A - DBMS Workbook-1 1
120 pages
WMS 1
No ratings yet
WMS 1
10 pages
Query Processing in DBMS
No ratings yet
Query Processing in DBMS
22 pages
Low Cost FPGA Development System For Tea
No ratings yet
Low Cost FPGA Development System For Tea
5 pages
Amba
No ratings yet
Amba
7 pages
Slau 132 y
No ratings yet
Slau 132 y
187 pages
Online Placement Rectangle
No ratings yet
Online Placement Rectangle
10 pages
VLSI Processor Architecture
No ratings yet
VLSI Processor Architecture
26 pages
PDF 2
No ratings yet
PDF 2
13 pages
Sys LW-01EN ComputingBasises
No ratings yet
Sys LW-01EN ComputingBasises
12 pages
Centum AGM Notice
No ratings yet
Centum AGM Notice
11 pages
Getting Familiar With GCC Parameters
No ratings yet
Getting Familiar With GCC Parameters
11 pages
6820-02-UPRVUNL-PM-Equipment Master Review Log - CHD-I - V1.0
No ratings yet
6820-02-UPRVUNL-PM-Equipment Master Review Log - CHD-I - V1.0
5 pages
ADSP Compre
No ratings yet
ADSP Compre
2 pages
Delay H Impl
No ratings yet
Delay H Impl
3 pages
Ghost Cells
No ratings yet
Ghost Cells
16 pages
DataKinetics Batch Optimization Whitepaper
No ratings yet
DataKinetics Batch Optimization Whitepaper
7 pages
Digital Transformation Readiness Kit Booklet
No ratings yet
Digital Transformation Readiness Kit Booklet
19 pages
sensorKDD 2010
No ratings yet
sensorKDD 2010
9 pages
Principles of Code Optimization
No ratings yet
Principles of Code Optimization
28 pages
Appendix A: Authors: John Hennessy & David Patterson
No ratings yet
Appendix A: Authors: John Hennessy & David Patterson
15 pages
Lec 1
No ratings yet
Lec 1
17 pages
11 Uq Software Slides
No ratings yet
11 Uq Software Slides
18 pages
C 2030 Lundgren Gedae Gedae For Certification
No ratings yet
C 2030 Lundgren Gedae Gedae For Certification
24 pages
Comparison of Shock Calculation Methods
No ratings yet
Comparison of Shock Calculation Methods
3 pages
Module 1
No ratings yet
Module 1
8 pages
طراحی ساختمان اداری و تجاری Masih123
No ratings yet
طراحی ساختمان اداری و تجاری Masih123
3 pages
Frist Trail Report
No ratings yet
Frist Trail Report
6 pages
Morcotilo Nichita: Qualifications
No ratings yet
Morcotilo Nichita: Qualifications
3 pages
Elumanation - SlimCab Elevator LED Light Panel Fixture - ELP-SF30 - Specification Guide
No ratings yet
Elumanation - SlimCab Elevator LED Light Panel Fixture - ELP-SF30 - Specification Guide
2 pages
Design and Build Modern Datacentres, A to Z practical guide
From Everand
Design and Build Modern Datacentres, A to Z practical guide
Engineer Said AL Hosni
3/5 (2)
Accelerated Computing with HIP
From Everand
Accelerated Computing with HIP
Yifan Sun
4.5/5 (2)
Foundation Course for Advanced Computer Studies
From Everand
Foundation Course for Advanced Computer Studies
Franck Ismael Djédjé
No ratings yet

Fulltext

Uploaded by

Fulltext

Uploaded by

Purdue University

A Massively Parallel MIMD Implemented by

Follow this and additional works at: http://docs.lib.purdue.edu/ecetr

fThis work was supported by the Office of Naval Research (ONR)

H. G.Dietz and W . E. Cohen

Parallel Processing Laboratory

Keywords: MIMD, SIMD, Microcode, Compilers, Common Subexpression Induction.

Basic MIMD Interpreter Algorithm

1.1. Interpretation Overhead

13. Enable Masking

might be implemented by all PEs simultaneously executing the C code:

1.4. Our Approach

Figure 1: Block Diagram of MIMD using SIMD pngine

2. Instruction Set Design

2.1. Memory Reference Model

2.1.1. Local Memory Model

2.1.2. Global Memory Model

There are two reasons that we use a shared memory model:

2.2. Assembly Language Model

23. Prototype Instruction Set

ShL int shift left Op-NOS

Table 1: MIMD Instruction Set.

3.1. Shortening The Basic Cycle

3.2. Minimizing Operation Time

3.2.1. Maximizing Instruction Overlap

3.2.2. Reducing Operation Count

Listing 1: MIMD Factorial- C and Assembly Code

switch ((globalor op) r opmask) I

Listing 2: Example Subemulator Set

3.2.2.2. Frequency Biasing

4.1. High-Level Language Peak MFLOPS

Do 1 GFLOP worth of float adds.

extern double dpuTimerElapsed0;

Listing 3: Peak FLOPS benchmark in MIMDC and MPL.

4.2. Emulation Overhead

43. A Many-Thread Example

43.1. The Program

Figure 2: Initial Graph for graph.mc.

mono int master-from[LINKS] - (

int I. Nope. Go there now.... ' /

/ * Try to find the paths from node 0 to 1 * I int

Listing 4: Code for graph.mc.

Operation Interpret Count Execute Count

Table 2: Instruction Statistics for g r a p h . mc.

Subemulator Mask Interpret Count

Table 3: Subemulator Statistics for g r a p h . mc.

5. Room for Improvement

5.1. Compiler (mimdc)

5.2. Assembler (mimda)

53. Emulator (mimd)

You might also like