Csa Notes Mod 6-Part 2
Csa Notes Mod 6-Part 2
Csa Notes Mod 6-Part 2
• Traditionally, shared-memory multiproccssors like the Cray Y-MP were used to perform coarse-grain
computations in which each processor executed programs having tasks of a few seconds or longer.
• Message-passing multicomputer are used to execute medium-grain programs with approximately 10-ms
task size as in the iPSC/1.
• In order to build It-MPP (Massive Parallel Processing) systems, we may have to explore a higher
degree of parallelism by making the task grain size even smaller.
• Fine-grain parallelism was utilized in SIMD or data-parallel computers like the CM-2 or on the message
driven J-Machine and Mosaic C .
Fine-Grain Parallelism
Latency Analysis
The computing granularity and communication latency of leading early examples of multiproccssors, data-
parallel computers, and medium-and fine-grain multicomputers are summarized in Table. Four attributes are
identified to characterize these machines. Only typical values are shown
• The communication latency: measures the data or message transfer time on a system interconnect.
This corresponds to the shared-memory access time on the Cray Y-MP, the time required to send a 32-
bit value across the hypercube network in the CM-2. and the network latency on the iPSC/1 or J-
Machine.
• The synchronization overhead : is the processing time required on a processor, or on a PE, or on a
processing node of a multicomputer for the purpose of synchron ization.
o The sum Tc + Ts gives the total time required For IPC.
o The shared-memory Cray Y-MP had a short Tc but a long Ts. The SIMD machine CM-2 had
a short Ts but a long Tc. The long latency of the iPSC/1 made it unattractive based on fast
advancing standards. The MIT J-Machine was designed to make a major improvement in both of
these communication delays.
• The grain size is measured by the execution time of a typical program, including both computing time
and communication time involved. Supercomputers handle large grain. Both the CM-2 and the J-
Machine were designed as fine-grain machines. The iPSC/1 I was a relatively medium-grain machine
compared with the rest.
Fine grain leads to a much higher DOP and also to higher communication overhead.
29
TIST CSA MOD 6-PART 2
SIMD machines used hardwired synchronization and massive parallelism to overcome the problems of long
network latency and slow processor speed. Fine-grain muliticomputers , like the J-Machine and Caltech
Mosaic, were designed to lower both the grain size and the communication overhead compared to those of
traditional multicomputer.
The architecture and building block of the MIT-J machine, its instruction set, and system design considerations
are described below.
The building block was the message driven process(MDP),a36-bit microprocessor custom-designed for a
line-grain multicomputer.
J –Machine Architecture
Network addressing limited the size of the J-Machine to a maximum configuration maximum configuration of
65,5315 nodes, corresponding to a three-dimensional mesh with 32*32* 64 nodes. All hidden parts (nodes and
links)are not shown for purposes of clarity. Clearly, every node has a constant node degree of 6,and there are
three rings crossing each node along the three dimensions. The end-around connection scan be folded to
balance the wire length on all channels.
30
TIST CSA MOD 6-PART 2
The MDP chip include a processor, a 4096-word by36-bit memory and build in router with network ports. An
on-chip memory controller with error checking and correction [EEC] capability permitted local memory to be
expanded to 1 million words by adding external DRAM chips. The processor was message-driven in the sense
that it executed functions in response to message via the dispatch mechanism. No receive instruction was
needed.
• The MDP created a task to handle each arriving message. Messages carrying these tasks drove each
computation.
• MDP was a general –purpose multicomputer processing node that provided the communication,
synchronization, and global naming mechanisms required to efficiently support line-grain, concurrent
programming models.
• The grain size was as small as 8-word objects or 20-instruction tasks.
• Fne-grain programs typically execute from 10 t0 100 instructions between communication and
synchronization actions.
31
TIST CSA MOD 6-PART 2
• MDP chips provided inexpensive processing nodes with plentiful VLSI commodity parts to construct
the Jellybean Machine (J-Machine) multicomputer.
• The MDP appeared as a component with a memory port, six two-way network ports, and a
diagnostic port.
• The memory port provided a direct interface to up to IM words of ECC DRAM, consisting of 11
multiplexed address lines, a 12-bit data bus, and 3control signals.
• The network ports connected MDPs together in a three-dimensional mesh network. Each of the ports
corresponded to one of the six cardinal directions(+x,-x,+y, -y, +z, -z) and consist of nine data and six
control lines. Each port connected directly to the opposite port on an adjacent MDP.
• The diagnostic port could issue supervisory commands and read and write MDP memory from a
console processor (host). Using this port, a host could read or write at any location in the MDP‘s address
space, as well as reset, interrupt, halt, or single-step the processor.
The chip included a conventional microprocessor with pre fetch, control, register file and ALU [RALU], and
memory blocks. The network communication subsystem comprised the routers and network input and output
interfaces. The arithmetic unit(ALU) provided addressing functions. The MDP also included a DRAM
interface, control clock, and diagnostic interface.
Instruction-Set Architecture
Communication Support
• The MDP provided hardware support for end-to-end message delivery including formatting,
injection, delivery, buffer allocation, buffering, and task scheduling.
• All MDP transmitted a message using a series of SEND instructions, each of which injected one or two
words into the network at either priority 0or l.
Consider the following MDP assembly code for sending a four-word message using three variants of the SEND
instruction.
32
TIST CSA MOD 6-PART 2
• The first SEND instruction reads the absolute address of the destination node in <X, Y, Z> format from
R0 and forwards it other network hardware.
• The SEND2 instruction reads the first two words of the message out of registers R1 and R2 and
enqueues them for transmission.
• The final instruction enqueues two additional words of data, one from R3 and one from memory.
• The use of the SEND2E instruction marks the end of the message and causes it to be transmitted into the
network.
• The J-Machine was a three-dimensional mesh with two-way channels, dimension-order routing. and
blocking flow control .
33
TIST CSA MOD 6-PART 2
• The MDP supported a broad range of parallel programming models including shared memory, data
parallel, data flow, actor, and explicit message passing,by providing a low-overhead primitive
mechanism for communication, synchronization, and naming.
• lts communication mechanisms permitted a user-level task on one node to send a message to any other
node in a 4096-node machine in less than 2 micro second.
• This process did not consume any processing resources on intermediate nodes, and it automatically
allocated buffer memory on the receiving node.
• On message arrival, the receiving node created and dispatched a task in less than l micro second.
• Presence tags provided synchronization on all storage locations. Three separate register sets allowed fast
context switching.
• A translation mechanism maintained bindings between arbitrary names and values and supported a
global virtual address space.
• The J-Machine used wormhole routing and blocking flow control. A combining-tree approach was used
for synchronization.
• The routers formed the switches in a J-Machine network and delivered messages to their destinations.
• The MDP contained three independent routers.
• Each router contained two separate virtual networks with different priorities that shared the same
physical channels.
• The priority-l network could preempt the wires even if the priority-0 network was congested or jammed.
• Each of the 18 router paths Contained buffers. comparators, and output arbitration.
• On each data path, a comparator compared the lead flit, which contained the destination address in that
dimension, to the node coordinate. If the head flit did not match, the message continued in the current
34
TIST CSA MOD 6-PART 2
• A message entering the dimension competed with messages continuing in the dimension at a two-to one
switch. Once a message was granted this switch, all other input was locked out for the duration of the
message. Once the head flit of the message had set up the route, subsequent flits followed directly
behind it
Synchronization
The MDP synchronized using message dispatch and presence tags on all states. Because each message arrival
dispatched a process, messages could signal events on remote nodes. For example, in the following combining-
tree example, each COMBINE message signals its own arrival and initiates the COMBINE routine.
In response to an arriving message, the processor may set presence tags for task synchronization. For example,
access to the value produced by the combining tree may be synchronized by initially tagging as empty the
location that will hold this value. An attempt to read this location before the combining tree has written it will
raise an exception and suspend the reading task until the root of the tree writes the value.
A combining tree is shown in Fig. This tree sums results produced by a distributed computation. Each node
sums the input values as they arrive and then passes a result message to its parent.
35
TIST CSA MOD 6-PART 2
A pair of SEND instructions was used to send the COMBINE message to anode. Upon message arrival, the
MDP buffered the message and created a task to execute the following COMBINE routine written in MDP
assembly code:
If the node was idle execution of this routine began three cycles after message arrival. The routine loaded the
combining-node pointer and value from the message, performed the required add and decrement, and, if Count
reached zero, sent a message to its parent.
Research Issue
The J-Machine was an exploratory research project. Rather than being specialized for a single model of
computation, the MDP incorporated primitive mechanisms for efficient communication, synchronization and
naming. The machine was used as a platform for software experiments in fine-grain parallel programming.
Reducing the grain size of a program increases both the potential speedup due to parallel execution and the
potential overhead associated with parallelism. Special hardware mechanisms for reducing the overhead due to
36
TIST CSA MOD 6-PART 2
communication process switching, synchronization, and multi-threading were therefore central to the design of
the MDP. Software issues such as load balancing, scheduling, and locality also remained open questions.
The MIT research group led by Dally implemented two languages on the J-Machine: the actor language
Concurrent Small talk and the data flow language Id. The machine's mechanism also supported dataflow and
object-oriented programming models using a global name space. The use of a few simple mechanisms provided
orders of magnitude lower communication and synchronization overhead than was possible with
multicomputers built from then available off-the-shelf microprocessors.
The Caltech Mosaic C was an experimental fine-grain multicomputer that employed single-chip nodes and
advanced packaging technology to demonstrate the performance /cost advantages of line-grain multicomputer
architecture.
• The evolution from the Cosmic Cube to the Mosaic is an in which advances in technology are employed
to reimplement nodes of a similar logical complexity but which are faster and smaller, have lower
power, and are less expensive.
• The progress in microelectronics over the preceding decade was such that Mosaic nodes were
approximately 60 times faster, used approax 20 times less power, were approax. 100 times smaller.
And were (in constant dollars) approax 25 times less expensive to manufacture than Cosmic Cube
nodes.
Mosaic C Node
• The Mosaic C multicomputer node was a single9.25 mm*10.00 mm chip fabricated in a l.2-microm-
feature-size, two-level-metal CMDS process.
37
TIST CSA MOD 6-PART 2
• At 5-V operation, the synchronous parts of the chip operated with large margins at a 30-MHz clock
rate, and the chip dissipated= 0.5W.
• The processor also included two program counters and two sets of general-purpose registers to allow
zero-time context switching between user programs and message handling.
• Thus, when the packet interface received a complete packet, received the header of a packet, completed
the sending of a packet, exhausted the allocated space for receiving packets or any of several other
events that could be selected, it could interrupt the processor by switching it instantly to the message-
handling context. Instead of several hundred instructions for handling a packet, the Mosaic typically
required only about 10 instructions.
• The choice of a two-dimensional mesh for the Mosaic was based on a 1989 engineering analysis.
• Originally, a three-dimensional mesh network was planned.
• But the mutual fit of the two-dimensional mesh network and the circuit board medium provided high
packaging density and allowed the high-speed signals between the routers to be conveyed on shorter
wires.
• Sixty-four Mosaic chips were packaged by tape- automated bonding (TAB) in an 8 x 8 array on a
circuit board.
• These boards allowed the construction of arbitrarily large, two-dimensional arrays of nodes using
stacking connectors.
• This style of packaging was meant to demonstrate some of the density, scaling, and testing advantages
of mesh-connected systems.
• Host-interface boards were also used to connect the Mosaic arrays and workstations.
Charles Seitz determined that the most profitable niche and scaling track for the multicomputer, a highly
scalable and economical MIMD architecture was the fine-grain multicomputer. The Mosaic C demonstrated
many of the advantages of this architecture, but the major part of the Mosaic experiment was to explore the
programmability and application span of this class of machine.
(1)Single-chip nodes are a technologically attractive point in the design space of multicomputers. Constant-
node-size scaling results in single-chip nodes of increasing memory size, processing capability, and
communication bandwidth in larger systems than centralized shared-memory multiprocessors.
(2) lt was also forecasts that constant-node-complexity scaling would allow a Mosaic 8 *8 board to be
implemented as a single chip, with about 20 times the performance per node, within 10 years.
38
TIST CSA MOD 6-PART 2
Arvind and Iannucci identified memory latency and synchronization overhead are two fundamental issues in
multiprocessing. Scalable multiprocessors must address the loss in processor efficiency in these cases. Using
various latency-hiding mechanisms and multiple contexts per processor can make the conventional von
Neumann architecture relatively expensive to implement, and only certain types of parallelism can be
exploited efficiently.
HEP/ Tera computers offered an evolutionary step beyond the von Neumann architectures. Dataflow
architectures represent a radical alternative to von Neumann architectures because they use dataflow graphs as
their machine languages. Dataflow graphs, as opposed to conventional machine languages. Dataflow graphs , as
opposed to conventional machine language , specify only a partial order for the execution of instructions and
thus provide opportunities for parallel and pipelined execution at the level of individual instructions.
Dataflow Graphs
We have seen a dataflow graph in Fig. 2.13. Dataflow graphs can be used as a machine language in dataflow
computers. Another example of a dataflow graph (Fig. 9.3la) is given below.
Example
This dataflow graph shows how to obtain an approximation of cos x by the following power series
computation .
The corresponding dataflow graph consists of nine operators (actors or nodes). The edges in the graph
interconnect the operator nodes. The successive powers of x are obtained by repeated multiplications. The
constants (divisors) are fed into the nodes directly. All intermediate results are forwarded among the nodes.
39
TIST CSA MOD 6-PART 2
• Static data flow computers simply disallow more than one token to reside on any one arc, which is
enforced by the firing rule.
• A node is enabled as soon as tokens are present on all input arcs and there is no token on any of its
output arcs.
• Jack Dennis proposed the very first static dataflow computer in 1974.
• The static firing rule is difficult to implement in hardware.
• Special feedback acknowledge signals are needed to secure the correct token passing between
producing nodes and consuming nodes.
• Also, the static rule makes it very inefficient to process arrays of data. The number of acknowledge
signals can grow too fast to be supported by hardware.
• In dynamic dataflow computers each data token is tagged with a contest descriptor, called a tagged
token. The firing rule of tagged token dataflow is changed to: A node is enabled as soon as tokens with
identical tags are present at each of its input arcs.
• With tagged tokens, tag matching becomes necessary. Special hardware mechanisms are needed to
achieve this. In the rest of this section, we discuss only dynamic dataflow computers.
40
TIST CSA MOD 6-PART 2
• Arvind of MIT pioneered the development of tagged token architecture for dynamic dataflow
computers.
Although data dependence does exist in dataflow graphs, it does not force unnecessary sequentialization , and
dataflow computers schedule instructions according to the availability of the operands. Conceptually token
carrying values flow along the edges of the graph. Values or tokens may be memory locations. Each instruction
waits for tokens on all inputs, consumes input tokens, computes output values based on input values, and
produces tokens on outputs. No further restriction on instruction ordering is imposed. No side effects are
produced with the execution of instructions in a dataflow computer. Both dataflow graphs and machines
implement only functional languages.
• The MIT tagged token dataflow architecture (TTDA) (Arvind et al, 1983), the Manchester
Dataflow Computer (Gurd and Watson, 1982), and the ETL Sigma-1(Hiraki and Shimada, 1987)
were all pure dataflow computers.
• The TTDA was simulated but never built. The Manchester machine was actually built and became
operational in mid-1982
• The ETL Sigma-1 was developed at the Electrotechnical Laboratory, Tsukuba,Japan. It consisted of
128 PEs fully synchronous with a 10 MHz clock. lt implemented the I-structure memory proposed by
Arvind. The full configuration became operational in I987 and achieved a I70-Mflops performance. The
major problem in using the Sigma-l was lack of high level language for users.
• These were successors to the pure dataflow machines. The basic idea is to eliminate associative token
matching. The waiting token memory is directly addressed, with the use of Full/ empty bits. This idea
was used in the MIT/Motorola Monsoon and in the ETL EM-4 system .
• Multithreading was supported in Monsoon using multiple register sets. Thread-based programming was
conceptually introduced in Monsoon. The maximum configuration built consisted of eight processors
and eight I-structure memory modules using an 8*8 crossbar network. lt became operational in I991.
• EM-4 was an extension of the Sigma-1. It was designed for 1024 nodes, but only an 80-node prototype
became operational in 1990. The prototype achieved 815 MIPS in an 80 X80 matrix multiplication
benchmark.
• These are architectures combining positive features from the Von Neumann and dataflow architectures.
The best research examples include the MIT P-RISC, the IBM Empire and the MIT/Motorola *T .
41
TIST CSA MOD 6-PART 2
• P-RISC was a “RISC-field dataflow architecture. It allowed tighter encodings of the dataflow graphs
and produced longer threads for better performance. This was achieved by splitting “complex“ dataflow
instructions into separate "simple" component instructions that could be composed by the compiler. lt
used traditional instruction sequencing. It performed all intraprocessor communication via memory and
implemented “joins” explicitly using memory locations.
• P-RISC replaced some of the dataflow synchronization with conventional program counter-based
synchronization. IBM Empire was a von Neumann dataflow hybrid architecture under development at
IBM based on the thesis of lannucci . The *T was a latter effort at MIT joining both the dataflow and
von Neumann ideas, to be discussed in following Section The Mit/ Motorola *T Prototype.
The internal design of the processor chip and of the node memory are shown in Fig. 9.32a.
• The processor chip communicated with the network through a 3 x 3 crossbar switch unit .
• The processor and its memory were interfaced with the memory control unit. The memory was used to
hold programs (template segments) as well as tokens (operand segments, heaps, or frames) waiting to
be fetched.
• The processor consisted of six component units. The input buffer was used as a token store with a
capacity of 32 words. The fetch match unit fetched tokens from the memory and performed tag-
matching operations among the tokens fetched in. Instructions were directly fetched from the memory
through the memory controller.
The heart of the processor was the execution unit which fetched instructions until the end of a thread.
Instructions with matching tokens were executed . Instructions could emit tokens or write to registers.
Instructions were fetched continually using traditional sequencing (PC + 1 or branch) until a “stop” flag was
raised to indicate the end of a thread. Then another pair of tokens was accepted. Each instruction in a thread
specified the two sources for the next instructions in the thread.
42
TIST CSA MOD 6-PART 2
Fig .9.32 a
Fig .9.32 b
The same idea was used as in Monsoon for token matching, but with different encoding. All data tokens were
32 bits, and instruction words were 33 bits. EM-4 supported remote loads and synchronizing loads. The full
/empty bits present in memory words were used to synchronize remote loads associated with different threads.
43
TIST CSA MOD 6-PART 2
Thus the dP would not be disrupted by short messages. The memory controller handled requests for remote
memory load or store, as well as the management of node memory (64 Mbytes).
The network interface unit received or transmitted messages from or to the network, respectively, as illustrated
in Fig. 9.33c. It should be noted that the sP was built as an on-chip SFU of the dP.
44
TIST CSA MOD 6-PART 2
45