Instruction Level Parallelism Through Microtrheading - A Scalable Approach To Chip Multiprocessors
Instruction Level Parallelism Through Microtrheading - A Scalable Approach To Chip Multiprocessors
Most microprocessor chips today use an out-of-order instruction execution mechanism. This mechanism
allows superscalar processors to extract reasonably high levels of instruction level parallelism (ILP).
be impossible to eliminate all dependencies between threads processors due to tag matching, wake-up signals to waiting
and hence synchronization is also required. The goal of this instructions and selection mechanisms for issuing instructions.
work therefore is to define a feasible architecture for a scalable These delays increase quadratically for most building blocks
CMP that is easy to program, that maximizes throughput for a with the instruction window size [12]. Finally, even with the
given technology and that minimizes the communication and implementation of a large instruction window, it difficult for
synchronization overheads between different threads. It should processors to find sufficient fine-grain parallelism, which has
be noted that in this work the term thread is used to describe made most chip manufacturers like Compaq, Intel and Sun look
very small code fragments with minimal context. at simultaneous multi-threading (SMT) [13] to expose more
Today Intel’s Itanium-2 (Madison) microprocessor features instruction level parallelism (ILP) through a combination of
over 410 million transistors in a 0.13 µm semiconductor process coarse- and fine-grain parallelism.
technology operating at a speed of 1.5 GHz. This is a dual- Future microprocessor technology needs efficient new
processor version of the previous Itanium processor (Mckinley), architectures to achieve the demands of many grand-challenge
which has an issue width of six. Moore’s law would indicate applications, such as weather and environmental modelling,
that the billion-transistor chip will become feasible in 65 nm computational physics (on both sub-atomic and galactic scales)
technology within the next 3 or 4 years [7]. Intel expects that and biomolecular simulation. In the near future, it should be
the Itanium processor will reach 1.7 billion transistors at the possible to integrate thousands of arithmetic units on a single
end of 2005. The questions we must ask are where do we chip [14] but out-of-order execution is not a good candidate,
go from here and what is the best way to utilize this wealth because of the limited concurrency it exposes and the non-
model, instructions are issued in-order from any of the threads or future file of check points and repairs is also required to re-
allocated to it but the schedule of instructions executed is non- order the completion of instructions before committing their
deterministic, being determined by data availability. Threads results to the registers specified in the ISA in order to achieve
can be deterministically distributed to multiple pipelines based a sequentially consistent state on exceptions.
on a simple scheduling algorithm. The allocation of these Control speculation predicts branch targets based on the prior
threads is dynamic, being determined by resource availability, history for the same instruction. Execution continues along
as the concurrency exposed is parametric and not limited by this path as if the prediction was correct, so that when the
the hardware resources. The instruction issue schedule is also actual target is resolved, a comparison with the predicted target
dynamic and requires linear hardware complexity to support will either match giving a correctly predicted branch or not,
it. Instructions can be issued from any microthread already in which case there was a missprediction. A missprediction
allocated and active. If such an approach could also give linear can require many pipeline cycles to clean up and, in a wide-
performance increase with number of pipelines used, then it issue pipeline, this can lead to hundreds of instruction slots
can provide a solution to both CMP and ILP scalability [19]. being unused, or to be more precise, if we focus on power,
The performance of the microthreaded model is compared being used unnecessarily. Both prediction and cleaning up a
with a conventional pipeline in [20] for a variety of missprediction require additional hardware, which consumes
memory latencies. The results show that the microthreaded extra area and also power, often to execute instructions whose
microprocessor always provides a superior performance to that results are never used. It can therefore be described as wasteful
of a non-threaded pipeline and consistently shows a height IPC of chip resources and, moreover, has unpredictable performance
the number of no-op instructions required is not known and architectural support for control and data speculation through
most VLIW compilers will schedule load instructions using the predicated instruction execution and binding pre-fetches of data
cache-hit latency rather than the maximum latency. This means into cache. In this architecture each operation is guarded by
that the processor will stall on every cache miss. The alternative one of the predicate registers, each of which stores 1 bit that
of scheduling all loads with the cache miss latency is not feasible determines whether the results of the instruction are required or
for most programs because the maximum latency may not be not. Predication is a form of delayed branch control and this bit
known due to bus contention or memory port delays, and it also is set based on a comparison operation. In effect, instructions
requires considerable ILP. This problem with non-determinism are executed speculatively but state update is determined by
in cache access limits VLIW to cacheless architecture unless the predicate bit so an operation is completed only if the value
speculative solutions are embraced. This is a significant of its guard bit is true, otherwise the processor invalidates the
problem with modern technology, where processor speeds are operation. This is a form of speculation that executes both
significantly higher than memory speeds [19]. Also pure VLIW arms of some branches concurrently but this action restricts
architectures are not good for general purpose applications, due the effective ILP, depending on the density and nesting of
to their lack of compatibility in binary code [23]. The most branches.
significant use of VLIW therefore is in embedded systems, Pre-fetching is achieved in a number of ways. For example,
where these constraints are both solved (i.e. single applications by an instruction identical to a load word instruction that does
using small fast memories). A number of projects described not perform a load but touches the cache and continues, setting
below have attempted to apply speculation to VLIW in order in motion any required transactions and cache misses up the
this approach is to expose a greater concurrency explicitly by described in this paper, although the processor was designed as a
the compiler. component of a large multi-computer and not as general purpose
The global control unit used in this architecture distributes the chip. The interleaved approach requires a large concurrency
tasks among multiple parallel execution units. Each execution in order to maintain efficient pipeline utilization, as it must
unit can fetch and execute only the instructions belonging to be filled with instructions from independent threads. Unlike
its assigned task. So, when a task missprediction is detected, the earlier approaches, Tera avoids this requirement using
all execution units between the incorrect speculation point something called explicit-dependence lookahead, which uses
and the later task are squashed [30]. Like superscalar, this an instruction tag of 3 bits that specifies how many instructions
can result in many wasted cycles, however, as the depth of can be issued from the same stream before encountering a
speculation is much greater, the unpredictability in performance dependency on it. This minimizes the number of threads
is correspondingly wider. required to keep the pipeline running efficiently, which is
The benefit of this architecture over a superscalar architecture ∼70 in the case of memory accesses. It will be seen that
is that it provides more scalability. The large instruction microthreading uses a different approach that maintains full
window is divided into smaller instruction windows, one backward compatibility in the ISA, as well as in the pipeline
per processing unit, and each processing unit searches a structure.
smaller instruction window for independent instructions. This Unlike IMT, which usually draws concurrency from ILP and
mitigates the problems of scaling instruction issue with issue loops, BMT usually exploits regular software threads. There
width. The multiple tasks are derived from loops and function have been many BMT proposals, see [17] and even some
the degree and characteristics of the parallelism. Executing in terms of reducing the number of registers or minimizing the
multiple processes or threads in parallel is the most common number of read or write ports [14, 15, 41, 42].
way to extract high level of parallelism but this requires Work done in [41] describes a bypass scheme to reduce
concurrency in the source code of an application. Previous the number of register file read ports by avoiding unnecessary
research has demonstrated that a CMP with four 2-issue register file reads for the cases where values are bypassed.
processors will reach a higher utilization than an 8-issue In this scheme an extra bypass hint bit is added to each
superscalar processor [17]. Also, work described in [3] shows operand of instructions waiting in the issue window and a
that a CMP with eight 2-issue superscalar processors would wake-up mechanism is issued to reduce register file read
occupy the same area as a conventional 12-issue superscalar. ports. As described in [43], this technique has two main
The use of CMPs is a very powerful technique to obtain problems. First, the scheme is only a prediction, which can be
more performance in a power efficient manner [38]. However, incorrect, requiring several additional repair cycles for recovery
using superscalar processors as a basis for CMPs, with their on missprediction. Second, because the bypass hint is not reset
complex issue window, large on-chip memory, large multi- on every cycle, the hint is optimistic and can be incorrect if the
ported register file and speculative execution is not such a good source instruction has written back to register file before the
strategy because of the scaling problems already outlined. It dependent instruction is issued. Furthermore, an extra pipeline
would be more efficient to use simpler in-order processors and stage is required to determine whether to read data operands
exploit more concurrency at the CMP level, provided that this from the bypass network or from the register file.
can be utilized by a sufficiently wide range of applications. This Other approaches include a delayed write-back scheme [42],
and explicitly routed network, operations must be scheduled by read. However, it significantly increases the efficiency of the
the compiler and routing information must also be generated pipeline, especially when a large number of thread suspensions
with code for each processor in order to route results from occur together, when the model resembles that of an IMT
one processor’s register file to another. Although it may be architecture. Only when the compiler can define a static
possible to program streaming applications using such a model, schedule are instructions from the same thread scheduled in
in general, concurrency and scheduling can not be defined BMT mode. Exceptions to this are cache misses, iterative
statically. operations and inter-thread communications. There is one other
Other previous work has described a distributed register situation where the compiler will flag a context switch and
file configuration [44] where a fully distributed register file that is following any branch instruction. This allows execution
organization is used in a superscalar processor. The architecture to proceed non-speculatively, eliminates the branch prediction
exploits a local register mapping table and a dedicated register and cleanup logic and fills any control hazard bubbles with
transfer network to implement this configuration. This instructions from other threads, if any are active.
architecture requires an extra hardware recopy unit to handle the The model is defined incrementally and can be applied to
register file dispatch operations. Also, this architecture suffers any RISC or VLIW instruction set. See Table 1 for details of
from a delay penalty as the execution unit of an instruction the extensions to an instruction set required to implement the
that requires a value from a remote register file must stall until concurrency controls required by this model. The incremental
it is available. The authors have proposed an eager transfer nature of the model allows a minimal backward compatibility,
mechanism to reduce this penalty, but this still suffers from an where existing binary code can execute unchanged on the
Broadcast bus
arbiter To nearest
neighbours
ring
Shared memory
remote
register
register read
init
allocation
model register write
4.2.1. Thread creation TABLE 2. TCB containing parameters that describe a family of
The microthreaded model defines explicit and parametric microthreads.
concurrency using the Cre instruction. This instruction Threads Cardinality of the set of threads representing an
broadcasts a pointer to the TCB and to all processors assigned iteration
to the current context; see [47] for details of dynamic processor Dependency Iteration offset for any loop carried dependencies,
allocation. The TCB contains parameters, which define a e.g. a[i]: = . . . a[i − d]
family of threads, i.e the set of threads representing the loop Preambles Number of iterations using preamble code
body and the triple defining the loop. It also defines the dynamic Postambles Number of iterations using postamble code
resources required by each thread (its microcontext) in terms of Start Start of loop index value
local and shared registers. For loops which carry a dependency, Limit Limit of loop index value
Step Step between loop indices
it also defines a dependency distance, and optionally, pointers
Locals Number of local registers dynamically allocated
to a preamble and/or postamble thread, which can be used to per iteration
set up and/or terminate a dependency chain. The dependency Shareds Number of shared registers dynamically allocated
distance is a constant offset in the index space which defines per iteration
regular loop-carried dependencies. A family of threads can Pre-pointer One pointer per thread in set for preamble code
be created without requiring a pipeline slot, as the create Main-pointer One pointer per thread in set for main loop–body
instruction is executed concurrently with a regular instruction code
in the IF stage of the pipeline. The TCB for our current work Post-pointer One pointer per thread in set for postamble code
on implementation overheads is defined in Table 2.
A global scheduling algorithm determines which iterations
will execute on which processors. This algorithm is built
into the local scheduler but the parameters in the TCB and A terminated thread releases it resources so long as any
the number of processors used to execute the family may be dependent thread has also terminated. To do so before this
dynamic. The concurrency described by this instruction is may destroy data that has not yet been read by the dependent
therefore parametric and may exceed the resources available in thread. Note that microthreads are usually (but not exclusively)
terms of registers and thread slots in the continuation queues. very short sequences of instructions without internal loops.
The register allocation unit in each local scheduler maintains
the allocation state of all registers in each register file and this 4.2.2. Context-switching
controls the creation of threads at a rate of one per pipeline The microthreaded context switching mechanism is achieved
cycle. Once allocated to a processor a thread runs to completion, using the Swch instruction, which is acted upon in the first
i.e. until it encounters a Kill instruction and then terminates. stage of the pipeline, giving a cycle-by-cycle interleaving if
necessary. When a Swch instruction is executed, the IF stage processor when managing loop-carried dependencies. This
reads the next instruction from another ready thread, whose implements a scalable and distributed shared-register model
state is passed to the IF stage as a result of the context switch. between processors without using a single, multi-ported register
As this action only requires the IF stage of the pipeline, it can file, which is known to be unscalable.
be performed concurrently with an instruction from the base The use of dataflow synchronization between threads enables
ISA, so long as the Swch instruction is pre-fetched with it. a policy of conservative instruction execution to be applied.
The context switching mechanism is used to manage both When no microthreads are active because all are waiting
control and data dependencies. It is used to eliminate control external events, such as load word requests, the pipeline will
dependencies by context switching following every transfer stall and, if the pipe is flushed completely, the scheduler will
of control, in order to keep the pipeline full without any stop clocks and power down the processor going into a standby
branch prediction. This has the advantage that no instruction mode, in which it consumes minimal power. This is a major
is executed speculatively and consequently, power is neither advantage of data-driven models. Conservative instruction
dissipated in making a prediction nor in executing instructions execution policies conserve power in contrast to the eager
on the wrong dynamic path. Context switching also eliminates policies used in out-of-order issue pipelines, which have no
bubbles in the pipeline on data dependencies that have non- mechanisms to recognize such a conjunction of schedules.
deterministic timing, such as loads from memory or thread-to- This will have a major impact on power conservation and
thread communication. Context switching provides an arbitrary efficiency.
large tolerance to latency, determined by the size of the local Context switching and successful synchronization have no
4.2.4. Thread termination the dependency to be resolved before being able to issue
Thread termination in the microthreaded model is achieved any new instructions. In effect the instruction window in
through a Kill instruction, which of course causes a context a microthreaded model is distributed to the whole of the
switch as well as updating the microthread’s state to killed. architectural register set and only one link in the dependency
The resources of the killed threads are released at this stage, graph for each fragment of code is ever exposed simultaneously.
unless there is another thread dependent upon it, in which case Moreover, no speculation is ever required and consequently,
its resources will not be released until the dependent thread has if the schedules are such that all processors would become
also been killed. (Note that this is the most conservative policy inactive, then this state can be recognized and used to
and more efficient policies may be implemented that detect power-down the processors to conserve energy.
when all loop-carried dependencies has been satisfied.) Compare this to the execution in an out-of-order processor,
where instructions are executed speculatively regardless of
whether they are on the correct execution path. Although
4.3. Scalable instruction issue
predictions are generally accurate in determining the execution
Current microprocessors attempt to extract high levels of ILP by path in loops, if the code within a loop contains unpredictable,
issuing independent instructions out of sequence. They do this data-dependent branches, this can result in a lot of energy being
most successfully by predicting loop branches and unrolling consumed for no useful work. Researchers now talk about
multiple iterations of a loop within the instruction window. ‘breaking the dependency barrier’ using data in addition to
The problem with this approach has already been described; control speculation but what does this mean? Indices can be
On a context switch or kill, the instruction fetch stage is access patterns have the characteristics that they are written
provided with the state of a new thread if any are active, to infrequently but read from frequently. The address space
otherwise the pipeline stalls for a few cycles to resolve the in a conventional RISC ISA is partitioned so that the lower 16
synchronization and if it fails, the pipeline simply stops. This registers form this global window. These are statically allocated
action is simple, requires no additional flush or cleanup logic for a given context and every thread can read and/or write to
and most importantly, is conservative in its use of power. them. Note that the main thread has 32 statically allocated
Note that by definition, when no local threads are active, the registers, 16 of which are visible to all microthreads as globals
synchronization event has to be asynchronous and hence does and 16 of which are visible only to the main thread. Each thread
not require any local clocks. sees 32 registers. The lower 16 of these are the globals and these
The state of a thread also includes its program counter, the are shared by all threads and those in the upper half are local to
base address of its micro-context and the base address and a given thread.
location of any micro-contexts it is dependent upon. The state The upper 16 registers are used to address the microcontext
also includes an implicit slot number, which is the address of the of each iteration in a family of threads. As each iteration shares
entry in the continuation queue and which uniquely identifies common code, the address of each micro-context in the register
the thread on a given processor. The last field required is file must be unique to that iteration. As we have seen, the base
a link field, which holds a slot number for building linked address of a thread’s micro-context forms a part of its state.
lists of threads to identify empty slots, ready queues and an This immediately gives a means of implementing a distributed,
arbitrary number of continuation queues that support multiple shared-register model. We need to know the processor on which
example, A[i] := . . . A[i − k] . . . where k is invariant of thread’s state must therefore contain the base address of its own
the loop. Such dependencies normally act as a deterrent to micro-context for local reads and also the base address of any
loop vectorization or parallelization but this is not so in this micro-context it is dependent upon. In the base-level model we
model, as the independent instructions in each loop can execute present, only one other micro-context is accessed, at a constant
concurrently. This is the same ILP as is extracted from an out- offset in the index space.
of-order model. In the second case, the producer and consumer iterations are
Consider now the implementation of this basic model. It scheduled to different processors. Now, the consumer’s read to
is straightforward to distribute the global register window the D window will generate a remote request to the processor
and its charactistics suggest a broadcast bus as being an on which the producer iteration is running. Whereas in the first
appropriate implementation. This requires that all processors case a micro-context’s D window is not physically allocated,
executing a family of microthreads be defined prior to any in this second case it must be. It is used to cache a local copy
loop invariants being written (or re-written) to the global of the remote micro-context’s S window. It is also used to
window. The hardware then traps any writes to the global store the thread continuation locally. The communication is
window and replicates the values using the broadcast bus to again asynchronous and independent of the pipeline operation.
the corresponding location in all processors’ global windows. The consumer thread is suspended on its read to the D window
As multiple threads may read the values written to the location until the data arrives from the remote processor. For
global register window, registers must support arbitrarily large this constant strided communication, iteration schedules exist
continuation queues, bounded above only by the number of that require only nearest neighbour communication in a ring
Create Create
I- I-
Scheduler Scheduler
cache cache
D-
… D-
Pipeline Pipeline
cache cache
Write $G Write $G
Initialise Initialise
$L0 Decoupled $L0 Decoupled
Local Lw Local Lw
Register file Register file
$D read $D read
FIGURE 3. Microthreaded CMP architecture, showing communication structures and clocking domains.
this case a break instruction acquires the bus and terminates all The results are based on a static analysis of the accesses to
other threads allowing the winner to write its results back to the various register windows and investigate the average traffic on
global state of the main context. the microthreaded register file ports. The five types of register
The second communication system is the shared-register, file ports are shown in Figure 2 and include, pipeline ports
ring network, which is used by processors to communicate (read-R and write-W ), the initialization port (I ), the shared-
results along a dependency chain. For the mode described, this dependent ports (Sd ), the broadcast port (Br ) and the write port
requires only local connectivity between independently clocked that is required in the case of a cache miss (Wm ). The goal
processors. of this analysis is to guide the implementation parameters of
All global communication systems are decoupled from the such a system. We aim to show that all accesses other than
operation of the microthreaded pipeline and thread scheduling the synchronous pipeline ports can be implemented by a pair
provides latency hiding during the remote access. This of read and write ports, with arbitration between the different
technique gives a microthreaded CMP a serious advantage as sources. In this case a register file with five fixed ports would
a long-term solution to silicon scaling. be sufficient for each of the processors in our CMP design.
The microthreaded pipeline uses three synchronous ports.
These ports are used to access three classes of register windows
5. ANALYSIS OF REGISTER FILE PORTS
i.e. the $L, $S and $G register windows. If we assume that
In this section, an analysis of microthreaded register file ports the average number of reads to the pipeline ports in each
is made in terms of the average number of accesses to each port cycle is R and the average number of writes to the pipeline
of the register file in every pipeline cycle. This analysis is based port in each cycle is W , then these values are defined by
on the hand compilation of a variety of loop kernels. The loops the following equations, where Ne is the total number of
considered included a number of Livermore kernels—some instructions executed.
which are independent and some which contain loop-carried
(Read($L) + Read($S) + Read($G))
dependencies. It also includes both affine and non-affine loops, R = inst. (1)
Ne
vector and matrix problems, and a recursive doubling algorithm.
(W rite($L) + W rite($S) + W rite($G))
We have used loop kernels at this stage as we currently have W = inst. (2)
no compiler to compile complete benchmarks. However, as the Ne
model only gains speedup via loops, we have chosen a broad set The initialization port on other hand is used in register allocation
of representative loops from scientific and other applications. to initialize the $L0 to the loop index. This port is accessed
Analysis of complete programs and other standard benchmarks once when each iteration is allocated to a processor and so the
will be undertaken when a compiler we are developing is able average number of accesses to this port is constant and equal to
to generate microthreaded code. the inverse of the number of instructions executed by the thread
TABLE 3. Average number of accesses to each class of register file port over a range of loop kernels, m = problem size.
Loop Ne R W I Br Sd
4m − 3 2m − 1 m−1
A: Partial Products 3m 0.333 0
Ne Ne M ∗ Ne
8m − 15 4m − 4 m−2
B: 2-D SOR 5m − 2 0.2 0
Ne Ne M ∗ Ne
5m + 3 4m + 1 m
L3: Inner Product 4m + 4 0.25 0
Ne Ne M ∗ Ne
2.4m + 37.4 3m + 22 4n 1.8m + 1.8
L4: Banded Linear Equation 3m + 34 0.2
Ne Ne Ne M ∗ Ne
7m 4m m−1
L5: Tri Diagonal Elimination 5m + 3 0.25 0
Ne Ne M ∗ Ne
5.5m + 2.5m1/2 − 5 3m + m1/2 − 2 (m1/2 − 1)n 0.5m − 0.5m1/2
L6: General Linear Recurrence 2.5m + 6.5m1/2 − 5 0.1429
Ne Ne Ne M ∗ Ne
9m + 3 6m + 2 n m
C: Pointer Chasing 14m + 5 0.0714
Ne Ne Ne M ∗ Ne
before it is killed, no . Therefore, if I is the average number of following equation, where Ne is the total number of instructions
accesses to the initialization port per cycle, we can say that: executed and n is the number of processors in the system. The
result is proportional to the number of processors, as one write
1
I= (3) instruction will cause a write to every processor in the system.
no
W rite($G) ∗ n
A dependent read to a remote processor uses a read port on Br = inst. (5)
Ne
the remote processor and a write port on the local processor,
as well as a read to the synchronous pipeline port on the local Finally, the frequency of accesses to the port that is required for
processor. The average number of accesses to these ports per the deferred register write in the case of a cache miss can also
cycle is dependent on the type of scheduling algorithm used. If be obtained. It is parameterized by cache miss rate in this static
we use modulo scheduling, where M consecutive iterations are analysis and again we look at the worst case (100% miss rate).
scheduled to one processor, then interprocessor communication The average number of writes per cycle to the cache-miss port
is minimized. An equation for dependent reads and writes is is given by Wm , which is given by the formula below where
given based on modulo scheduling although we consider it only Lw is the number of load instructions in each thread body, no
at the worst case scenario. The average number of accesses per is the number of instructions executed per thread body, and Cm
cycle to the dependent window is given below by Sd using is the cache miss rate. Again the average access to this port is
the following equation, where M is the number of consecutive constant for a given miss rate.
threads scheduling to one processor and Ne is the total number
(Lw)
of instructions executed. It is clear that the worst case is where Wm = inst. ∗ Cm (6)
M = 1, i.e. iterations are distributed one per processor in a no
modulo manner. Table 3 shows the average number of accesses to each class of
register file port over a range of loop kernels using the above
Read($D)
Sd = inst. (4) formulae. The first seven kernels are dependent loops, where
M ∗ Ne
the dependencies are carried between iterations using registers.
The global write port is used to store data from the broadcast The last three are independent loops, where all iterations of the
bus to the global window in every processor’s local register loop are independent of each other.
file. If we assume that the average number of accesses per As described previously, each of the distributed register files
cycle to this port is Br , then Br can be obtained from the has four sources for write accesses in addition to the pipeline
0.5 1
I
B
0.45 S/D 0.9
All writes
Average accesses per cycle (individual ports)
0.4 0.8
0.35 0.7
0.25 0.5
0.2 0.4
0.15 0.3
0.1 0.2
0 0
0 5 10 15 20 25 30 35 40 45 50
Normalised problem size (m/n)
0.5 1
0.45 0.9
I
B
Average accesses per cycle (individual ports)
0.4 0.8
S/D
Average accesses per cycle (all writes)
All writes
0.35 0.7
0.3 0.6
0.25 0.5
0.2 0.4
0.15 0.3
0.1 0.2
0.05 0.1
0 0
0 5 10 15 20 25 30 35 40 45 50
Normalised problem size (m/n)
ports. These are for $G write, the initialization write, the $D access per cycle over all analysed loop kernels. This is shown in
return data and the write to the port that supports decoupled Figures 4–7 where accesses to initialization (I ), broadcast (Br )
access to memory on a cache miss. Our analysis shows that and the network ports (Sd , shown as S/D) are given. The four
the average accesses from these sources is much less than one figures illustrate the scalability of the results (from n = 4 to
0.5 1
0.45 0.9
I
Average accesses per cycle (individual ports)
0.4 B 0.8
0.3 0.6
0.25 0.5
0.2 0.4
0.15 0.3
0.1 0.2
0.05 0.1
n = 256 processors). Results are plotted against the normalized of 2 from 2 to m/2). Normalized problem size is therefore a
problem size, where m is the size of the problem in terms of the measure of the number of iterations executed per processor. It
number of iterations, although not all iterations are executed can be seen that only accesses from the broadcast bus increase
concurrently in all codes (for example the recursive doubling with the number of processors and even this is only significant
algorithm has a sequence of concurrent loops varying by powers where few iterations are mapped to each processor. Even in
TABLE 4. Average number of accesses to all additional write ports for different number of processors, m/n = 8.
the case of 256 processors, providing we schedule more than a of iterations (64K). Clearly this demonstrates the schedule
1.0E+07 100
90
1.0E+06
80
1.0E+05
Cycles, IPC or instructions executed
70
60
1.0E+04
1.0E+03
40
30
1.0E+02
were set to match the cache line size and hence maximize microthreaded compiler. Finally, the memory model used
hit rate. in these simulations is non-blocking, as we do not have an
The second set of results was undertaken to illustrate what realistic model fully simulated yet. The results use a pipelined
effect, if any, the D-cache had on performance. In this memory with an 8-cycle start-up for first word and 2-cycles
set, the D-cache parameters were altered drastically to give per word to complete a cache line transfer. Note that this
a residual cache of just 1 Kbyte with direct mapped cache limitation is not a major issue with the kernel simulated as
lines (n.b. the register file size is a substantial larger than data can be distributed in memory according to the local
this at 8 Kbyte). Figure 9 shows these results. It can be schedule and blocking would be unlikely. In the 64 Kbyte cache
seen that number of instructions executed, time to solution only 3% of memory accesses cause a request to second level
and IPC are all virtually unchanged. The only significant memory. The remaining cache misses are same-line misses
change is in the cache hit rate, which is now much worse, due to the regularity of scheduling. We are currently working
varying from 40 up to 88% on a single processor. Indeed on a second level memory architecture and will present full
the minimal time to solution and maximum IPC by a small simulation results of this when we have a working compiler.
amount were observed with 2048 processors using the residual The results presented here, however, show great promise for
D-cache. this approach.
We note that there are some caveats to these results. They
are obtained from the execution of a single independent code
kernel. Simulations of dependent loops show that speedup
7. CONCLUSIONS
saturates, with the maximum speedup being determined by the
ratio of independent to dependent instructions in these kernels. The characteristics of advanced integrated circuits (ICs) will
For example, a simulation of a naive multiply-accumulate in future require powerful and scalable CMP architectures.
implementation of matrix multiplication (a dependent loop) However, current techniques like wide-issue, superscalar
saturated with a speedup of between 3 and 4 and was achieved processors suffer from complexity in instruction issue and in
using only four processors. This represents the maximum the large multi-ported register file required. The complexity of
instruction parallelism of the combined iterations. The results these components grows at least quadratically with increasing
also ignore the scalar component of the program, which we issue width; also, execution of instructions using these
cannot effectively evaluate until we have fully developed a techniques must proceed speculatively, which does not always
1.0E+07 100
90
1.0E+06
80
Cycles, IPC or instructions executed
1.0E+05
70
60
1.0E+04
1.0E+03
40
30
1.0E+02
20
provide results for the power consumed. In addition, more on- is the broadcast bus, used for creating threads and distributing
chip memory is required in order to ameliorate the effects of the invariants. The second is the shared-register ring network
so called ‘memory wall’. These obstacles limit the processor’s used to perform communication between the register files
performance, by constraining parallelism or through having in the producer and consumer threads. The asynchronous
large and slow structures. In short, this approach does not implementation of the bus and switch provides many
provide scalability in a processor’s performance, in the on-chip opportunities for power saving in large CMP systems. The
area and power dissipation. decoupled approach to register-file design avoids a centralized
An alternative solution which eliminates this complexity register file organization and, as we have shown, requires a
in instruction issue and the global register file, and avoids small, fixed number of ports to each processor’s register file,
speculation has been presented in this paper. The model is based regardless of the number of processors in the system.
on decomposing a sequential program into small fragments of An analysis of the register-file ports in terms of the frequency
code called microthreads, which are scheduled dynamically of accesses to each logical port is described in this paper. This
and which can communicate and synchronize with each other analysis involved different types of dependent and independent
very efficiently. This process allows sequential code to be loop kernels. The analysis illustrates a number of interesting
compiled for execution on scalable CMPs. Moreover, as the issues, which can be summarized as follows:
code is schedule invariant, the same code will execute on any
number of processors limited only by problem size. The model • A single write port with arbitration between different
exploits ILP within basic blocks and across loop bodies. In sources is sufficient to support all non-pipeline writes.
addition, this approach supports a pre-fetching mechanism that This port has an average access rate of <100% over normal
avoids many instruction-cache misses in the pipeline. The fully operating conditions. This is true even in the case of a
distributed register file configuration used in this approach has 100% cache-miss rate.
the additional advantage of full scalability in a CMP with the • A second port is required to handle reads to the $D window.
decoupling of all forms of communication from the pipeline’s The analysis shows that the average access to this port is
operation. This includes memory accesses and communication <10% over all analysed loop kernels.
between micro-contexts. • As a consequence, the distributed register files require
The distributed implementation of a microthreaded CMP only five ports per processor and these ports are fixed
includes two forms of asynchronous communication. The first regardless of the number of processors in the system. This
provides a scalable and efficient solution for large number Architectures and Compilation Techniques, Paris, France,
of processors on-chip. October 12–18, pp. 130–135. IEEE Computer Society,
• Finally, the average accesses to all write ports does not Washington, DC.
exceed 100% even in the case of n = 256-processor. [11] Olukotun, K., Nayfeh, B. A., Hammond, L., Wilson, K. and
However, to deal with a large number of processors, the Chang, K. (1996) The case for a single-chip multiprocessor.
performance would degrade gracefully due to the inherent In Proc. Seventh Int. Symp. on Architectural Support for
Programming Languages and Operating Systems (ASPLOS-7),
latency tolerance of the model. Eventually all threads
Cambridge, MA, October 1–5. Cambridge, MA, September,
would be suspended waiting for data and in this case the pp. 2–11. ACM Press, New York, NY.
stalled pipeline(s) would free up contention to the non- [12] Palacharla, S., Jouppi, N. P. and Smith, J. (1997) Complexity-
pipeline write port. effective superscalar processors. In Proc. 24th Int. Symp.
Finally we present results of the simulation of an independent Computer Architecture, Denver, CO, June 1–4, pp. 206–218.
loop kernel, that clearly demonstrate schedule invariance of ACM Press, New York, NY.
the binary code and linear speedup characteristics over a [13] Tullsen, D. M., Eggersa, S. and Levy, H. M. (1995)
wide range of processors on which the kernel is scheduled. Simultaneous multithreading: maximizing on chip parallelism.
Clearly, a microthreaded CMP based on a fully distributed In Proc. 22nd Annual Int. Symp. Computer Architecture, Santa
Margherita Ligure, Italy, June 22–24, pp. 392–403. ACM Press,
and scalable register file organization and asynchronous global
New York, NY.
communication buses is a good candidate to future CMP.
[14] Rixner, S., Dally, W. J., Khailany, B., Mattson, P. R., Kapasi, U. J.
[24] Sudharsanan, S., Sriram, P., Frederickson, H. and Gulati, A. [37] Codrescu, L., Wills, D. S. and Meindl, J. D. (2001) Architecture
(2000) Image and video processing using MAJC 5200. In Proc. of the Atlas Chip Multiprocessor: dynamically parallelising
2000 IEEE Int. Conf. Image Processing, Vancouver, BC, Canada, irregular applications. IEEE Comput. Soc., 50, 67–82.
September 10–13, pp. 122–125. IEEE Computer Society, [38] Diefendorff, K. (1999) Power4 focuses on memory bandwidth:
Washington, DC. IBM confronts IA-64, says ISA not important. Microprocessor
[25] Cintra, M. and Torrellas, J. (2002) Eliminating squashes through Rep., 13, 11–17.
learning cross-thread violations in speculative parallelisation [39] Preston, R. P. et al. (2002) Design of an 8-wide superscalar
for multiprocessors. In Proc. 8th Int. Symp. High-Performance RISC microprocessor with simultaneous multithreading. In Proc.
Computer Architecture, Boston, MA, February 2–6, pp. 43–54. 2002 IEEE Int. Solid-State Circuits Conf., San Francisco, CA,
IEEE Computer Society, Washington, DC. February 4–6, pp. 334–335. IEEE Solid-State Circuits, USA.
[26] Cintra, M. Martinez, J. S. and Torrellas, J. (2000) Architecture [40] Scott, L., Lee, L., Arends, J. and Moyer, B. (1998) Designing the
support for scalable speculative parallelization in shared- low-power M-CORE architecture. In Proc. IEEE Power Driven
memory multiprocessors. In Proc. Int. Symp. Computer Micro Architecture Workshop at ISCA98, Barcelona, Spain,
Architecture, Vancouver, Canada, June 10–14, pp. 13–24. ACM June 28, pp. 145–150.
Press, New York, NY. [41] Park, I., Powell, M. D. and Vijaykumar, T. N. (2002) Reducing
[27] Terechko, A., Thenaff, E. L., Garg, M. J., Van Eijndhoven, J. V. register ports for higher speed and lower energy. In Proc.
and Corporaal, H. (2003) Inter-cluster communication mod- 35th Annual ACM/IEEE Int. Symp. Microarchitecture, Istanbul,
els for clustered VLIW processors. In Proc. 9th Int. Turkey, November 18–22, pp. 171–182. IEEE Computer Society,
Symp. High-Performance Computer Architecture, Anaheim, Los Alamitos, CA.