0% found this document useful (0 votes)
7 views23 pages

Instruction Level Parallelism Through Microtrheading - A Scalable Approach To Chip Multiprocessors

The document discusses the challenges of instruction level parallelism (ILP) in modern microprocessors, particularly focusing on the limitations of out-of-order execution and superscalar architectures. It introduces a microthreaded model as an alternative approach that allows for scalable instruction issue and avoids the complexities of speculative execution by fragmenting code into microthreads. The paper presents an analysis of the microthreaded model's performance, demonstrating its potential for significant speedup and efficiency in chip multiprocessors.

Uploaded by

Braincain007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views23 pages

Instruction Level Parallelism Through Microtrheading - A Scalable Approach To Chip Multiprocessors

The document discusses the challenges of instruction level parallelism (ILP) in modern microprocessors, particularly focusing on the limitations of out-of-order execution and superscalar architectures. It introduces a microthreaded model as an alternative approach that allows for scalable instruction issue and avoids the complexities of speculative execution by fragmenting code into microthreads. The paper presents an analysis of the microthreaded model's performance, demonstrating its potential for significant speedup and efficiency in chip multiprocessors.

Uploaded by

Braincain007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

© The Author 2005. Published by Oxford University Press on behalf of The British Computer Society. All rights reserved.

For Permissions, please email: [email protected]


Advance Access published on December 19, 2005 doi:10.1093/comjnl/bxh157

Instruction Level Parallelism through


Microthreading—A Scalable
Approach to Chip Multiprocessors
Kostas Bousias1 , Nabil Hasasneh2 and Chris Jesshope1,∗
1 Department of Computer Science, University of Amsterdam, NL
2 Department of Electronic Engineering, University of Hull, UK
∗ Corresponding author: [email protected]

Most microprocessor chips today use an out-of-order instruction execution mechanism. This mechanism
allows superscalar processors to extract reasonably high levels of instruction level parallelism (ILP).

Downloaded from https://academic.oup.com/comjnl/article/49/2/211/436583 by Louisiana Tech Univ user on 22 April 2025


The most significant problem with this approach is a large instruction window and the logic to support
instruction issue from it. This includes generating wake-up signals to waiting instructions and a selection
mechanism for issuing them. Wide-issue width also requires a large multi-ported register file, so that each
instruction can read and write its operands simultaneously. Neither structure scales well with issue width
leading to poor performance relative to the gates used. Furthermore, to obtain this ILP, the execution of
instructions must proceed speculatively. An alternative, which avoids this complexity in instruction issue
and eliminates speculative execution, is the microthreaded model. This model fragments sequential code
at compile time and executes the fragments out of order while maintaining in-order execution within
the fragments. The only constraints on the execution of fragments are the dependencies between them,
which are managed in a distributed and scalable manner using synchronizing registers. The fragments of
code are called microthreads and they capture ILP and loop concurrency. Fragments can be interleaved
on a single processor to give tolerance to latency in operands or distributed to many processors to achieve
speedup. The implementation of this model is fully scalable. It supports distributed instruction issue and
a fully scalable register file, which implements a distributed, shared-register model of communication
and synchronization between multiple processors on a single chip. This paper introduces the model,
compares it with current approaches and presents an analysis of some of the implementation issues. It
also presents results showing scalable performance with issue width over several orders of magnitude,
from the same binary code.

Keywords: concurrency, CMP, microthreads, code fragments


Received 10 May 2005; revised 24 August 2005

1. INTRODUCTION achieve significant speedup over current architectures that use


implicit concurrency and achieve minimal speedup through
For many years now, researchers have been interested in the
concurrent instruction issue [6]. One of the major barriers to
idea of achieving major increases in the computational power
the use of CMPs is the problem of programming them without
of computers by the use of chip multiprocessors (CMPs).
using explicit concurrency in the user code. Ideally they should
Examples of CMP are the Compaq Piranha [1], Stanford
be programmed using legacy sequential code.
Hydra [2] and Hammond et al. [3]. Several architectures
have been proposed and some manufacturers have produced In theory, there is no limit to the number of processors that
commercial designs, such as the IBM Power PC [4] and Sun’s can be used in a CMP provided that the concurrency derived
MAJC project [5]. Ideally, the performance of such systems from the sequential code scales with the problem size. The
should be directly proportional to the number of processors problem is how to split the code into a number of independent
used, i.e. should be scalable. CMPs scale well, with the limit threads, schedule these on many processors and to do this with
to scalability defined by Moore’s law. We calculate that current a low and scalable overhead in terms of the control logic and
technology could support hundreds of in-order processors and processor efficiency. In fact, on general-purpose code it will

The Computer Journal Vol. 49 No. 2, 2006

“bxh157” — 2006/2/21 — page 211 — #1


212 K. Bousias, N. Hasasneh and C. Jesshope

be impossible to eliminate all dependencies between threads processors due to tag matching, wake-up signals to waiting
and hence synchronization is also required. The goal of this instructions and selection mechanisms for issuing instructions.
work therefore is to define a feasible architecture for a scalable These delays increase quadratically for most building blocks
CMP that is easy to program, that maximizes throughput for a with the instruction window size [12]. Finally, even with the
given technology and that minimizes the communication and implementation of a large instruction window, it difficult for
synchronization overheads between different threads. It should processors to find sufficient fine-grain parallelism, which has
be noted that in this work the term thread is used to describe made most chip manufacturers like Compaq, Intel and Sun look
very small code fragments with minimal context. at simultaneous multi-threading (SMT) [13] to expose more
Today Intel’s Itanium-2 (Madison) microprocessor features instruction level parallelism (ILP) through a combination of
over 410 million transistors in a 0.13 µm semiconductor process coarse- and fine-grain parallelism.
technology operating at a speed of 1.5 GHz. This is a dual- Future microprocessor technology needs efficient new
processor version of the previous Itanium processor (Mckinley), architectures to achieve the demands of many grand-challenge
which has an issue width of six. Moore’s law would indicate applications, such as weather and environmental modelling,
that the billion-transistor chip will become feasible in 65 nm computational physics (on both sub-atomic and galactic scales)
technology within the next 3 or 4 years [7]. Intel expects that and biomolecular simulation. In the near future, it should be
the Itanium processor will reach 1.7 billion transistors at the possible to integrate thousands of arithmetic units on a single
end of 2005. The questions we must ask are where do we chip [14] but out-of-order execution is not a good candidate,
go from here and what is the best way to utilize this wealth because of the limited concurrency it exposes and the non-

Downloaded from https://academic.oup.com/comjnl/article/49/2/211/436583 by Louisiana Tech Univ user on 22 April 2025


of transistors, while maximizing performance and minimizing linear increase in hardware complexity of instruction issue
power dissipation? with issue width. Instruction issue is not the only scaling
It can be argued that the current trend in increasing the problem, the complexity of the register file [15] and bypass
clock speed and using a large area to marginally improve the logic [12] also scale badly giving further barriers to a scalable
instructions executed per cycle (IPC) of the chip is a poor architecture. Very-long instruction word (VLIW) architectures
strategy for future generations of microprocessors and does transfer the task of instruction scheduling from the hardware to
not guarantee better performance. The use of aggressive clock the compiler, which avoids the scaling problems in instruction
speed as a means of achieving performance only exacerbates issue. However, in practice, many modern VLIW architectures
the memory wall and requires larger on-chip caches. More end up requiring many of the same complex mechanisms
importantly this strategy cannot be continued indefinitely as as superscalar processors [16], such as branch prediction,
power density is a function of frequency and is becoming a speculative loads, pipeline interlocks, and new mechanisms like
critical design constraint. Using concurrency as a means of code compressors.
increasing performance without increasing power density is a Multi-threading can expose higher levels of concurrency and
much better strategy so long as models and implementations can also hide latency by switching to a new thread when one
can be found that are scalable. Conversely, by exploiting thread stalls. SMT appears to be the most popular form of
concurrency for constant performance, clock frequencies can multi-threading. In this technique, instructions from multiple
be reduced, allowing voltages to be scaled down and gaining threads are issued to a single wide-issue processor out of
a quadratic decrease in power density with concurrency. programmed order using the same problematic structures we
Of course, this assumes completely scalable models and have already described. It increases the amount of concurrency
implementations. exposed in out-of-order issue by working on multiple threads
Another problem area in future technology is the scaling simultaneously. The main drawback to SMT is that it
of wire delays compared with gate delays. As transistor complicates the instruction issue stage, which is central for
dimensions scale down, the number of gates which are the multiple threads [17]. Scalability in instruction issue is
reachable within the scaled clock is at best constant, which no easier to achieve because of this and the other scalability
means that distributed rather than monolithic architectures need problems remain unchanged. Thus SMT suffers from the same
to be exploited [8]. implementation problems [18] as superscalar processors.
Superscalar processors today issue up to eight instruction per An alternative approach to multi-threading that eliminates
clock cycle but instruction issue is not scalable [9] and a linear speculation and does provide scalable instruction issue is the
increase in parallelism requires at least a quadratic increase in microthreaded model. The threads in this model are small code
area [10]. The logic required occupies ∼30% of the total chip fragments with an associated program counter. Little other state
area in a 6-way superscalar processor [11]. In addition, more is required to manage them. The model is able to expose and
and more area is being used for on-chip memory. Typically support much higher levels of concurrency using explicit but
the second level on-chip cache occupies 25–30% of the die dynamic controls. Like VLIW, the concurrency exposed comes
area on a modern microprocessors and between 50 and 75% from the compiler and expresses loops as well as basic block
on the recently announced Itanium-2. Moreover, a significant concurrency. Unlike VLIW, the concurrency is parametric
delay and power consumption are seen in high-issue-width and is created dynamically. In pipelines that execute this

The Computer Journal Vol. 49 No. 2, 2006

“bxh157” — 2006/2/21 — page 212 — #2


Instruction Level Parallelism through Microthreading 213

model, instructions are issued in-order from any of the threads or future file of check points and repairs is also required to re-
allocated to it but the schedule of instructions executed is non- order the completion of instructions before committing their
deterministic, being determined by data availability. Threads results to the registers specified in the ISA in order to achieve
can be deterministically distributed to multiple pipelines based a sequentially consistent state on exceptions.
on a simple scheduling algorithm. The allocation of these Control speculation predicts branch targets based on the prior
threads is dynamic, being determined by resource availability, history for the same instruction. Execution continues along
as the concurrency exposed is parametric and not limited by this path as if the prediction was correct, so that when the
the hardware resources. The instruction issue schedule is also actual target is resolved, a comparison with the predicted target
dynamic and requires linear hardware complexity to support will either match giving a correctly predicted branch or not,
it. Instructions can be issued from any microthread already in which case there was a missprediction. A missprediction
allocated and active. If such an approach could also give linear can require many pipeline cycles to clean up and, in a wide-
performance increase with number of pipelines used, then it issue pipeline, this can lead to hundreds of instruction slots
can provide a solution to both CMP and ILP scalability [19]. being unused, or to be more precise, if we focus on power,
The performance of the microthreaded model is compared being used unnecessarily. Both prediction and cleaning up a
with a conventional pipeline in [20] for a variety of missprediction require additional hardware, which consumes
memory latencies. The results show that the microthreaded extra area and also power, often to execute instructions whose
microprocessor always provides a superior performance to that results are never used. It can therefore be described as wasteful
of a non-threaded pipeline and consistently shows a height IPC of chip resources and, moreover, has unpredictable performance

Downloaded from https://academic.oup.com/comjnl/article/49/2/211/436583 by Louisiana Tech Univ user on 22 April 2025


on a single issue pipeline (0.85) even for highly dependent characteristics [21]. We will show that it is possible to obtain
code and in the presence of very high memory latencies. The high performance without speculation and, moreover, save
model naturally requires more concurrency to achieve the same power in doing so.
asymptotic performance as memory delays are increased. It will As already noted in Section 1, as the issue width increases
be shown in this paper that applying this model to a CMP can in an out-of-order, superscalar architecture, the size of the
also provide orders of magnitude speedup and IPCs of well over instruction window and associated logic increase quadratically,
1000 have been demonstrated. Note that both the instruction which result in a large percentage of the chip being devoted
issue logic and the size of the register file in this model scale to instruction issue. The out-of-order execution mechanism
linearly with issue width. therefore prevents concurrency from scaling with technology
and will ultimately restrict the performance over time. The only
2. CURRENT APPROACHES reason for using this approach is that it provides an implicit
mechanism to achieve concurrent execution from sequential
2.1. Out-of-order execution
binary code.
To achieve a higher performance, modern microprocessors
use an out-of-order execution mechanism to keep multiple
2.2. VLIW
execution units as busy as possible. This is achieved by allowing
instructions to be issued and completed out of the original An alternative and explicit approach to concurrency in
program sequence as a means of exposing concurrency in a instruction issue is VLIW, where multiple functional units
sequential instruction stream. More than one instruction can be are used concurrently as specified by a single instruction
issued in each cycle, but only independent instructions can be word. This usually contains a fixed number of operations that
executed in parallel, other instructions must be kept waiting or, are fetched, decoded, issued and executed concurrently. To
under some circumstances, can proceed speculatively. avoid control or data hazards, VLIW compilers must hoist
Speculation refers to executing an instruction before it is later independent instructions into the VLIW or if this is
known whether the results of the instruction will be used or not, not possible, must explicitly add no-op instructions instead
this means that a guess is made as to the outcome of a control of relying on hardware to stall the instruction issue until the
or data hazard as a means to continue executing instructions, operands are ready. This can cause two problems, firstly, a stall
rather than stalling the pipeline. Register renaming is also in one instruction will stall the entire width of the instruction;
used to eliminate the artificial data-dependencies introduced by secondly, adding no-op instructions, increases the program size.
issuing instructions out of order. This also enables the extension In terms of performance, if the program size is large compared
of the architectural register set of the original ISA, which to the I-cache or TLB size, it may result in higher miss rates,
is necessary to support concurrency in instruction execution. which in turn degrades the performance of the processor [22].
Any concurrent execution of a sequential program will require It is not possible to identify all possible sources of pipeline
some similar mechanism to extend the synchronization memory stalls and their duration at compile time. For example, suppose
available to instructions. Speculative execution and out- a memory access causes a cache miss, this leads to a longer
of-order issue are used in superscalar processors to expose than expected stall. Therefore, in instructions with non-
concurrency from sequential binary code. A reorder buffer deterministic delay like a load instruction to a cache hierarchy,

The Computer Journal Vol. 49 No. 2, 2006

“bxh157” — 2006/2/21 — page 213 — #3


214 K. Bousias, N. Hasasneh and C. Jesshope

the number of no-op instructions required is not known and architectural support for control and data speculation through
most VLIW compilers will schedule load instructions using the predicated instruction execution and binding pre-fetches of data
cache-hit latency rather than the maximum latency. This means into cache. In this architecture each operation is guarded by
that the processor will stall on every cache miss. The alternative one of the predicate registers, each of which stores 1 bit that
of scheduling all loads with the cache miss latency is not feasible determines whether the results of the instruction are required or
for most programs because the maximum latency may not be not. Predication is a form of delayed branch control and this bit
known due to bus contention or memory port delays, and it also is set based on a comparison operation. In effect, instructions
requires considerable ILP. This problem with non-determinism are executed speculatively but state update is determined by
in cache access limits VLIW to cacheless architecture unless the predicate bit so an operation is completed only if the value
speculative solutions are embraced. This is a significant of its guard bit is true, otherwise the processor invalidates the
problem with modern technology, where processor speeds are operation. This is a form of speculation that executes both
significantly higher than memory speeds [19]. Also pure VLIW arms of some branches concurrently but this action restricts
architectures are not good for general purpose applications, due the effective ILP, depending on the density and nesting of
to their lack of compatibility in binary code [23]. The most branches.
significant use of VLIW therefore is in embedded systems, Pre-fetching is achieved in a number of ways. For example,
where these constraints are both solved (i.e. single applications by an instruction identical to a load word instruction that does
using small fast memories). A number of projects described not perform a load but touches the cache and continues, setting
below have attempted to apply speculation to VLIW in order in motion any required transactions and cache misses up the

Downloaded from https://academic.oup.com/comjnl/article/49/2/211/436583 by Louisiana Tech Univ user on 22 April 2025


to solve the scheduling problems and one, the Transmeta hierarchy. These instructions are hoisted by the compiler up
Crusoe, has applied dynamic binary code translation to solve the instructions stream, not just within the same basic block.
the backward compatibility problem. They can therefore tolerate high latencies in memory, providing
The Sun MAJC 5200 [24] is a CMP based on four- the correct loads can be predicted. There are many more
way issue, VLIW pipelines. This architecture provides a explicit controls on caching in the instruction set to attempt to
set of predicated instructions to support control speculation. manage the non-deterministic nature of large cache hierarchies.
The MAJC architecture attempts to use speculative, thread- Problems again arise from the speculative nature of the solution.
level parallelism (TLP) to support the multiple processors. If for some reason the pre-fetch fails, either because of a conflict
This aggressively executes code in parallel that cannot be or insufficient delay between the pre-fetch and genuine load
fully parallelized by the compiler [25, 26]. It requires new word instruction, then a software interrupt is triggered incurring
hardware mechanisms to eliminate most squashes due to a large delay and overhead.
data dependencies [25]. This method of execution is again EPIC compilers face a major problem in constructing a plan
speculative and can degrade the processor’s performance when of execution, they cannot predict all conditional branches and
the speculation fails. MAJC replicates its shared registers in all know which execution path is taken [29]. To some extent
pipelines to avoid sharing resources. From the implementation this uncertainty is mitigated by predicated execution but as
point of view, replicating the registers costs significant power already indicated, this is wasteful of resources and power and
and area [27] and also restricts the scalability. Furthermore, the like all speculative approaches can cause unpredictability in
MAJC compiler must know the instructions latencies before it performance. Although object code compatibility has been
can create a schedule. As described previously, it is not simple solved to some extent, the forward compatibility is only as
to detect all instructions’ latencies due to the variety of the good as the compiler’s ability to generate good schedules in the
hardware communication overheads. absence of dynamic information. Also the code size problem
is still a challenge facing the EPIC architecture [29].
2.3. EPIC
2.4. Multiscalar
Intel’s explicitly parallel instruction computing (EPIC)
architecture is another speculative evolution of VLIW, which Another paradigm to extract even more ILP from sequential
also solves the forward (although not backward) code code is the multiscalar architecture. This architecture extends
compatibility problem. It does this through the run-time the concept of superscalar processors by splitting one wide
binding of instruction words to execution units. The IA-64 [28] processor into multiple superscalar processors. In a superscalar
architecture supports binary code compatibility across a range architecture, the program code has no explicit information
of processor widths by utilizing instructions packets that are not regarding ILP, only the hardware can be employed to discover
determined by issue width. This means a scheduler is required the ILP from the program. In multiscalar, the program code
to select instructions for execution on the available hardware is divided into a set of tasks or code fragments, which can
from the current instruction packet. This gives more flexibility be identified statically by a combination of the hardware and
as well as supporting binary code compatibility across future software. These tasks are blocks in the control flow graph of
generations of implementation. The IA 64 also provides the program and are identified by the compiler. The purpose of

The Computer Journal Vol. 49 No. 2, 2006

“bxh157” — 2006/2/21 — page 214 — #4


Instruction Level Parallelism through Microthreading 215

this approach is to expose a greater concurrency explicitly by described in this paper, although the processor was designed as a
the compiler. component of a large multi-computer and not as general purpose
The global control unit used in this architecture distributes the chip. The interleaved approach requires a large concurrency
tasks among multiple parallel execution units. Each execution in order to maintain efficient pipeline utilization, as it must
unit can fetch and execute only the instructions belonging to be filled with instructions from independent threads. Unlike
its assigned task. So, when a task missprediction is detected, the earlier approaches, Tera avoids this requirement using
all execution units between the incorrect speculation point something called explicit-dependence lookahead, which uses
and the later task are squashed [30]. Like superscalar, this an instruction tag of 3 bits that specifies how many instructions
can result in many wasted cycles, however, as the depth of can be issued from the same stream before encountering a
speculation is much greater, the unpredictability in performance dependency on it. This minimizes the number of threads
is correspondingly wider. required to keep the pipeline running efficiently, which is
The benefit of this architecture over a superscalar architecture ∼70 in the case of memory accesses. It will be seen that
is that it provides more scalability. The large instruction microthreading uses a different approach that maintains full
window is divided into smaller instruction windows, one backward compatibility in the ISA, as well as in the pipeline
per processing unit, and each processing unit searches a structure.
smaller instruction window for independent instructions. This Unlike IMT, which usually draws concurrency from ILP and
mitigates the problems of scaling instruction issue with issue loops, BMT usually exploits regular software threads. There
width. The multiple tasks are derived from loops and function have been many BMT proposals, see [17] and even some

Downloaded from https://academic.oup.com/comjnl/article/49/2/211/436583 by Louisiana Tech Univ user on 22 April 2025


calls, allowing the effective size of the instruction window to be commercial designs such as the Sun’s Niagra processor [34].
extremely large. Note that not all instructions within this wide However the concurrency exposed in BMT architectures is
range are simultaneously being considered for execution [31]. limited, as resources, such as register files must be duplicated
This optimization of the instruction window is offset by a to avoid excessive context switching times. This limits the
potentially large amount of communication, which may effect applicability of BMT to certain classes of applications, such as
the overall system performance. servers.
Communication arises because of dependencies between SMT, is probably the most popular and commercial form
tasks, examples are loop-carried dependencies, function of multi-threading in use today. In this approach, multiple
arguments and results. Results stored to register, that are instructions from multiple threads provide ILP for multiple
required by another task, are routed from one processor to execution units in an out-of-order pipeline. Several recent
another at run-time via a unidirectional ring network. Recovery architectures have either used or proposed SMT, such as the
from misspeculation is achieved by additional hardware that Hyper-Thread Technology in the Intel Xeon processor [35]
maintains two copies of the registers along with a set of register and the Alpha 21464 [36]. As already described, the main
masks, in each processing unit [32]. In summary then, although problem with an SMT processor is that it suffers from the same
the multiscalar approach mitigates against instruction window scalability issues as a superscalar processor, i.e. layout blocks
scaling allowing wider issue width, in practice it requires many and circuit delays grow faster than linearly with issue width. In
of the same complex mechanisms as superscalar and being addition to this, multiple threads share the same level-1 I-cache,
speculative is unlikely to be able to perform as consistently which can cause high cache miss rates, all of which provides
as a scalable CMP. limits to its ultimate performance [18].

2.5. Multi-threading 2.6. Recent CMPs


In order to improve processor performance, modern micropro- From the above discussion we see that most current
cessors try to exploit TLP through a multi-threading approach techniques for exploiting concurrency suffer from software
even at the same time as they exploit ILP. Multi-threading is and/or hardware difficulties, and the focus of research and
a technique that tolerates delays associated with synchroniz- development activity now seems to be on CMPs. These
ing, including synchronizing with remote memory accesses, by designs give a more flexible and scalable approach to instruction
switching to a new thread, when one thread stalls. Many forms issue, freeing them to exploit Moore’s law though system level
of explicit multi-threading techniques have been described, concurrency. Some applications can exploit such concurrency
such as interleaved multi-threading (IMT), blocked multi- through the use of multi-threaded applications. Web and other
threading (BMT) and SMT. A good survey of multi-threading servers are good examples; however, the big problem is how to
is given in [17]. program CMPs for general purpose computation and whether
A number of supercomputers designed by Burton Smith performance can ever be achieved from legacy sequential code,
have successfully exploited IMT, these include the Delencor either in binary or even source form.
HEP, the Horizon and culminated in the Tera architecture [33]. Several recent projects have investigated CMP designs
This approach is perhaps the closest to that of microthreading [1, 2, 3, 37]. Typically, the efficiency of a CMP depends on

The Computer Journal Vol. 49 No. 2, 2006

“bxh157” — 2006/2/21 — page 215 — #5


216 K. Bousias, N. Hasasneh and C. Jesshope

the degree and characteristics of the parallelism. Executing in terms of reducing the number of registers or minimizing the
multiple processes or threads in parallel is the most common number of read or write ports [14, 15, 41, 42].
way to extract high level of parallelism but this requires Work done in [41] describes a bypass scheme to reduce
concurrency in the source code of an application. Previous the number of register file read ports by avoiding unnecessary
research has demonstrated that a CMP with four 2-issue register file reads for the cases where values are bypassed.
processors will reach a higher utilization than an 8-issue In this scheme an extra bypass hint bit is added to each
superscalar processor [17]. Also, work described in [3] shows operand of instructions waiting in the issue window and a
that a CMP with eight 2-issue superscalar processors would wake-up mechanism is issued to reduce register file read
occupy the same area as a conventional 12-issue superscalar. ports. As described in [43], this technique has two main
The use of CMPs is a very powerful technique to obtain problems. First, the scheme is only a prediction, which can be
more performance in a power efficient manner [38]. However, incorrect, requiring several additional repair cycles for recovery
using superscalar processors as a basis for CMPs, with their on missprediction. Second, because the bypass hint is not reset
complex issue window, large on-chip memory, large multi- on every cycle, the hint is optimistic and can be incorrect if the
ported register file and speculative execution is not such a good source instruction has written back to register file before the
strategy because of the scaling problems already outlined. It dependent instruction is issued. Furthermore, an extra pipeline
would be more efficient to use simpler in-order processors and stage is required to determine whether to read data operands
exploit more concurrency at the CMP level, provided that this from the bypass network or from the register file.
can be utilized by a sufficiently wide range of applications. This Other approaches include a delayed write-back scheme [42],

Downloaded from https://academic.oup.com/comjnl/article/49/2/211/436583 by Louisiana Tech Univ user on 22 April 2025


is an active area of research in the compiler community and until where a memory structure is used to delay the write-back results
this problem is solved, CMPs based on user-level threads will for a few cycles to reduce register file ports. The disadvantage
only be used in applications which match this requirement, such of this scheme is that it is necessary to write the results both
as large server applications, where multiple service requests are to the register file and the write-back queue concurrently to
managed by threads. avoid consistency problems during register renaming. The
authors propose an extension to this scheme to reduce the
number of register write ports. However, this extension suffers
from an IPC penalty and it degrades the pipeline performance.
3. REGISTER FILES
Furthermore, in this model, any branch misspredictions cause a
The register file is a major design obstacle to scalable systems. pipeline stall and insufficient use of the delay write-back queue.
All systems that implement concurrency require some form In fact most previous schemes for minimizing the multi-ported
of synchronizing memory. In a dataflow architecture, this is register file have required changes in the pipeline design and
the matching store, in an out-of-order issue processor it is the do not enable full scalability. At best they provide a constant
register file, supported by the instruction window, reservation remission in the scalability of the register file.
stations and re-order buffer. To implement more concurrency Recently Rixner et al. [14] suggested several partitioning
and higher levels of latency tolerance, this synchronizing schemes for the register file from the perspective of streaming
memory must be increased in size. This would not be a problem applications, including designs spanning a central register file
except that in centralized architectures, as issue width increases, through to a distributed register file organization. Their results,
the number of ports to this synchronizing memory must also not surprisingly, show that a centralized register file is costly
increase. The problem is that the cell size grows quadratically and scales as O(N 3 ), while in the distributed scheme, each
with the number of ports or issue width. If N instructions can ALU has its own port to connect to the local register files and
be issued in one cycle, then a central register file requires 2N another port to access other register files via a fast crossbar
read ports and N write ports to handle the worst case scenario. switch network. This partitioning proved to use less area and
This means that the register cell size grows quadratically with power and caused less delay compared with the purely global
N. Moreover, as the number of registers also increases with scheme, and was also shown to provide a scalable solution.
the issue width, a typical scaling of register file area is as the The distributed configuration also has a smaller access time
cube of N . compared with the centralized organization. In this work, a 128
The register file in the proposed Alpha 8-way issue 21464 32-bit register file with 16 read ports and 8 write ports is used
had a single 512 register file with 24 ports in total. It for a central register file and is compared to 8 local register files
occupied an area of some five times the size of the L1 D- of 32 32-bit registers, 2 read ports, 1 write port, and 1 read/write
cache of 64 Kbyte [39]. Also, in the Motorola’s M. CORE port for external access. The result from their CACTI model
architecture, the register file energy consumption can be 16% of showed a 47.8% reduction in access time for the distributed
the total processor power and 42% of the data path power [40]. register file organization across all technologies [44].
It is clear therefore that the multi-ported register files in It is not clear from this work, whether the programming
modern microprocessors consume significant power and die model for the distributed register file model is sufficiently
area. Several projects have investigated the register file problem general for most computations. With a distributed register file

The Computer Journal Vol. 49 No. 2, 2006

“bxh157” — 2006/2/21 — page 216 — #6


Instruction Level Parallelism through Microthreading 217

and explicitly routed network, operations must be scheduled by read. However, it significantly increases the efficiency of the
the compiler and routing information must also be generated pipeline, especially when a large number of thread suspensions
with code for each processor in order to route results from occur together, when the model resembles that of an IMT
one processor’s register file to another. Although it may be architecture. Only when the compiler can define a static
possible to program streaming applications using such a model, schedule are instructions from the same thread scheduled in
in general, concurrency and scheduling can not be defined BMT mode. Exceptions to this are cache misses, iterative
statically. operations and inter-thread communications. There is one other
Other previous work has described a distributed register situation where the compiler will flag a context switch and
file configuration [44] where a fully distributed register file that is following any branch instruction. This allows execution
organization is used in a superscalar processor. The architecture to proceed non-speculatively, eliminates the branch prediction
exploits a local register mapping table and a dedicated register and cleanup logic and fills any control hazard bubbles with
transfer network to implement this configuration. This instructions from other threads, if any are active.
architecture requires an extra hardware recopy unit to handle the The model is defined incrementally and can be applied to
register file dispatch operations. Also, this architecture suffers any RISC or VLIW instruction set. See Table 1 for details of
from a delay penalty as the execution unit of an instruction the extensions to an instruction set required to implement the
that requires a value from a remote register file must stall until concurrency controls required by this model. The incremental
it is available. The authors have proposed an eager transfer nature of the model allows a minimal backward compatibility,
mechanism to reduce this penalty, but this still suffers from an where existing binary code can execute unchanged on the

Downloaded from https://academic.oup.com/comjnl/article/49/2/211/436583 by Louisiana Tech Univ user on 22 April 2025


IPC penalty and requires both central issue logic and global conventional pipeline, although without any of the benefits of
renaming. the model to be realized.
In our research, it seems that only the microthreaded model Microthreading defines ILP in two ways. Sets of threads can
provides sufficient information to implement a penalty-free be specified where those threads generate MIMD concurrency
distributed register file organization. Such a proposal is given within a basic block. Each thread is defined by a pointer
in [6] where each processor in a CMP has its own register file in to its first instruction and is terminated by one or more Kill
a shared register model. Accesses to remote data is described instructions depending on whether it branches or not. Sets
in the binary code and does not require speculative execution of threads provide concurrency on one pipeline and share
or routing. The decoupling is provided by a synchronization registers. They provide latency tolerance through explicit
mechanism on registers and the routing is decoupled from the context switching for data and control hazards. Iterators,
operation of the microthreaded pipeline operation, exploiting on the other hand, define SPMD concurrency by exploiting
the same latency tolerance mechanisms as used for main a variety of loop structures, including for and while loops.
memory access. Section 4.5 explains in more detail how the Iterators give parametric concurrency by executing iterations in
microthreaded CMP distributes data to the register files and parallel subject to dataflow constraints. Independent loops have
hides the latency during a remote register file access. no loop-carried dependencies and can execute with minimal
overhead on multiple processors. Dependent loops can also
execute on multiple processors, exploiting instruction level
4. THE MICROTHREADED MODEL concurrency but during the execution of dependency chains
activity will move from one processor to another and speedup
4.1. Overview
will not be linear. Ideally dependency chains should execute
In this section we consider the microthreaded concurrency with minimal latency and parameters for the instruction in
model in more detail and describe the features that support the Table 1 allow dependencies to be bypassed on interactions
implementation of a scalable CMP based on it. This model was executed on a single processor giving the minimal latency
first described in [45], and was then extended in [6, 19, 46] to possible, i.e. one pipeline cycle per link in the chain.
support systems with multiple processors on-chip. Iterators share code between iterations and use a set of
Like the Tera, this model combines the advantages of BMT threads to define the loop body. This means that some form
and IMT but does so by explicitly interleaving microthreads of context must be provided to differentiate multiple iterations
on a cycle-by-cycle basis in a conventional pipeline. This is executing concurrently. This is achieved by allocating registers
achieved using an explicit context switch instruction, which is to iterations dynamically. A family of threads then, is defined
acted upon in the first stage of the pipeline. Context switching by an iterator comprising a triple of start, step and limit over
is performed when the compiler can not guarantee that data will a set of threads. Information is also required that defines
be available to the current instruction and is used in conjunction the micro-context associated with an iteration and, as each
with a synchronization mechanism on the register file that iteration is created, registers for its microcontext are allocated
suspends the thread until the data becomes available. The dynamically. To create a family of threads a single instruction
context switch control is not strictly necessary, as this can is executed on one processor, which points to a thread control
be signalled from the synchronization failure on the register block (TCB) containing the above parameters. Iterations can

The Computer Journal Vol. 49 No. 2, 2006

“bxh157” — 2006/2/21 — page 217 — #7


218 K. Bousias, N. Hasasneh and C. Jesshope

TABLE 1. Concurrency-control instructions. f giving a quadratic reduction of energy required by exploiting


concurrency in a computation over at least over some range of
Instruction Instruction behaviour
voltage scaling.
Cre Creates a new family of threads The set of shared components used in this model to support
Swch Causes a context switch to occur
the microthreaded CMP is minimal and their use is infrequent.
Kill Terminates the thread being executed
Bsync Waits for all other threads to terminate
These components are the broadcast bus, used to create a family
Brk Terminates all other threads of threads and a ring network for register sharing. In a Cre
instruction a pointer to the TCB is distributed to each processor,
where the scheduler will use this information to determine the
subset of the iterations it will execute. The broadcast bus is
then be scheduled on one or more processors as required to also used to replicate global state to each processor’s local
achieve the desired performance. register file instead of accessing a centralized register file. Both
Virtual concurrency on a single pipeline defines the latency operations are low in frequency and can be amortized over the
that can be tolerated and is limited by the size of the local register execution of multiple iterations.
file or continuation queue in the scheduler. The latter holds the Replication of global variables is one of two mechanisms
minimal state associated with each thread. Both are related by that allow the register file in microthreaded model to be
the two characteristics of the code; the number of registers per fully distributed between the multiple processors on a chip.
micro-context and the cardinality of the set of threads defining The other is the shared-register ring network, which allows

Downloaded from https://academic.oup.com/comjnl/article/49/2/211/436583 by Louisiana Tech Univ user on 22 April 2025


the loop body. In this model, all threads are drawn from the communications between pairs of threads to implement loop-
same context and the only state manipulated in the architecture carried dependencies. This communication between the
is the thread’s execution state, its PC and some information shared and dependent threads is described in more detail
about the location of its micro-context in the local register file. below but uses the ring network only if two iterations are
This mechanism removes any need to swap register values on allocated to different processors. Note that schedules can
a context switch. be defined to minimize inter-processor communication, more
Theoretically, physical concurrency is limited only by importantly, this communication is totally decoupled from
the silicon available to implement a CMP, as all structures the pipeline’s operation through the use of explicit context
supporting this model are scalable and are related to the amount switching.
of the virtual concurrency required for latency tolerance, i.e. Both types of global communication i.e. the broadcast
register file, continuation queue and register allocation logic. bus and the ring network are able to use asynchronous
Practically, physical concurrency will be limited by the extent communications, creating independent clocking domains for
of the loops that the compiler can generate, whether they each processor. Indeed, the broadcast mechanisms could be
are independent or contain loop-carried dependencies and implemented at some level by eager propagation within the
ultimately, the overheads in distribution and synchronization ring network.
that frame the SPMD execution. Note that thread creation
proceeds in a two stages. A conceptual schedule is determined
4.2. Concurrency controls
algorithmically on each processor following the creation of a
family of microthreads but the actual thread creation, i.e. the The microthread model is a generic one, as it can be applied to
creation of entries in the continuation queue, occurs over a any ISA, so long as its instructions are executed in-order. In
period of time at the rate of one thread per cycle, keeping up addition, the model can be designed to maintain full backward
with the maximum context-switch rate. This continues while compatibility, allowing existing binary code to run without
resources are available. speedup [6] on a microthreaded pipeline. Binary compatibility
Figure 1 shows a simple microthreaded pipeline with five with speedup can also be obtained using binary-to-binary
stages and the required shared components used in this translation to identify loops and dependencies and adding
model. Notice that no additional stages are required for instructions to support the concurrent execution of those loops
instruction issue, retiring instructions, or in routing data and/or the concurrency within the basic blocks.
between processors’ register files. Short pipelines provide low Table 1 shows the five instructions required to support this
latency for global operations. Note that more pipeline stages model on an existing ISA. The instructions create a family of
could be used to reduce the clock period, as is the current threads, explicitly context switch between threads, kill a thread
trend in microprocessor architecture. However, as concurrency and two instructions provide for global synchronization. One is
provides the most power-efficient solution to performance it is a barrier synchronization, the other a form of break instruction,
moot whether this is a sound strategy. Two processors can give which forces a break from a loop executed concurrently. These
the same performance as one double speed processor but do concurrency controls provide an efficient and flexible method to
so with less power dissipated. Dynamic power dissipation is extract high levels of ILP from existing code. Each instruction
proportional to frequency f and V 2 , but V can be reduced with will now be described in more detail.

The Computer Journal Vol. 49 No. 2, 2006

“bxh157” — 2006/2/21 — page 218 — #8


Instruction Level Parallelism through Microthreading 219

Broadcast bus

arbiter To nearest
neighbours
ring

Shared memory
remote
register
register read
init
allocation
model register write

read-only broadcast decoupled lw


scheduler
cache
bypass
context
switch
create write
thread register
ALU cache back
control file
and IF

Downloaded from https://academic.oup.com/comjnl/article/49/2/211/436583 by Louisiana Tech Univ user on 22 April 2025


reschedule thread

FIGURE 1. Microthreaded microprocessor pipeline.

4.2.1. Thread creation TABLE 2. TCB containing parameters that describe a family of
The microthreaded model defines explicit and parametric microthreads.
concurrency using the Cre instruction. This instruction Threads Cardinality of the set of threads representing an
broadcasts a pointer to the TCB and to all processors assigned iteration
to the current context; see [47] for details of dynamic processor Dependency Iteration offset for any loop carried dependencies,
allocation. The TCB contains parameters, which define a e.g. a[i]: = . . . a[i − d]
family of threads, i.e the set of threads representing the loop Preambles Number of iterations using preamble code
body and the triple defining the loop. It also defines the dynamic Postambles Number of iterations using postamble code
resources required by each thread (its microcontext) in terms of Start Start of loop index value
local and shared registers. For loops which carry a dependency, Limit Limit of loop index value
Step Step between loop indices
it also defines a dependency distance, and optionally, pointers
Locals Number of local registers dynamically allocated
to a preamble and/or postamble thread, which can be used to per iteration
set up and/or terminate a dependency chain. The dependency Shareds Number of shared registers dynamically allocated
distance is a constant offset in the index space which defines per iteration
regular loop-carried dependencies. A family of threads can Pre-pointer One pointer per thread in set for preamble code
be created without requiring a pipeline slot, as the create Main-pointer One pointer per thread in set for main loop–body
instruction is executed concurrently with a regular instruction code
in the IF stage of the pipeline. The TCB for our current work Post-pointer One pointer per thread in set for postamble code
on implementation overheads is defined in Table 2.
A global scheduling algorithm determines which iterations
will execute on which processors. This algorithm is built
into the local scheduler but the parameters in the TCB and A terminated thread releases it resources so long as any
the number of processors used to execute the family may be dependent thread has also terminated. To do so before this
dynamic. The concurrency described by this instruction is may destroy data that has not yet been read by the dependent
therefore parametric and may exceed the resources available in thread. Note that microthreads are usually (but not exclusively)
terms of registers and thread slots in the continuation queues. very short sequences of instructions without internal loops.
The register allocation unit in each local scheduler maintains
the allocation state of all registers in each register file and this 4.2.2. Context-switching
controls the creation of threads at a rate of one per pipeline The microthreaded context switching mechanism is achieved
cycle. Once allocated to a processor a thread runs to completion, using the Swch instruction, which is acted upon in the first
i.e. until it encounters a Kill instruction and then terminates. stage of the pipeline, giving a cycle-by-cycle interleaving if

The Computer Journal Vol. 49 No. 2, 2006

“bxh157” — 2006/2/21 — page 219 — #9


220 K. Bousias, N. Hasasneh and C. Jesshope

necessary. When a Swch instruction is executed, the IF stage processor when managing loop-carried dependencies. This
reads the next instruction from another ready thread, whose implements a scalable and distributed shared-register model
state is passed to the IF stage as a result of the context switch. between processors without using a single, multi-ported register
As this action only requires the IF stage of the pipeline, it can file, which is known to be unscalable.
be performed concurrently with an instruction from the base The use of dataflow synchronization between threads enables
ISA, so long as the Swch instruction is pre-fetched with it. a policy of conservative instruction execution to be applied.
The context switching mechanism is used to manage both When no microthreads are active because all are waiting
control and data dependencies. It is used to eliminate control external events, such as load word requests, the pipeline will
dependencies by context switching following every transfer stall and, if the pipe is flushed completely, the scheduler will
of control, in order to keep the pipeline full without any stop clocks and power down the processor going into a standby
branch prediction. This has the advantage that no instruction mode, in which it consumes minimal power. This is a major
is executed speculatively and consequently, power is neither advantage of data-driven models. Conservative instruction
dissipated in making a prediction nor in executing instructions execution policies conserve power in contrast to the eager
on the wrong dynamic path. Context switching also eliminates policies used in out-of-order issue pipelines, which have no
bubbles in the pipeline on data dependencies that have non- mechanisms to recognize such a conjunction of schedules.
deterministic timing, such as loads from memory or thread-to- This will have a major impact on power conservation and
thread communication. Context switching provides an arbitrary efficiency.
large tolerance to latency, determined by the size of the local Context switching and successful synchronization have no

Downloaded from https://academic.oup.com/comjnl/article/49/2/211/436583 by Louisiana Tech Univ user on 22 April 2025


register file. overhead in terms of additional pipeline cycles. The context
switch interleaves threads in the first stage of the pipeline, if
necessary on a cycle by cycle basis. Synchronization occurs
4.2.3. Thread synchronization at register-read stage and only if it fails will any exceptional
The only synchronizing memory in the microthreaded model action be triggered. On a synchronization failure, control
is provided by the registers and this gives an efficient and for the instruction is mutated to store a reference to the
scalable mechanism for synchronizing data dependencies. The microthread in the register being read. This means that the
synchronization is performed using two synchronization bits only overhead in supporting these explicit concurrency controls
associate with every register, which differentiate between the is the additional cycle required to reissue the failed instruction
following states: full, empty, waiting-local and waiting-remote. when the suspended thread is reactivated by the arrival of the
Registers are allocated to micro-contexts in the empty state data. Of course there are overheads in hardware but this is true
and a read to an empty register will fail, resulting in a reference for any model.
to the microthread that issued the instruction being stored in that The model also provides a barrier synchronization (Bsync)
register. This reference passes down the pipeline with each instruction, which suspends the issuing thread until all other
instruction executed. Using the continuation queue in the threads have completed and a Brk instruction, which explicitly
scheduler, lists of continuations may be suspended on a register, kills all other threads leaving only the main thread. These
which is required when multiple threads are dependent on the instructions are required to provide bulk synchronization for
value to be stored there. All registers therefore implement memory consistency. There is no synchronization on main
I-structures in a microthreaded microprocessor. In the full memory, only the registers are synchronizing. This means that
state, registers operate normally, providing data upon a register two different microthreads in the same family may not read after
read and, if no synchronization is required, a register can write to the same location in memory because the ordering of
be repeatedly written to without changing its synchronization those operations cannot be guaranteed. It also means that any
state to provide backward compatibility. The compiler can loop-carried dependencies must be compiled to use register
easily recognize the potential for a synchronization failure if variables. A partitioning of the micro-context supports this
a schedule for the dependency is not known at compile time. mechanism efficiently.
If so, it inserts a context switch on the dependent instruction. We do not address the design of the shared memory system
Examples include instructions dependent on a prior load word, in this paper, although we are currently investigating several
produced in another thread, or produced in iterative CPU approaches. Indeed the latency tolerance provided by this
operations. model makes this design of the memory system somewhat
The register is set to one of the waiting states when it holds flexible. For example, a large, banked, multi-ported memory
a continuation. Two kinds of continuation are distinguished: would give a solution that would provide all the buffering
waiting-local, when the register holds the head of a list required for the large number of concurrent requests generated
of continuations to local microthreads and; waiting-remote, by this model. It is important to note that using in-order
when the register holds a remote request for data from processors and a block-based memory consistency model,
another processor. The latter enables the micro-context for memory ordering does not pose the same problem as it does
one iteration to be stored for read-only access on a remote in an out-of-order processor.

The Computer Journal Vol. 49 No. 2, 2006

“bxh157” — 2006/2/21 — page 220 — #10


Instruction Level Parallelism through Microthreading 221

4.2.4. Thread termination the dependency to be resolved before being able to issue
Thread termination in the microthreaded model is achieved any new instructions. In effect the instruction window in
through a Kill instruction, which of course causes a context a microthreaded model is distributed to the whole of the
switch as well as updating the microthread’s state to killed. architectural register set and only one link in the dependency
The resources of the killed threads are released at this stage, graph for each fragment of code is ever exposed simultaneously.
unless there is another thread dependent upon it, in which case Moreover, no speculation is ever required and consequently,
its resources will not be released until the dependent thread has if the schedules are such that all processors would become
also been killed. (Note that this is the most conservative policy inactive, then this state can be recognized and used to
and more efficient policies may be implemented that detect power-down the processors to conserve energy.
when all loop-carried dependencies has been satisfied.) Compare this to the execution in an out-of-order processor,
where instructions are executed speculatively regardless of
whether they are on the correct execution path. Although
4.3. Scalable instruction issue
predictions are generally accurate in determining the execution
Current microprocessors attempt to extract high levels of ILP by path in loops, if the code within a loop contains unpredictable,
issuing independent instructions out of sequence. They do this data-dependent branches, this can result in a lot of energy being
most successfully by predicting loop branches and unrolling consumed for no useful work. Researchers now talk about
multiple iterations of a loop within the instruction window. ‘breaking the dependency barrier’ using data in addition to
The problem with this approach has already been described; control speculation but what does this mean? Indices can be

Downloaded from https://academic.oup.com/comjnl/article/49/2/211/436583 by Louisiana Tech Univ user on 22 April 2025


a large instruction window is required in order to find sufficient predicted readily but these are not true dependencies and do
independent instructions and the logic associated with it grows not constrain the microthreaded model; addresses, based on
at least with the square of the issue width. those indices can also be predicted with a reasonable amount of
If we compare this with what is happening in the accuracy but again these do not constrain the microthreaded
microthreaded model, we see that almost exactly the same model. This leaves true computational data dependencies,
mechanism is being used to extract ILP, with one major which can only be predicted under very extraordinary
difference, a microthreaded microprocessor executes fragments circumstances. It seems therefore that there is no justification
of the sequential programs out-of-order. These fragments (the for consuming power in attempting data speculation.
microthreads) are identified at compile time from loop bodies Out-of-order issue has no global knowledge of concurrency
and conventional ILP and may execute in any order subject only or synchronization. Microthreading, on the other hand, is able
to dataflow constraints. Instructions within fragments however, to execute conservatively as it does have that global knowl-
issue and complete in-order. We have already seen that a context edge. Real dependencies are flagged by context switching,
switch suspends a fragment at instructions whose operands have concurrency is exposed by dynamically executing parametric
non-deterministic timing. The dependent instruction is issued Cre instructions and the namespace for synchronization spans
and stores a pointer to its fragment if a register operand is found an entire loop as registers are allocated dynamically. At any
to be empty. Any suspended fragments are rescheduled when instant the physical namespace is determined by the registers
data is written to the waiting register. Thus only instructions up that have been allocated to threads.
to the first dependency in each fragment (loop body) are issued
and only that instruction will be waiting for the dependency
4.4. Thread state
to be resolved; all subsequent instructions in that pipeline will
come from other fragments. In an out-of-order issue model the When a thread is assigned resources by the scheduler, it is
instruction window is filled with all instructions from each loop initially set to the waiting state, as it must wait for its code
unrolled by branch prediction because it knows nothing a priori to be loaded into the I-cache before it can be considered active.
about the instruction schedules. A thread will go into a suspended state when it has been context
Consider a computation that only ever contains one switched until either the register synchronization has been
independent instruction per loop of l instructions, then to completed or the branch target has been defined, when it again
get n-way issue n loops must be unrolled and the instruction goes into the waiting state. The scheduler generates a request
window will contain n*l instructions for each n instructions to the I-cache to pre-fetch the required code for any thread that
issued. In comparison, the microthreaded model would enters the waiting state. If the required code is available, then
issue the first n independent instructions from n threads the I-cache acknowledges the scheduler immediately, otherwise
(iterations), then it would issue the first dependent instructions not until the required code is in the cache. The thread’s state
from the same n threads before context switching. The next becomes ready at this stage. A killed state is also required to
n instructions would then come from the next n iterations indicate those threads that have been completed but whose data
(threads). Synchronization, instead of taking place in a may still be in use. At any time there is just one thread per
global structure with O(n2 ) complexity, is distributed to n processor, which is in the running state; on start-up this will be
registers and has linear complexity. Each thread waits for the main thread.

The Computer Journal Vol. 49 No. 2, 2006

“bxh157” — 2006/2/21 — page 221 — #11


222 K. Bousias, N. Hasasneh and C. Jesshope

On a context switch or kill, the instruction fetch stage is access patterns have the characteristics that they are written
provided with the state of a new thread if any are active, to infrequently but read from frequently. The address space
otherwise the pipeline stalls for a few cycles to resolve the in a conventional RISC ISA is partitioned so that the lower 16
synchronization and if it fails, the pipeline simply stops. This registers form this global window. These are statically allocated
action is simple, requires no additional flush or cleanup logic for a given context and every thread can read and/or write to
and most importantly, is conservative in its use of power. them. Note that the main thread has 32 statically allocated
Note that by definition, when no local threads are active, the registers, 16 of which are visible to all microthreads as globals
synchronization event has to be asynchronous and hence does and 16 of which are visible only to the main thread. Each thread
not require any local clocks. sees 32 registers. The lower 16 of these are the globals and these
The state of a thread also includes its program counter, the are shared by all threads and those in the upper half are local to
base address of its micro-context and the base address and a given thread.
location of any micro-contexts it is dependent upon. The state The upper 16 registers are used to address the microcontext
also includes an implicit slot number, which is the address of the of each iteration in a family of threads. As each iteration shares
entry in the continuation queue and which uniquely identifies common code, the address of each micro-context in the register
the thread on a given processor. The last field required is file must be unique to that iteration. As we have seen, the base
a link field, which holds a slot number for building linked address of a thread’s micro-context forms a part of its state.
lists of threads to identify empty slots, ready queues and an This immediately gives a means of implementing a distributed,
arbitrary number of continuation queues that support multiple shared-register model. We need to know the processor on which

Downloaded from https://academic.oup.com/comjnl/article/49/2/211/436583 by Louisiana Tech Univ user on 22 April 2025


continuations on different registers. The slot reference is used a thread is running and the base address of its microcontext in
as a tag to the I-cache and is also passed through the pipeline order to share its data. However, we can further partition the
and stored in the relevant operand register if a register read fails, micro-context into a local part and a shared part to avoid too
where it forms the head of that continuation queue. much additional complexity in implementing the pipeline.
Three register windows are mapped to the upper or dynamic
half of the address space for each micro-context. These are the
4.5. Register file partitioning and distribution
local window ($Li), the shared window ($Si) and the dependent
We have already seen that Rixner et al. [14] have shown window ($Di). Thus the sum of the size of these three windows
that a distributed register file architecture achieved a better must be 16. The local window stores values that are local
performance compared with a global solution and it also to a given thread. For example they store values from indexed
provides superior scaling properties. Their work was based on arrays used only in a single iteration. Reads and writes to the
streaming applications, where register sources and destinations local window are all local to the processor a thread is running
are compiled statically. We will show that such a distributed on and no distribution of the L window is therefore required.
organization can also be based on extensions to a general- The S and D register windows provide the means of sharing
purpose ISA with dynamic scheduling. The concept of a a part of a micro-context between threads. The S window is
dynamic micro-context associated with parallelizing different written by one thread and is read by another thread using its
iterations has already been introduced and is required in order to D window.
manage communications between micro-contexts in a scalable It should be noted that many different models can be
manner. It is necessary for the compiler to partition the micro- supported by this basic mechanism. In this paper a simple
context into different windows representing different types model is described but different models of communication with
of communication and for the hardware to recognize these different constraints and solutions to resource deadlock can be
windows to trap a register read to an appropriate distributed implemented. The mechanism would even support a hierarchy
communication mechanism. of microcontexts by allowing an iteration in one family of
A microthreaded compiler must recognize and identify four threads to create a subordinate family, where the dynamic part of
different types of communication patterns. There are a number the address space in the creating family became the static part
of ways in which this partitioning can be encoded, and here in the subordinate family. This would support nested multi-
we describe a simple and efficient scheme that supports a dimensional loops as well as breadth first recursion. There are
fully distributed register file based on a conventional RISC difficulties however, in resolving resource deadlock problems
ISA, assuming a 5-bit register specifier and hence a 32-register in all but the simplest models and these require further research
address space per microthread (although, not the same 32 to resolve.
registers for each thread). In this paper we describe a simple model, that supports a
The first register window is the global window (represented single level of loop with communication between iterations
by $Gi). These registers are used to store loop invariants being allowed only between iterations that differ by a create-
or any other data that is shared by all threads. In other time constant. An example of this type of communication can
models of concurrency these would represent broadcast data, be found in loop-carried dependencies, where one iteration
which are written by one and read by many processes. Their produces a value, which is used by another iteration. For

The Computer Journal Vol. 49 No. 2, 2006

“bxh157” — 2006/2/21 — page 222 — #12


Instruction Level Parallelism through Microthreading 223

example, A[i] := . . . A[i − k] . . . where k is invariant of thread’s state must therefore contain the base address of its own
the loop. Such dependencies normally act as a deterrent to micro-context for local reads and also the base address of any
loop vectorization or parallelization but this is not so in this micro-context it is dependent upon. In the base-level model we
model, as the independent instructions in each loop can execute present, only one other micro-context is accessed, at a constant
concurrently. This is the same ILP as is extracted from an out- offset in the index space.
of-order model. In the second case, the producer and consumer iterations are
Consider now the implementation of this basic model. It scheduled to different processors. Now, the consumer’s read to
is straightforward to distribute the global register window the D window will generate a remote request to the processor
and its charactistics suggest a broadcast bus as being an on which the producer iteration is running. Whereas in the first
appropriate implementation. This requires that all processors case a micro-context’s D window is not physically allocated,
executing a family of microthreads be defined prior to any in this second case it must be. It is used to cache a local copy
loop invariants being written (or re-written) to the global of the remote micro-context’s S window. It is also used to
window. The hardware then traps any writes to the global store the thread continuation locally. The communication is
window and replicates the values using the broadcast bus to again asynchronous and independent of the pipeline operation.
the corresponding location in all processors’ global windows. The consumer thread is suspended on its read to the D window
As multiple threads may read the values written to the location until the data arrives from the remote processor. For
global register window, registers must support arbitrarily large this constant strided communication, iteration schedules exist
continuation queues, bounded above only by the number of that require only nearest neighbour communication in a ring

Downloaded from https://academic.oup.com/comjnl/article/49/2/211/436583 by Louisiana Tech Univ user on 22 April 2025


threads that can be active at any time on one processor. network to implement the distributed shared-register scheme.
The write to the global window can be from any processor Note that a request from the consumer thread may find an
and thus can be used to return a value from an interaction to empty register, in which case the request gets suspended in
the global state of a context. The write is also asynchronous the producer’s S window until the required data has been
and independent of pipeline operation, provided there is local produced. Thus a shared-register transaction may involve two
buffering for the data in the event of a conflict on the bus. continuations, a thread suspended in the D window of the
Contention for this bus should not occur regularly, as writes consumer (waiting-local) and a remote request suspended in the
to globals are generally much less frequent than reads (by a S window of the producer (waiting-remote). As these states are
factor proportional to the concurrency of the code). This is mutually exclusive, the compiler must ensure that the producer
analysed later in the paper. thread does not suspend on one of its own S-window locations.
The distribution of S and D windows is a little more complex This can happen if a load from memory to an S location is also
than the global window. Normally, a producer thread writes used in the local thread. However, as dependencies are passed
to its S window and the consumer reads from its D window, via register variables this can only happen in the initialization
which maps in some sense onto the S window of the producer; of a dependency chain. This case can be avoided by loading to a
we will later return to this. However, there is no restriction on location in the L window when the value is required locally and
a thread reading a register from its S window so long as data then copying it to the S window with a deterministic schedule .
has already been written to it (it would deadlock otherwise). The additional complexity required in this distributed register
There is also no physical restriction on multiple writes to the file implementation is 2 bits in each register to encode
S window, although this may introduces non-determinism if the four synchronization states: full, empty, waiting-local,
a synchronization is pending on it. As far as the hardware is waiting-global; a small amount of additional logic to address
concerned therefore, the S window is identical to the L window, the dynamically allocated registers using base-displacement
as all reads and writes to it are local and are mapped to the addressing; and a simple state machine on each register port
dynamic half of the register-address space. On the other hand, to implement the required action based on the synchronization
a thread may never write to its D window, which is strictly read- state.
only. The hardware need only recognize reads to the D window A method has now been described to distribute all classes
in order to implement sharing between two different threads. In of communication required in the base-level model. However,
order to perform a read from a D window, a processor needs the we must ensure that this distribution does not require us to
location (processor id) and base address of the S window of the implement register files locally that are not scalable. This
producer thread. There are two cases to consider in supporting requires the number of local ports in the register file to be
the distribution of register files in the base-level model we have constant. Accesses to L, S and local D windows requires at
described. most two read and one write port for a single-issue pipeline.
The first and easiest case is when the consumer iteration The G window requires an additional write port independent of
is scheduled to the same processor as the producer. In this the pipeline ports. Finally, reads to a remote D window require
case a read to the D window can be implemented as a normal one read port and one write port per processor. Contention for
pipeline read by mapping the D window of the consumer micro- this port will depend on the pattern of dependencies, which for
context onto the S window of the producer micro-context. The the model described is regular and hence evenly distributed with

The Computer Journal Vol. 49 No. 2, 2006

“bxh157” — 2006/2/21 — page 223 — #13


224 K. Bousias, N. Hasasneh and C. Jesshope

Global Write Bus most significant problems in modern synchronous processor


design.
Full asynchronous design is difficult but one promising
Shared-Dependent Ports
technique is to use a Globally-Asynchronous, Locally-
Synchronous (GALS) clocking scheme [48]. This approach
promises to eliminate the global clocking problem and provides

Global Write Port


a significant power reduction over globally synchronous
designs. It divides the system into multiple independent
Data_S_Write
Data_S_Read

domains, which are independently clocked but which


communicate in an asynchronous manner. A GALS system
Initialization Port
not only mitigates against the clock distribution problem, the
Data_local_LCQ problem of clock skew and the resulting power consumption,
Data_G_Write it can also simplify the reuse of modules as they have
Cache Miss Port asynchronous interfaces that do not require redesign for
Write_cache_Miss
timing issues when composed [49]. In CMP design, global
communication is one of the most significant problems in both
Data_local_Read1
current and future systems [6], yet not every system can be
Data_local_Write Data_local_Read2
decomposed into asynchronously communicating synchronous

Downloaded from https://academic.oup.com/comjnl/article/49/2/211/436583 by Louisiana Tech Univ user on 22 April 2025


blocks easily, there must be a clear decoupling of local and
Pipeline Ports

remote activity. To achieve this the local activity should not


be overly dependent on a remote communication. The model
we have described has just this property; each processor is
independent and when it does need to communicate with other
processors, that communication occurs independently without
limiting the local activity. In short, the local processor has
FIGURE 2. Register-file ports analysed.
tolerance to any latency involved in global communication, as
in most circumstances it will have many other independent
instructions it can process and, if this is not the case, it will
appropriate scheduling. Each iteration is allocated a separate
simply switch off its clocks, reduce its voltage levels and wait
microcontext in the dynamic half of the register-address space
until it has work to accomplish dissipating minimal power.
and the first local register ($L0) is initialized by the scheduler to
The size of the synchronous block in a microthreaded CMP
the loop index for that iteration, so this also requires a write port.
can be from a single processor upwards. The size of this block
Finally, a write port is required to support decoupled access to
is a low-level design decision. The issue is that as technology
the memory on a cache miss in order to avoid a bubble in the
continues to scale this block size will scale down with the
pipeline, when data becomes available.
problems of signal propagation. Thus the model provides
Figure 2 is a block diagram of the microthreaded register file
solutions to the end of scaling in silicon CMOS. Compare this
illustrating these ports. As shown, it has a maximum of eight
with the current approach, which seeks to gain performance by
local ports per processor. The register file could be implemented
clock speed in a single large wide-issue processor where all
with just three ports by stalling the pipeline whenever one of the
strategies are working against the technology.
asynchronous reads or writes occurs but this would degrade its
performance significantly. An analysis of the accesses to these
ports is given in Section 5 below, where we attempt to reduce 4.6. Microthreaded CMP architecture
these eight ports to five using contention, i.e. the three pipeline
A block diagram of a microthreaded CMP is shown in Figure 3.
ports and one additional read and write port for all other cases.
As shown, N microthreaded pipelines are connected to these
two shared communications systems. The first is a broadcast
4.5.1. Globally asynchronous locally synchronous bus, for which there must be contention, although in practice
communication the access to this bus is at a low frequency. It is used by one
Modern synchronous CMP architectures are based on single processor to create a family of threads, it will probably be used
clock domain with global synchronization and control signals. by the same processor to distribute any loop invariants and
The control signal distribution must be very carefully designed finally, if there is a scalar result, one processor may write values
in order to meet the operation rate on each component used back to global locations. This situation occurs when searching
and the larger the chip, the more is the power that is required an iteration space—it is the only situation where contention
to distribute these signals. In fact, clock skew, and the large might be required—as a number of processors might find a
power consumption required to eliminate it, is one of the solution simultaneously and attempt to write to the bus. In

The Computer Journal Vol. 49 No. 2, 2006

“bxh157” — 2006/2/21 — page 224 — #14


Instruction Level Parallelism through Microthreading 225

Create Create

I- I-
Scheduler Scheduler
cache cache

D-
… D-
Pipeline Pipeline
cache cache

Create/write $G Broadcast Bus Create/write $G

Write $G Write $G
Initialise Initialise
$L0 Decoupled $L0 Decoupled
Local Lw Local Lw
Register file Register file

$D read $D read

Ring interconnect for registers


and bus arbitration

Downloaded from https://academic.oup.com/comjnl/article/49/2/211/436583 by Louisiana Tech Univ user on 22 April 2025


Independently synchronous domains

FIGURE 3. Microthreaded CMP architecture, showing communication structures and clocking domains.

this case a break instruction acquires the bus and terminates all The results are based on a static analysis of the accesses to
other threads allowing the winner to write its results back to the various register windows and investigate the average traffic on
global state of the main context. the microthreaded register file ports. The five types of register
The second communication system is the shared-register, file ports are shown in Figure 2 and include, pipeline ports
ring network, which is used by processors to communicate (read-R and write-W ), the initialization port (I ), the shared-
results along a dependency chain. For the mode described, this dependent ports (Sd ), the broadcast port (Br ) and the write port
requires only local connectivity between independently clocked that is required in the case of a cache miss (Wm ). The goal
processors. of this analysis is to guide the implementation parameters of
All global communication systems are decoupled from the such a system. We aim to show that all accesses other than
operation of the microthreaded pipeline and thread scheduling the synchronous pipeline ports can be implemented by a pair
provides latency hiding during the remote access. This of read and write ports, with arbitration between the different
technique gives a microthreaded CMP a serious advantage as sources. In this case a register file with five fixed ports would
a long-term solution to silicon scaling. be sufficient for each of the processors in our CMP design.
The microthreaded pipeline uses three synchronous ports.
These ports are used to access three classes of register windows
5. ANALYSIS OF REGISTER FILE PORTS
i.e. the $L, $S and $G register windows. If we assume that
In this section, an analysis of microthreaded register file ports the average number of reads to the pipeline ports in each
is made in terms of the average number of accesses to each port cycle is R and the average number of writes to the pipeline
of the register file in every pipeline cycle. This analysis is based port in each cycle is W , then these values are defined by
on the hand compilation of a variety of loop kernels. The loops the following equations, where Ne is the total number of
considered included a number of Livermore kernels—some instructions executed.
which are independent and some which contain loop-carried 
(Read($L) + Read($S) + Read($G))
dependencies. It also includes both affine and non-affine loops, R = inst. (1)
Ne
vector and matrix problems, and a recursive doubling algorithm. 
(W rite($L) + W rite($S) + W rite($G))
We have used loop kernels at this stage as we currently have W = inst. (2)
no compiler to compile complete benchmarks. However, as the Ne
model only gains speedup via loops, we have chosen a broad set The initialization port on other hand is used in register allocation
of representative loops from scientific and other applications. to initialize the $L0 to the loop index. This port is accessed
Analysis of complete programs and other standard benchmarks once when each iteration is allocated to a processor and so the
will be undertaken when a compiler we are developing is able average number of accesses to this port is constant and equal to
to generate microthreaded code. the inverse of the number of instructions executed by the thread

The Computer Journal Vol. 49 No. 2, 2006

“bxh157” — 2006/2/21 — page 225 — #15


226 K. Bousias, N. Hasasneh and C. Jesshope

TABLE 3. Average number of accesses to each class of register file port over a range of loop kernels, m = problem size.

Loop Ne R W I Br Sd

4m − 3 2m − 1 m−1
A: Partial Products 3m 0.333 0
Ne Ne M ∗ Ne
8m − 15 4m − 4 m−2
B: 2-D SOR 5m − 2 0.2 0
Ne Ne M ∗ Ne
5m + 3 4m + 1 m
L3: Inner Product 4m + 4 0.25 0
Ne Ne M ∗ Ne
2.4m + 37.4 3m + 22 4n 1.8m + 1.8
L4: Banded Linear Equation 3m + 34 0.2
Ne Ne Ne M ∗ Ne
7m 4m m−1
L5: Tri Diagonal Elimination 5m + 3 0.25 0
Ne Ne M ∗ Ne
5.5m + 2.5m1/2 − 5 3m + m1/2 − 2 (m1/2 − 1)n 0.5m − 0.5m1/2
L6: General Linear Recurrence 2.5m + 6.5m1/2 − 5 0.1429
Ne Ne Ne M ∗ Ne
9m + 3 6m + 2 n m
C: Pointer Chasing 14m + 5 0.0714
Ne Ne Ne M ∗ Ne

Downloaded from https://academic.oup.com/comjnl/article/49/2/211/436583 by Louisiana Tech Univ user on 22 April 2025


15m 8m + 3 3n
L1: Hydro Fragment 9m + 5 0.1111 0
Ne Ne Ne
17m − 5log m − 27 10m − 5log m − 12 (log m − 1)n
L2: ICCG 11m + 2log m − 21 0.0909 0
Ne Ne Ne
43m + 3 25m + 3 3n
L7: Equation of State Fragment 26m + 5 0.0385 0
Ne Ne Ne

before it is killed, no . Therefore, if I is the average number of following equation, where Ne is the total number of instructions
accesses to the initialization port per cycle, we can say that: executed and n is the number of processors in the system. The
result is proportional to the number of processors, as one write
1
I= (3) instruction will cause a write to every processor in the system.
no 
W rite($G) ∗ n
A dependent read to a remote processor uses a read port on Br = inst. (5)
Ne
the remote processor and a write port on the local processor,
as well as a read to the synchronous pipeline port on the local Finally, the frequency of accesses to the port that is required for
processor. The average number of accesses to these ports per the deferred register write in the case of a cache miss can also
cycle is dependent on the type of scheduling algorithm used. If be obtained. It is parameterized by cache miss rate in this static
we use modulo scheduling, where M consecutive iterations are analysis and again we look at the worst case (100% miss rate).
scheduled to one processor, then interprocessor communication The average number of writes per cycle to the cache-miss port
is minimized. An equation for dependent reads and writes is is given by Wm , which is given by the formula below where
given based on modulo scheduling although we consider it only Lw is the number of load instructions in each thread body, no
at the worst case scenario. The average number of accesses per is the number of instructions executed per thread body, and Cm
cycle to the dependent window is given below by Sd using is the cache miss rate. Again the average access to this port is
the following equation, where M is the number of consecutive constant for a given miss rate.
threads scheduling to one processor and Ne is the total number 
(Lw)
of instructions executed. It is clear that the worst case is where Wm = inst. ∗ Cm (6)
M = 1, i.e. iterations are distributed one per processor in a no
modulo manner. Table 3 shows the average number of accesses to each class of
 register file port over a range of loop kernels using the above
Read($D)
Sd = inst. (4) formulae. The first seven kernels are dependent loops, where
M ∗ Ne
the dependencies are carried between iterations using registers.
The global write port is used to store data from the broadcast The last three are independent loops, where all iterations of the
bus to the global window in every processor’s local register loop are independent of each other.
file. If we assume that the average number of accesses per As described previously, each of the distributed register files
cycle to this port is Br , then Br can be obtained from the has four sources for write accesses in addition to the pipeline

The Computer Journal Vol. 49 No. 2, 2006

“bxh157” — 2006/2/21 — page 226 — #16


Instruction Level Parallelism through Microthreading 227

0.5 1
I
B
0.45 S/D 0.9
All writes
Average accesses per cycle (individual ports)

0.4 0.8

0.35 0.7

Average accesses per cycle (all writes)


0.3 0.6

0.25 0.5

0.2 0.4

0.15 0.3

0.1 0.2

Downloaded from https://academic.oup.com/comjnl/article/49/2/211/436583 by Louisiana Tech Univ user on 22 April 2025


0.05 0.1

0 0
0 5 10 15 20 25 30 35 40 45 50
Normalised problem size (m/n)

FIGURE 4. Average accesses per cycle on additional ports, n = 4 processors.

0.5 1

0.45 0.9
I
B
Average accesses per cycle (individual ports)

0.4 0.8
S/D
Average accesses per cycle (all writes)

All writes
0.35 0.7

0.3 0.6

0.25 0.5

0.2 0.4

0.15 0.3

0.1 0.2

0.05 0.1

0 0
0 5 10 15 20 25 30 35 40 45 50
Normalised problem size (m/n)

FIGURE 5. Average accesses per cycle on additional ports, n = 16 processors.

ports. These are for $G write, the initialization write, the $D access per cycle over all analysed loop kernels. This is shown in
return data and the write to the port that supports decoupled Figures 4–7 where accesses to initialization (I ), broadcast (Br )
access to memory on a cache miss. Our analysis shows that and the network ports (Sd , shown as S/D) are given. The four
the average accesses from these sources is much less than one figures illustrate the scalability of the results (from n = 4 to

The Computer Journal Vol. 49 No. 2, 2006

“bxh157” — 2006/2/21 — page 227 — #17


228 K. Bousias, N. Hasasneh and C. Jesshope

0.5 1

0.45 0.9
I
Average accesses per cycle (individual ports)

0.4 B 0.8

Average accesses per cycle (all writes)


S/D
All writes
0.35 0.7

0.3 0.6

0.25 0.5

0.2 0.4

0.15 0.3

0.1 0.2

0.05 0.1

Downloaded from https://academic.oup.com/comjnl/article/49/2/211/436583 by Louisiana Tech Univ user on 22 April 2025


0 0
0 5 10 15 20 25 30 35 40 45 50
Normalised problem size (m/n)

FIGURE 6. Average accesses per cycle on additional ports, n = 64 processors.

FIGURE 7. Average accesses per cycle on additional ports, n = 256 processors.

n = 256 processors). Results are plotted against the normalized of 2 from 2 to m/2). Normalized problem size is therefore a
problem size, where m is the size of the problem in terms of the measure of the number of iterations executed per processor. It
number of iterations, although not all iterations are executed can be seen that only accesses from the broadcast bus increase
concurrently in all codes (for example the recursive doubling with the number of processors and even this is only significant
algorithm has a sequence of concurrent loops varying by powers where few iterations are mapped to each processor. Even in

The Computer Journal Vol. 49 No. 2, 2006

“bxh157” — 2006/2/21 — page 228 — #18


Instruction Level Parallelism through Microthreading 229

TABLE 4. Average number of accesses to all additional write ports for different number of processors, m/n = 8.

Average accesses Average accesses Average accesses Average accesses


Miss rate Average accesses (all write ports) (all write ports) (all write ports) (all write ports)
(R%) to Wm Port n=4 n = 16 n = 64 n = 256
10 0.038 0.405 0.453 0.517 0.634
20 0.076 0.443 0.491 0.555 0.672
30 0.113 0.480 0.528 0.592 0.709
40 0.151 0.518 0.566 0.630 0.747
50 0.189 0.556 0.604 0.668 0.785
60 0.227 0.594 0.642 0.706 0.823
70 0.265 0.632 0.680 0.744 0.861
80 0.302 0.669 0.717 0.781 0.898
90 0.340 0.707 0.755 0.819 0.936
100 0.378 0.744 0.785 0.857 0.974

the case of 256 processors, providing we schedule more than a of iterations (64K). Clearly this demonstrates the schedule

Downloaded from https://academic.oup.com/comjnl/article/49/2/211/436583 by Louisiana Tech Univ user on 22 April 2025


few iterations to each processor, the overall number of writes invariance of microthreaded code.
is <50%. Note that a register file of 512 registers supports at The results are presented as functions of the number of
least 32 micro-contexts per processor. To put this in perspective, processors on which the code was run and include: the number
this means that on average a single port sharing all I , Br and of cycles to execute the code; the number of instructions
Sd writes would be busy only 50% of the time. There may executed (note that failed synchronization will cause an increase
be peaks in the distribution of writes per cycle, however all in the overall number of instructions executed); IPC, which is
of these accesses are asynchronous and they can be queued computed from the previous two; the number of completely
without stalling the operation of any of the pipelines. This still inactive cycles, when the pipe is empty and no threads are able
leaves capacity to include writes from the decoupled-memory to execute; and finally that L1 D-cache hit rate.
accesses. All results use identical processors with 1024 registers per
The analysis of the decoupled-memory port also shows that processor and a 512 entry continuation queue. The I-cache is
the average number of accesses per cycle is small. If we assume also identical in all results, a relatively small 2-Kbyte 8-way
a cache miss rate of 50%, then the average number of accesses set associative cache with 32 byte line size. The first set of
is <20% over all loop kernels. Thus, a single write port would results are for a relatively complex level-1 D-cache of 64 Kbyte,
not be fully utilized at this miss rate. For completeness, Table 4 with 8-way associativity and 64 byte line size. Figure 8 shows
shows the average number of accesses per cycle to all write the results. The largest number of instructions executed due
ports including the Wm port with a variable cache miss rate. to instruction reissue on failed synchronization is <4% and
This table is compiled for a normalized problem size where the occurred at 256 processors. The maximum IPC was 1613
m/n = 8, which corresponds to fully utilizing a small register and represented a speedup of 1602 times the single processor
file. Actual cache miss rates for one of the kernels are given in execution and occurred, as expected, with 2048 processors.
the simulation results presented below. This is nearly 80% efficient. Indeed the speedup was within 4%
of the ideal up to 256 processors. The beginning of saturation
in speedup is not unexpected given the fixed problem size.
When using 2048 processors, for example, the processor’s
6. CMP SIMULATION RESULTS
resources (registers and continuation queue slots) are only
This section gives some preliminary simulation results, which partially utilized and hence latency tolerance suffers. Also the
show scalability of performance in executing a single loop fixed overhead of thread creation, including broadcast of the
kernel. Reference [47] also shows scalability of power for the TCB pointer to all processors is amortized over fewer cycles.
same simulations. The results are preliminary as they cannot Note that all simulations are started with cold caches. The
show the relationship between the scalar part of the code and overhead is represented by the relatively static 100 or so cycles
the concurrent part, as a microthreaded compiler is required during which the processors are inactive. As the solution time
before we can simulate complete applications. The significant approaches this, the percentage overhead will increase rapidly.
issue however, is that all of these results were obtained from a Finally it can be seen that the cache hit rate varies between 75
single binary code that was hand compiled from the Livermore and 95%. The best hit rate is seen on one processor but the
hydro fragment and executed on a microthreaded CMP using worst occurs on ∼64 processors, where the resources start to
from 1 to 2048 processors. The code performs a fixed number become less efficiently used. Note that the local schedules

The Computer Journal Vol. 49 No. 2, 2006

“bxh157” — 2006/2/21 — page 229 — #19


230 K. Bousias, N. Hasasneh and C. Jesshope

1.0E+07 100

90
1.0E+06

80

1.0E+05
Cycles, IPC or instructions executed

70

60
1.0E+04

Hit rate (%)


50

1.0E+03
40

30
1.0E+02

Downloaded from https://academic.oup.com/comjnl/article/49/2/211/436583 by Louisiana Tech Univ user on 22 April 2025


20
Cycles to solution
1.0E+01
IPC
Instructions executed 10
Inactive cycles
Hit Rate (%)
1.0E+00 0
1 10 100 1000 10000
Number of processors used to execute kernal

FIGURE 8. Characteristics of microthreaded kernel execution (64K D-cache).

were set to match the cache line size and hence maximize microthreaded compiler. Finally, the memory model used
hit rate. in these simulations is non-blocking, as we do not have an
The second set of results was undertaken to illustrate what realistic model fully simulated yet. The results use a pipelined
effect, if any, the D-cache had on performance. In this memory with an 8-cycle start-up for first word and 2-cycles
set, the D-cache parameters were altered drastically to give per word to complete a cache line transfer. Note that this
a residual cache of just 1 Kbyte with direct mapped cache limitation is not a major issue with the kernel simulated as
lines (n.b. the register file size is a substantial larger than data can be distributed in memory according to the local
this at 8 Kbyte). Figure 9 shows these results. It can be schedule and blocking would be unlikely. In the 64 Kbyte cache
seen that number of instructions executed, time to solution only 3% of memory accesses cause a request to second level
and IPC are all virtually unchanged. The only significant memory. The remaining cache misses are same-line misses
change is in the cache hit rate, which is now much worse, due to the regularity of scheduling. We are currently working
varying from 40 up to 88% on a single processor. Indeed on a second level memory architecture and will present full
the minimal time to solution and maximum IPC by a small simulation results of this when we have a working compiler.
amount were observed with 2048 processors using the residual The results presented here, however, show great promise for
D-cache. this approach.
We note that there are some caveats to these results. They
are obtained from the execution of a single independent code
kernel. Simulations of dependent loops show that speedup
7. CONCLUSIONS
saturates, with the maximum speedup being determined by the
ratio of independent to dependent instructions in these kernels. The characteristics of advanced integrated circuits (ICs) will
For example, a simulation of a naive multiply-accumulate in future require powerful and scalable CMP architectures.
implementation of matrix multiplication (a dependent loop) However, current techniques like wide-issue, superscalar
saturated with a speedup of between 3 and 4 and was achieved processors suffer from complexity in instruction issue and in
using only four processors. This represents the maximum the large multi-ported register file required. The complexity of
instruction parallelism of the combined iterations. The results these components grows at least quadratically with increasing
also ignore the scalar component of the program, which we issue width; also, execution of instructions using these
cannot effectively evaluate until we have fully developed a techniques must proceed speculatively, which does not always

The Computer Journal Vol. 49 No. 2, 2006

“bxh157” — 2006/2/21 — page 230 — #20


Instruction Level Parallelism through Microthreading 231

1.0E+07 100

90
1.0E+06

80
Cycles, IPC or instructions executed

1.0E+05
70

60
1.0E+04

Hit rate (%)


50

1.0E+03
40

30
1.0E+02

20

Downloaded from https://academic.oup.com/comjnl/article/49/2/211/436583 by Louisiana Tech Univ user on 22 April 2025


Cycles to solution
1.0E+01
IPC
Instructions executed 10
Inactive cycles
Hit Rate (%)
1.0E+00 0
1 10 100 1000 10000

Number of Processers used to execute kernal

FIGURE 9. Characteristics of microthreaded kernel execution (1K D-cache).

provide results for the power consumed. In addition, more on- is the broadcast bus, used for creating threads and distributing
chip memory is required in order to ameliorate the effects of the invariants. The second is the shared-register ring network
so called ‘memory wall’. These obstacles limit the processor’s used to perform communication between the register files
performance, by constraining parallelism or through having in the producer and consumer threads. The asynchronous
large and slow structures. In short, this approach does not implementation of the bus and switch provides many
provide scalability in a processor’s performance, in the on-chip opportunities for power saving in large CMP systems. The
area and power dissipation. decoupled approach to register-file design avoids a centralized
An alternative solution which eliminates this complexity register file organization and, as we have shown, requires a
in instruction issue and the global register file, and avoids small, fixed number of ports to each processor’s register file,
speculation has been presented in this paper. The model is based regardless of the number of processors in the system.
on decomposing a sequential program into small fragments of An analysis of the register-file ports in terms of the frequency
code called microthreads, which are scheduled dynamically of accesses to each logical port is described in this paper. This
and which can communicate and synchronize with each other analysis involved different types of dependent and independent
very efficiently. This process allows sequential code to be loop kernels. The analysis illustrates a number of interesting
compiled for execution on scalable CMPs. Moreover, as the issues, which can be summarized as follows:
code is schedule invariant, the same code will execute on any
number of processors limited only by problem size. The model • A single write port with arbitration between different
exploits ILP within basic blocks and across loop bodies. In sources is sufficient to support all non-pipeline writes.
addition, this approach supports a pre-fetching mechanism that This port has an average access rate of <100% over normal
avoids many instruction-cache misses in the pipeline. The fully operating conditions. This is true even in the case of a
distributed register file configuration used in this approach has 100% cache-miss rate.
the additional advantage of full scalability in a CMP with the • A second port is required to handle reads to the $D window.
decoupling of all forms of communication from the pipeline’s The analysis shows that the average access to this port is
operation. This includes memory accesses and communication <10% over all analysed loop kernels.
between micro-contexts. • As a consequence, the distributed register files require
The distributed implementation of a microthreaded CMP only five ports per processor and these ports are fixed
includes two forms of asynchronous communication. The first regardless of the number of processors in the system. This

The Computer Journal Vol. 49 No. 2, 2006

“bxh157” — 2006/2/21 — page 231 — #21


232 K. Bousias, N. Hasasneh and C. Jesshope

provides a scalable and efficient solution for large number Architectures and Compilation Techniques, Paris, France,
of processors on-chip. October 12–18, pp. 130–135. IEEE Computer Society,
• Finally, the average accesses to all write ports does not Washington, DC.
exceed 100% even in the case of n = 256-processor. [11] Olukotun, K., Nayfeh, B. A., Hammond, L., Wilson, K. and
However, to deal with a large number of processors, the Chang, K. (1996) The case for a single-chip multiprocessor.
performance would degrade gracefully due to the inherent In Proc. Seventh Int. Symp. on Architectural Support for
Programming Languages and Operating Systems (ASPLOS-7),
latency tolerance of the model. Eventually all threads
Cambridge, MA, October 1–5. Cambridge, MA, September,
would be suspended waiting for data and in this case the pp. 2–11. ACM Press, New York, NY.
stalled pipeline(s) would free up contention to the non- [12] Palacharla, S., Jouppi, N. P. and Smith, J. (1997) Complexity-
pipeline write port. effective superscalar processors. In Proc. 24th Int. Symp.
Finally we present results of the simulation of an independent Computer Architecture, Denver, CO, June 1–4, pp. 206–218.
loop kernel, that clearly demonstrate schedule invariance of ACM Press, New York, NY.
the binary code and linear speedup characteristics over a [13] Tullsen, D. M., Eggersa, S. and Levy, H. M. (1995)
wide range of processors on which the kernel is scheduled. Simultaneous multithreading: maximizing on chip parallelism.
Clearly, a microthreaded CMP based on a fully distributed In Proc. 22nd Annual Int. Symp. Computer Architecture, Santa
Margherita Ligure, Italy, June 22–24, pp. 392–403. ACM Press,
and scalable register file organization and asynchronous global
New York, NY.
communication buses is a good candidate to future CMP.
[14] Rixner, S., Dally, W. J., Khailany, B., Mattson, P. R., Kapasi, U. J.

Downloaded from https://academic.oup.com/comjnl/article/49/2/211/436583 by Louisiana Tech Univ user on 22 April 2025


and Owens, J. D. (2000) Register organization for media
REFERENCES processing. In Proc. Int. Symp. High Performance Computer
Architecture, Toulouse, France, January 8–12, pp. 375–386.
[1] Barroso, L. A., Gharachorloo, K., McNamara, R., Nowatzyk, A.,
IEEE CS Press, Los Alamitos, CA.
Qadeer, S., Sano, B., Smith, S., Stets, S. and Verghese, B.
(2000) Piranha: a scalable architecture based on single-chip [15] Balasubramonian, R., Dwarkadas, S. and Albonesi, D. (2001)
multiprocessing. In Proc. 27th Annual Int. Symp. Computer Reducing the Complexity of the register file in dynamic
Architecture, Vancouver, British Columbia, Canada, June 12–14, superscalar processors. In Proc. 34th Int. Symp. on Micro-
pp. 282–293. ACM Press, New York, NY. architecture, Austin, TX, December 1–5, pp. 237–248. IEEE
Computer Society, Washington, DC.
[2] Hammond, L., Hubbert, B. A., Siu, M., Prabhu, M. K., Chen, M.
[16] Diefendorff, K. and Duquesne, Y. (2002) Complex SOCs
and Olukotun, K. (2000) The Stanford Hydra CMP. IEEE Micro,
require new architectures. EE Times. Available at http://www.
20, 71–84.
eetimes.com/issue/se/OEG20020911S0076.
[3] Hammond, L., Nayfah, B. A. and Olukotun, K. (1997) A single-
[17] Ungerer, T., Robec, B. and Silc, J. (2003) A survey of processors
chip multiprocessor. IEEE Comput. Soc., 30, 79–85.
with explicit multithreading. ACM Comput. Surveys, 35, 29–63.
[4] Tendler, J. M., Dodson, J. S., Fields, J. S., Le, H. and Sinharoy, B.
[18] Burns, J. and Gaudiot, J.-L. (2001) Area and system clock
(2002) Power4 System Micro-architecture. IBM J. Res. Develop.,
effects on SMT/CMP processors. In Proc. 2001 Int. Conf.
46, 5–25.
Parallel Architectures and Compilation Techniques, Barcelona,
[5] Tremblay, M., Chan, J., Chaudhry, S., Conigliaro, A. W. and Spain, September 8–12, pp. 211–218. IEEE Computer Society,
Tse, S. S. (2000) The MAJC architecture: a synthesis of Washington, DC.
parallelism and scalability. IEEE Micro, 20, 12–25. [19] Jesshope, C. R. (2003) Multi-threaded microprocessors evolution
[6] Jesshope, C. R. (2004) Scalable instruction-level parallelism. or revolution. In Proc. 8th Asia-Pacific Conf. ACSAC’2003, Aizu,
In Proc. Computer Systems: Architectures, Modeling and Japan, September 23–26, LNCS 2823, pp. 21–45. Springer,
Simulation, 3rd and 4th Int. Workshops, SAMOS 2004, Samos, Berlin, Germany.
Greece, July 19–21, LNCS 3133, pp. 383–392. Springer. [20] Luo, B. and Jesshope, C. R. (2002) Performance of a micro-
[7] Bhandarkar, D. (2003) Billion transistor chips in mainstream threaded pipeline. In Proc. 7th Asia-Pacific Conf. Computer
enterprise platforms of the future. In Proc. 9th Int. Symp. High- Systems Architecture, Melbourne, Victoria, Australia, January
Performance Computer Architecture, Anaheim, CA, February 28–February 2, pp. 83–90. Australia Computer Society, Inc.
8–12, pp. 3. IEEE Computer Society, Washington, DC. Darlinghurst, Australia.
[8] Agarwal, V., Hrishikesh, M. S., Keckler, S. W. and Burger, D. [21] Jesshope, C. R. (2001) Implementing an efficient vector
(2000) Clock rate versus IPC: the end of the road for instruction set in a chip multi-processor using micro-threaded
conventional microarchitectures. In Proc. 27th Annual Int. Symp. pipelines. In Proc. ACSAC 2001, Gold Coast, Queensland,
Computer Architecture, Vancouver, British, Columbia, Canada, Australia, January 29–30, pp. 80–88. IEEE Computer Society,
June 10–14, pp. 248–259. ACM Press, New York, NY. Los Alamitos, CA.
[9] Onder, S. and Gupta, R. (2001) Instruction wake-up in wide [22] Zhou, H. and Conte, T. M. (2002) Code Size Efficiency in
issue superscalars. In Proc. 7th Int. Euro-Par Conf. Manchester Global Scheduling for VLIW/EPIC Style Embedded Processors.
on Parallel Processing, Manchester, UK, August 28–31, Technical Report, Department of Electrical and Computer
pp. 418–427. Springer-Verlag, London, UK. Engineering, North Carolina State University, Raleigh, NC.
[10] Onder, S. and Gupta, R. (1998) Superscalar execution [23] Hwang, K. (1993) Advanced Computer Architecture. MIT and
with dynamic data forwarding. In Proc. Int. Conf. Parallel McGraw-Hill, New York, St Louis, San Francisco.

The Computer Journal Vol. 49 No. 2, 2006

“bxh157” — 2006/2/21 — page 232 — #22


Instruction Level Parallelism through Microthreading 233

[24] Sudharsanan, S., Sriram, P., Frederickson, H. and Gulati, A. [37] Codrescu, L., Wills, D. S. and Meindl, J. D. (2001) Architecture
(2000) Image and video processing using MAJC 5200. In Proc. of the Atlas Chip Multiprocessor: dynamically parallelising
2000 IEEE Int. Conf. Image Processing, Vancouver, BC, Canada, irregular applications. IEEE Comput. Soc., 50, 67–82.
September 10–13, pp. 122–125. IEEE Computer Society, [38] Diefendorff, K. (1999) Power4 focuses on memory bandwidth:
Washington, DC. IBM confronts IA-64, says ISA not important. Microprocessor
[25] Cintra, M. and Torrellas, J. (2002) Eliminating squashes through Rep., 13, 11–17.
learning cross-thread violations in speculative parallelisation [39] Preston, R. P. et al. (2002) Design of an 8-wide superscalar
for multiprocessors. In Proc. 8th Int. Symp. High-Performance RISC microprocessor with simultaneous multithreading. In Proc.
Computer Architecture, Boston, MA, February 2–6, pp. 43–54. 2002 IEEE Int. Solid-State Circuits Conf., San Francisco, CA,
IEEE Computer Society, Washington, DC. February 4–6, pp. 334–335. IEEE Solid-State Circuits, USA.
[26] Cintra, M. Martinez, J. S. and Torrellas, J. (2000) Architecture [40] Scott, L., Lee, L., Arends, J. and Moyer, B. (1998) Designing the
support for scalable speculative parallelization in shared- low-power M-CORE architecture. In Proc. IEEE Power Driven
memory multiprocessors. In Proc. Int. Symp. Computer Micro Architecture Workshop at ISCA98, Barcelona, Spain,
Architecture, Vancouver, Canada, June 10–14, pp. 13–24. ACM June 28, pp. 145–150.
Press, New York, NY. [41] Park, I., Powell, M. D. and Vijaykumar, T. N. (2002) Reducing
[27] Terechko, A., Thenaff, E. L., Garg, M. J., Van Eijndhoven, J. V. register ports for higher speed and lower energy. In Proc.
and Corporaal, H. (2003) Inter-cluster communication mod- 35th Annual ACM/IEEE Int. Symp. Microarchitecture, Istanbul,
els for clustered VLIW processors. In Proc. 9th Int. Turkey, November 18–22, pp. 171–182. IEEE Computer Society,
Symp. High-Performance Computer Architecture, Anaheim, Los Alamitos, CA.

Downloaded from https://academic.oup.com/comjnl/article/49/2/211/436583 by Louisiana Tech Univ user on 22 April 2025


CA, February 8–12, pp. 354–364. IEEE Computer Society, [42] Kim, N. S. and Mudge, T. (2003) Reducing register ports
Washington, DC. using delayed write-back queues and operand pre-fetch. In Proc.
[28] Halfhill, T. (1998) Inside IA-64. Byte Magaz., 23, 81–88. 17th Annual Int. Conf. Supercomputing, San Francisco, CA,
[29] Schlansker, M. S. and Rau, B. R. (2000) EPIC: an architecture for June 23–26, pp. 172–182. ACM Press, New York, NY.
instruction-level parallel processors. Compiler and Architecture [43] Tseng, J. H. and Asanovic, K. (2003) Banked multiported
Research, HPL-1999-111. HP Laboratories, Palo Alto. register files for high-frequency superscalar microprocessors. In
[30] Sundararaman, K. and Franklin, M. (1997) Multiscalar execution Proc. 30th Int. Symp. Computer Architecture, San Diego, CA,
along a single flow of control. In Proc. IEEE Int. Conf. Parallel June 9–11, pp. 62–71. ACM Press, New York, NY.
Processing, Bloomington, IL, August 11–15, pp. 106–113. IEEE [44] Bunchua, S., Wills, D. S. and Wills, L. M. (2003) Reducing
Computer Society, Washington, DC. operand transport complexity of superscalar processors using
[31] Sohi, G. S., Breach, S. E. and Vijaykumar, T. N. (1995) distributed register files. In Proc. 21st Int. Conf. Computer
Multiscalar processors. In Proc. 22nd Annual Int. Symp. Design, San Jose, CA, October 13–15, pp. 532–535. IEEE
Computer Architecture, S. Margherita Ligure, Italy, June 22–24, Computer Society, Los Alamitos, CA.
pp. 414–425. ACM Press, New York, NY. [45] Bolychevsky, A., Jesshope, C. R. and Muchnick, V. (1996)
[32] Breach, S. E., Vijaykumar, T. N. and Sohi, G. S. (1994) Dynamic Scheduling in RISC Architectures. IEE Proc. Comput.
The anatomy of the register file in a multiscalar processor. Digit. Tech., 143, 309–317.
In Proc. 27th Int. Symp. Microarchitecture, San Jose, [46] Jesshope, C. R. (2005) Micro-grids—the exploitation of massive
CA, November 30–December 2, pp. 181–190. ACM Press, on-chip concurrency. In Proc. HPC Workshop 2004, Grid
New York, NY. Computing: A New Frontier of High Performance Computing,
[33] Alverson, R., Callahan, D., Cummings, D., Koblenz, B., L. Grandinetti (ed.), Cetraro, Italy, May 31–June 3. Elsevier,
Porterfield, A. and Smith, B. (1990) The Tera com- Amsterdam.
puter system. In Proc. 4th Int. Conf. Supercomputing, [47] Bousias, K. and Jesshope, C. R. (2005) The challenges of
Amsterdam, The Netherlands, June 11–15, pp. 1–6. ACM Press, massive on-chip concurrency. Tenth Asia-Pacific Computer
New York, NY. Systems Architecture Conference, Singapore, October 24–26.
[34] Kongetira, P., Aingaran, K. and Olukotun, K. (2005) Niagara: LNCS 3740, pp. 157–170. Springer-Verlag.
32-way multithreaded Sparc processor. IEEE Comput. Soc., 25, [48] Shapiro, D. (1984) Globally Asynchronous Locally Synchronous
21–29. Circuits. PhD Thesis, Report No. STAN-CS-84-1026, Stanford
[35] Marr, D. T., Binns, F., Hill, D. L., Hinton, G., Koufaty, D. A. and University.
Upton, M. (2002) Hyper-threading technology architecture and [49] Shengxian, Z., Li, W., Carlsson, J., Palmkvist, K. and
microarchitecture. Intel Technol. J., 6, 4–15. Wanhammar, L. (2002) An asynchronous wrapper with novel
[36] Emer, J. (1999) Simultaneous multithreading: multiple Alpha’s handshake circuits for GALS systems. In Proc. IEEE 2002 Int.
performance. In Presentation at the Microprocessor Forum’99, Conf. Communication, Circuits and Systems, Cheungdu, China,
MicroDesign Resources, San Jose, CA. June 29–July 1, pp. 1521–1525. IEEE Society, CA.

The Computer Journal Vol. 49 No. 2, 2006

“bxh157” — 2006/2/21 — page 233 — #23

You might also like