0% found this document useful (0 votes)
88 views8 pages

Understanding EPIC Architectures and Implementations

The document discusses different approaches to instruction level parallelism (ILP) in computer architectures, including: 1) Superscalar architectures rely on hardware to dynamically uncover and schedule independent instructions, while EPIC/VLIW architectures rely on compilers to statically schedule instructions. 2) There are four major categories of ILP architectures depending on whether the tasks of dependency checking, function unit assignment, and initiation timing are done by hardware or compilers. 3) Early VLIW efforts in the 1980s-1990s included Multiflow and Cydrome machines, but they struggled to achieve market success. This led companies like HP to pursue the explicitly parallel instruction computing (EPIC) approach jointly developed

Uploaded by

nqfaq
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views8 pages

Understanding EPIC Architectures and Implementations

The document discusses different approaches to instruction level parallelism (ILP) in computer architectures, including: 1) Superscalar architectures rely on hardware to dynamically uncover and schedule independent instructions, while EPIC/VLIW architectures rely on compilers to statically schedule instructions. 2) There are four major categories of ILP architectures depending on whether the tasks of dependency checking, function unit assignment, and initiation timing are done by hardware or compilers. 3) Early VLIW efforts in the 1980s-1990s included Multiflow and Cydrome machines, but they struggled to achieve market success. This led companies like HP to pursue the explicitly parallel instruction computing (EPIC) approach jointly developed

Uploaded by

nqfaq
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Understanding EPIC Architectures and Implementations

Mark Smotherman
Dept. of Computer Science
Clemson University
Clemson, SC, 29634
[email protected]

Abstract- HP and Intel have recently introduced a new style Fisher led efforts at Yale on a VLIW machine called
of instruction set architecture called EPIC (Explicitly the ELI-512 and later helped found Multiflow, which
Parallel Instruction Computing), and a specific architecture produced the Multiflow Trace line of computers [6]. Rau led
called the IPF (Itanium Processor Family). This paper seeks efforts at TRW on the Polycyclic Processor and later helped
to illustrate the differences between EPIC architectures and found Cydrome, which produced the Cydra-5 computer
former styles of instruction set architectures such as [25]. Other early VLIW efforts include iWarp and CHoPP.
superscalar and VLIW. Several aspects of EPIC In contrast to these VLIW-based efforts, other
architectures have already appeared in computer designs, companies were exploring techniques similar to those used
and these precedents are noted. Opportunities for traditional in the 1960s where extra hardware would dynamically
instruction sets to take advantage of EPIC-like uncover and schedule independent operations. This
implementations are also examined. approach was called “superscalar” (the term was coined by
Tilak Agerwala and John Cocke of IBM) to distinguish it
1. Introduction from both traditional scalar pipelined computers and vector
supercomputers. In 1989, Intel introduced the first
Instruction level parallelism (ILP) is the initiation and superscalar microprocessor, the i960CA [21], and IBM
execution within a single processor of multiple machine introduced the first superscalar workstation, the RS/6000
instructions in parallel. ILP is becoming an increasingly [33]. In 1993 Intel introduced the superscalar Pentium, and
more important factor in computer performance. It was first since the mid-1990s the AMD or Intel processor in your
used in the supercomputers of the 1960s, such as the CDC desktop or laptop has relied on both clock rate and the
6600 [34], IBM S/360 M91 [36], and IBM ACS [31], but superscalar approach for performance.
such efforts were for the most part dropped in the 1970s due After a few years of operation, Cydrome and Multiflow
to the an apparent lack of parallelism in programs generated both closed their doors after failing to establish a large
by then-existing compilers [35] and due to the less attractive enough market presence in the crowded minisupercomputer
performance / implementation-complexity tradeoffs market of the 1980s. HP hired Bob Rau and Mike
necessary for ILP as compared to simpler cache-based Schlansker of Cydrome, and they began the FAST (Fine-
processors, such as the IBM S/360 M85, and as compared to grained Architecture and Software Technologies) research
multiprocessor systems. project at HP in 1989; this work later developed into HP’s
By the 1980s and 1990s, instruction level parallelism PlayDoh architecture. In 1990 Bill Worley at HP started the
once again became an important approach to computer PA-Wide Word project (PA-WW, also known as SWS,
performance. Alan Charlesworth, Josh Fisher, and Bob Rau SuperWorkStation). Josh Fisher, also hired by HP, made
were leaders in experimenting with VLIW (very long contributions to these projects [28].
instruction word) architectures, in which sophisticated In 1992, Worley recommended that HP seek a
compilers uncovered independent instructions within a manufacturing partner for PA-WW, and in December 1993
program and statically scheduled these as multiple HP approached Intel [8,28]. Cooperation between the two
concurrent operations in a single wide instruction word. companies was announced in June 1994, and the companies
Charlesworth led efforts at FPS (Floating Point Systems) for made a joint presentation of their plans at the
attached array processors programmed in VLIW style [4]. Microprocessor Forum in October 1997. The term EPIC
(Explicitly Parallel Instruction Computing) was coined to
describe the design philosophy and architecture style
envisioned by HP, and the specific jointly designed
instruction set architecture was named IA-64. More
recently, Intel has preferred to use IPF (Itanium Processor
Family) as the name of the instruction set architecture.
Itanium is the name of the first implementation (it was
previously called by the project codename Merced) [29],
and currently Itanium-based systems can be purchased from instruction words that provide a separate operation for each
HP, Dell, and Compaq, with many other system function unit on each cycle (similar to horizontal
manufacturers committed to selling Itanium-based systems. microprogramming). The width of the instruction word
depends on the number of function units; e.g., Multiflow
2. Three Major Tasks for ILP Execution produced machines with long instruction words up to 28
operations wide. Groups of independent operations are
Processing instructions in parallel requires three major placed together into a single VLIW, and the operations are
tasks: (1) checking dependencies between instructions to assigned to function units by position in the given fields
determine which instructions can be grouped together for within the long instruction word (“slotting”). The initiation
parallel execution; (2) assigning instructions to the timing is bound by the instruction word in which an
functional units on the hardware; and, (3) determining when operation appears; all operations in a VLIW start execution
instructions are initiated (i.e., start execution) [27]. (Note: in parallel.
This departs from the earlier Rau and Fisher paper [24]; the A sequence of long instruction words thus defines the
three tasks identified there are: determine dependencies, plan of execution for a particular program on a particular
determine independencies, and bind resources.) Four major implementation, the plan being specified by the sequence of
classes of ILP architectures can be differentiated by whether VLIW instructions cycle by cycle [28]. It is the
these tasks are performed by the hardware or the compiler. responsibility of the compiler to determine which operations
can be grouped together and where they must be placed in
this sequence of long instruction words. However, this also
Grouping Fn unit asgn Initiation represents the Achilles heel of VLIW architectures: the
problem of compatibility between implementations. Code
Superscalar Hardware Hardware Hardware compiled for one implementation with a certain set of
function units with certain latencies will not execute
EPIC Compiler Hardware Hardware correctly on a different implementation with a different set
of function units and/or different latencies (although there
Dynamic Compiler Compiler Hardware have been studies directed at providing compatibility, e.g.,
VLIW see [7]). In contrast, compatibility is not a problem with
VLIW Compiler Compiler Compiler superscalars, and this is a major reason for their popularity.
Table 1 suggests intermediate architectures between
superscalars and VLIWs with varying amounts of compiler
Table 1. Four Major Categories of ILP Architectures. responsibility (see also [27]). If the compiler determines the
grouping of independent instructions and communicates this
Table 1 identifies the four classes of ILP architectures via explicit information in the instruction set, we have what
that result from performing the three tasks either in Fisher and Rau termed an “independence architecture” [24]
hardware or the compiler. A superscalar processor is one or what is now known as the EPIC architecture style [28].
with a traditional, sequential instruction set in which the EPIC retains compatibility across different implementations
semantics (i.e., meaning) of a program is based on a as do superscalars but does not require the dependency
sequential machine model. That is, a program’s results checking hardware of superscalars. In this manner, EPIC
should be the same as if the instructions were individually can be said to combine the best of both superscalar and
processed on a sequential machine where one instruction VLIW architectures. The first EPIC architecture appears to
must be completed before the next one is examined. A be Burton Smith’s Horizon (in 1988), which provided an
superscalar processor includes the necessary hardware to explicit lookahead count field of the distance to the next
speed up program execution by fetching, decoding, issuing, dependent instruction [19], although Lee Higbie sketched an
executing, and completing multiple instructions each cycle, EPIC-like approach some ten years earlier (in 1978) in
but yet in such a way that the meaning of the program is which concurrency control bits are added to the instruction
preserved. The decoding and issuing of multiple format and set by the compiler or programmer [13].
instructions requires dependency-checking hardware for Another category of ILP architecture is one in which
instruction grouping, decoding and routing hardware for the grouping and function unit assignment is done by the
assignment of instructions to function units, and register compiler, but the initiation timing of the operations is done
scoreboard hardware for timing the initiation of instruction by hardware scheduling. This style is called dynamic
execution. The dependency-checking hardware does not VLIW [26], and it has some advantage over traditional
scale well (O(n²)) and has been seen as a limit to the width VLIW since it can respond to events at run time that cannot
of multiple instruction issue in superscalars. be handled by the compiler at compile time. For example,
At the opposite extreme is VLIW. The three early VLIW designs did not include data caches, since a
responsibilities for ILP are each assigned to the compiler. cache miss would disrupt the sequence of long instruction
The implementation of a VLIW computer uses long words by invalidating the compiler’s assumption of latency
compiler hardware

Code generation
superscalar

Instruction grouping Instruction grouping


EPIC

Fn. unit assignment Fn. unit assignment


Dynamic VLIW

Initiation timing Initiation timing


VLIW

Figure 1. Graphical Depiction of the Three Major Tasks.

for load instructions. Thus in a simple dynamic VLIW There is no information in the instruction set to convey
approach, we can add load-miss interlock to otherwise bare independent instruction groups, function unit assignment, or
hardware and stall the entire machine on a data cache miss. instruction timing.
Along these lines, Rau paid special attention to The dashed lines within the compiler box indicate that
memory latency in the Cydra 5 design by use of a “memory for best performance, the compiler may go ahead and do all
collating buffer” which handled the early, and possibly out- three tasks as required for best performance on a particular
of-order, arrival of values loaded from memory so as to implementation, but supply the instructions in a less
preserve the static memory access latency assumptions made semantically-rich instruction set. In such a case, the
by the compiler; late arrivals delayed the entire machine hardware has to rediscover the independent groups among
[25,26]. (See also the discussion of LEQ semantics in [27].) the instructions that the compiler has already arranged
A more complicated architecture handles run-time within the instruction stream, and it has to repeat the
events not by merely delaying the initiation of the next function unit assignment and instruction initiation timing.
group, but by adding what is essentially dynamic scheduling As an example of the benefit of scheduling even for a
hardware for the individual operations within each VLIW. superscalar processor, consider the HP PA-8000. It will run
(Although called dynamic VLIW by Rudd [26] and others, code generated for any PA-RISC 1.1 or 2.0 processor, but
because of the complexity of the hardware this approach Holler reports that SPECint95 benchmarks ran 38% faster
might actually be considered as a fifth category.) and SPECfp95 benchmarks ran 53% faster when specifically
Instruction execution is split into two (or three) phases, with compiled for the PA-8000 as compared to running those
the first phase statically scheduled to read the registers, same benchmarks but as compiled for the previous
compute a result, and write the result to a temporary results generation PA-7200 [14].
buffer. The second phase will move results from the buffer The other levels in Figure 1 at which programs can be
into the register file. (Note the extra hardware and conveyed to the hardware add more information to the
buffering, similar to what is found in a superscalar instruction set and thus require less hardware. For example,
processor.) Rudd’s simulations suggest that there is little let us assume a machine with two load/store units, an integer
performance to be gained from introducing this level of ALU, and a branch unit, with latencies 2, 2, 1, 2,
complexity [26]. respectively. If we wish to perform a simple addition, C =
Figure 1 is a revision of Figure 2 from Rau and Fisher A + B, the code given to a superscalar would be something
[24] using the responsibilities identified above, and shows like this:
the three responsibilities as performed by the compiler or by
the hardware. The horizontal lines demonstrate the four Load R1,A
levels at which information about the program can be given Load R2,B
to the hardware. At the top level, a traditional instruction set Add R3,R1,R2
is used and the hardware must perform the three tasks. Store C,R3
The hardware has to determine that the loads are The EPIC style of instruction set for this example must
independent and can be grouped together, while the add is have grouping information, such as a count of independent
dependent on both and must be in a separate group. instructions. For example, if we add a Horizon-like
Likewise the store is dependent on the add and must be lookahead count (given in parentheses for all but the last
placed in a third group. The hardware will assign the instruction), we obtain:
instructions to the different function units based on the
operation codes, and the register scoreboarding hardware (2) Load R1,A
will govern the initiation of the add and store. (Note: (1) Load R2,B
because of special forwarding paths or cascaded function (1) Add R3,R1,R2
units, some superscalar processors like the IBM RS/6000 (.) Store C,R3
and the TI SuperSPARC can start the execution of certain
pairs of dependent instructions at the same time.) The hardware can use the lookahead count to group the two
If we look at the corresponding VLIW program, we see loads together. Note that hardware function unit assignment
that the compiler has completely planned out the grouping, and hardware instruction initiation timing are still required.
function unit assignment, and initiation timing. This is the An analogy to these distinctions that might be helpful
complete plan of execution and relies on a particular in presenting these ideas in the classroom is the example of
implementation and on particular latencies. Thus, this designing and building a simple wooden stool. The design
VLIW program would be invalid for an implementation represents a program, and the construction and assembly of
with only one load/store unit, whereas the traditional code the stool (e.g., top, two legs, cross-brace) represent the
for a superscalar processor as given above would run operations. The designer will send the plan to the
without any changes being necessary. woodshop, which represents the processor. If there are
several machines and workers in the woodshop (i.e.,
ld/st unit 0 ld/st unit 1 integer alu branch unit multiple function units), a shop foreman would set up a
Load R1,A Load R2,B nop nop complete plan of building the stool. This plan would
nop nop nop nop determine which parts could be constructed or assembled in
nop nop Add R3,R1,R2 nop parallel, which machines or tools would be used, and when
Store C,R3 nop nop nop construction or assembly activities would start. E.g.,

The VLIW program also illustrates a difficulty of low table saw band saw hand saw hammer
utilization of the long instruction word fields. Of the 16 Cut Cut leg 0
total fields in the four long instruction words, 12 are empty brace
and have a no-operation placed in them. Multiflow Cut top Cut leg 1 Nail leg 0 to brace
recognized this inefficiency and provided a compression Nail leg 1 to brace
scheme, in which VLIW programs existed on disk and in Nail legs to top
main memory in an encoded format [6]. Long instruction
words from the program were expanded to the traditional To correspond to the superscalar approach, the designer
fixed-length VLIW format when they were fetched into the walks to the wood shop and hands the design to the shop
instruction cache. Several VLIW processors now use foreman, who must then do the planning there in the shop as
similar compression schemes, including the Lucent construction proceeds. To correspond to the VLIW
StarCore, Philips TriMedia, Sun MAJC, and TI C6x approach, the designer and the shop foreman are the same
processors. person; the design includes the detailed plan of building as
If we add register scoreboarding to handle dynamic illustrated above. This plan is necessarily shop-specific,
events like cache misses, either on a long-instruction-word- since some shops might not have a band saw. Instead, in
wide basis or on a function-unit-by-function-unit basis, we this second shop a hand saw must be used, and the time for
move to a dynamic VLIW architecture. Note that in this cutting the legs will lengthen, thereby forcing the nailing to
case we can omit the second long instruction word above, start later. The plan is thus not compatible across wood
since it only has nops. This further improves the utilization shops. (A dynamic VLIW analog might be where the time to
of instruction memory. Two computers that can be placed complete hand sawing is unknown in the second shop and
in this category appeared in the late 1980s: the Intel i860 nailing activities start only when sawing on certain parts is
microprocessor [18] and the Apollo Domain DN10000 completed.)
workstation. In each computer, an integer pipeline can run To complete the analogy for the EPIC approach, the
in parallel with a floating-point pipeline. The instruction dependencies can be given in the design. For instance, the
formats in these computers include a bit to specify an cutting of the brace, top, and legs are all independent; but,
integer and a floating-point instruction pair that can be nailing cannot start until at least two parts are completed.
initiated in parallel (they are slotted in a fixed order in The design and independence information make no
memory to correspond to the function unit assignment).
assumptions about what particular shops tools will be can delete instructions with false predicates in the decode or
present (i.e., at least one cutting-type tool is assumed, but issue stages.)
the specific type is not necessarily set down in the plan), nor IPF provides 64 predicate registers. Each register can
assumptions about the length of time required to construct or hold one bit (true or false) and is set by compare
build. The independence information assists the foreman in instructions. In the normal case, a compare instruction
the shop in setting up an efficient plan of building. writes to two predicate registers, one with the result of the
compare and one with the inverted result, so that if-
3. Characteristics of EPIC Architectures converted code can make use of this register pair.
The idea of predication dates back to at least 1952,
and Historical Precedents when the IBM 604 plugboard-controlled computer included
a suppression bit in each instruction format and
3.1. Explicit Parallelism programmers could provide if-converted segments of code
[2]. Predication has been an important part of several
As described above, explicit information on independent
instruction sets, including Electrologica X8 (1965), IBM
instructions in the program is a major distinguishing feature
ACS (1967), ARM (1986), Cydra-5 (1988), and Multiflow
of EPIC architectures. In the IPF architecture, three 41-bit
(1990) [2,24]. Other instruction sets without extra bits to
instructions are packaged together into a 128-bit “bundle”,
spare in instruction formats have added a conditional move
which is the unit of instruction fetch. A 5-bit template
instruction, which provides for “partial predication”.
identifies the instruction type and any architectural stops
between groups of instructions. In little-endian format, a
bundle appears as: 3.3. Unbundled branches
Conditional branches are composed of three separate
Instruction 2 Instruction 1 Instruction 0 Template actions: (1) making a decision to branch or not; (2)
127 86 45 4 0 providing the target address; and, (3) actual change of the
PC. By separating these actions, multiple comparisons can
Bundles can have zero, one, or at most two stops. be made in parallel, earlier in the instruction stream.
Instruction groups (i.e., sets of independent instructions) Moreover, multiple targets can be specified, and instructions
thus can span instruction bundles. Nops may be needed to can be prefetched from those paths. Thus, the change of the
pad out the bundles in some cases. The instruction type PC can be delayed until an explicit branch instruction or set
(one of six types: integer alu, non-alu integer, memory, of branch instructions, having the effect of a multiway,
floating-point, branch, and extended) can help in function prioritized branch.
unit assignment and routing during decoding, but this The IPF architecture uses the predicate registers to
information provides type information rather than specific record the results of comparisons and includes eight branch
function unit identification. Thus, it is not in the dynamic or registers for use in prefetching. Branch instructions in IPF
traditional VLIW category. (Note that not all combinations are made conditional by use of a predicate and can specify a
of instruction type and stop boundaries are available -- branch register (action 3) or relative address (actions 2+3).
which would have required an 11-bit template to encode The decomposition of branches into separate actions is
6³*2³ cases.) S. Vassiliadis at IBM proposed a similar an idea that has been independently rediscovered several
instruction bundling scheme, called SCISM, in the early times, but the decomposition into actions 1+2 as a branch on
1990s [37]. condition instruction and action 3 as a separate exit
Schlansker and Rau list five other attributes of EPIC instruction that chose among the currently active targets was
architectures beyond instruction grouping [28]. The first part of the IBM ACS-1 instruction set in the mid-1960s [31].
two deal with eliminating and/or speeding up branching, the
third with cache locality management, and the final two with 3.4. Compiler control of the memory hierarchy
starting load instructions as early as possible.
EPIC architectures should be able to provide hints to the
3.2. Predicated execution hardware about the probable latency of load operation (i.e.,
where in the memory hierarchy a data value will be found)
To avoid conditional branches, each instruction can be and the probable locality of a loaded or stored data item
conditioned or predicated on a true/false value in a predicate (i.e., where in the memory hierarchy to place a data value).
register. Only those instructions with a true predicate are These are hints rather than exact operation timings, so
allowed to write into their destination registers. Thus, if- register interlocks or scoreboarding techniques are still used.
then-else sequences can be compiled without branches IPF provides hints as given in Table 2 and also
(called “if conversion”). Instructions from each side of the provides prefetching stride information by use of base-
decision are predicated with one of two inversely-related update addressing mode. Because of low temporal locality
predicate registers and can be executed in parallel. (If of vector operands, data cache bypass was a feature of some
predicate values are available in time, an implementation
vector processors. The Intel i860 was perhaps the first instructions in the instruction stream and start loads early
processor to offer two types of scalar load instructions, one [3]. Moving instructions across branches was a vital aspect
of which would bypass cache [18]. In 1994, the HP 7200 of Fisher’s trace scheduling [24], and Ebcioglu discussed
included temporal locality hints as part of the normal conditional execution of instructions based on branches in
load/store instructions [20]. Several instruction set 1987 [9]. Multiflow introduced special non-faulting loads
architectures since that time have included locality hints, and used IEEE floating-point NaN propagation rules for
typically as part of software prefetch instructions (e.g., supporting control speculation [6]. Deferring exceptions
Alpha, MIPS, SPARC v.9). from speculative loads appears to have been first presented
by Smith, Lam, and Horowitz in 1990 [30].
hint Store Load Fetch
Temporal locality / L1 Yes Yes Yes 3.6. Data speculation
No temporal locality / L1 Yes Yes To be able to rearrange load and store instructions, the
compiler must know the memory addresses to which the
No temporal locality / L2 Yes
instructions refer. Because of aliasing, compilers are not
No temporal locality / all Yes Yes Yes always able to do this at compile time. In the absence of
levels exact alias analysis, most compilers must settle for safe but
slower (i.e., unreordered) code. EPIC architectures provide
Table 2. Cache Hints in IPF speculative loads that can be used when an alias situation is
unlikely but yet still possible. A speculative load is moved
3.5. Control speculation earlier in the schedule to start the load as early as possible;
and, at the place where the loaded value is needed, a data-
To start loads (or other potentially-long-running verifying load is used instead. If no aliasing has occurred,
instructions) early, they must often be moved up beyond a then the value retrieved by the speculative load is used by
branch. The problem with this approach occurs when the the data-verifying load instruction. Otherwise, the data-
load (or other instruction) generates an exception. If the verifying load reexecutes the load to obtain the new value.
branch is taken, the load (or other instruction) would not IPF provides advanced load and advanced load check
have been executed in the original program and thus the instructions that use an Advanced Load Address Table
exception should not be seen. To allow this type of code (ALAT) to detect stores that invalidate advanced load
scheduling, an EPIC architecture should provide a values.
speculative form of load (or other long-running instruction) The IBM Stretch (1961) started loads early, as
and tagged operands. When the speculative instruction mentioned above. The lookahead unit checked the memory
causes an exception, the exception is deferred by tagging the address of a store instruction against subsequent loads and
result with the required information. The exception is on a match cancelled the load and forwarded the store value
handled only when a nonspeculative instruction reads the to the buffer reserved for the loaded value (only one
tagged operand (in fact, multiple instructions may use the outstanding store was allowed at a time) [3]. The CDC
tagged operand in the meantime and merely pass the tag on). 6600 (1964) memory stunt box performed similar actions
Thus, if the branch over which the instruction is moved is [34].
not taken, no exception occurs, thereby following the
semantics of the original program.
IPF provides speculative load and speculation check
4. Alternate Translation Times
instructions. Integer speculative loads set a NaT (Not a
To this point, we have assumed a standard compilation
Thing) bit associated with the integer destination register
model, which includes steps such as compilation, linking,
when an exception is deferred. Floating-point speculative
loads place a NaTVal (Not a Thing Value) code in the loading, and execution. We have assumed that either the
floating-point destination register when an exception is compiler or the hardware does the three major tasks of
deferred. These bits and encoded values propagate through managing ILP. However, alternatives exist. For example,
other instructions until a speculation check instruction is even within the compilation model, nontraditional points of
executed. At that point a NaT bit or NaTVal special value translation and optimization have been proposed, such as
will raise the exception. reallocating registers and/or repositioning procedures at link
The need to start loads early can be seen back in time for better performance. Additionally, other
nontraditional points are available during execution, such as
Konrad Zuse’s Z4 computer constructed in Germany during
software-based translation at page-fault time [7], hardware-
the Second World War. The instruction stream was read
based translation at icache-miss time [1,22], hardware-based
two instructions in advance; and, if a load was encountered,
capture and caching of parallel issue [11,23], or various
it was started early [2]. The IBM Stretch (1961) used a
separate index processor to pre-execute index-related dynamic optimizations during execution (e.g., software
compiler
predecoder instruction unit
Code generation

Instruction grouping

Fn. unit assignment

Initiation timing

Contents of executable file = Contents of instruction cache


traditional instruction set = dynamic VLIW

Figure 2. Nontraditional Assignment of ILP Management Tasks.

based [10] or hardware-based [5]). At any of these processing between the compiler and the hardware. As
additional points, translation from a traditional instruction such, EPIC architectures can claim to combine the best
set into an EPIC or VLIW internal format is possible. attributes of superscalar processors (compatibility across
Indeed, a typical compiler optimization step of if-conversion implementations) and VLIW processors (efficiency since
has been proposed as a run-time action using either software less control logic). Through nontraditional translation,
[12] or hardware [17]. Transmeta provides a run-time current traditional instruction sets can be used but the
software approach that translates x86 instructions into an combined hardware and software system can exploit the
internal VLIW format, which they call “code-morphing” efficiency of VLIW and EPIC implementations.
and which includes data and control speculation [15,16].
An early example of the run-time hardware approach References
is the National Semiconductor Swordfish processor (1990).
Instructions from a traditional instruction set were examined [1] S. Banerjia, et al., “MPS: Miss-Path Scheduling for
at instruction cache miss by a hardware predecoder. The Multiple Issue Processors,” IEEE Transactions on
predecoder checked the instruction types and stored pairs of Computers, December 1998, pp. 1382-1397.
instructions with a grouping bit for parallel issue in the [2] G. Blaauw and F. Brooks, Jr., Computer Architecture:
instruction cache [32]. Register scoreboarding was still Concepts and Evolution. Reading, MA: Addison-
performed at decode time, so this scheme looks like a Wesley, 1997.
traditional superscalar processor from the outside but is [3] W. Buchholz (ed.), Planning a Computer System. New
actually a dynamic-VLIW processor internally. Figure 2 York: McGraw-Hill, 1962.
illustrates this approach. [4] A. Charlesworth, “An Approach to Scientific Array
Processing: The Architectural Design of the AP-
5. Conclusions 120B/FPS-164 Family,” IEEE Computer, September
1981, pp. 18-27.
EPIC architectures are a new style of instruction set for [5] Y. Chou, et al., “PipeRench Implementation of the
computers. They are the skillful combination of several Instruction Path Coprocessor,” in Proceedings Micro-
preexisting ideas in computer architecture along with a 33, December 2000, pp. 147-158.
nontraditional assignment of the responsibilities in ILP
[6] R. Colwell, et al., “A VLIW Architecture for a Trace [22] K. Minagawa, M. Saito, and T. Aikawa, “Pre-decoding
Scheduling Compiler,” IEEE Transactions on Mechanism for Superscalar Architecture,” in
Computers, August 1988, pp. 967-979. Proceedings of IEEE Pacific Rim Conference on
[7] T. Conte and S. Sathaye, “Dynamic Rescheduling: A Communications, Victoria B.C., May 1991, pp. 21-24.
Technique for Object Code Compatibility in VLIW [23] R. Nair and M. Hopkins, “Exploiting Instruction Level
Architectures,” in Proceedings of Micro-28, Ann Parallelism in Processors by Caching Scheduled
Arbor, November 1995, pp. 208-217. Groups,” in Proceedings ISCA-24, June 1997, pp. 13-
[8] J. Crawford, “Introducing the Itanium Processors,” IEEE 25.
Micro, September-October 2000, pp. 9-11. [24] B. Rau and J. Fisher, “Instruction-Level Parallel
[9] K. Ebcioglu, “A Compilation Technique for Software Processing: History, Overview, and Perspective,”
Pipelining of Loops with Conditional Jumps,” in Journal of Supercomputing, July 1993, pp. 9-50.
Proceedings of Micro-20, Colorado Springs, [25] B. Rau, et al., “The Cydra 5 Departmental
December 1987, pp. 69-79. Supercomputer,” IEEE Computer, January 1989, pp.
[10] K. Ebcioglu and E. Altman, “DAISY: Dynamic 12-35.
Compilation for 100% Architectural Compatibility,” in [26] K. Rudd, “VLIW Processors: Efficiently Exploiting
Proceedings of ISCA-24, Denver, June 1997, pp. 26- Instruction Level Parallelism,” Ph.D. dissertation,
37. Computer Science Dept., Stanford University, 1999.
[11] M. Franklin and M. Smotherman, “A Fill-Unit [27] M. Schlansker, et al., “Achieving High Levels of
Approach to Multiple Instruction Issue,” in Instruction-Level Parallelism with Reduced Hardware
Proceedings of Micro-27, San Jose, December 1994, Complexity,” HP Labs Tech. Rept. HPL-96-120,
pp. 162-171. November 1994.
[12] K. Hazelwood and T. Conte, “A Lightweight [28] M. Schlansker and B. Rau, “EPIC: Explicitly Parallel
Algorithm for Dynamic If-Conversion During Instruction Computing,” IEEE Computer, February
Dynamic Optimization,” in Proceedings PACT, 2000, 2000, pp. 37-45. (See also “EPIC: An Architecture for
pp. 71-80. Instruction-Level Parallel Processors,” HP Labs Tech.
[13] L. Higbie, “Overlapped Operation with Rept. HPL-1999-111, February, 2000.)
Microprogramming,” IEEE Transactions on [29] H. Sharangpani and K. Arora, “Itanium Processor
Computers, March 1978, pp. 270-275. Microarchitecture,” IEEE Micro, September-October
[14] A. Holler, “Compiler Optimizations for the PA-8000,” 2000, pp. 24-43.
in Proceedings of Compcon 97, San Jose, February [30] M. Smith, M. Lam, and M. Horowitz, “Boosting
1997, pp. 87-94. Beyond Static Scheduling in a Superscalar Processor,”
[15] E. Kelly, R. Cmelik, and M. Wing, “Memory in Proceedings of ISCA-17, Seattle, May 1990, pp.
Controller for a Microprocessor for Detecting a Failure 344-354.
of Speculation on the Physical Nature of a Component [31] M. Smotherman, “IBM Advanced Computing Systems
Being Addressed,” US Patent 5,832,205. – A Secret 1960’s Supercomputer Project,” available
[16] A. Klaiber, “The Technology Behind Crusoe online http://www.cs.clemson.edu/~mark/acs.html
Processors,” Transmeta Corporation, January 2000. [32] M. Smotherman, “National Semiconductor Swordfish,”
[17] A. Klauser, et al., “Dynamic Hammock Predication for available online
Non-predicated Instruction Set Architectures,” in http://www.cs.clemson.edu/~mark/swordfish.html
Proceedings PACT, 1998, pp. 278-285. [33] Special issue, “IBM RISC System/6000 Processor,”
[18] L. Kohn and N. Margulis, “Introducing the Intel i860 IBM Journal of Research and Development, January
64-bit Microprocessor,” IEEE Micro, August 1989, pp. 1990.
15-30. [34] J. Thornton, Design of a Computer – The Control Data
[19] J. Kuehn and B. Smith, “The Horizon Supercomputing 6600. Glenview, IL: Scott, Foresman, and Co., 1970.
System: Architecture and Software,” in Proceedings of [35] G. Tjaden and M. Flynn, “Detection and Parallel
Supercomputing 88, Orlando, November 1988, pp. 28- Execution of Independent Instructions,” IEEE
34. Transactions on Computers, October 1970, pp. 889-
[20] G. Kurpanek, et al., “PA 7200: A PA-RISC Processor 895.
with Integrated High-Performance MP Bus Interface,” [36] R. Tomasulo, “An Efficient Algorithm for Exploiting
in Proceedings of Compcon 94, San Francisco, Multiple Arithmetic Units,” IBM Journal of Research
February 1994, pp. 375-382. and Development, January 1967, pp. 25-33.
[21] S. McGeady, “The i960CA Superscalar [37] S. Vassiliadis, R. Blaner, and R. Eickemeyer, “SCISM:
Implementation of the 80960 Architecture,” in A Scalable Compound Instruction Set Machine,”
Proceedings of Compcon 90, San Francisco, January Journal of Research and Development, January 1994,
1990, pp. 232-239. pp. 59-78.

You might also like