0% found this document useful (0 votes)
14 views

Implementation of Precise Interrupts in Pipelined Processors

Uploaded by

laheriody
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Implementation of Precise Interrupts in Pipelined Processors

Uploaded by

laheriody
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

From the companion CD-ROM to the IEEE CS Press book,

"The Anatomy of a Microprocessor: A Systems Perspective,"


by Shriver & Smith

Implementation of Precise Interrupts in


Pipelined Processors

James E. Smith

Department of Electrical and Computer Engineering


University of Wisconsin-Madison
Madison, WI 53706

Andrew R. Pleszkun

Computer Sciences Department


University of Wisconsin-Madison
Madison, WI 53706

Abstract

An interrupt is precise if the saved process state corresponds with the sequential model of program execution where
one instruction completes before the next begins. In a pipelined processor, precise interrupts are difficult to achieve
because an instruction may be initiated before its predecessors have been completed. This paper describes and
evaluates solutions to the precise interrupt problem in pipelined processors.
The precise interrupt problem is first described. Then five solutions are discussed in detail. The first forces in-
structions to complete and modify the process state in architectural order. The other four allow instructions to com-
plete in any order, but additional hardware is used so that a precise state can be restored when an interrupt occurs.
All the methods are discussed in the context of a parallel pipeline struck sure. Simulation results based on the
CRAY-1S scalar architecture are used to show that, at best, the first solution results in a performance degradation
of about 16%. The remaining four solutions offer similar performance, and three of them result in as little as a 3%
performance loss. Several extensions, including virtual memory and linear pipeline structures, are briefly discussed.

1. Introduction from that defined by the sequential architectural model.


At the time an interrupt condition is detected. the hard-
Most current computer architectures are based on a ware may not be in a state that is consistent with any
sequential model of program execution in which an specific program counter value.
architectural program counter sequences through in- When an interrupt occurs, the state of an interrupted
structions one-by-one, finishing one before starting the process is typically saved by the hardware, the soft-
next. In contrast, a high performance implementation ware, or by a combination of the two. The process state
may be pipelined. permitting several instructions to be generally consists of the program counter, registers, and
in some phase of execution at the same time. The use of memory. If the saved process state is consistent with
a sequential architecture and a pipelined implementa- the sequential architectural model then the interrupt is
tion clash at the time of an interrupt; pipelined instruc- precise. To be more specific, the saved state should
tions may modify the process state in an order different renew the following conditions:

Originally published in Proc. Computer Architecture, 1985, 1


pp. 34–44. Copyright  1985 IEEE. All rights reserved.
(1) All instructions preceding the instruction indicated 1. For I/O and timer interrupts a precise process state
by the saved program counter have been executed makes restarting possible.
and have modified the process state correctly. 2. For software debugging it is desirable for the saved
(2) All instructions following the instruction indicated state to be precise. This information can be helpful
by the saved program counter are unexecuted and in isolating the exact instruction and circumstances
have not modified the process state. that caused the exception condition.
(3) If the interrupt is caused by an exception condition 3. For graceful recovery from arithmetic exceptions,
raised by an instruction in the program, the saved software routines may be able to take steps,
program counter points to the interrupted instruc- re-scale floating point numbers for example, to al-
tion. The interrupted instruction may or may not low a process to continue. Some end cases of mod-
have been executed, depending on the definition of ern floating point arithmetic systems might best be
the architecture and the cause of the interrupt. handled by software; gradual underflow in the pro-
Whichever is the case, the interrupted instruction posed IEEE floating point standard [Stev81], for
has either completed or has not started execution. example.
4. In virtual memory systems precise interrupts allow
If the saved process state is inconsistent with the a process to be correctly restarted after a page fault
sequential architectural model and does not satisfy the has been serviced.
above conditions, then the interrupt is imprecise. 5. Unimplemented opcodes can be simulated by sys-
This paper describes and compares ways of imple- tem software in a way transparent to the program-
menting precise interrupts in pipelined processors. The mer if interrupts are precise. In this way, lower per-
methods used are designed to modify the state of an formance models of an architecture can maintain
executing process in a carefully controlled way. The compatibility with higher performance models us-
simpler methods force all instructions to update the ing extended instruction sets.
process state in the architectural order. Other, more 6. Virtual machines an be implemented if privileged
complex methods save portions of the process suite so instruction faults cause precise interrupts. Host
that the proper state may be restored by the hardware at software can simulate these instructions and return
the time an interrupt occurs. to the guest operating system in a user-transparent
way.

1.1. Classification of Interrupts


1.2. Historical Survey
We consider interrupts belonging to two classes:
The precise interrupt problem is as old as the first
pipelined computer and is mentioned as early as Stretch
(1) Program interrupts, sometimes referred to as [Buch62]. The IBM 360/91 [Ande67] was a
“traps,” result from exception conditions detected well-known computer that produced imprecise inter-
during fetching and execution of specific instruc- rupts under some circumstances, floating point excep-
tions. These exceptions may be due to software er- tions. for example. Imprecise interrupts were a break
rors, for example trying to execute an illegal op- with the IBM 360 architecture which made them even
code, numerical errors such as overflow, or they more noticeable. All subsequent IBM 360 and 370 im-
may be part of normal execution, for example page plementations have used less aggressive pipeline de-
faults. signs where instructions modify the process state in
(2) External interrupts are not caused by specific in- strict program order and all interrupts are precise.* A
structions and are often caused by sources outside more complete description of the method used in these
the currently executing process, sometimes com- “linear” pipeline implementations is in Section 8.4.
pletely unrelated to it. I/O interrupts and timer in- Most pipelined implementations of general purpose
terrupts are examples. architectures are similar to those used by IBM. These
pipelines constrain all instructions to pass through the
pipeline in order with a stage at the end where excep-
For a specific architecture, all interrupts may be de-
tion conditions are checked before the process suite is
fined to be precise or only a proper subset. Virtually
every architecture, however, has some types of inter- *
Except for the models 95 and 195 which were derived from the
rupts that must be precise. There are a number of con-
original model 91 design. Also, the models 85 and 165 had impre-
ditions under which precise interrupts are either neces- cise interrupts for the case of protection exceptions and addressing
sary or desirable. exceptions caused by store operations.

Originally published in Proc. Computer Architecture, 1985, 2


pp. 34–44. Copyright  1985 IEEE. All rights reserved.
modified. Examples include the Amdahl 470 and 580 through 6 describe methods for implementing precise
[Amdh81, Amdh80] and the Gould/SEL 32/87 interrupts. Section 3 describes a simple method that is
[Ward82]. easy to implement, but which reduces performance. It
The high performance CDC 6600 [Thor70], CDC forces instructions to complete in architectural order
7600 [Bons69], and Cray Research [Russ78, Cray79] which sometime inhibits the degree of parallelism in a
computers allow instructions to complete out of the pipelined system. Section 4 describes a higher perform-
architectural sequence. Consequently, they have some ance variation where results may be bypassed to other
exception conditions that result in imprecise interrupts. instructions before the results are used to modify the
In these machines, the advantages of precise interrupts process suite. Sections 5 and 6 describe methods where
have been sacrificed in favor of maximum parallelism instructions are allowed to complete in any order, but
and design simplicity. I/O interrupts in these machines where state information is saved so that a precise suite
are precise, and they do not implement virtual memory. may be restored when an interrupt occurs. The descrip-
The CDC STAR-100 [HiTa72] and CYBER 200 tions of these methods assume that the only suite in-
[CDC81] series machines also allow instructions to formation is the program counter, general purpose reg-
complete out of order, and they do support virtual isters, and main memory. The methods are also dis-
memory. In these machines the use of vector instruc- cussed in the absence of a data cache. Section 7 pres-
tions further complicates the problem, and all the diffi- ents simulation results. Experimental results based on
culties were not fully recognized until late in the devel- these CRAY-1S simulations are presented and dis-
opment of the STAR-100. The eventual solution was cussed. Section 8 contains a brief discussion of 1) sav-
the addition of an invisible exchange package ing additional state information, 2) supporting virtual
[CDC81]. This captures machine-dependent suite in- memory, 3) precise interrupts when a dam cache is
formation resulting from partially completed instruc- used, and 4) linear pipeline structures.
tions. A similar approach has more recently been sug-
gested in MIPS [Henn82] where pipeline information is
dumped at the time of an interrupt and restored to the 2. Preliminaries
pipeline when the process is resumed. This solution
makes a process restartable although it is arguable 2.1. Model Architecture
whether it has all the features and advantages of an
architecturally precise interrupt. For example, it might For describing the various techniques, a model archi-
be necessary to have implementation-dependent soft- tecture is chosen so that the basic methods are not ob-
ware sift through the machine-dependent state in order scured by details and unnecessary complications
to provide complete debug information. brought about by a specific architecture
The recently-announced CDC CYBER 180/990 We choose a register-register architecture where all
[CDC84] is a pipelined implementation of a new ar- memory accesses are through registers and all func-
chitecture that supports virtual memory, and offers tional operations involve registers. In this respect it
roughly the same performance as a CRAY-1S. To pro- bears some similarity to the CDC and Cray architec-
vide precise interrupts, the CYBER 180/990 uses a tures, but has only one set of registers. The load in-
history buffer, to be described later in this paper, where structions are of the form: Ri = (Rj + disp). That is, the
state information is saved just prior to being modified. content of Rj plus a displacement given in the instruc-
Then when an interrupt occurs, this “history” informa- tion are added to form an effective address. The content
tion can be used m back the system up into a precise of the addressed memory location is loaded into Ri.
state. Similarly, a store is of the form: (Rj + disp) = Ri, where
Ri is stored at the address found by adding the content
1.3. Paper Overview of Rj and a displacement. The functional instructions
are of the form Ri = Rj op Rk where op is the operation
This paper concentrates on explaining and discussing being performed. For unary operations, the degenerate
basic methods for implementing precise interrupts in form Ri = op Rk is used. Conditional instructions are of
pipelined processors. We emphasize scalar architec- the form P = disp : Ri op Rj, where the displacement is
tures (as opposed to vector architectures) because of the address of the branch target; op is a relational op-
their applicability to a wider range of machines. Section erator. =, >, <, etc.
2 describes the model architecture to be used in de- The only process suite in the model architecture
scribing precise interrupt implementations. The model consists of the program counter, the general purpose
architecture is very simple so that the fundamentals of registers, and main memory. The architecture is simple,
the methods can be clearly described. Sections 3 has a minimal amount of process state, can be easily

Originally published in Proc. Computer Architecture, 1985, 3


pp. 34–44. Copyright  1985 IEEE. All rights reserved.
pipelined, and can be implemented in a straightforward the time an instruction issues or when an instruction is
way with parallel functional units like the CDC and approaching completion. This assumes the functional
Cray architectures. Hence, implementing precise inter- unit times are deterministic. A new instruction can issue
rupts for the model architecture presents a realistic every clock period in the absence of register or result
problem. bus conflicts.
Initially, we assume no operand cache. Similarly,
condition codes are not used. They add other problems
beyond precise interrupts Then a pipelined implemen-
tation is used. Extensions for operand cache and condi- Example 1
tion codes are discussed in Section 8.
The implementation for the simple architecture is To demonstrate how an imprecise process state may
shown in Fig. 1. It uses an instruction fetch/decode occur in our model architecture, consider the following
pipeline which processes instructions in order. The last section of code which sums the elements of arrays A
sage of the fetch/decode pipeline is an issue register and B into array C. Consider the instructions in state-
where all register interlock conditions are checked. If ments 6 and 7. Although the integer add which incre-
there are no register conflicts, an instruction issues to ments the loop count will be issued after the floating
one of the parallel functional units. Here, the memory point add, it will complete before the floating point
access function is implemented as one of the functional add. The integer add will therefore change the process
units. The operand registers are read at the time an in- state before an overflow condition is detected in the
struction issues. There is a single result bus that returns floating point add. In the event of such an overflow.
results to the register file. This bus may be reserved at there is an imprecise interrupt.

Figure 1. Pipelined implementation of our model architecture. Not shown is the result shift register used to
control the result bus.

Originally published in Proc. Computer Architecture, 1985, 4


pp. 34–44. Copyright  1985 IEEE. All rights reserved.
Comments Execute Time

0 R2 < – 0 Init. loop index


1 R0 < – 0 Init. loop count
2 RS < – 1 Loop inc. value
3 R7 < – 100 Maximum loop count
4 Ll: R1 < – (R2 + A) Load A(l) 11cp
5 R3 < – (R2 + B) Load B(l) 11cp
6 R4 < – Rl +f R3 Floating add 6cp
7 R0 < – R0+ R5 Inc. Ioop count 2cp
8 (R0 + C) < – R4 Store C(l)
9 R.2 < – R2 + R5 Inc. Ioop index 2cp
10 P = Ll : R0 ! = R7 cond. branch not equal

2.2. Interrupts Prior to Instruction Issue That is, they do not depend on the operands, only on
the function. Thus. the result bus can be reserved at the
Before proceeding with the various precise interrupt time of issue.
methods. we discuss interrupts that occur prior to in- First, we consider a method commonly used to
struction issue separately because they are handled the control the pipelined organization shown in Fig. 1. This
same way by all the methods. method may be used regardless of whether precise in-
In the pipeline implementation of Fig. 1. instruc- terrupts are to be implemented. However, the precise
tions stay in sequence until the time they are issued. interrupt methods described in this paper are integrated
Furthermore, the process state is not modified by an into this basic control strategy. To control the result
instruction before it issues. This makes precise inter- bus, a “result shift register,” is used; see Fig. 2. Here,
rupts a simple manner when an exception condition can the stages are labeled 1 through n, where n is the length
be detected prior to issue. Examples of such exceptions of the longest functional unit pipeline. An instruction
are privileged instruction faults and unimplemented that takes i clock periods reserves stage i of the result
instructions. This class also includes external interrupts shift register at the time it issues. If the stage already
which can be checked at the issue stage. contains valid control information, then issue is held
When such an interrupt condition is detected, in- until the next clock period, and stage i is checked once
struction issuing is halted. Then, there is a wait while again. An issuing instruction places control information
all previously issued instructions complete. After they in the result shift register. This control information
have completed, the process is in a precise state, with identifies the functional unit that will be supplying the
the program counter value corresponding to the in- result and the destination register of the result. This
struction being held in the issue register. The registers control information is also marked “valid” with a va-
and main memory are in a state consistent with this lidity bit. Each clock period, the control information is
program counter value. shifted down one stage toward stage one. When it
Because exception conditions detected prior to in- reaches stage one, it is used during the next clock to
struction can be handled easily as described above, we control the result bus so that the functional unit result is
will not consider them any further. Rather, we will con- placed in the correct result register.
centrate on exception conditions detected after instruc- Still disregarding precise interrupts, it is possible
tion issue. for a short instruction to be placed in the result pipeline
in stage i when previously issued instructions are in
3. In-order Instruction Completion stage j, j > i. This leads to instructions finishing out of
the original program sequence. If the instruction at
With this method, instructions modify the process state stage j eventually encounters an exception condition,
only when all previously issued instructions are known the interrupt will be imprecise because the instruction
to be free of exception conditions. This section de- placed in stage i will complete and modify the process
scribes a strategy that is most easily implemented when state even though the sequential architecture model
pipeline delays in the parallel functional units are fixed. says i does not begin until j completes.

Originally published in Proc. Computer Architecture, 1985, 5


pp. 34–44. Copyright  1985 IEEE. All rights reserved.
Figure 2. Result Shift Register

Example 2 sult shift register is no longer reserved. This would be 5


clock periods after the issue of the floating point add.
If one considers the section of code presented in Exam- A generalization of this method is to determine, if
ple 1, and an initially empty result shift register (all the possible, that an instruction is free of exception condi-
entries invalid), the floating point add would be placed tions prior to the time it is complete. Only result shift
in stage 6 while the integer add would be placed in register stages that will finish before exceptions are
stage 2. The result shift register entries shown in Fig. 2 detected need to be reserved (in addition to the stage
reflect the state of the result shift register after the inte- that controls the result).
ger add issues. Notice that the floating point add entry
is in stage 5 since one clock period has passed since it 3.2. Main Memory
issued. As described above, this situation leads to in-
structions finishing out of the original program Store instructions modify the portion of process state
sequence. that resides in main memory. To implement precise
interrupts with respect to memory, one solution is to
3.1. Registers force store instructions to wait for the result shift reg-
ister to be empty before issuing. Alternatively, stores
To implement precise interrupts with respect to regis- can issue and be held in the load/store pipeline until all
ters using the above pipeline control structure, the con- preceding instructions are known to be exception-free.
trol should “reserve” stages i < j as well as stage j. That Then the store can be released to memory.
is, the stages i < j that were not previously reserved by To implement the second alternative, recall that
other instructions are reserved, and they are loaded memory can be treated as a special functional unit.
with null control information so that they do not affect Thus, as with any other instruction, the store can make
the process state. This guarantees that instructions an entry in the result shift register. This entry is defined
modifying registers finish in order. as a dummy store. The dummy store does not cause a
There is logic on the result bus that checks for ex- result to be placed in the registers, but is used for con-
ception conditions in instructions as they complete. If trolling the memory pipeline. The dummy store is
an instruction contains a non-masked exception condi- placed in the result shift register so that it will not reach
tion, then control logic “cancels” all subsequent in- stage 1 until the store is known to be exception-free.
structions coming on the result bus so that they do not When the dummy store reaches stage 1, all previous
modify the process state. instructions have completed without exceptions, and a
signal is sent to the load/store unit to release the store
Example 3 to memory. If the store itself contains an exception
condition, then the store is cancelled, all following
For our sample section of code given in Example 1, load/store instructions are cancelled, and the store unit
assuming the result shift register is initially empty, such signals the pipeline control so that all instructions is-
a policy would have the floating point add instruction sued subsequent to the store are cancelled as they leave
reserve stages 1 through 6 of the result shift register. the result pipeline.
When, on the next clock cycle, the integer add is in the
issue register, it would normally issue and reserve stage 3.3. Program Counter
2. However, this is now prohibited from happening
because stage 2 is already reserved. Thus, the integer To implement precise interrupts with respect to the
add must wait at the issue stage until stage 2 of the re- program counter, the result shift register is widened to

Originally published in Proc. Computer Architecture, 1985, 6


pp. 34–44. Copyright  1985 IEEE. All rights reserved.
include a field for the program counter of each instruc- but a special buffer called the reorder buffer is used to
tion (see Fig. 2). This field is filled as the instruction reorder them before they modify the process state.
issues. When an instruction with an exception condition
appears at the result bus, its program counter is avail- 4.1. Basic Method
able and becomes part of the saved state. The overall organization is shown in Fig. 3a. The reor-
der buffer, Fig. 3b, is a circular buffer with head and
tail pointers. Entries between the head and tail are con-
4. The Reorder Buffer sidered valid. At instruction issue time the next avail-
able reorder buffer entry, pointed to by the tail pointer,
The primary disadvantage of the above method is that is given to the issuing instruction. The tail pointer value
fast instructions may sometimes get held up at the issue is used as a tag to identify the entry in the buffer re-
register even though they have no dependencies and served for the instruction. The tag is placed in the result
would otherwise issue. In addition, they block the issue shift register along with the other control information.
register while slower instructions behind them could The tail pointer is then incremented, modulo the buffer
conceivably issue. size. The result shift register differs from the one used
This leads us to a more complex, but more general earlier because there is a field containing a reorder tag
solution. Instructions are allowed to finish out of order, instead of a field specifying a destination register.

Figure 3. (a) Reorder Buffer Organization. (b) Reorder Buffer and associated Result Shift Register.

Originally published in Proc. Computer Architecture, 1985, 7


pp. 34–44. Copyright  1985 IEEE. All rights reserved.
When an instruction completes, both results and ex- space in the reorder buffer at issue time (shown in Fig-
ception conditions are sent to the reorder buffer. The ure 3b). While the program counter could be sent to the
tag from the result shift register is used to guide them to result shift register, it is expected that the result shift
the correct reorder buffer entry. When the entry at the register will contain more stages than the reorder buffer
head of the reorder buffer contains valid results (its and thus require more hardware. The length of the re-
instruction has finished) then its exceptions are sult shift register must be as long as the longest pipeline
checked. If there are none, the results are written into stage. As will he seen in Section 7, the number of en-
the registers. If an exception is detected. issue is tries in the reorder buffer can be quite small. When an
stopped in preparation for the interrupt, and all further instruction arrives at the head of the reorder buffer with
writes into the register file are inhibited. an exception condition, the program counter found in
the reorder buffer entry becomes part of the saved pre-
Example 4 cise state.

The entries in the reorder buffer and result shift register


shown in Figure 3b reflect their state after the integer 4.4. Bypass Paths
add from Example 2 has issued. Notice that the result
shift register entries are very similar to those in Figure While an improvement over the method described in
2. The integer add will complete execution before the Section 3, the reorder buffer still suffers a performance
floating point add and its results will be placed in entry penalty. A computed result that is generated out of or-
5 of the reorder buffer. These results, however, will not der is held in the reorder buffer until previous instruc-
be written into R0 until the floating point result, found tions, finishing later, have updated the register file. An
in entry 4, has been placed in R4. instruction dependent on a result being held in the reor-
der buffer cannot issue until the result has been written
4.2. Main Memory into the register file.
The reorder buffer may, however, be modified to
Preciseness with respect to memory is maintained in a minimize some of the drawbacks of finishing strictly in
manner similar to that in the in-order completion order. For results to be used early, bypass paths may be
scheme (Section 3.2). The simplest method holds stores provided from the entries in the reorder buffer to the
in the issue register until all previous instructions are register file output latches; see Fig. 4. These paths al-
known to be free of exceptions. In the more complex low data being held in the reorder buffer to be used in
method, a store signal is sent to the memory pipeline as place of register data. The implementation of this
a “dummy” store is removed from the reorder buffer. method requires comparators for each reorder buffer
Stores are allowed to issue, and block in the store pipe- stage and operand designator. If an operand register
line prior to being committed to memory while they designator of an instruction being checked for issue
wait for their dummy counterpart. matches a register designator in the reorder buffer, then
a multiplexer is set to gate the data from the reorder
4.3. Program Counter buffer to the register output latch. In the absence of
other issue blockage conditions, the instruction is al-
To maintain preciseness with respect to the program lowed to issue, and the data from the reorder data is
counter, the program counter can be sent to a reserved used prior to being written into the register file.

Figure 4. Reorder Buffer Method with Bypasses.

Originally published in Proc. Computer Architecture, 1985, 8


pp. 34–44. Copyright  1985 IEEE. All rights reserved.
There may be bypass paths from some or all of the Example 5
reorder buffer entries. If multiple bypass paths exist, it
is possible for more than one destination entry in the The entries in the history buffer and result shift register
reorder buffer to correspond to a single register. Clearly show in Fig. 5b correspond to our code in Example 1,
only the latest reorder buffer entry that corresponds to after the integer add has issued. The only differences
an operand designator should generate a bypass path to between this and the reorder buffer method shown in
the register output latch. To prevent multiple bypassing Fig. 3b are the addition of an “old value” field in the
of the same register, when an instruction is placed in history buffer and a “destination register” field in the
the reorder buffer, any entries with the same destination result shift register. The result shift register now looks
register designator must be inhibited from matching a like the one shown in Fig. 2.
bypass check. When an exception condition arrives at the head of
When bypass paths are added, preciseness with re- the buffer, the buffer is held, instruction issue is imme-
spect to the memory and the program counter does not diately halted, and there is a wait until pipeline activity
change from the previous method. completes. The active buffer entries are then emptied
The greatest disadvantage with this method is the from tail to head, and the history values are loaded
number of by/pass comparators needed and the amount back into their original registers. The program counter
of circuitry required for the multiple bypass check. value found in the head of the history buffer is the pre-
While the circuitry is conceptually simple, there is a cise program counter.
great deal of it. To make main memory precise, when a store entry
emerges from the buffer, it sends a signal that another
5. History Buffer store can be committed to memory. Stores can either
wait in the issue register or can be blocked in the mem-
The methods presented in this section and the next are ory pipeline, as in the previous methods.
intended to reduce or eliminate performance losses The extra hardware required by this method is in
experienced with a simple reorder buffer, but without the form of a large buffer to contain the history in-
all the control logic needed for multiple bypass paths. formation. Also the register file must have three read
Primarily, these methods place computed results in a ports since the destination value as well as the source
working register file, but retain enough state informa- operands must be read at issue time. There is a slight
tion so a precise state can he restored if an exception problem if the basic implementation has a bypass of
occurs. the result bus around the register file. In such a case,
Fig. 5a illustrates the history buffer method. The the bypass must also be connected into the history
history buffer is organized in a manner very similar to buffer.
the reorder buffer. At issue time, a buffer entry is
loaded with control information, as with the reorder
buffer, but the value of the destination register (soon to
be overwritten) is also read front the register file arid 6. Future File
written into the buffer entry. Results on the result are
written directly into the register file when an instruction The future file method (Fig. 6) is similar to the history
completes. Exception reports come back as an instruc- buffer method; however it uses two separate register
tion completes and are written into the history buffer. files. One register file reflects the state of the architec-
As with the reorder buffer, the exception reports are tural (sequential) machine. This file will be referred to
guided to the proper history buffer entry through the as the architectural file. A second register file is up-
use of tags found in the result shift register. When the dated as soon as instructions finish and therefore runs
history buffer contains an element at the head that is ahead of the architectural file (i.e. it reflects the future
known to have finished without exceptions, the history with respect to the architectural file). This future file is
buffer entry is no longer needed and that buffer location the working file used for computation by the functional
can he re-used (the head pointer is incremented). As units.
with the reorder buffer, the history buffer can be shorter Instructions are issued and results are returned to
than the maximum number of pipeline stages. If all the future file in any order, just as in the original pipe-
history buffer entries are used (the buffer is too small), line model. There is also a reorder buffer that receives
issue must be blocked until an entry becomes available. results at the same time they are written into the future
Hence the buffer should be long enough so that this file. When the head pointer finds a completed instruc-
seldom happens. The effect of the history buffer on tion (a valid entry), the result associated with that entry
performance is determined in Section 7. is written in the architectural file.

Originally published in Proc. Computer Architecture, 1985, 9


pp. 34–44. Copyright  1985 IEEE. All rights reserved.
Figure 5. (a) History Buffer Organization. (b) History Buffer and associated Result Shift Register.

Originally published in Proc. Computer Architecture, 1985, 10


pp. 34–44. Copyright  1985 IEEE. All rights reserved.
Figure 6. Future File Organization.

Example 6 7. Performance Evaluation


If we consider the code in Example 1 again, there is a
period of time when the architecture file and the future To evaluate the effectiveness of our precise interrupt
file contain different entries. With this method, an in- schemes, we use a CRAY-1S simulation system devel-
struction may finish out of order, so when the integer oped at the University of Wisconsin [PaSm83]. This
add finishes, the future file contains the new contents of trace-driven simulator is extremely accurate, due to the
R(). The architecture file however does not, and the highly deterministic nature of the CRAY-1S, and gives
new contents of R0 are buffered in the reorder buffer the number of clock periods required to execute a
entry corresponding to the integer add. Between the program.
time the integer add finishes and the time the floating The scalar portion of the CRAY-1S is very similar
point add finishes, the two files are different. Once the to the model architecture described in Section 2.1.
floating point finishes and its results are Driven into R4 Thus, casting the basic approaches into the CRAY-1S
of both files, R0 of the architecture file is written. scalar architecture is straightforward.
Just as with the pure reorder buffer method, pro- For a simulation workload, the first fourteen Law-
gram counter values are written into the reorder buffer rence Livermore Loops [McMa72] were used. Because
at issue time. When the instruction at the head of the we are primarily interested in pipelined implementa-
reorder buffer has completed without error, its result is tions of conventional scalar architectures, the loops
placed in the architectural file. If it completed with an were compiled by the Cray FORTRAN compiler with
error, the register designators associated with the buffer the vectorizer turned off.
entries between the head and tail pointers are used to In the preceding sections, five methods were de-
restore values in the future file from the architectural scribed that could be used for guaranteeing precise in-
file.* terrupts. To evaluate the effect of these methods on
The primary advantage of the future file method is system performance, the methods were partitioned into
realized when the architecture implements interrupts three groups. The first and second group respectively
via an “exchange” where all the registers are automati- contain the in-order method and the simple reorder
cally saved in memory and new ones are restored (as is buffer method. The third group is composed of the re-
done in CDC and Cray architectures). In this case, the order buffer with bypasses, the history buffer, and the
architectural file can he stored away immediately; no future file. This partitioning was performed because the
restoring is necessary as in history buffer method. methods in the third group result in identical system
There is also no bypass problem as with the history performance. This is because the future file has a reor-
buffer method. der buffer embedded as part of its implementation. And
the history buffer length constrains performance in the
*
The restoration is performed from the architectural file since the
same way as a reorder buffer: when the buffer fills,
future file is register file from which al execution takes place. issue must stop. All the simulation results are reported

Originally published in Proc. Computer Architecture, 1985, 11


pp. 34–44. Copyright  1985 IEEE. All rights reserved.
as for the reorder buffer with bypasses. They apply The simulation results for the In-order column are
equally well for the history buffer and future file meth- constant since this method does not depend on a buffer
ods. The selection of a particular method depends not that reorders instructions. For all the methods, there is
only on its effect on system performance but also the some performance degradation. Initially, when the re-
cost of implementation and the ease with which the order buffer is small, the In-order method produces the
precise CPU state can be restored. least performance degradation. A small reorder buffer
For each precise interrupt method, two methods (less than 3 entries) limits the number of instructions
were described for handling stores. Simulations were that can simultaneously be in some stage of execution.
run for each of these methods. For those methods other Once the reorder buffer size is increased beyond 3 en-
than the in-order completion method, the size of the tries, either of the other methods results in better per-
reorder buffer is a parameter. Sizing the buffer with too formance, As expected, the reorder buffer with by-
few entries degrades performance since instructions passes offers superior performance when compared
that might issue could block at the issue register. The with the simple reorder buffer. When the size of the
blockage occurs because there is no room for a new buffer was increased beyond 10 entries, simulation re-
entry in the buffer. sults indicated no further performance improvements.
Table 1 shoes the relative performance of the (Simulations were also run for buffer sizes of 15, 16,
In-order, Reorder Buffer, and Reorder Buffer with by- 20, 25, and 60.) At best, one can expect a 12% per-
pass methods when the stores are held until the result formance degradation when using a reorder buffer with
shift register is empty. The results in the table indicate bypasses and the first method for handling stores.
the relative performance of these methods with respect Table 2 indicates the relative performance when
to the CRAY-1S across the first 14 Lawrence stores issue and wait at the same memory pipeline stage
Livermore Loops; real CRAY-1S performance is 1.0. A as for memory bank conflicts in the original CRAY-lS.
relative performance greater than 1.0 indicates a degra- After issuing, stores wait for their counterpart dummy
dation in performance. The number of entries in the store to signal that all previously issued register in-
reorder buffer was varied from 3 to 10. structions have finished. Subsequent loads and stores
are blocked from issuing.

Table 1. Relative Performance for the first 14 Lawrence Livermore Loops, with stores blocked until the re-
sults pipeline is empty.

Number of Entries In-order Reorder R w/ BP


3 1.2322 1.3315 1.3069
4 1.2322 1.2183 1.1743
5 1.2322 1.1954 1.1439
8 1.2322 1.1808 1.1208
10 1.2322 1.1808 1.1208

Table 2. Relative Performance for the first 14 Lawrence Livermore Loops. with stores held in the memory
pipeline after issue

Number of Entries In-order Reorder R w/ BP


3 1.1560 1.3058 1.2797
4 1.1560 1.1724 1.1152
5 1.1560 1.1348 1.0539
8 1.1560 1.1167 1.0279

10 1.1560 1.1167 1.0279

Originally published in Proc. Computer Architecture, 1985, 12


pp. 34–44. Copyright  1985 IEEE. All rights reserved.
As in Table 1, the In-order results are constant condition codes present to conditional branches is not
across all entries. For the simple reorder buffer, the totally unrelated to the topic here, solutions to the
buffer must have at least 5 entries before it results in branch problem are not the primary topic of this paper.
better performance than the In-order method. The reor- It is assumed that the conditional branch problem has
der buffer with bypasses, however, requires only 4 en- been solved in some way, e.g. [Ande67]. If a reorder
tries before it is performing more effectively than the buffer is being used, condition codes can be placed in
In-order method. Just as in Table 1, having more than 8 the reorder buffer. That is, just as for data, the reorder
entries in the reorder buffer does not result in improved buffer is made sufficiently wide to hold the condition
performance. Comparing Table 1 to Table 2, the sec- codes. The condition code entry is then updated when
ond method for handling stores offers a clear improve- the condition codes associated with the execution of an
ment over the first method. If the second method is instruction are computed. Just as with data in the reor-
used with an 8 entry reorder buffer that has bypasses, a der buffer, a condition code entry is not used to change
performance degradation of only 3% is experienced. processor state until all previous instructions have
Clearly there is a trade-off between performance completed without error (however condition codes can
degradation and the cost of implementing a method. be bypassed to the instruction fetch unit to speed up
For essentially no cost, the In-order method can be conditional branches).
combined with the first method of handling stores. Se- Extension of the history buffer and future file meth-
lecting this ‘cheap’ approach results in a 23% perform- ods to handle condition codes is very similar to that of
ance degradation. If this degradation is too great, either the reorder buffer. For the history buffer, the condition
the second store method must be used with the In-order code settings at the time of instruction issue must be
method or one of the more complex methods must be saved in the history buffer. The saved condition codes
used. If the reorder buffer method is used, one must use can then be used to restore the processor state when an
a buffer with at least 3 or 4 entries. exception is detected. Since the future file method uses
a reorder buffer, the above discussion indicates how
8. Extensions condition codes may be saved.

In previous sections, we described methods that could 8.2. Virtual Memory


be used to guarantee precise interrupts with respect to
the registers, the main memory, and the program coun- Virtual memory is a very important reason for sup-
ter of our simple architectural model. In the following porting precise interrupts; it must be possible to recover
sections, we extend the previous methods to handle from page faults. First, the address translation pipeline
additional state information, virtual memory, a cache, should be designed so that all the load/store instructions
and linear pipelines. Effectively, some of these machine pass through it in order. This has been assumed
features can be considered to be functional units with throughout this paper. Depending on the method being
non-deterministic execution times. used, the load/store instructions reserve time slots in the
result pipeline and/or re-order buffer that are read no
8.1. Handling Other State Values earlier than the time at which the instructions have been
checked for exception conditions (especially page
Most architectures have more state information than we faults). For stores, these entries are not used for data;
have assumed in the model architecture. For example, a just for exception reporting and/or holding a program
process may have state registers that point to page and counter value,
segment tables, indicate interrupt mask conditions, etc. If there is an addressing fault, then the instruction is
This additional state information can be precisely cancelled in the addressing pipeline, and all subsequent
maintained with a method similar to that used for stores load/store instructions are cancelled as they pass
to memory. If using a reorder buffer, an instruction that through the addressing pipeline. This guarantees that no
changes a state register reserves a reorder buffer entry additional loads or stores modify the process suite. The
and proceeds to the part of the machine where the state mechanisms described in the earlier sections for assur-
change will be made. The instruction then waits there ing preciseness with respect to registers guarantee that
until receiving a signal to continue from the reorder non-load/store instructions following the faulting
buffer. When its entry arrives at the head of the buffer load/store will not modify the process state; hence the
and is removed, then a signal is sent to cause the state interrupt is precise.
change. For example, if the reorder buffer method is being
In architectures that use condition codes, the condi- used, a page fault would he sent to the reorder buffer
tion codes are state information. Although the problem when it is detected. The tag assigned to the corre-

Originally published in Proc. Computer Architecture, 1985, 13


pp. 34–44. Copyright  1985 IEEE. All rights reserved.
sponding load/store instruction guides it to the correct 8.3.2. Write-Back Cache
reorder buffer entry. The reorder buffer entry is re-
moved from the buffer when n reaches the head. The A write-back cache is perhaps the cache type most
exception condition in the entry causes all further en- compatible with implementing precise interrupts. This
tries of the reorder buffer to be discarded so that the is because stores in a write-back cache are not made
process state is modified no further(no more registers directly to memory; there is a built-in delay between
are written). The program counter found in the reorder updating the cache and updating main memory. Before
buffer entry is precise with respect to the fault. an actual write-back operation can be performed, how-
ever, the reorder buffer should be emptied or should be
8.3. Cache-Memory checked for data belonging to the line being written
back. If such data should be found, the write-back must
Thus far we have assumed systems that do not use a wait until the data has made its way into the cache. If a
cache memory. Inclusion of a cache in the memory history buffer is used, either a cache line must be saved
hierarchy affects the implementation of precise inter- in the history buffer, or the write-back must wait until
rupts. As we have seen, an important part of all the the associated instruction has made its way to the end
methods is that stores are held until all previous in- of the buffer. Notice that in any case, the write-back
structions are known to be exception-free. With a will sometimes have to wait until a precise state is
cache, stores may be made into the cache earlier, and reached.
for performance reasons should be. The actual updating
of main memory, however, is still subject to the same
constraints as before.
8.4. Linear Pipeline Structures
8.3.1. Store-through Caches
An alternative to the parallel functional unit organiza-
With a store-through cache, the cache can be updated tions we have been discussing is a linear pipeline or-
immediately, while the store-through to main memory ganization. Refer to Fig. 7. Linear pipelines provide a
is handled as in previous sections. That is, all previous more natural implementation of register-storage archi-
instructions must first be known to be exception-free. tectures like the IBM 370. Here, the same instruction
Load instructions are free to use the cached copy; how- can access a memory operand and perform some func-
ever, regardless of whether the store-through has taken tion on it. Hence, these linear pipelines have an in-
place. This means that main memory is always in a pre- struction fetch/decode phase, an operand fetch phase,
cise state, but the cache contents may “run ahead” of and an execution phase, any of which may be com-
the precise state. If an interrupt should occur while the posed of one or several pipeline sages.
cache is potentially in such a state, then the cache In general, reordering instructions after execution is
should be flushed. This guarantees that prematurely not as significant an issue in such organizations because
updated cache locations will not be used. However, this it is natural for instructions to stay in order as they pass
can lead to performance problems, especially for larger through the pipe. Even if they finish early in the pipe,
caches. they proceed to the end where exceptions are checked
Another alternative is to treat the cache in a way before modifying the process suite. Hence, the pipeline
similar to the register files. One could, for example, itself acts as a sort of reorder buffer.
keep a history buffer for the cache. Just as with regis- The role of the result shift register is played by the
ters, a cache location would have to he read just prior control information that floes down the pipeline along-
to writing it with a new value. This does not necessarily side the data path. Program counter values for precise-
mean a performance penalty because the cache must be ness may also flow down the pipeline so that they are
checked for a hit prior to the write cycle. In many high available should an exception arise.
performance cache organizations, the read cycle for the Linear pipelines often have several bypass paths
history data could be done in parallel both the hit connecting intermediate pipeline stages. A complete set
check. Each store instruction makes a buffer entry indi- of bypasses is typically not used, rather there is some
cating the cache location it has written. The buffer en- critical subset selected to maximize performance while
tries can be used to restore the suite of the cache. As keeping control complexity manageable. Hence, using
instructions complete without exceptions, the buffer the terminology of this paper, linear pipelines achieve
entries are discarded. The future file can be extended in precise interrupts by using a reorder buffer method with
a similar way. bypasses.

Originally published in Proc. Computer Architecture, 1985, 14


pp. 34–44. Copyright  1985 IEEE. All rights reserved.
Figure 7. Example of a linear pipeline implementation.

9. Summary and Conclusions [Amdh80] Amdahl Corporation, “580 Technical Introduction,”


1980.

Five methods have been described that solve the pre- [Ande67] D.W. Anderson, F.J. Sparacio, and F.M. Tomasulo, “The
IBM Svstem/360 Model 91: Machine Philosophy and Instruc-
cise interrupt problem. These methods were then evalu- tion Handling,” IBM Journal of Research and Development, V
ated through simulations of a CRAY-1S implemented 11, January 1967, pp. 8-24.
with these methods. These simulation results indicate [Bons69] P. Bonseigneur, “Description of the 7600 Computer Sys-
that, depending on the method and the way stores are tem,” Computer Group News, May 1969, pp. 11–15.
handled, the performance degradation can range from Buch62] W. Bucholz, ed., Planning a Computer System.
between 2.5% to 3%. It is expected that the cost of im- McGraw-Hill, New York, 1962.
plementing these methods could vary substantially,
[Dc841 Control Data Corporation, “CDC Cyber 180 Computer Sys-
with the method producing the smallest performance tem Model 990 Hardware Reference Manual,” pub. No.
degradation probably being the most expensive. Thus, 60462090, 1984.
selection of a particular method will depend not only on [CDC81] Control Data Corporation, “CDC CYBER 200 Model 205
the performance degradation, but whether the imple- Computer System Hardware Reference Manual," Arden Hills,
mentor is willing to pay for that method. MN, 1981.
It is important to note that some indirect causes for [Cray79] Cray Research, Inc., “CRAY-1 Computer Systems, Hard-
performance degradation were not considered. These ware Reference Manual,” Chippewa Falls, Wl, 1979.
include longer control paths that would tend to lengthen [Henn82] J. Hennessy et. al., “Hardware/Software Tradeoffs for
the clock period. Also, additional logic for supporting Increased Performance,” Proc. Symp. Architectural Support for
precise interrupts implies greater board area which im- Programming Languages and Operating Systems, 1982, pp.
2-11.
plies more wiring delays which could also lengthen the
clock period. [HiTa721 R. G. Hintz and D. P. Tate, “Control Data STAR-100
Processor Design,” Proc. Compcon 72. 1972, pp. 1-4.

10. Acknowledgment [McMa721 F. H. McMahon, “FORTRAN CPU Performance Analy-


sis,” Lawrence Livermore Laboratories, 1972.

One of the authors (J. E. Smith) would like to thank R. [PaSm83] N. Pang and J. E. Smith, “CRAY-1 Simulation Tools,”
Tech. Report ECE-83-11, University of Wisconsin-Madison,
G. Hintz and J. B. Pearson of the Control Data Corp. Dec. 1983.
with whom he was associated during the development
[Russ78] R.M. Russell, “The CRAY-1 Computer System,” Comm.
of the CYBER 180/990. This paper is based upon re- ACM, V 21, N 1, January 1978, pp. 63-72.
search supported by the National Science Foundation
[Stev81] David Stevenson, “A Proposed Standard for Binary Float-
under grant ECS-8207277. ing Point Arithmetic,” Computer, V 14 N 3, March 1981, pp.
5l62.
11. References [Thor70] J.E. Thornton, Design of a Computer - The Control Data
6600, Scon, Foresman and Co., Glenview, IL, 1970
[Amdh81] Amdahl Corporation, “Amdahl 470V/8 Computing Sys- [Ward82] William P. Ward, “Minicomputer Blasts Through 4 Mil-
tem Machine Reference Manual,” publication no. lion Instructions a Second,” Electronics, Jan. 13, 1982, pp.
G1014.0-03A, Oct. 1981. 155–159.

Originally published in Proc. Computer Architecture, 1985, 15


pp. 34–44. Copyright  1985 IEEE. All rights reserved.

You might also like