Precise Exceptions in Computer Architecture
Precise Exceptions in Computer Architecture
OF PRECISE
Dtpanmtnr
INTERRUPTS
IN PIPELINED
PROCESSORS
01 Elt~r~~l
awi Compurrr Enginttring
Unibwsiiy o/ Wisconsin-Madison
Mdison.
WI J37W
Andrew
R. Plazkun
the imcrruptcd
insuuaion.
The inurrupmd
instruction
may or
may not have been cxautcd.
depending on the definition
of Ihe
arcbitaturc
and the cause of the intcmtpt.
Whkbcvcr
is the
case. the intcrruptcd instruction has either complctcd. or has not
swtcd cxsutiott.
Abalmd
An imerrupl Is precise if the saved process state corresponds with
the sequcnnal model of program execution where one insrruction completes before the next begins. In a pipclined processor. precise intcrrums are diffcuh IO achmve baause an insrrucuon
may be inirirmd
before us predecessors have been completed.
This pager describes and
evaluates solunons IO the precise interrupt problem in pipclincd proccsSOCS.
If the saved proccsr mu? is inconsistent wiIh the sequential l rchitecrural model and does noI satisfy the above condirions. rhen the intcrrupl
is imprecisr.
This paper dcsuibcs and comparcs ways of implemmting
precise
inlerrupu
in p&lined
proecssors. The mcchods used arc dcsigncd IO
modify che note of an cxeculing process in a carefully cootrolled way.
The simpler methods force all instructions to update the pracss state in
the architectural
order. other. more complex mahods YH ponions of
the proass
stan so lhm the proper sute may be resscucd by the
hardware a1 the time an intcrrupc occurs.
1.1.
Clnrritic~lion
WC consider
belonging
IO Iwo classes.
(1)
0)
1. Introduction
Most currem computer architmurcs
are based on a scquenIial
model of program execution in which an archrtcmural program counIer
scqucnces through instructtons one-by-one, finishing one before staning the next. In conIras1. a high performance
implemcntaIIon
may be
pipclined. prmmutg
several instructions 10 be in some phase of cxcculion 81 the same time. The use of a sequential architecture and a pipe.
lined implcmenudon
clash at the time of an interrupt;
pipclincd
instruaions
may modify the proms rule in an order diffcrcnt from tl181
defined by the scqucnlial archircmural
model. Al the time an interrupt
condition is dcIected. the hardware may no1 bc in a state thac is mnsistcnc wiIh any specific program counicr value.
of interrupts
interrupts
(1)
inccrrupts
a praise
(2)
(3)
For graceful rccovmy from arithmetic exccpdons. sofnvare rourincs may bc able IO take mepa. re-sole
floating point numbers
for cxamplc. to allow a proocss to continue.
Some end cnscs of
modem floating poin1 l rhhrnetic sysums might bcs~ hc hat~dlcd
by rofrwrrc;
gradual underflow in the proposed IEEE floating
point standard [S~cvgl], for example.
(1)
All insbunions
pracding
the instruaion
indieaud by the &
program counter have been cxccu~ed and have modified Ihc prowas state corrady.
(2)
(4)
(9
Unimplemented
(3)
291
by system software
in a
Virtual
machines can be implemented
if privileged instruction
faults cause praise interrupts.
Host software can simulate these
instructions
and return IO the gum1 operanng system in e userlrenrparenr
1.2.
Hislorial
way.
Survey
The prectse interrupt problem is as old as the first pipelined computer and is mentioned es early es Stretch lBuch62).
The IBM 360/91
[Ande67]
was e well-known
computer that produced imprecise interrupts under some circumstances.
floating point exceptions. for example. Imprecise interrupts were a bruk with the IBM 360 architecture
which made them even more not&able.
All subsequent IBM 360 and
370 implementations
have used less aggressive pipeline destgns where
instructions
modify the process state in strict program order. and ell
interrupts
are precise.
A more complete description of the method
used in these linur
pipeline implementations
is in Section 6.4.
Most pipelined implementations
of general purpose archittnura
are similar
IO those used by IBM.
These pipelines constrain
all
instructions
IO pass through the pipeline in order with e stage II the end
where exception conditions are checked before the process state is
the
Amdahl
470
end
580
Examples
include
modified.
(Amdh81.AmdhSOJ
and the Gould/SE1 32/87 lWard82].
2. Preliminaries
2.1.
is a
The recently-announced
CDC CYBER
1801990 [CDC84]
pipelined implementation
of a new architecture
that supports virmel.
memory, and offers roughly the same performance as l CRAY-IS.
To
provide prectse interrupts. the CYBER 18OKKJO uses a history buffer,
IO be described later in this paper, where state informetion
is saved just
prior
IO being
modified.
Then when an interrupt occurs. this history
information
can be used IO back the system up into e precise sure.
Paper
Archileclure
For
1.3.
Model
The implementaIIon
for the simple architecture
is shown in Fig.
1. 11 uses an instruction fetch/decode pipeline which processes instrucnons in order.
The last stage of the fetch/decode pipeline is an issue
register where all register interlock conditions are checked.
If there
are no regtster conflicts. an instructton
issues IO one of the parallel
functional units. Here. the memov
access function is implemented as
one of the funcuonal unns. The operand registers are read a1 the time
an instructton Issues. There is a single result bus that returns results to
the regtster ftle. This hus mav be reserved a1 the ome an instruction
This
issues or when an instructjon
is approachmg
completion.
assumes the funcuonal unit times are deterministic.
A new instruction
an issue every clock period in the absence of reftster or result bus
conflicts.
Overview
292
FUNCTIONAL
UNIT
@=I-
II
INStRUCTION
FETCH/nECOIlE
FUNCTIONAL
UNIT
3. In-order
REGISTER
r-
FILE
Instruction
Completion
With this method. instructions modify the process sute only when
all prevtouslv issued tnstrucuons arc known IO be free of cxcepnon conditions.
This sectton descrther a strategy that is most easily implemented when pipeline delays in the parellel funcnonal unrts are fixed.
Thrt IS. they do not depend on the operands. only on the function.
Thus, the result hur can br reserved at the umc of issue.
-YmlRESULT
ElLlf
Example
T-
01RLc-r1oN
w
11:
Rz <-0
RO<-0
R5 <- I
R7<-100
RI<-(R2+A)
R3<-(It2
+ B)
R4 <-RI
+fSU
RO<-RO+
R5
(RO + C) <- R4
Iu<-R2+R5
P- 11 : RO !- R7
Init. loop
Init. loop
Lonp inc.
Muimum
W
AC0
Eeauu
TlItu
indea
wunt
v~ltte
loop wunt
Bfl)
Floniag odd
Inc. loop count
store C(I)
Inc. loop index
wnd. branch not quJl
14
llcp
ss,
2cp
2cp
Still disregarding
precise interrupts,
it is possible for J short
instruction IO be placed in the result pipeline in suge i when previously
issued instructions are in suge j. j > i. This leads to instructiotts ftnishing out of the original program sequence. If the instruction et stage
j mnttully
enwunters
en exception condition,
the interrupt till be
imprecise kuuse
the instruction
pleeed in suge i will complete Jnd
modifv the process stale even though the squentiel archnecture model
says i does not begin until j completes.
before the floating point add. The integer Jdd will therefore change the
praess state before an overflow condition is detected in the flomtog
point add. In the event of such an overflow,
there is en imprqise
interrupt.
2.2.
Interrupts
Prlor
to Instruction
Issue
Example
293
3.1.
Re@sters
4.1.
There is logic on the result bur that checks for exccptlon conditions in ins1ructions
as thev complete.
If an mstruction
contams a
non-masked exception condition. then conlrol logic cancels all subscquent instructtons
comtnp on the result hus so that they do not modif!
the process state.
Fhmplc
Basic Method
Main
I !a&
REGI!VER
Memory
Store instructions
modify the ponion of process stale that resides
in main memory.
To implement
precise interrupts with respect IOmemory. one solution is to force store insmtaions
to wait for the result
shift register IO be empty &fore issuing. Alterttati~rly.
stores can issue
and be held in the load/store pipeline until all preceding instructions
arc k~~own to be exception-free.
Then the store can be released IO
memory.
(8)
Program
.
.
.
.
.
.
.
.
REwmER
I I
STRGE
Gwnar
To implemcm
precise intcrrttpts
with rapect
IO the program
counter. the result shih register is widened IO include a field for the
program counter of each rnstruction (see Fig. 2). This field is filled as
the tnstrucuon ISSUCS. When an instruction with an exception condition
appears at the result bus. us program counter is atailahlc and becomes
pan of the saved state.
pJ$cc%&
.
.
.
.
BLSTER
1 UALID
1 TRG
DIRECTION
a=
HGKKNT
1
4. The Reorder
BuNer
GGISJLT SnIFT
The prtmarv disadvantagr of the abow method is that fast instrucuons may sometimes get held up at the issue repister even though the!
have no dependenctes and would othenviv
issuc. In addiuon. the\
block the lssuc rcgistcr while slower instruction% hehind them could
conceivahlv tssuc.
REGISTER
(b)
Figure
294
Buffer and
REGISTER
FILE
Main
UNITS
Program
inslrucnon
is placed m the reorder buffer, any en~rles ~lth Ihe same
destination
register dcslgnaior
muv be inhibited
from matching a
bypass check.
When bypass paths are added. preciseness with respccc IO the
memory and the program coumer does not change from the previous
method.
The grcatesc disadvantage with this method is the number of
bmss comparators needed and the amount of circuitry required for the
multiple bypass check.
While this clrcujtry
is conccpfually
simple.
there is a grcrii deal of II.
Counter
FLJNCTIONftL
Memor!
SOLRCE DATA
TO
The entries m the reorder buffer and result shift reelster shown
in Figure 3b reflect thclr state aher the tntegcr add from Example 2 has
issued. Kottce rhaf rhc resuh shift regrslcr entries are very sunrlar IO
those In the Figure 2. The Integer add will complete executton before
the floaung pomc add and IIS resuhs will be placed m entry 5 of the
reorder buffer.
These rcsul~s. however. will not be wrmcn Into RO
until the floatmg pm, result. found In entry 4. has hccn placed in R4.
4.2.
>
5. History
Buffer
The methods presented in this section and the next arc miended IO
reduce or eliminste
performance
losses experienced
with a simple
reorder buffer, but without all fhe control logs needed for multiple
bypass paths. Primarily.
these melhods place computed results in a
working replster file. but retam enough state Information so a precise
state can he restored if an excepuon occurs.
Fig. Sa illustrates the hIslo? buffer melhod. The hIstory buffer
IS organized in a manner ver\ stmilar IO the reorder buficr.
AI ~ISUC
time, a bufier cntr! IS loaded with comrol informriton.
as with the
reorder buffer, but the value of the desfmauon reFis,er (soon 10 be
overwrmen)
IS also read from Ihe register file and wrmen into thr
buffer entry.
Rcsulis on the resuli bus are wrmcn duectly into the
regtster file when an mslructaon completes.
Exccpuon reports come
back as an instrucuon compleles and are wrinen Into Ihe hIstory buffer.
AS with the reorder buffer. Ihe exception reports are guided 10 the
proper history huller emry through the use of tags found in the resuh
shift register.
When the history buffer comains an element al the head
that is known 10 have tinishcd without cxccprions. the hlsrory buffer
entry is no longer needed and fhai buffer locadon can he rc-used (Ihe
head pointer IS incremented).
As with the reorder buffer. the history
buffer can be shoner than the maxImum number of pipeline stages. If
rll history buRer entrzs are used (the buffer is toosmrll). issue musl
be blocked until an entry becomes available.
Hata
the buffer should
be long enough so that this seldom happens. The effect of the history
buffer on performance is determmed in Section 7.
B-ypass Paths
Example
The entrIes in the history buffer and resub shih regisier shown
Fig. Sb correspond 10 our cede in Example 1, affcr the imegcr add has
issued. The only differences
bcrwun
this and the reorder buffer
method shown in Fig. 3b are the addition of an old value- field in the
histoT buffer and a destinauon register field in the result shift register. The result shih register now lcoks like the one shown in Fig. 2.
There mav be bypass parhs from some or all of the reorder buffer
entries.
If muliiple bypass paths CXISI. it is possible for more than one
dcsdnalion entry in the reorder buffer to correspond IO a single realster. Clearly onlv the--_/o~sr reorder
_--.--- buffer entry that corresponds 10 an
operand destgnaior should generate a bypass path lo the register OUIpul
when an
latch. To prevent muluple bypassmg of Ihe same reeler.
295
6. Future
&=--I-\
SOURCE
RESULT
File
DATA
BUS
RESULT
HERo+
TAIL+
14BBBBBBBi
0142
56
.
.
.
.
.
.
.
.
.
I
I
.
.
.
I0
I0
I
.
.
.
I
I
BUS
FROM
NNCTIONAL
UNITS
I
4
0
.
.
.
STFIGE
DIRECTION
OF
NOKKNT
T
1
=.YALID
=G.
TAG
Instructions
are issued and results are returned 10 Ihe furure file
m my order. just as in Ihe origmal pipeline model. There IS also a
reorder buffer [hat receives results al the same time rhey are written
into the future file. When the head pointer finds a completed insoucnon (a valid envy).
the result assoctaled with that entr); is wrinen in the
architectural
file.
I
RBLILT
I
SHIFT
I 0
REGISTER
Example
(to
Figure
5.
Buffer and
296
7. Performance
Evaluation
The simulatton results for the In-order column are c~nsfant suxc
:his method does no: depend on a bufler that reorders mstrucnons.
For
all the methods, there IS some performance
degradanon.
Initially.
when the reorder buffer is small, the In-order method produces the
)easl performance
degradatton.
A small reorder buffer (less than 3
entries) limits the number of instructrons that can simultaneously
be in
some stage of execution.
Once the reorder buffer size 1s mcreased
beyond 3 entries. either of the other methods results in better perform
mancc. As expected, the reorder huffcr ui:h bypasses offers superior
performance when compared with the stmplc reorder buffer. When the
stze of the buffer was increased bevond 10 entrres, simulation results
indicated no further performance
improvements.
(Simulanons
were
also run for buffer stxes of 15. 16. 20. 25. and 60.) At best. one can
expect a 12% performance
degradation when using a reorder buffer
with bypasses and the first method for handling stores.
more loops
workload.
[McMa72]
were used.
In-order
Reorder
RwlBP
I .2322
1.3315
1 xl69
1.2322
1.2183
1.1743
1.2322
1.1954
1.1439
1.2322
1.1808
1.1208
1.2322
1.1808
1.1208
1.1724
1.1560
1.1348
1.0539
1.1560
1.1167
1.0279
10
1.1560
1.1167
1.0279
1.1152
Number of
Entries
1.1560
8. Extensions
In previous sections. we described methods that could be used to
guarantee precise ntterrupts with respect IO the registers, rhe main
memory, and the program counter of our simple architectural
model.
In the following secttons. we extend the previous methods IO handle
addinonal stale information.
virtual memory. a cache, and linear pipelines. Effectrvely . some of these machine features can be considered IO
be functional unns with non-deterministic
cxecuuon nmes.
297
8.1.
Handling
Other
State Values
Pan Of all the methods is that stores arc held until all prevtous tnsIruc.
tions are known to be exccpuon-free.
Wirh a cache, stores mav br
made inlo the cache earlier. and for performance
reasons should b,.,
The actual updating of main memory, however. IS still subjut to tnc
same constratnt5 as before.
Mart architectures
have more stale informauon
than we have
assumed in the model architecture.
For example. a process may have
stale regtsters that point to page and segment tables. indicate Interrupt
mask condittons. CIC. This additional stale informatton can bc precisely
maintained wtth a method similar to that used for store5 to memory.
If
using a reorder buffer. an instrucuon
that changes a stale register
reserves a reorder hulTer encry and proceeds to the part of the machine
where the state change will be made. The instruction then wait5 there
until reccivtng a signal to continue from the rwrder buffer. When its
entry arrives at the head of the buffer and is removed, then a stgnal is
sent to cau5e the slate change.
83.1.
8.3.2.
Write-Back
Cache
Memory
8.4.
Linear
Pipeline
Structure5
An aiternattvr
to the parallel funcrtonal untl organizations
wr
have been dtscusstnp IS a linear ptpcitne organrzatton.
Refer to Fig. 7.
Caches
With a store-through
cache. the cache can bc updated tmmediately. while the store-through to main memory IS handled as in prevlous sections. That is. all prevtous instructions
musi lirst be known to
be exception-free.
Load instructtons are free IO use the cached copy,
however. regardless of whether the store-through
has taken place. This
means that main memo? is always in a prectse stale. but the cache
contents may *run ahead of the precise state. If an tnferrupt should
occur while the cache is potentially in such a stale. then the cache
should bc flushed.
This guarantees that prematurely
updated cache.
locations will not be used. However,
this can lead to performance
problems. especially for larger caches.
Another alternative is to treat the cache in a way similar to the
register files. One could. for example. keep a history buffer for the
cache. Just as with registers, a cache location would have to be read
just prior IO writing it with a new value. This does not necessarily
mew a performance penalty because the cache must be checked for a
hit prior to the write cycle. In many high performance cache organizations. the read cycle for the history data could be done in parallel with
the hit check. Each store instruction
make5 a buffer entv indicatmg
the cache location it has wrtnen.
The buffer entries can be used IO
restore the state of the cache. 4s instrucuonscomplcte
without excep
tiotts. the buffer entries are discarded.
The future file can be extended
in a timilar way.
8.2,
Store-through
A
-
A
. . .
OPERAND FiZTCH
A
4
I
. . .
EXECUTION
Cachc.Mcmor?
Figure
298
7.
10. AcknowledgemenI
One of the author5 (J. E. Smith) would like to thank R. G. Hina
and J. B. Pearson of the Conaol Dau Corp. with whom he was l 55ociated during the development of the CYBER lllO1990.
Thi5 paper ir
based upon research rupported by the National Science Foundation
under grant ECS-8207217.
11. Rcfereneer
lAmdh8Ol
Amdahl
Corporation,
Am&h/
47OV/& Compurrng
hiochrnr Rrjrrtncr
Manual. publication no. Gl014.0-03A.
1981.
Amdahl
Corporation,
580 Technical
Introduction.
System
Oct.
1980.
lAnde67)
fBott569J
Description
of the 7600 Computer
P. Bonsetgneur.
Syrtem. Cornpurer Croup News. May 1969. pp. 11-15.
fBuch62J
W. Bucholr.ed..
New York, 1962.
lCDC841
ICray79)
IHenn821
J. Hennessy
et. al.. Hardware/Software
Tradeoffs
for
Increased Performance,
Iror. Svmp. on Arrlr~rrrrural
Supper-1
for Prn#rammrn# Language, aud Opcmrrq Sv~rotu. April 1982.
pp. 2-11.
fHiTa72)
fMcMa72]
F. H.
McMahon.
FORTRAN
CPU
Performance
Analysis.
Lawrence Livcrmorc
Laboratories,
1972.
fPaSm83)
[RussfS]
and Conclusion5
Five methods have been described that solve the precise interrupt
problem.
These methods were then evaluated fhrough simulations of a
CRAY-IS
implemented with these methods. These simulation results
indicate thaf, depending on the method and the way stores are handled,
the performance degradation can range from berueen 25% IO 3%. It is
expected that the COSI of implementing
these methods could vary substantially. with the method producing the smallest performance degradation probably being the most expensive.
Thus. selection of a par&uIar method will depnd not only on the performance degradation, but
whether the :mpicmen:or is willing IO pay for that method.
1: i5 important to note that 5ome indirect cause5 for performance
degradation were not con5idered. Thae include longer control pnIhs
that would tend IO lengthen the clock period. Alro. additional logic for
supporting precise interrupts implies greater board area which implies
more wiring delay5 which could alro lengthen the clock period.
IAmdhBl]
lCDC8l
299
System,
Contm.
(Thor701
I.E. Thornton,
Dessrgn oj a Compuwr
6600. Scott. Foresman and Co., Glenview.
- The Conrrol
IL. 1970
Dora
mard82J