Tuning The Pentium Pro Microarchitecture
Tuning The Pentium Pro Microarchitecture
PENTIUM PRO
MICROARCHITECTUR
D
David B. Papworth esigning a wholly new microproces- mance of the core logic improves, designs
sor is dfficult and expensive. To jus- must continue to enhance the bus and cache
in tel Corporation tlfy this effort, a major new architecture to keep pace with the core
microarchitecture must improve performance Further, as other technologies (such as mul-
one and a half or two times over the previ- tiprocessing) mature, there is a natural ten-
ous-generation microarchitecture,when eval- dency to draw them into the processor
uated on equivalent process technology. In design as a way of providing additional fea-
addition, semiconductor process technology tures and value for the end user
continues to evolve while the processor
design is in progress. The previous-genera- Mass-market designs
tion microarchitecture increases in clock The large installed base and broad range
speed and performance due to compactions of applications for the Intel architecture
and conversion to newer technology. A new place additional constraints on the design,
microarchitecturemust “intercept”the process constraints beyond the purely academic
technology to achieve a compounding of ones of performance and clock frequency
process and microarchitectural speedups. We do not have the flexibility to control soft-
The process technology, degree of ware applications, compilers, or operating
pipelining, and amount of effort a team is systems in the same way system vendors
willing to spend on circuit and layout issues can We cannot remove obsolete features
determine the clock speed of a microarchi- and must cater to a wide variety of coding
tecture. Typically, a mlcroarchitecture will styles The processor must run thousands of
start with the same clock speed as a prior shrink-wrapped applications, rather than
microarchitecture (adjusted for process tech- only those compiled by a vendor-provided
nology scaling). This enables the maximum compiler running on a vendor-provided
reuse of past designs and circuits, and fits operating system on a vendor-provided plat
the new design to the existing product form These limitations leave fewer avenues
%is inside look at a development tools and methodology. for workarounds and the processor exposed
Performance enhancements should come to a much greater variety of instruction
large microprocessor primarily from the microarchitecture and not sequences and boundary conditions
from clock speed enhancements per se. Intel’s architecture has accumulated a
development project Often, a new processor’s die area is close great deal of history and features in 15 years
to the maximum that can be manufactured. The product must delivei world-class per-
reveals some of the This design choice stems from marketplace formance and also successfully identify and
competitiveness and efforts to get as much resolve compatibility issues The micro-
reasoning vor goals, performance as possible in the new microar- processor may be an assemblage of pieces
chitecture. While making the die smaller and from many different vendors, yet must func-
changes, trade-osfsi cheaper and improving performance are tion reliably and be easy for the general pub-
desirable, it is generally not possible to lic to use
andJerformance achieve a 1.5-to-2-times-better performance Since a new design needs to be manufac-
goal without using at least 1.5 to 2 times a turable in high volume froin the very begin-
simulation) that lay prior design’s transistors. ning, designers cannot allow the design to
Finally, new processor designs often expand to the limits of the technology It
behind itsjnalform. incorporate new features. As the perfor- also must meet stringent environmental and
Authorized licensed use limited to: University of Southern California. Downloaded on February 09,2023 at 20:28:11 UTC from IEEE Xplore. Restrictions apply.
design-life limits. It must deliver high performance using a
set of motherboard components costing less than a few hun-
dred dollars.
Meeting these additional design constraints is critical for
business success. They add to the complexity of the project
and the total effort required, compared to a brand-new
instruction set architecture. The additional complexity results
in extra staffing and longer schedules. A vital ingredient in
long-term success is proper planning and management of
the demands of extra complexity; management must ensure
that the complexity does not impact the product's long-term
performance and availability. What we actually built
The actual Pentium Pro processor looks much different
Our first effort from our first straw man:
After due consideration of the performance, area, and
mass-market constraints, we knew we would have to imple- a 150-MHz clock using 0 6-micron Iechnology,
ment out-of-order execution and register renaming to wring a 14-stage pipeline,
more instruction level parallelism out of existing code. three-instruction decoding per clock cycle,
Further, the modest register file of the Intel architecture con- three micro-operations (micro-ops) I enamed and retired
strains any compiler. That is, it limits the amount of instruc- per clock cycle,
tion reordering that a compiler can do to increase superscalar an 8-Kbyte L 1 instruction cache,
parallelism and basic block size. Clearly, a new microarchi- dn 8-Kbyte L 1 data cache,
tecture would have to provide some way to escape the con- one dedicated load port and one tore port, and
straints of false dependencies and provide a form of dynamic 5 5 million transistors
code motion in hardware.
To conform to projected Pentium processor goals, we ini- The evolution process
tially targeted a 100-MHz clock speed using 0.6-micron tech- Our first efforts centered on designing and simulating a
nology. Such a clock speed would have resulted in roughly high-performance dynamic-execution engine. We attacked
a 10-stage pipeline. It would have much the same structure the problems of renaming, scheduling, and dispatching, and
as the Pentium processor with an extra decode stage added designed core structures that implemented the desired
for more instruction decode bandwidth. It would also require functionality.
extra stages to implement register renaming, nintime sched- Circuit and layout studies overlapped this effort. We dis-
uling, and in-order retirement functions. covered that the basic out-of-order core and the functional
We expected a two-clock data cache access time (like the units could run at a higher clock frequency than 100 MHz.
Pentium processor) and other core execution units that would In addition, instruction fetching and decoding in two pipeline
strongly resemble the Pentium processor. The straw-man stages and data cache access in two clock cycles were the
microarchitecture would have had the following components: main frequency limiters.
One of our first activities was to create a microarchitect's
a 100-MHz clock using 0.6-micron technology, workbench. Basically, this was a performance simulator
a 10-stage pipeline, capable of modeling the general class of dynamic execution
four-instruction decoding per clock cycle, microarchitectures. We didn't base this simulator on silicon
four-micro-operation renaming and retiring per clock structures or detailed modeling of any particular implemen-
cycle, tation. Instead, it took an execution trace as input and
a 32-Kbyte level-1 instruction cache, applied various constraints to each instruction, modeling the
a separate 32-Kbyte L 1 data cache, functions of decoding, renaming, scheduling, dispatching,
two general load/store ports, and and retirement. It processed one micro-operation at a time,
a total of 10 million transistors. from decoding until retirement, and at each stage applied
the limitations extant in the design being modeled.
From the outset we planned to include a full-frequency, This simulator was very flexible in allowing us to model
dedicated L2 cache, with some flavor of advanced packag- any number of possible architectures. Modifying it was much
ing connecting the cache to the processor. Our intent was to faster than modifying a detailed, low-level implementation,
enable effective shared-memory multiprocessing by remov- since there was no need for functional correctness or rout-
ing the processor-10-12 transactions from the traditional glob- ing signals from one block to another. \We set up this simu-
al interconnect, or front-side, bus, and to facilitate board and lator to model our initial straw-man microarchitecture and
platform designs that could keep up with the high-speed then performed a sensitivity analysis of the inajor microar-
processor. Remember that in 1990/1991 when we began the chitecturdl areas that affect performance, clock speed, and
project, it was quite a struggle to build 50- and 66-MH.z sys- die area.
tems. It seemed prudent to provide for a package-level solu- We simulated each change or potential change against at
tion to this problem. least 2 billion instructions from more than 200 programs. We
April1996 9
Authorized licensed use limited to: University of Southern California. Downloaded on February 09,2023 at 20:28:11 UTC from IEEE Xplore. Restrictions apply.
AGU Address generation unit
BIU Bus interface unit
BTB Branch target buffer
DCU Data cache unit
FEU Floating-point execution unit
ID Instruction decoder
IEU Integer execution unit
IFU Instruction fetch unit (includes I-cache)
L2 level-2 cache
MIS Microinstruction sequencer
MIU Memory interface unit
MOB Memory reorder buffer
RAT Register alias table
ROB Reorder buffer
RRF Retirement register file
RS Reservation station
studied the effects of L1 cache size, pipeline depth, branch The trade-off
prediction effectiveness, renaming width, reservation station Based on our circuit studies, we explored what would
depth and organization, and reorder buffer depth Quite happen if we boosted the core frequency by 1 5 times over
often, we found that our initial intuition was wrong and that our initial straw man This required a few simple changes to
every assumption had to be tested and tuned to what was the reservation station, but clearly we could build the basic
proven to work core to operate at this frequency It would allow us to retain
10 IEEE Micro
Authorized licensed use limited to: University of Southern California. Downloaded on February 09,2023 at 20:28:11 UTC from IEEE Xplore. Restrictions apply.
Pentium Pro (continued)
Folloning renaming. tlic opcwdons wait in :.I 20-entry
restm-:iticin sration (RS) iintil ;a11 of their cq~ennds;ire
data rcxidy 2nd a fiinction:il unit is availal~le.As many as
5 micro-ops pt-r clock can i w s s from 11icrc.scryation ski-
tion IC) tlic. \wious cxcciition units. Tlicsc units pcrforrii
the desired cotiiput2tion :ind send the result dat:i Ixick
t o dari-clepentlrnt micro-ops in thc- tr*serv;itionst;iticms.
:is well :IS storing tlic rcsult in the- rcordcr biiffkr (R013).
'['he reorcler Ixiffer siores the indivicliial micro-ops in
the original prograiii order ancl retires as m a n y as 1hre.c
per clocli i n tlic rctircmcnt rcgistcr filc: (RRF). This filc
-1U i I I I I
100 110 120 130 140 150
examines each completed micro-op for the presence of
faults or txinch inisl)retlictions;and a1)arts futtlier retire- Frequency (MHz)
nicnt upon detecting sucti a discontinuity. 'This reim-
poses tlic sequential fault model atid the illusion o f ;I
niicrc-i;irchitectiire that executes each inslruction in strict Figure 1. Delivered performance versus clock frequency.
sequent id orclcr.
l'hc L 1 data c:iclic unit (DCIT) :ICLS as one o f the em-
cution units. It c m accept a new load or store operation simulator to model this change and discovered that it result-
every clock and has a data latency o f three clocks for ed in a 7 percent degradation (increase) in clock cycles per
loads. It contains xi 8-KI)yte. two-way associ;.itive cache instruction.
a m ! pliis fill I.xiffc.rs to M f r r dat;.~and track thr st;itiis The next change was to rework the instruction fetch/
of as many :is four simultaneously outstanding data decodehename pipeline, resulting in an additional two stages
in the in-order fetch pipeline. Retirement also required one
The Ixis interface unit (IHI!) processes cache missrs. additional pipe stage. This lengthening resulted in a further
'l'his unit man:igc=s the 1.2 c:iclic :incl its associatcd 64- clock-per-instruction (CPI) degradation of about 3 percent.
l)i1, full-frrc~iienc~~~ bus. :is well :as the front-side systcni Finally, we pursued a series of logic simplifications to
bus. u.1iic.h Iypically opei:itr-:s ill :I fr;iction of the core shorten critical speed paths. We applied less aggressive orga-
f'rc.qiicXiic). suc11 as 66-blHz on :I 200-MHz processor. nizations to several microarchitecture areas and experienced
rl'rmisactions o n the front-sick and clec1ic;itccl l~usesarc another 1 percent in CPI loss.
organized as 32-lq-tc m c t w linc: transfers. The overall The high-frequency microarchitecture completes instruc-
1x1s nxliitecturr pcrtrijts ruultiplc Pentium Pro proces- tions at a 50 percent higher rate than the lower frequency
sors t o he intcrconnec.lc.ci o n 111cfront-side INIS t o fortn microarchitecture, but requires 11 percent more of the now-
3 glueless. symmetric. s t ~ ~ r e ~ l - ~ i imultipro
~~iioi~- faster clocks per 100 instructions to enable this higher fre-
system. quency. The net performance is (1.5/1.0) * (1.0/1.11) = 1.35,
I b r a detailetl tlisc.iission of 111r see
tiiic.ro;irc.liitec.trire. or a 35 percent performance improvement-a very signifi-
Col\vcll and Steck,l the Intel \X"h site.' and C;wenn:ip.j cant gain compared to most microarchitectureenhancements.
In Figure 1 we see that performance generally improves
as clock frequency increases. The improvement is not linear
References or monotonic, however. Designers must make a series of
1 . R. Colwell and R. Steck, " A 0.6-pm BiCMOS microarchitectural changes or trade-offs to enable the high-
Microprocessor with Dynamic Execution," Proc. /nt'/So/id- er frequency. This in turn results in "jaggies"or a CPI degra-
State Circuits Conf, IEEE, Piscataway, N.J., 1995,pp. 176- dation and performance loss at the point at which we make
177. a change. The major drop shown at point 1represents the 7
2. http://www.intel.com/procs/p6/p6white/index.html(Intel's percent CPI loss due to the added data cache pipe stage. The
World Wide Web site). series of minor deflections at points 2, 3, and 4 shows the
3. L. Gwennap, "Intel's P6 Uses Decoupled Superscalar effect of added front-end pipe stages. The overall trend does
Design," Microprocessor Report. Vol. 9, No. 2, Feb. 16, not continue indefinitely, as the CPI starts to roll off dra-
1995, pp. 9-15. matically once the central core no longer maintains one-clock
latency for simple operations.
The right of this graph shows a fairly clear performance
single-cycle execution of basic arithmetic operations at a 50 win, assuming that one picks a point that is not in the val-
percent higher frequency The rest of the pipeline would ley of one of the CPI dips, and assuming that the project can
then have to be retuned to use one-third fewer gates per absorb the additional effort and complexity required to hit
clock than a comparable Pentium microarchitecture. higher clock speeds.
We first added a clock cycle to data cache lookup, chang- When we first laid out the straw man, we did not expect
ing it from two to three cycles. We used the performance the CPI-clock frequency curve to have this shape. Our ini-
April1996 11
Authorized licensed use limited to: University of Southern California. Downloaded on February 09,2023 at 20:28:11 UTC from IEEE Xplore. Restrictions apply.
both, assuming loads are pipelined and nonblocking. This
Our initial intuition suggested blocking-factor effect varies, depending upon the program
mix. But a rule of thumb for the Pentium Pro processor is
that the cost of an extra clock that additional clocks of load latency cost approximately half
of what they do in a strict in-order architecture.
of latency on loads would be Thus the 15 percent of micro-ops that are critical-path
loads take an extra clock, but the overlap factor results in
more severe than it actually is. one half of the equivalent in-order penalty, or about 7.5 per-
cent. This is close to what we measured in the detailed sim-
ulation of actual code.
Now let’s look at the effect of additional fetch pipeline
tial intuition suggested that the cost of an extra clock of laten- stages. If branches are perfectly predicted, the fetch pipeline
cy on loads would be more severe than it actually is. Further, can be arbitrarily long, at no performance cost. The cost of
past industry experience suggests that high frequency at the extra pipe stages comes only on branch mispredictions. If
expense of deep pipelines often results in a relative stand- 20 percent of the micro-ops are branches, and branch pre-
still in performance for general-purpose integer code. diction is 90 percent accurate, then two micro-ops in 100 are
However, our performance simulator showed the graphed mispredictions. Each additional clock in the fetch pipeline
behavior and the performance win possible from higher will add one clock per misprediction. If the base CPI is about
clock speeds. Since we had carefully validated the simulator 1,we’ll see about two extra clocks per 100 micro-ops or and
with micro-benchmarks (to ensure that it really modeled the additional 2 percent per pipe stage. The actual penalty is
effect in question), we were inclined to believe its results somewhat less, because branch prediction is typically better
over our own intuition and went ahead with the modified than 90 percent, and there is a compound-interest effect. As
design. We did, however, work to come up with qualitative the pipeline gets longer, the CPI degrades and the cost of
explanations of the results, which follow. yet another pipe stage diminishes.
Consider a program segment that takes 100 clock cycles at
100 MHz. The baseline microarchitecturetakes 1 microsecond Clock frequency versus effort
to execute this segment. We modify this baseline by adding This all sounds like a fine theoretical basis for building a
an extra pipe stage to loads. If 30 percent of all operations are faster processor, but it comes at a nontrivial cost. Using a
loads, this would add 30 clocks to the segment, and take 130 higher clock frequency reduces the margin for error in any
clocks to execute. If the extra pipe stage enables a 50 per- one pipe stage. The consequence of needing one too many
cent higher frequency, the total execution time becomes gates to get the required functionality into a given pipe stage
130/150 or 0.867 microseconds. This amounts to a 15 percent is a 9 to 10 percent performance loss, rather than a 4 to 5
performance improvement (1/0.867).This is certainly a high- percent loss. So the design team must make a number of
er performance microarchitecture but hardly the 50 percent small microarchitecture changes as the design matures, since
improvement one might naively expect from clock rate alone. it is impossible to perfectly anticipate every critical path and
Such a result is typical of past experience with in-order design an ideal pipeline. This results in rework and a longer
pipelines when we seek the CPI-frequency balance. project schedule. Further, with short pipe stages, many paths
The Pentium Pro microarchitecture does not suffer this cannot absorb the overhead of logic synthesis, increasing
amount of CPI degradation from increased load latency the proportion of the chip for which we must hand-design
because it buffers multiple loads, dispatches them out of circuits.
order, and completes them out of order. About 50 percent of Higher clock speeds require much more hand layout and
loads are not critical from a dataflow perspective. These loads careful routing. The densities achievable by automatic place-
(typically from the stack or from global data) have their ment and routing are often inadequate to keep parasitic
address operands available early. The 20-entry reservation delays within what will fit in a clock cycle. Beyond that, the
station in the Pentium Pro processor can buffer a large pool processor spends a bigger fraction of each clock period on
of micro-ops, and these “data-ready”load micro-ops can latched delay, set-up time, clock skew, and parasitics than
bubble up and issue well ahead of the critical-path opera- with a slower, wider pipeline. This puts even more pressure
tions that need their results. For this class of loads, an extra on designers to limit the number of gates per pipe stage.
clock of load latency does not impact performance. The higher performance that results from higher clock
The remaining 50 percent of the loads have a frequent speeds places more pressure on the clock and power distri-
overlap of multiple critical-path loads. For example, the code bution budget. The shorter clock period is less able to absorb
fragment a = b + c might compile into the sequence clock jitter and localized voltage sags, requiring very careful
and detailed speed path simulations.
load b => r1 As long as a design team expects, manages, and supports this
load c => 1-2 extra effort, clock speedups provide an excellent path to high-
r l plus r2 => r3 er Performance. Even if this comes at some CPI degradation,
the end result is both a higher performance product and one
Both b and c are critical-path loads, but even if each takes that hits production earlier than one that attempts to retrofit
an extra clock of latency, only one extra clock is added for higher clock frequency into a pipeline not designed for it.
12 IEEE Micro
Authorized licensed use limited to: University of Southern California. Downloaded on February 09,2023 at 20:28:11 UTC from IEEE Xplore. Restrictions apply.
The terms “architecturalefficiency”or “performance at the
same clock” are sometimes taken as metrics of goodness in It is equally meaningless to
and of themselves. Perhaps this is one way of apologizing for
low clock rates or a way to imply higher performance when brag about high clock
the microarchitecture “someday”reaches a clock rate that is
in fact unobtainable for that design with that process tech- frequency without considering
nology. Performance at the same clock is not a good microar-
chitectural goal, if it means building bottlenecks into the the CPI and other significant
pipeline that will forever impact clock frequency. Similarly,
low latency by itself is not an important goal. Designers must performance trade-0 ffs ma de
consider the balance between latency, available parallelism
in the application, and the impact on clock speed of forcing to achieve it.
a lot of functionality into a short clock period.
It is equally meaningless to brag about high clock fre-
quency without considering the CPI and other significant
performance trade-offs made to achieve it. In designing the patches them whenever they and a functional unit become
Pentium Pro microarchitecture, we balanced our efforts on ready. N o particular harm comes from delaying any one
increasing frequency and reducing CPI. As architects, we micro-op, since a micro-op can execute in several different
spent the same time working on clock frequency and layout clocks without affecting the critical path through the flow
issues as on refining parallel-execution techniques. The true graph. This contrasts with in-order superscalar approaches,
measure of an architecture is delivered performance, which which offer only one opportunity to execute an operation
is clock speed/CPI and not optimal CPI with low clock speed that will not result in adding a clock or more to the execu-
or great clock speed but poor CPI. tion time. The in-order architecture runs in feast-or-famine
One final interesting result was that the dynamic-execution mode, its multiple functional units idle much of the time and
microarchitecture was actually a major enabler of higher only coming into play when parallelism is instantaneously
clock frequency. In 1990, many pundits claimed that the available to fit its available templates.
complexity of out-of-order techniques would ultimately lead The same phenomenon occurs in the instruction decoder.
to a clock speed degradation, due to the second-order effects A decoder for a complex instruction set will typically have
of larger die size and bigger projects with more players. In restrictions (termed “templates” here) placed upon it. This
the case of the Pentium Pro processor, we often found that refers to the number and type of instructions that can be
dynamic execution enabled us to add pipe stages to reduce decoded in any one clock period. The Pentium Pro’s decoder
the number of critical paths. We did not pay the kind of CPI operates to a 4-1-1 template. It decodes up to three instruc-
penalty that an in-order microarchitecture would have suf- tions each clock, with the first decoder able to handle most
fered for the same change. By alleviating some of the data- instructions and the other two restricted to single dataflow
path barriers to higher clock frequency, we could focus our nodes (micro-ops) such as loads and register-to-register. A
efforts on the second-order effects that remained. hypothetical 4-2 template could decode up to two instruc-
tions per clock, with the second decoder processing stores
Tuning area and performance and memory-to-register instructions as well as single micro-
Another critical tuning parameter is the trade-off between op instructions.
silicon area and CPU performance. As designers of a new The Pentium Pro’s instruction decoder has a six-micro-op
microarchitecture, we are always tempted to add more capa- queue on its output, and the reservation station provides a
bility and more features to try to hit as high a performance substantial amount of additional buffering. If a template
as possible. We try to guesstimate what will fit in a given restriction forces a realignment, and only two micro-ops are
level of silicon technology, but our early estimates are gen- decoded in a clock, opportunities exist to catch up in sub-
erally optimistic and the process is not particularly accurate. sequent clocks. At an average CPI of about 1, there is no
As we continued to refine the Pentium Pro microarchi- long-term need to sustain a decode rate of three instructions
tecture, we discovered that, by and large, most applications per clock. Given adequate buffering and some overcapaci-
do not perform as well as possible, being unable to keep all ty, the decoder can stay well ahead of the execution
of the functional units busy all of the time. At the same time, dataflow. The disparity between CPI and the maximum
better understanding of layout issues revealed that the die decode rate reduce the template restrictions to a negligible
size of the original microarchitecture was uncomfortably impact.
large for high-volume manufacturing. After observing these generic effects, we performed sen-
We found that the deep buffering provided by the large, sitivity studies on other microarchitecture aspects. We
uniform reservation station allowed a time-averaging of func- trimmed each area of the preliminary microarchitecture until
tional-unit demand. Most program parallelism is somewhat we noted a small performance loss.
bursty (that is, it occurs in nonuniform clumps spread For example, we observed that each of the two load/store
through the application). The dynamic-execution architec- ports were used about 20 percent of the time. We surmised
ture can average out bursty demands for functional units; it that changing to one dedicated load port and one dedicat-
draws micro-ops from a large range of the program and dis- ed store port should not have a large effect on performance.
April1996 13
Authorized licensed use limited to: University of Southern California. Downloaded on February 09,2023 at 20:28:11 UTC from IEEE Xplore. Restrictions apply.
a,
120 , It is important to consider the actual shape of the area-
performance curve. Most high-end CPU designs operate well
100-
past the knee of this curve. Efficiency is not a particularly
critical goal. For example, the market demands as much per-
formance as possible from a given technology, even when
that means using a great deal of silicon area for relatively
modest incremental gains
Figure 2 illustrates this effect. This graph charts the per-
formance of various instruction decoder schemes coupled
I to a fixed execution core. All of the architectures discussed
25 50 63 75 a7 io0 125 200 i earlier are clearly well past the knee of the performance
Relative area (percentage)
curve. Moving from point A (a 4-2-2-2 template) to point B
(4-2-2) is clearly the right choice, since the performance
Figure 2. Performance versus die area f o r different curve is almost flat. Moving down to point C (4-1-1) shows
decoder designs. a detectable performance loss, but it is hardly disastrous in
the grand scheme of things. Point D (4-2-one we actually
The load port would operate about 30 percent of the time considered at one time) occupies an awkward valley in this
and the store port at about 10 percent of the time. This curve, barely improved over point E (4-1) for significantly
proved to be the case, with less than a 1 percent performance more area and noticeably lower performance than point C.
loss for this change. As we converted the microarchitecture to transistors and
Changing from a 4-2-2-2 decode template (four instruc- then to layout, execution quality became critical. All mem-
tions per clock) to a 4-2-2 template (three instructions per bers of the project team participated in holding the line on
clock) also was a no-brainer, with no detectable performance clock frequency and area. Some acted as firefighters, han-
change on 95 percent of programs examined. dling the hundreds of minor emergencies that arose.
We also changed the renaming and retirement ports from This phase is very critical in any major project. If a project
fourxmicro-ops per clock to three, which resulted in a slight- takes shortcuts and slips clock frequency or validation to
ly larger, but still manageable 2 percent performance loss. achieve earlier silicon, the design often contains bugs and
Finally, we reduced the L 1 cache size from 16 to 8 Kbytes. suffers unrecoverable performance losses. This design phase
In doing this, we took advantage of the full-frequency ded- determines a project’s ultimate success. The best planned
icated bus we had already chosen. Since the L1 cache is back- and most elegant microarchitechirewill fail if the design team
stopped by a full bandwidth, three-clock L2 cache, the extra does not execute implementations well. As CPU architects,
L 1 misses that result from cutting the L1 cache size cause a we were very fortunate to work with a strong design and
relatively minor 1.5 percent performance loss. layout team that could realize our shared vision in the result-
The reduction from four- to three-way superscalar opera- ing silicon.
tion and the reduction in L 1 cache size had some negative
impact on chest-thumping bragging rights, but we could not
justify the extra capacity by the delivered performance.
Further, tenaciously holding on to extra logic would have THEPENTIUM P R O PROCESSOR ran DOS, Windows,
resulted in significant negative consequences in die area, and Unix within one week of receiving first silicon. We had
clock speed, and project schedule. most major operating systems running within one month.
As the design progressed, we eventually found that even We made a series of small metal fixes to correct minor bugs
the first round of trimming was not enough. We had to make and speed paths. The A2 material, manufactured using a 0.6-
further reductions to keep die size in a comfortable range, micron process, ran at 133 MHz with a production-quality
and, as it turned out later, maintain clock frequency. This test program, including 85 degree case temperature and 5
required making further cuts, which resulted in detectable percent voltage margins).
performance loss, rather than the truly negligible losses from The BO steppingincorporated several microcode bug and
the earlier changes. speed path fixes for problems discovered on the A-step sil-
We made two major changes. We cut back the decoder to icon, and added frequency ratios to the front-side bus. Our
a 4-1-1 template from a 4-2-2 template. This amounted to success with early Pentium Pro processor silicon, plus pos-
about a 3 percent performance loss. We also cut back the itive early data on the 0.35-micron process, encouraged us
branch target buffer from 1,024 to 512 entries, which barely to retune the L2 access pipeline. Retuning allowed for ded-
affected SPECint92 results (1 percent) but did hurt transac- icated bus frequencies in excess of 200 MHz. We added one
tion processing (5 percent). It was emotionally difficult (at the clock to the L2 pipeline, splitting the extra time between
time) for the microarchitecture team to accept the resulting address delivery and path while retaining the full-core clock
performance losses, but these results turned out to be criti- frequency and the pipelined/nonblocking data access capa-
cal to keeping the die area reasonable and obtaining a high bility. The 0.6-micron BO silicon became a lSO-MHz, pro-
clock frequency. This kind of careful tuning and flexibility in duction-worthy part and met our goals for performance and
product goals was essential to the ultimate success of the frequency using 0.6-micron Pentium processor technology.
program. We optically shrank the design to the newly available 0.35-
14 IEEEMicro
Authorized licensed use limited to: University of Southern California. Downloaded on February 09,2023 at 20:28:11 UTC from IEEE Xplore. Restrictions apply.
Table 1. Pentium Pro performance.
Processor Processor
(0.6 micron, 150 MHz, (0.35 micron, 200 MHz,
Authorized licensed use limited to: University of Southern California. Downloaded on February 09,2023 at 20:28:11 UTC from IEEE Xplore. Restrictions apply.