0% found this document useful (0 votes)

62 views8 pages

Tuning The Pentium Pro Microarchitecture

The document discusses the design process for the Pentium Pro microprocessor. It aimed to improve performance over previous generations through techniques like out-of-order execution and register renaming while meeting constraints of die size, cost, and compatibility. The initial design targeted 100MHz but evolved during circuit design and simulation to instead use a 150MHz clock and different pipeline and cache configurations.

Uploaded by

Ishan Mahendra Patel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views8 pages

Tuning The Pentium Pro Microarchitecture

Uploaded by

Ishan Mahendra Patel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

TUNING THE

PENTIUM PRO
MICROARCHITECTUR
D
David B. Papworth esigning a wholly new microproces- mance of the core logic improves, designs
sor is dfficult and expensive. To jus- must continue to enhance the bus and cache
in tel Corporation tlfy this effort, a major new architecture to keep pace with the core
microarchitecture must improve performance Further, as other technologies (such as mul-
one and a half or two times over the previ- tiprocessing) mature, there is a natural ten-
ous-generation microarchitecture,when eval- dency to draw them into the processor
uated on equivalent process technology. In design as a way of providing additional fea-
addition, semiconductor process technology tures and value for the end user
continues to evolve while the processor
design is in progress. The previous-genera- Mass-market designs
tion microarchitecture increases in clock The large installed base and broad range
speed and performance due to compactions of applications for the Intel architecture
and conversion to newer technology. A new place additional constraints on the design,
microarchitecturemust “intercept”the process constraints beyond the purely academic
technology to achieve a compounding of ones of performance and clock frequency
process and microarchitectural speedups. We do not have the flexibility to control soft-
The process technology, degree of ware applications, compilers, or operating
pipelining, and amount of effort a team is systems in the same way system vendors
willing to spend on circuit and layout issues can We cannot remove obsolete features
determine the clock speed of a microarchi- and must cater to a wide variety of coding
tecture. Typically, a mlcroarchitecture will styles The processor must run thousands of
start with the same clock speed as a prior shrink-wrapped applications, rather than
microarchitecture (adjusted for process tech- only those compiled by a vendor-provided
nology scaling). This enables the maximum compiler running on a vendor-provided
reuse of past designs and circuits, and fits operating system on a vendor-provided plat
the new design to the existing product form These limitations leave fewer avenues
%is inside look at a development tools and methodology. for workarounds and the processor exposed
Performance enhancements should come to a much greater variety of instruction
large microprocessor primarily from the microarchitecture and not sequences and boundary conditions
from clock speed enhancements per se. Intel’s architecture has accumulated a
development project Often, a new processor’s die area is close great deal of history and features in 15 years
to the maximum that can be manufactured. The product must delivei world-class per-
reveals some of the This design choice stems from marketplace formance and also successfully identify and
competitiveness and efforts to get as much resolve compatibility issues The micro-
reasoning vor goals, performance as possible in the new microar- processor may be an assemblage of pieces
chitecture. While making the die smaller and from many different vendors, yet must func-
changes, trade-osfsi cheaper and improving performance are tion reliably and be easy for the general pub-
desirable, it is generally not possible to lic to use
andJerformance achieve a 1.5-to-2-times-better performance Since a new design needs to be manufac-
goal without using at least 1.5 to 2 times a turable in high volume froin the very begin-
simulation) that lay prior design’s transistors. ning, designers cannot allow the design to
Finally, new processor designs often expand to the limits of the technology It
behind itsjnalform. incorporate new features. As the perfor- also must meet stringent environmental and

8 IEEEMicro 0272-1732/96/$5.00 0 1996 IEEE

Authorized licensed use limited to: University of Southern California. Downloaded on February 09,2023 at 20:28:11 UTC from IEEE Xplore. Restrictions apply.
design-life limits. It must deliver high performance using a
set of motherboard components costing less than a few hun-
dred dollars.
Meeting these additional design constraints is critical for
business success. They add to the complexity of the project
and the total effort required, compared to a brand-new
instruction set architecture. The additional complexity results
in extra staffing and longer schedules. A vital ingredient in
long-term success is proper planning and management of
the demands of extra complexity; management must ensure
that the complexity does not impact the product's long-term
performance and availability. What we actually built
The actual Pentium Pro processor looks much different
Our first effort from our first straw man:
After due consideration of the performance, area, and
mass-market constraints, we knew we would have to imple- a 150-MHz clock using 0 6-micron Iechnology,
ment out-of-order execution and register renaming to wring a 14-stage pipeline,
more instruction level parallelism out of existing code. three-instruction decoding per clock cycle,
Further, the modest register file of the Intel architecture con- three micro-operations (micro-ops) I enamed and retired
strains any compiler. That is, it limits the amount of instruc- per clock cycle,
tion reordering that a compiler can do to increase superscalar an 8-Kbyte L 1 instruction cache,
parallelism and basic block size. Clearly, a new microarchi- dn 8-Kbyte L 1 data cache,
tecture would have to provide some way to escape the con- one dedicated load port and one tore port, and
straints of false dependencies and provide a form of dynamic 5 5 million transistors
code motion in hardware.
To conform to projected Pentium processor goals, we ini- The evolution process
tially targeted a 100-MHz clock speed using 0.6-micron tech- Our first efforts centered on designing and simulating a
nology. Such a clock speed would have resulted in roughly high-performance dynamic-execution engine. We attacked
a 10-stage pipeline. It would have much the same structure the problems of renaming, scheduling, and dispatching, and
as the Pentium processor with an extra decode stage added designed core structures that implemented the desired
for more instruction decode bandwidth. It would also require functionality.
extra stages to implement register renaming, nintime sched- Circuit and layout studies overlapped this effort. We dis-
uling, and in-order retirement functions. covered that the basic out-of-order core and the functional
We expected a two-clock data cache access time (like the units could run at a higher clock frequency than 100 MHz.
Pentium processor) and other core execution units that would In addition, instruction fetching and decoding in two pipeline
strongly resemble the Pentium processor. The straw-man stages and data cache access in two clock cycles were the
microarchitecture would have had the following components: main frequency limiters.
One of our first activities was to create a microarchitect's
a 100-MHz clock using 0.6-micron technology, workbench. Basically, this was a performance simulator
a 10-stage pipeline, capable of modeling the general class of dynamic execution
four-instruction decoding per clock cycle, microarchitectures. We didn't base this simulator on silicon
four-micro-operation renaming and retiring per clock structures or detailed modeling of any particular implemen-
cycle, tation. Instead, it took an execution trace as input and
a 32-Kbyte level-1 instruction cache, applied various constraints to each instruction, modeling the
a separate 32-Kbyte L 1 data cache, functions of decoding, renaming, scheduling, dispatching,
two general load/store ports, and and retirement. It processed one micro-operation at a time,
a total of 10 million transistors. from decoding until retirement, and at each stage applied
the limitations extant in the design being modeled.
From the outset we planned to include a full-frequency, This simulator was very flexible in allowing us to model
dedicated L2 cache, with some flavor of advanced packag- any number of possible architectures. Modifying it was much
ing connecting the cache to the processor. Our intent was to faster than modifying a detailed, low-level implementation,
enable effective shared-memory multiprocessing by remov- since there was no need for functional correctness or rout-
ing the processor-10-12 transactions from the traditional glob- ing signals from one block to another. \We set up this simu-
al interconnect, or front-side, bus, and to facilitate board and lator to model our initial straw-man microarchitecture and
platform designs that could keep up with the high-speed then performed a sensitivity analysis of the inajor microar-
processor. Remember that in 1990/1991 when we began the chitecturdl areas that affect performance, clock speed, and
project, it was quite a struggle to build 50- and 66-MH.z sys- die area.
tems. It seemed prudent to provide for a package-level solu- We simulated each change or potential change against at
tion to this problem. least 2 billion instructions from more than 200 programs. We

April1996 9

Authorized licensed use limited to: University of Southern California. Downloaded on February 09,2023 at 20:28:11 UTC from IEEE Xplore. Restrictions apply.
AGU Address generation unit
BIU Bus interface unit
BTB Branch target buffer
DCU Data cache unit
FEU Floating-point execution unit
ID Instruction decoder
IEU Integer execution unit
IFU Instruction fetch unit (includes I-cache)
L2 level-2 cache
MIS Microinstruction sequencer
MIU Memory interface unit
MOB Memory reorder buffer
RAT Register alias table
ROB Reorder buffer
RRF Retirement register file
RS Reservation station

studied the effects of L1 cache size, pipeline depth, branch The trade-off
prediction effectiveness, renaming width, reservation station Based on our circuit studies, we explored what would
depth and organization, and reorder buffer depth Quite happen if we boosted the core frequency by 1 5 times over
often, we found that our initial intuition was wrong and that our initial straw man This required a few simple changes to
every assumption had to be tested and tuned to what was the reservation station, but clearly we could build the basic
proven to work core to operate at this frequency It would allow us to retain

10 IEEE Micro

Authorized licensed use limited to: University of Southern California. Downloaded on February 09,2023 at 20:28:11 UTC from IEEE Xplore. Restrictions apply.
Pentium Pro (continued)
Folloning renaming. tlic opcwdons wait in :.I 20-entry
restm-:iticin sration (RS) iintil ;a11 of their cq~ennds;ire
data rcxidy 2nd a fiinction:il unit is availal~le.As many as
5 micro-ops pt-r clock can i w s s from 11icrc.scryation ski-
tion IC) tlic. \wious cxcciition units. Tlicsc units pcrforrii
the desired cotiiput2tion :ind send the result dat:i Ixick
t o dari-clepentlrnt micro-ops in thc- tr*serv;itionst;iticms.
:is well :IS storing tlic rcsult in the- rcordcr biiffkr (R013).
'['he reorcler Ixiffer siores the indivicliial micro-ops in
the original prograiii order ancl retires as m a n y as 1hre.c
per clocli i n tlic rctircmcnt rcgistcr filc: (RRF). This filc
-1U i I I I I
100 110 120 130 140 150
examines each completed micro-op for the presence of
faults or txinch inisl)retlictions;and a1)arts futtlier retire- Frequency (MHz)
nicnt upon detecting sucti a discontinuity. 'This reim-
poses tlic sequential fault model atid the illusion o f ;I
niicrc-i;irchitectiire that executes each inslruction in strict Figure 1. Delivered performance versus clock frequency.
sequent id orclcr.
l'hc L 1 data c:iclic unit (DCIT) :ICLS as one o f the em-
cution units. It c m accept a new load or store operation simulator to model this change and discovered that it result-
every clock and has a data latency o f three clocks for ed in a 7 percent degradation (increase) in clock cycles per
loads. It contains xi 8-KI)yte. two-way associ;.itive cache instruction.
a m ! pliis fill I.xiffc.rs to M f r r dat;.~and track thr st;itiis The next change was to rework the instruction fetch/
of as many :is four simultaneously outstanding data decodehename pipeline, resulting in an additional two stages
in the in-order fetch pipeline. Retirement also required one
The Ixis interface unit (IHI!) processes cache missrs. additional pipe stage. This lengthening resulted in a further
'l'his unit man:igc=s the 1.2 c:iclic :incl its associatcd 64- clock-per-instruction (CPI) degradation of about 3 percent.
l)i1, full-frrc~iienc~~~ bus. :is well :as the front-side systcni Finally, we pursued a series of logic simplifications to
bus. u.1iic.h Iypically opei:itr-:s ill :I fr;iction of the core shorten critical speed paths. We applied less aggressive orga-
f'rc.qiicXiic). suc11 as 66-blHz on :I 200-MHz processor. nizations to several microarchitecture areas and experienced
rl'rmisactions o n the front-sick and clec1ic;itccl l~usesarc another 1 percent in CPI loss.
organized as 32-lq-tc m c t w linc: transfers. The overall The high-frequency microarchitecture completes instruc-
1x1s nxliitecturr pcrtrijts ruultiplc Pentium Pro proces- tions at a 50 percent higher rate than the lower frequency
sors t o he intcrconnec.lc.ci o n 111cfront-side INIS t o fortn microarchitecture, but requires 11 percent more of the now-
3 glueless. symmetric. s t ~ ~ r e ~ l - ~ i imultipro
~~iioi~- faster clocks per 100 instructions to enable this higher fre-
system. quency. The net performance is (1.5/1.0) * (1.0/1.11) = 1.35,
I b r a detailetl tlisc.iission of 111r see
tiiic.ro;irc.liitec.trire. or a 35 percent performance improvement-a very signifi-
Col\vcll and Steck,l the Intel \X"h site.' and C;wenn:ip.j cant gain compared to most microarchitectureenhancements.
In Figure 1 we see that performance generally improves
as clock frequency increases. The improvement is not linear
References or monotonic, however. Designers must make a series of
1 . R. Colwell and R. Steck, " A 0.6-pm BiCMOS microarchitectural changes or trade-offs to enable the high-
Microprocessor with Dynamic Execution," Proc. /nt'/So/id- er frequency. This in turn results in "jaggies"or a CPI degra-
State Circuits Conf, IEEE, Piscataway, N.J., 1995,pp. 176- dation and performance loss at the point at which we make
177. a change. The major drop shown at point 1represents the 7
2. http://www.intel.com/procs/p6/p6white/index.html(Intel's percent CPI loss due to the added data cache pipe stage. The
World Wide Web site). series of minor deflections at points 2, 3, and 4 shows the
3. L. Gwennap, "Intel's P6 Uses Decoupled Superscalar effect of added front-end pipe stages. The overall trend does
Design," Microprocessor Report. Vol. 9, No. 2, Feb. 16, not continue indefinitely, as the CPI starts to roll off dra-
1995, pp. 9-15. matically once the central core no longer maintains one-clock
latency for simple operations.
The right of this graph shows a fairly clear performance
single-cycle execution of basic arithmetic operations at a 50 win, assuming that one picks a point that is not in the val-
percent higher frequency The rest of the pipeline would ley of one of the CPI dips, and assuming that the project can
then have to be retuned to use one-third fewer gates per absorb the additional effort and complexity required to hit
clock than a comparable Pentium microarchitecture. higher clock speeds.
We first added a clock cycle to data cache lookup, chang- When we first laid out the straw man, we did not expect
ing it from two to three cycles. We used the performance the CPI-clock frequency curve to have this shape. Our ini-

April1996 11

Authorized licensed use limited to: University of Southern California. Downloaded on February 09,2023 at 20:28:11 UTC from IEEE Xplore. Restrictions apply.
both, assuming loads are pipelined and nonblocking. This
Our initial intuition suggested blocking-factor effect varies, depending upon the program
mix. But a rule of thumb for the Pentium Pro processor is
that the cost of an extra clock that additional clocks of load latency cost approximately half
of what they do in a strict in-order architecture.
of latency on loads would be Thus the 15 percent of micro-ops that are critical-path
loads take an extra clock, but the overlap factor results in
more severe than it actually is. one half of the equivalent in-order penalty, or about 7.5 per-
cent. This is close to what we measured in the detailed sim-
ulation of actual code.
Now let’s look at the effect of additional fetch pipeline
tial intuition suggested that the cost of an extra clock of laten- stages. If branches are perfectly predicted, the fetch pipeline
cy on loads would be more severe than it actually is. Further, can be arbitrarily long, at no performance cost. The cost of
past industry experience suggests that high frequency at the extra pipe stages comes only on branch mispredictions. If
expense of deep pipelines often results in a relative stand- 20 percent of the micro-ops are branches, and branch pre-
still in performance for general-purpose integer code. diction is 90 percent accurate, then two micro-ops in 100 are
However, our performance simulator showed the graphed mispredictions. Each additional clock in the fetch pipeline
behavior and the performance win possible from higher will add one clock per misprediction. If the base CPI is about
clock speeds. Since we had carefully validated the simulator 1,we’ll see about two extra clocks per 100 micro-ops or and
with micro-benchmarks (to ensure that it really modeled the additional 2 percent per pipe stage. The actual penalty is
effect in question), we were inclined to believe its results somewhat less, because branch prediction is typically better
over our own intuition and went ahead with the modified than 90 percent, and there is a compound-interest effect. As
design. We did, however, work to come up with qualitative the pipeline gets longer, the CPI degrades and the cost of
explanations of the results, which follow. yet another pipe stage diminishes.
Consider a program segment that takes 100 clock cycles at
100 MHz. The baseline microarchitecturetakes 1 microsecond Clock frequency versus effort
to execute this segment. We modify this baseline by adding This all sounds like a fine theoretical basis for building a
an extra pipe stage to loads. If 30 percent of all operations are faster processor, but it comes at a nontrivial cost. Using a
loads, this would add 30 clocks to the segment, and take 130 higher clock frequency reduces the margin for error in any
clocks to execute. If the extra pipe stage enables a 50 per- one pipe stage. The consequence of needing one too many
cent higher frequency, the total execution time becomes gates to get the required functionality into a given pipe stage
130/150 or 0.867 microseconds. This amounts to a 15 percent is a 9 to 10 percent performance loss, rather than a 4 to 5
performance improvement (1/0.867).This is certainly a high- percent loss. So the design team must make a number of
er performance microarchitecture but hardly the 50 percent small microarchitecture changes as the design matures, since
improvement one might naively expect from clock rate alone. it is impossible to perfectly anticipate every critical path and
Such a result is typical of past experience with in-order design an ideal pipeline. This results in rework and a longer
pipelines when we seek the CPI-frequency balance. project schedule. Further, with short pipe stages, many paths
The Pentium Pro microarchitecture does not suffer this cannot absorb the overhead of logic synthesis, increasing
amount of CPI degradation from increased load latency the proportion of the chip for which we must hand-design
because it buffers multiple loads, dispatches them out of circuits.
order, and completes them out of order. About 50 percent of Higher clock speeds require much more hand layout and
loads are not critical from a dataflow perspective. These loads careful routing. The densities achievable by automatic place-
(typically from the stack or from global data) have their ment and routing are often inadequate to keep parasitic
address operands available early. The 20-entry reservation delays within what will fit in a clock cycle. Beyond that, the
station in the Pentium Pro processor can buffer a large pool processor spends a bigger fraction of each clock period on
of micro-ops, and these “data-ready”load micro-ops can latched delay, set-up time, clock skew, and parasitics than
bubble up and issue well ahead of the critical-path opera- with a slower, wider pipeline. This puts even more pressure
tions that need their results. For this class of loads, an extra on designers to limit the number of gates per pipe stage.
clock of load latency does not impact performance. The higher performance that results from higher clock
The remaining 50 percent of the loads have a frequent speeds places more pressure on the clock and power distri-
overlap of multiple critical-path loads. For example, the code bution budget. The shorter clock period is less able to absorb
fragment a = b + c might compile into the sequence clock jitter and localized voltage sags, requiring very careful
and detailed speed path simulations.
load b => r1 As long as a design team expects, manages, and supports this
load c => 1-2 extra effort, clock speedups provide an excellent path to high-
r l plus r2 => r3 er Performance. Even if this comes at some CPI degradation,
the end result is both a higher performance product and one
Both b and c are critical-path loads, but even if each takes that hits production earlier than one that attempts to retrofit
an extra clock of latency, only one extra clock is added for higher clock frequency into a pipeline not designed for it.

12 IEEE Micro

Authorized licensed use limited to: University of Southern California. Downloaded on February 09,2023 at 20:28:11 UTC from IEEE Xplore. Restrictions apply.
The terms “architecturalefficiency”or “performance at the
same clock” are sometimes taken as metrics of goodness in It is equally meaningless to
and of themselves. Perhaps this is one way of apologizing for
low clock rates or a way to imply higher performance when brag about high clock
the microarchitecture “someday”reaches a clock rate that is
in fact unobtainable for that design with that process tech- frequency without considering
nology. Performance at the same clock is not a good microar-
chitectural goal, if it means building bottlenecks into the the CPI and other significant
pipeline that will forever impact clock frequency. Similarly,
low latency by itself is not an important goal. Designers must performance trade-0 ffs ma de
consider the balance between latency, available parallelism
in the application, and the impact on clock speed of forcing to achieve it.
a lot of functionality into a short clock period.
It is equally meaningless to brag about high clock fre-
quency without considering the CPI and other significant
performance trade-offs made to achieve it. In designing the patches them whenever they and a functional unit become
Pentium Pro microarchitecture, we balanced our efforts on ready. N o particular harm comes from delaying any one
increasing frequency and reducing CPI. As architects, we micro-op, since a micro-op can execute in several different
spent the same time working on clock frequency and layout clocks without affecting the critical path through the flow
issues as on refining parallel-execution techniques. The true graph. This contrasts with in-order superscalar approaches,
measure of an architecture is delivered performance, which which offer only one opportunity to execute an operation
is clock speed/CPI and not optimal CPI with low clock speed that will not result in adding a clock or more to the execu-
or great clock speed but poor CPI. tion time. The in-order architecture runs in feast-or-famine
One final interesting result was that the dynamic-execution mode, its multiple functional units idle much of the time and
microarchitecture was actually a major enabler of higher only coming into play when parallelism is instantaneously
clock frequency. In 1990, many pundits claimed that the available to fit its available templates.
complexity of out-of-order techniques would ultimately lead The same phenomenon occurs in the instruction decoder.
to a clock speed degradation, due to the second-order effects A decoder for a complex instruction set will typically have
of larger die size and bigger projects with more players. In restrictions (termed “templates” here) placed upon it. This
the case of the Pentium Pro processor, we often found that refers to the number and type of instructions that can be
dynamic execution enabled us to add pipe stages to reduce decoded in any one clock period. The Pentium Pro’s decoder
the number of critical paths. We did not pay the kind of CPI operates to a 4-1-1 template. It decodes up to three instruc-
penalty that an in-order microarchitecture would have suf- tions each clock, with the first decoder able to handle most
fered for the same change. By alleviating some of the data- instructions and the other two restricted to single dataflow
path barriers to higher clock frequency, we could focus our nodes (micro-ops) such as loads and register-to-register. A
efforts on the second-order effects that remained. hypothetical 4-2 template could decode up to two instruc-
tions per clock, with the second decoder processing stores
Tuning area and performance and memory-to-register instructions as well as single micro-
Another critical tuning parameter is the trade-off between op instructions.
silicon area and CPU performance. As designers of a new The Pentium Pro’s instruction decoder has a six-micro-op
microarchitecture, we are always tempted to add more capa- queue on its output, and the reservation station provides a
bility and more features to try to hit as high a performance substantial amount of additional buffering. If a template
as possible. We try to guesstimate what will fit in a given restriction forces a realignment, and only two micro-ops are
level of silicon technology, but our early estimates are gen- decoded in a clock, opportunities exist to catch up in sub-
erally optimistic and the process is not particularly accurate. sequent clocks. At an average CPI of about 1, there is no
As we continued to refine the Pentium Pro microarchi- long-term need to sustain a decode rate of three instructions
tecture, we discovered that, by and large, most applications per clock. Given adequate buffering and some overcapaci-
do not perform as well as possible, being unable to keep all ty, the decoder can stay well ahead of the execution
of the functional units busy all of the time. At the same time, dataflow. The disparity between CPI and the maximum
better understanding of layout issues revealed that the die decode rate reduce the template restrictions to a negligible
size of the original microarchitecture was uncomfortably impact.
large for high-volume manufacturing. After observing these generic effects, we performed sen-
We found that the deep buffering provided by the large, sitivity studies on other microarchitecture aspects. We
uniform reservation station allowed a time-averaging of func- trimmed each area of the preliminary microarchitecture until
tional-unit demand. Most program parallelism is somewhat we noted a small performance loss.
bursty (that is, it occurs in nonuniform clumps spread For example, we observed that each of the two load/store
through the application). The dynamic-execution architec- ports were used about 20 percent of the time. We surmised
ture can average out bursty demands for functional units; it that changing to one dedicated load port and one dedicat-
draws micro-ops from a large range of the program and dis- ed store port should not have a large effect on performance.

April1996 13

Authorized licensed use limited to: University of Southern California. Downloaded on February 09,2023 at 20:28:11 UTC from IEEE Xplore. Restrictions apply.
a,
120 , It is important to consider the actual shape of the area-
performance curve. Most high-end CPU designs operate well
100-
past the knee of this curve. Efficiency is not a particularly
critical goal. For example, the market demands as much per-
formance as possible from a given technology, even when
that means using a great deal of silicon area for relatively
modest incremental gains
Figure 2 illustrates this effect. This graph charts the per-
formance of various instruction decoder schemes coupled
I to a fixed execution core. All of the architectures discussed
25 50 63 75 a7 io0 125 200 i earlier are clearly well past the knee of the performance
Relative area (percentage)
curve. Moving from point A (a 4-2-2-2 template) to point B
(4-2-2) is clearly the right choice, since the performance
Figure 2. Performance versus die area f o r different curve is almost flat. Moving down to point C (4-1-1) shows
decoder designs. a detectable performance loss, but it is hardly disastrous in
the grand scheme of things. Point D (4-2-one we actually
The load port would operate about 30 percent of the time considered at one time) occupies an awkward valley in this
and the store port at about 10 percent of the time. This curve, barely improved over point E (4-1) for significantly
proved to be the case, with less than a 1 percent performance more area and noticeably lower performance than point C.
loss for this change. As we converted the microarchitecture to transistors and
Changing from a 4-2-2-2 decode template (four instruc- then to layout, execution quality became critical. All mem-
tions per clock) to a 4-2-2 template (three instructions per bers of the project team participated in holding the line on
clock) also was a no-brainer, with no detectable performance clock frequency and area. Some acted as firefighters, han-
change on 95 percent of programs examined. dling the hundreds of minor emergencies that arose.
We also changed the renaming and retirement ports from This phase is very critical in any major project. If a project
fourxmicro-ops per clock to three, which resulted in a slight- takes shortcuts and slips clock frequency or validation to
ly larger, but still manageable 2 percent performance loss. achieve earlier silicon, the design often contains bugs and
Finally, we reduced the L 1 cache size from 16 to 8 Kbytes. suffers unrecoverable performance losses. This design phase
In doing this, we took advantage of the full-frequency ded- determines a project’s ultimate success. The best planned
icated bus we had already chosen. Since the L1 cache is back- and most elegant microarchitechirewill fail if the design team
stopped by a full bandwidth, three-clock L2 cache, the extra does not execute implementations well. As CPU architects,
L 1 misses that result from cutting the L1 cache size cause a we were very fortunate to work with a strong design and
relatively minor 1.5 percent performance loss. layout team that could realize our shared vision in the result-
The reduction from four- to three-way superscalar opera- ing silicon.
tion and the reduction in L 1 cache size had some negative
impact on chest-thumping bragging rights, but we could not
justify the extra capacity by the delivered performance.
Further, tenaciously holding on to extra logic would have THEPENTIUM P R O PROCESSOR ran DOS, Windows,
resulted in significant negative consequences in die area, and Unix within one week of receiving first silicon. We had
clock speed, and project schedule. most major operating systems running within one month.
As the design progressed, we eventually found that even We made a series of small metal fixes to correct minor bugs
the first round of trimming was not enough. We had to make and speed paths. The A2 material, manufactured using a 0.6-
further reductions to keep die size in a comfortable range, micron process, ran at 133 MHz with a production-quality
and, as it turned out later, maintain clock frequency. This test program, including 85 degree case temperature and 5
required making further cuts, which resulted in detectable percent voltage margins).
performance loss, rather than the truly negligible losses from The BO steppingincorporated several microcode bug and
the earlier changes. speed path fixes for problems discovered on the A-step sil-
We made two major changes. We cut back the decoder to icon, and added frequency ratios to the front-side bus. Our
a 4-1-1 template from a 4-2-2 template. This amounted to success with early Pentium Pro processor silicon, plus pos-
about a 3 percent performance loss. We also cut back the itive early data on the 0.35-micron process, encouraged us
branch target buffer from 1,024 to 512 entries, which barely to retune the L2 access pipeline. Retuning allowed for ded-
affected SPECint92 results (1 percent) but did hurt transac- icated bus frequencies in excess of 200 MHz. We added one
tion processing (5 percent). It was emotionally difficult (at the clock to the L2 pipeline, splitting the extra time between
time) for the microarchitecture team to accept the resulting address delivery and path while retaining the full-core clock
performance losses, but these results turned out to be criti- frequency and the pipelined/nonblocking data access capa-
cal to keeping the die area reasonable and obtaining a high bility. The 0.6-micron BO silicon became a lSO-MHz, pro-
clock frequency. This kind of careful tuning and flexibility in duction-worthy part and met our goals for performance and
product goals was essential to the ultimate success of the frequency using 0.6-micron Pentium processor technology.
program. We optically shrank the design to the newly available 0.35-

14 IEEEMicro

Authorized licensed use limited to: University of Southern California. Downloaded on February 09,2023 at 20:28:11 UTC from IEEE Xplore. Restrictions apply.
Table 1. Pentium Pro performance.

Processor Processor
(0.6 micron, 150 MHz, (0.35 micron, 200 MHz,

6.08 SPECint95 8.09 SPECint95

5.42 SPECfp95 6.75 SPECfp95

micron process, which allowed Intel to add a 200-MHz Accurate, up-to-date

processor to the product line. Table 1 shows the delivered microprocessor and
performance on some industry-standard benchmarks. microcontroller information
These results are competitive with every processor built
today on any instruction set architecture.
Visit /€E€ Micro on the World Wide Web a t
The Pentium Pro processor was formally unveiled on
November 1, 1995, 10 months after first silicon. Since then, http:www.computer.org/pubs/micro/micro.h t m
more than 40 systems vendors have announced the avail- and hyperlink t o
ability of computer systems based on the processor. In the
future, we will enhance the basic microarchitecturewith mul- Y The February 1996 issue:
timedia features and ever higher clock speeds that will Table of Contents-with links to abstracts
become available as the microarchitecture moves to 0.25- Micro Law: Bulletin boards and Net sites
micron process technology and beyond. b Micro Web: Three complete articles in
Adobe Acrobat format
Parallel Fiber-optic SCI Links by
David R. Engebretsen, Daniel M. Kuchta,
Richard C. Booth, John D. Crow, and
Wayne G. Nation
Memory Channel Network for PCI by
Richard B. Gillett
David 8. Papworth is a principal proces- The GigaRing Channel by Steve Scott
sor architect for Intel Corporation and one
of the senior architects of the Pentium Pro Y The December 1995 issue:
processor. Earlier, he was director of engi- Six complete articles from the Fuzzy Hardware
neering for MultiflowComputer, Inc., and series in Adobe Acrobat
one of the architects of the first commer-
cial VLlM processor. He holds a BSE Y Future-issue information
degree from the University of Michigan. Y How to contact the editorial board
and staff
e How to subscribe
Y How to write for /E€€ Micro
Direct comments regarding this articleto the author at Intel
Corporation, JF1-19, 5200 N.E. Elam Young Parkway,
Hillsboro, OR 97124; papworthQichips.inteI.com. IEEEMicro publishes detailed information about the design,
performance, or application of microcomputer and micro-
processor systems.Micro includes tutorials, book and software
reviews, and economic and standards discussions. Adobe
Acrobat freeware allows Web users to view and search articles
on line and print them locally.

Reader Interest Survey

Indicate your interest in this article by circling the appropriate
number on the Redder Service Card.

LOW 159 Medium 160 High 161

""MICISOJ
April 1996 15

Authorized licensed use limited to: University of Southern California. Downloaded on February 09,2023 at 20:28:11 UTC from IEEE Xplore. Restrictions apply.

Pupil Practice Book
67% (3)
Pupil Practice Book
89 pages
Pragmatics PDF
0% (1)
Pragmatics PDF
87 pages
Microprocessor Book
No ratings yet
Microprocessor Book
296 pages
SIGRADE
No ratings yet
SIGRADE
2 pages
Avionics
100% (1)
Avionics
43 pages
Jenkins X for Cloud-Native CI/CD: The Complete Guide for Developers and Engineers
From Everand
Jenkins X for Cloud-Native CI/CD: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Heizer Chapter 6 - Managing Quality
No ratings yet
Heizer Chapter 6 - Managing Quality
17 pages
Synapse RIS Version 4-1
No ratings yet
Synapse RIS Version 4-1
46 pages
Ultratech Report Final
No ratings yet
Ultratech Report Final
78 pages
Data Reconciliation
No ratings yet
Data Reconciliation
15 pages
Arithmetic Questions For Ibps RRB Prelims Exam With Video Explanation
No ratings yet
Arithmetic Questions For Ibps RRB Prelims Exam With Video Explanation
18 pages
Vlsi Design Unit 2 2019
No ratings yet
Vlsi Design Unit 2 2019
37 pages
(2022-S2) 02 Robot Kinematics Part 1 New
No ratings yet
(2022-S2) 02 Robot Kinematics Part 1 New
36 pages
WAGO 750-461en
No ratings yet
WAGO 750-461en
6 pages
Interpretation of Geophysical Logs Coal.
No ratings yet
Interpretation of Geophysical Logs Coal.
16 pages
Prajwal Deshmukh - Batch A
No ratings yet
Prajwal Deshmukh - Batch A
38 pages
EM 3rd Sem Assignment
No ratings yet
EM 3rd Sem Assignment
1 page
Module 2 Previous Year Questions
No ratings yet
Module 2 Previous Year Questions
9 pages
Forging Presentation
No ratings yet
Forging Presentation
17 pages
Cement Evaluation Challenges
No ratings yet
Cement Evaluation Challenges
18 pages
Chemistry
No ratings yet
Chemistry
11 pages
Reference Softener Calculations
No ratings yet
Reference Softener Calculations
1 page
All Test Cases PDF
0% (1)
All Test Cases PDF
7 pages
Fundamentals of Meter Provers and Proving Methods
100% (1)
Fundamentals of Meter Provers and Proving Methods
9 pages
The IBM System 360 Model 91 Floating-Point Execution Unit
No ratings yet
The IBM System 360 Model 91 Floating-Point Execution Unit
20 pages
Features Material Specifications: Application
No ratings yet
Features Material Specifications: Application
1 page
Network Scanning With Scapy in Python by Zhang Zeyu Python in Plain English
No ratings yet
Network Scanning With Scapy in Python by Zhang Zeyu Python in Plain English
11 pages
Intel P6
No ratings yet
Intel P6
7 pages
03 - Encyclopedia of Parallel Computing PDF
100% (2)
03 - Encyclopedia of Parallel Computing PDF
2,211 pages
Final Demonstration LP
No ratings yet
Final Demonstration LP
12 pages
Form Substation+400V+Switchboard+Test+Form
No ratings yet
Form Substation+400V+Switchboard+Test+Form
2 pages
IND315 Operations Research I, Fall 2023, by Ç. Özgün Kibiroğlu
No ratings yet
IND315 Operations Research I, Fall 2023, by Ç. Özgün Kibiroğlu
7 pages
Running Head: Specialty Toys Inc: Managerial Report 1
No ratings yet
Running Head: Specialty Toys Inc: Managerial Report 1
9 pages
Acknowledgement Abstract
No ratings yet
Acknowledgement Abstract
6 pages
JOUR213 Answers Fall 2020 6
No ratings yet
JOUR213 Answers Fall 2020 6
4 pages
MainBoard - Nguyễn Tuấn Lợi
No ratings yet
MainBoard - Nguyễn Tuấn Lợi
371 pages
Computer Architecture Slides
No ratings yet
Computer Architecture Slides
274 pages
Unit 1
No ratings yet
Unit 1
194 pages
CA Mid1
No ratings yet
CA Mid1
15 pages
Lecture01 IntroToComputerArchitecture
No ratings yet
Lecture01 IntroToComputerArchitecture
47 pages
ACA Question Bank
No ratings yet
ACA Question Bank
16 pages
Chapter 01
No ratings yet
Chapter 01
78 pages
Archtitecure 1
No ratings yet
Archtitecure 1
64 pages
Week2 - 1
No ratings yet
Week2 - 1
64 pages
02 - Computer Evolution and Performance
No ratings yet
02 - Computer Evolution and Performance
32 pages
IAS & MIPS Rate
No ratings yet
IAS & MIPS Rate
42 pages
Decisive Aspects in The Evolution of Microprocessors
No ratings yet
Decisive Aspects in The Evolution of Microprocessors
71 pages
ELEC Lecture 2
No ratings yet
ELEC Lecture 2
18 pages
CH02-COA10e Spring 2025
No ratings yet
CH02-COA10e Spring 2025
24 pages
CH02-COA10e Spring 2025
No ratings yet
CH02-COA10e Spring 2025
24 pages
Chapter 2
No ratings yet
Chapter 2
15 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
49 pages
CAQA5e ch1
No ratings yet
CAQA5e ch1
80 pages
02 Computer-Evolution
No ratings yet
02 Computer-Evolution
52 pages
001 Intro
No ratings yet
001 Intro
55 pages
Aula Ch1
No ratings yet
Aula Ch1
40 pages
Imicro96 Yu Proc - Future
No ratings yet
Imicro96 Yu Proc - Future
8 pages
Study Notes COAL Mids
No ratings yet
Study Notes COAL Mids
14 pages
MP Co4 PDF
No ratings yet
MP Co4 PDF
79 pages
EE310: Introduction To VLSI Design
No ratings yet
EE310: Introduction To VLSI Design
73 pages
Evolution of Microprocessors
No ratings yet
Evolution of Microprocessors
71 pages
Microprocessor (Report)
No ratings yet
Microprocessor (Report)
4 pages
William Stallings Computer Organization and Architecture
No ratings yet
William Stallings Computer Organization and Architecture
20 pages
Ico22 - 1 - Computer Abstraction and Technology
No ratings yet
Ico22 - 1 - Computer Abstraction and Technology
42 pages
Modle 01 - HPC Introduction To Pipeline
No ratings yet
Modle 01 - HPC Introduction To Pipeline
124 pages
Embedded Systems Programming with C: Writing Code for Microcontrollers
From Everand
Embedded Systems Programming with C: Writing Code for Microcontrollers
Larry Jones
No ratings yet
Unit - 4 Notes (C.O.A)
No ratings yet
Unit - 4 Notes (C.O.A)
4 pages
01) Fundamentals of Quantitative Design and Analysis
No ratings yet
01) Fundamentals of Quantitative Design and Analysis
71 pages
Computer Abstractions and Technology: Adapted by Prof. Gheith Abandah
No ratings yet
Computer Abstractions and Technology: Adapted by Prof. Gheith Abandah
35 pages
Module 2
No ratings yet
Module 2
127 pages
CAQA6e ch1
No ratings yet
CAQA6e ch1
31 pages
Pentium
No ratings yet
Pentium
20 pages
Embedded System Design (EEMS140) : Dr. Prasanna Kumar Misra IIIT Allahabad
No ratings yet
Embedded System Design (EEMS140) : Dr. Prasanna Kumar Misra IIIT Allahabad
27 pages
Lesson 5: Processor Design: Topic 1 - Methods and Concepts
No ratings yet
Lesson 5: Processor Design: Topic 1 - Methods and Concepts
57 pages
Aca Univ 2 Mark and 16 Mark
No ratings yet
Aca Univ 2 Mark and 16 Mark
20 pages
Advanced Computer Architecture: Azvjvhd
No ratings yet
Advanced Computer Architecture: Azvjvhd
61 pages
Advanced Computer Architecture ECE 6373: Pauline Markenscoff N320 Engineering Building 1 E-Mail: Markenscoff@uh - Edu
No ratings yet
Advanced Computer Architecture ECE 6373: Pauline Markenscoff N320 Engineering Building 1 E-Mail: Markenscoff@uh - Edu
151 pages
William Stallings Computer Organization and Architecture 8 Edition Computer Evolution and Performance
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Computer Evolution and Performance
50 pages
Fundamentals of Quantitative Design and Analysis: A Quantitative Approach, Fifth Edition
No ratings yet
Fundamentals of Quantitative Design and Analysis: A Quantitative Approach, Fifth Edition
24 pages
An Introduction To Computer Architecture: © 2019 Arm Limited
No ratings yet
An Introduction To Computer Architecture: © 2019 Arm Limited
46 pages
Smd150 Computer Architecture: Per Lindgren Eislab, Lectures Andrey Kruglyak, Syncsim Johan Eriksson, VHDL
No ratings yet
Smd150 Computer Architecture: Per Lindgren Eislab, Lectures Andrey Kruglyak, Syncsim Johan Eriksson, VHDL
43 pages
Lect4 - IC Technology
No ratings yet
Lect4 - IC Technology
43 pages
Ec6013-Advanced Microprocessor and Microcontroller
No ratings yet
Ec6013-Advanced Microprocessor and Microcontroller
153 pages
COA - 02 - Computer Evolution and Performance
No ratings yet
COA - 02 - Computer Evolution and Performance
9 pages
Introduction To Embedded Systems
No ratings yet
Introduction To Embedded Systems
9 pages
Octopus Deploy in Modern CI/CD Workflows: Definitive Reference for Developers and Engineers
From Everand
Octopus Deploy in Modern CI/CD Workflows: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
CSE 820 Graduate Computer Architecture: Dr. Enbody
No ratings yet
CSE 820 Graduate Computer Architecture: Dr. Enbody
25 pages
Architecture of The Pentium Microprocessor
No ratings yet
Architecture of The Pentium Microprocessor
12 pages
William Stallings Computer Organization and Architecture 8 Edition Computer Evolution and Performance
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Computer Evolution and Performance
52 pages
Defining Computer Architecture
No ratings yet
Defining Computer Architecture
6 pages
Hyper-Threading Technology: Processor Microarchitecture
No ratings yet
Hyper-Threading Technology: Processor Microarchitecture
18 pages
CircuitPython in Practice: Definitive Reference for Developers and Engineers
From Everand
CircuitPython in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Study Guide 300-435 ENAUTO: Automating and Programming Cisco Enterprise Solutions Certification Exam
From Everand
Study Guide 300-435 ENAUTO: Automating and Programming Cisco Enterprise Solutions Certification Exam
Anand Vemula
No ratings yet

Tuning The Pentium Pro Microarchitecture

Uploaded by

Tuning The Pentium Pro Microarchitecture

Uploaded by

TUNING THE

8 IEEEMicro 0272-1732/96/$5.00 0 1996 IEEE

6.08 SPECint95 8.09 SPECint95

micron process, which allowed Intel to add a 200-MHz Accurate, up-to-date

Reader Interest Survey

LOW 159 Medium 160 High 161

You might also like