0% found this document useful (0 votes)
12 views50 pages

Pipe 4

Uploaded by

pedro paulo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views50 pages

Pipe 4

Uploaded by

pedro paulo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Review: Summary of Pipelining

Basics
° Pipelines pass control information down the pipe just
as data moves down pipe
° Forwarding/Stalls handled by local control
° Hazards limit performance
• Structural: need more HW resources
• Data: need forwarding, compiler scheduling
• Control: early evaluation & PC, delayed branch, prediction

° Increasing length of pipe increases impact of hazards;


pipelining helps instruction bandwidth, not latency
° Interrupts, Instruction Set, FP makes pipelining harder
° Compilers reduce cost of data and control hazards
• Load delay slots
• Branch delay slots
• Branch prediction
cs 152 L1 5 .1 DAP Fa97, U.CB
Recap: Pipeline Hazards
I-Fet ch DCD MemOpFetch OpFetch Exec Store

IFetch DCD °°°


Structural
Hazard

I-Fet ch DCD OpFetch Jump Control Hazard

IFetch DCD °°°

IF DCD EX Mem WB RAW (read after write) Data Hazard


IF DCD EX Mem WB
WAW Data Hazard
IF DCD EX Mem WB (write after write)

IF DCD OF Ex Mem

IF DCD OF Ex RS WAR Data Hazard


(write after read)
cs 152 L1 5 .2 DAP Fa97,  U.CB
Recap: Data Hazards

° Avoid some “by design”


• eliminate WAR by always fetching operands early (DCD) in pipe
• eleminate WAW by doing all WBs in order (last stage, static)

° Detect and resolve remaining ones


• stall or forward (if possible)

IF DCD EX Mem WB RAW Data Hazard

IF DCD EX Mem WB
WAW Data Hazard
IF DCD EX Mem WB

IF DCD OF Ex Mem

IF DCD OF Ex RS WAR Data Hazard

cs 152 L1 5 .3 DAP Fa97,  U.CB


Recap: Exception Problem
° Exceptions/Interrupts: 5 instructions executing in 5 stage pipeline
• How to stop the pipeline?
• Restart?
• Who caused the interrupt?
Stage Problem interrupts occurring
IF Page fault on instruction fetch; misaligned memory
access; memory-protection violation
ID Undefined or illegal opcode
EX Arithmetic exception
MEM Page fault on data fetch; misaligned memory
access; memory-protection violation; memory error
° Load with data page fault, Add with instruction page
fault?
° Solution 1: interrupt vector/instruction
cs 152 L1 5 .4 DAP Fa97,  U.CB
The Big Picture: Where are We Now?

° The Five Classic Components of a Computer


Processor
Input
Control
Memory

Datapath Output

° Today’s Topics:
• Recap last lecture
• Review MIPS R3000 pipeline
• Advanced Pipelining
• SuperScalar

cs 152 L1 5 .5 DAP Fa97,  U.CB


FYI: MIPS R3000 clocking
discipline
phi1

phi2

° 2-phase non-overlapping clocks


° Pipeline stage is two (level sensitive) latches

phi1 phi2 phi1


Edge-triggered

cs 152 L1 5 .6 DAP Fa97,  U.CB


MIPS R3000 Instruction
Pipeline

Inst Fetch Decode ALU / E.A Memory Write Reg


Reg. Read

TLB I-Cache RF Operation WB


E.A. TLB D-Cache

Resource Usage

TLB TLB
I-cache
RF WB
ALUALU
D-Cache

Write in phase 1, read in phase 2 => eliminates bypass from WB

cs 152 L1 5 .7 DAP Fa97,  U.CB


Recall: Data Hazard on
r1 Time (clock cycles)
IF ID/RF EX MEM WB

ALU
I add r1,r2,r3 Im Reg Dm Reg

ALU
Im Reg Dm Reg
s
t
sub r4,r1,r3
r.

ALU
Im Reg Dm Reg
and r6,r1,r7
O

ALU
r Im Reg Dm Reg
d or r8,r1,r9
e

ALU
Im Reg Dm Reg
r xor r10,r1,r11

With MIPS R3000 pipeline, no need to forward from WB stage

cs 152 L1 5 .8 DAP Fa97,  U.CB


MIPS R3000 Multicycle
Operations
op Rd Ra Rb

Ex: Multiply, Divide, Cache Miss

Stall all stages above multicycle


operation in the pipeline
mul Rd Ra Rb A B Drain (bubble) stages below it

Use control word of local stage


state to step through multicycle
operation
Rd
R

Rd T
to reg
file

cs 152 L1 5 .9 DAP Fa97,  U.CB


6.8 Superscalar and Dynamic Pipelining
This and next section are brief overviews of advanced
topics. More info in Computer Architecture: A
Quantitave Approach, 2nd edition.
For faster processors:
° Superpipelining: longer pipelines. Some recent
microprocessors have gone to pipelines with 8 or more
stages.
° Superscalar: replicate the internal components of the
computer so that it can issue a varying no. of
instrs/cycle (1 to 6). Instr execution rate can excede
clock rate, or CPI < 1. Some suggest IPC (Instrs/cycle).
• Parallelism and dependencies determined/resolved by HW
• IBM PowerPC 604, Sun UltraSparc, DEC Alpha 21164, HP 7100

° Dynamic Pipeline Scheduling: compiler must schedule


delays so that later instrs ready for execution proceed
in parallel.
cs 152 L1 5 .10 DAP Fa97,  U.CB
Getting CPI < 1: Issuing Multiple
Instructions/Cycle
° Superscalar (SS) MIPS: 2 instructions, 1 ALU or branch
& 1 load or store
– Fetch 64-bits/clock cycle; ALU or branch on left, LW or SW on right
– Can only issue 2nd instruction if 1st instruction issues. The HW makes
this decision dynamically, issuing only the 1st instr if conditions are
not met.
– More ports for regs file. May need 2 regs for the ALU operation and 2 for
a store. Also 1 write port for ALU and 1 for a load. 1 more adder for
effective address calculations for loads and stores.
Type PipeStages
ALU instruction IF ID EX MEM WB
LW instructionIF ID EX MEM WB
ALU instruction IF ID EX MEM WB
LW instruction IF ID EX MEM WB
ALU instruction IF ID EX MEM WB
LW instruction IF ID EX MEM WB
cs 152 L1 5 .11 DAP Fa97,  U.CB
Superscalar Datapath

M
40000040 u
x

M
u
x
4

ALU

M
Registers u
Instruction x
PC
memory Write
data

Data
memory

Sign ALU Address


extend Sign
extend

M
u
x

Superscalar additions: 32 more bits from instr memory, 2 read ports + 1 write
port for regs file, 1 more ALU (top ALU for address calculation, bottom ALU
for all else).
cs 152 L1 5 .12 DAP Fa97,  U.CB
Superscalar Characteristics

° Loads have a latency of 1 cycle. If the next instr


uses the load’s result it must stall. The 1 cycle load
delay expands to the next 2 instrs in the next slot in
SS.
° Performance improvement: e.g. a 1000 MHz four-
way superscalar microprocessor can execute a
peak rate of 4 billion instrs / second, and have a
best CPI of 0.25. Today’s superscalar machines try
to schedule 2 - 6 instrs in each pipe stage.
° If instrs in the instr stream are dependent or don’t
meet certain criteria, only the first few (maybe just
the first) instrs in the sequence are issued.
° More ambitious compiler or HW scheduling
techniques are needed, as well as more complex
instr decoding, to effectively exploit parallelism
available in SS.
cs 152 L1 5 .13 DAP Fa97,  U.CB
Scheduling Code for Superscalar

Reorder the following instrs to avoid as many stalls


as possible.

Loop: lw $t0, 0($s1) #$t0 is first array element


addu $t0, $t0, $s2 # add scalar in $s2
sw $t0, 0($s1) # store result
addi $s1, $s1, -4 # decrement pointer
bne $s1, $zero, Loop # branch if $s1 != 0

The 1st 3 instrs have data dependencies and so do


the last 2.
cs 152 L1 5 .14 DAP Fa97,  U.CB
Best Scheduling Solution

ALU or branch instr data transfer instr Clock cycle


Loop: lw $t0,0($s1) 1
addi $s1, $s1, -4 2
addu $t0, $t0, $s2 3
bne $s1, $zero, Loop sw $t0, 4($S1) 4

Only one pair of instrs executes in superscalar mode.


4 cycles / loop iteration => 4 cycles / 5 instrs
=> CPI = 0.8 (Not good when compared to best
case CPI = 0.5).
To get more performance from loops that access
arrays => Loop Unrolling: make multiple copies of
the loop, and instrs from different iterations are
scheduled together.
cs 152 L1 5 .15 DAP Fa97,  U.CB
Unrolled Loop that Minimizes Stalls for
Scalar
4 copies to schedule without delays (loop index is multiple of 4).
ALU or branch instr data transfer instr Clock cycle
Loop: addi $s1, $s1, -16 lw $t0, 0($s1) 1
lw $t1, 12($s1) 2
addu $t0, $t0, $s2 lw $t2, 8($s1) 3
addu $t1, $t1, $s2 lw $t3, 4($s1) 4
16
addu $t2, $t2, $s2 sw $t0, 0($s1) 5
addu $t3, $t3, $s2 sw $t1, 12($s1) 6
sw $t2, 8($S1) 7
bne $s1, $zero, Loop sw $t3, 4($S1) 8
Since the 1st pair decrements $s1 by 16 the addresses loaded are the original
value of $s1, then this address - 4, - 8, and - 12.
12 of 14 instrs execute in superscalar mode.
8 cycles / 4 loop iterations = 2 cycles / iteration (without unrolling 4 cycles/iter)
=>factor of 2 improvement: from reducing loop control instrs + SS execution.
cs 152 L1 5 .16 Overhead: 4 temp regs rather than 1. DAP Fa97,  U.CB
Performance Improvement Limitations

° Pipelining and SS increase peak instr throughput.


° While ALU/data transf split is simple for the HW, get
CPI of 0.5 only for programs with:
• Exactly 50% ALU and branch operations
• No hazards

° If more instructions issue at same time, greater


difficulty of decode and issue
• Even 2-scalar => examine 2 opcodes, 6 register specifiers, &
decide if 1 or 2 instructions can issue

° Longer pipelines & wider SS issue => more


pressure on compiler scheduling to deliver
potential performance of the HW.
° Compiler writers must understand the pipeline to
generate appropriate code and achieve best
performance.
cs 152 L1 5 .17 DAP Fa97,  U.CB
Multiple Pipes/ Harder
Superscalar

IR0 IR1 Issues:


Reg. File ports
Register
File
Detecting Data
Dependences
A B B A Bypassing
RAW Hazard
WAR Hazard
R R
D$ D$ Multiple load/store ops?
T T
Branches

cs 152 L1 5 .18 DAP Fa97,  U.CB


Limits of
Superscalar

° Data + control dependencies + instr latencies => upper


limit on delivered performance.
° Designers must guarantee correct execution of all instr
sequences.
° VLIW (Very Long Instr Word): Several instrs are issued during
each cycle, like in SS, but here the compiler guarantees that there are no
dependencies between instrs that issue at the same time and that there
are sufficient HW resources to execute them (simplifies instr decode and
issue logic. Tradeoff instruction space for simple decoding
• The long instruction word has room for many operations
• By definition, all the operations the compiler puts in the long
instruction word can execute in parallel
• E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
- 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide
• Need compiling technique that schedules across several branches

cs 152 L1 5 .19 DAP Fa97,  U.CB


SS X VLIW

° SS processors can run without changing binary


machine code that runs on more traditional
architectures.

° VLIW works well when the source code of the


programs is available so that the programs can be
recompiled.

cs 152 L1 5 .20 DAP Fa97,  U.CB


Dynamic Pipeline Scheduling

Tries to find later instrs to execute while waiting for stall


to be resolved.
Pipeline is divided into 3 major units:
°instr fetch and issue unit: fetches, decodes and sends intrs to
corresponding functional unit of execute stage.
°execute units: each one has buffers called reservation stations, that hold
the operands and the operation.
commit unit: decides when it is safe to put result into the regs file or
memory.

To make programs behave as if they were running on a


simple nonpipelined computer, the IF/ID unit must issue
instrs in order, and the commit unit must write the
results to regs and memory in program execution order:
in-order completion.
Exception occurs in this conservative mode =>
computer can point to last instr executed and only instrs
issued before the falty will update regs and memory.
cs 152 L1 5 .21 DAP Fa97,  U.CB
Dynamically Scheduled Pipeline

Instruction fetch In-order issue


and decode unit

Reservation Reservation … Reservation Reser vation


station station station station

Floating Load/
Functional Integer Integer … Out-of-order execute
units point Store

In-order commit
Commit
unit

cs 152 L1 5 .22 DAP Fa97,  U.CB


Difficulties of Dynamic Pipeline
Scheduling
Functional Units are free to start and finish whenever
they want.

Out-of-Order completion: (more radical) allows the


commit to be out of order => introduces imprecise
interrupts.

Dynamic Scheduling is normally combined with branch


prediction (Speculative Execution) => commit unit
must be able to discard all results in the execution unit
due to instrs executed after a mispredicted branch.
Dynamic Scheduling is also combined with SS
execution, so each unit may be issuing or committing
4 - 6 instrs each cycle.
cs 152 L1 5 .23 DAP Fa97,  U.CB
HW Schemes: Instruction
Parallelism
° Why in HW at run time?
• Works when can’t know real dependence at compile time
• Compiler simpler
• Code for one machine runs well on another
• Hide memory latency
• Speculatively execute instrs while waiting for potential hazards to
be resolved.

° Key idea: Allow instructions behind stall to proceed


DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F12,F8,F14
• Enables out-of-order execution => out-of-order
completion
° ID stage checked both for structural and data hazards
cs 152 L1 5 .24 DAP Fa97,  U.CB
HW Schemes: Instruction
Parallelism

° Out-of-order execution divides ID stage:


1.Issue—decode instructions, check for structural hazards
2.Read operands—wait until no data hazards, then read operands

° Scoreboards allow instruction to execute whenever 1


& 2 hold, not waiting for prior instructions
° CDC 6600: In order issue, out of order execution, out
of order commit ( also called completion)

Dynamic machines are predicting program flow,


looking at the instrs in multiple segments to see
which to execute next, and then speculatively
executing instrs based on the predictions and the
instr dependencies.

cs 152 L1 5 .25 DAP Fa97,  U.CB


Example Architecture

DEC Alpha 21264: deep pipelines, SS and Dynamic


Pipelining
4 instrs fetched / cycle
out-of-order execution
in-order completion
simple integer and FP pipelines: 9 stages
==> 600 MHz in 1997

Compare with Cray T-90 supercomputer in 1997: 455


MHz. Clock rate isn’t only performance parameter
but this is still imprecive.

cs 152 L1 5 .26 DAP Fa97,  U.CB


Commit Unit

Controls updates to regs file and memory

Some Dynamically Scheduled machines update the


regs file immediately during execution.
Others have a copy of the regs file, and the actual
update to the regs file occurs later as part of the
commit.

For memory, there is normally a store buffer (or write


buffer). The commit unit allows the store to write to
memory from the buffer, when the buffer has a valid
address and valid data, and when the store is no
longer dependent on predicted branches.

cs 152 L1 5 .27 DAP Fa97,  U.CB


Scoreboard Implications

° Out-of-order completion => WAR, WAW hazards?


° Solutions for WAR.
• Queue both the operation and copies of its operands
• Read registers only during Read Operands stage

° For WAW, must detect hazard: stall until other


completes.
° Need to have multiple instructions in execution phase
=> multiple execution units or pipelined execution
units.
° Scoreboard keeps track of dependencies, state or
operations.

cs 152 L1 5 .28 DAP Fa97,  U.CB


Dynamic Branch Prediction

° Solution: 2-bit scheme where change prediction


only if get misprediction twice

T
NT
Predict Taken Predict Taken
T
T NT NT
Predict Not Predict Not
Taken T Taken
NT

cs 152 L1 5 .29 DAP Fa97,  U.CB


BHT Accuracy

° Mispredict because either:


• Wrong guess for that branch
• Got branch history of wrong branch when index the table

cs 152 L1 5 .30 DAP Fa97,  U.CB


Need Address @ Same Time as Prediction

° Branch Target Buffer (BTB): Address of branch index to get


prediction AND branch address (if taken)
• Note: must check for branch match now, since can’t use wrong branch
address Branch Prediction:
Predicted PC Taken or not Taken

° Return instruction addresses predicted with stack


cs 152 L1 5 .31 DAP Fa97,  U.CB
Dynamic Branch Prediction Summary

° Branch History Table: 2 bits for loop accuracy


° Branch Target Buffer: include branch address &
prediction

cs 152 L1 5 .32 DAP Fa97,  U.CB


HW support for More
ILP
° Need HW buffer for results of
uncommitted instructions:
reorder buffer
• Reorder buffer can be Reorder
operand source FP
Buffer

• Once operand commits, Op


Queue
result is found in register FP Regs
• 3 fields: instr. type,
destination, value
Res Stations Res Stations
• Use reorder buffer number
instead FP Adder FP Adder
of reservation station
° Instructions

cs 152 L1 5 .33 DAP Fa97,  U.CB


6.9 Dynamic Scheduling in PowerPC 604 and Pentium Pro
° Both In-order Issue, Out-of-order execution, In-order
Commit PC
Instruction
cache
Data
cache

Branch Instruction queue


prediction
Register file

Decode/dispatch unit

Reser vation Reser vation Reser vation Reser vation Reser vation Reser vation
station station station station station station

Store Load
Floating
Branch Integer Integer Complex Load/
point
integer store

Commit
unit

Reorder
buf fer

PPro central reservation station for any functional unit with one bus
shared by a branch and an integer unit
cs 152 L1 5 .34 DAP Fa97,  U.CB
Dynamic Scheduling in PowerPC 604 and Pentium Pro

Parameter PPC PPro


Max. instructions issued/clock 4 3
Max. instr. complete exec./clock 6 5
Max. instr. commited/clock 6 3
Instructions in reorder buffer 16 40
Number of rename buffers 12 Int/8 FP 40
Number of reservations stations 12 20
No. integer functional units (FUs) 2 2
No. floating point FUs 1 1
No. branch FUs1 1
No. complex integer FUs 1 0
No. memory FUs 1 1 load +1 store

cs 152 L1 5 .35 DAP Fa97,  U.CB


Dynamic Scheduling in Pentium Pro and PowerPC 604

°Both use a 512-entry branch history table for branch


prediction to predict branches and speculatively execute
instrs after a predicted branch.
°Dispatcher sends each instr and its operands to the
reservation station of one of the six FUs.
°Dispatcher also places an entry for the instr in the
reorder buffer of the commit unit.
°An instr cannot issue unless there is space available in
both an appropriate reservation station and in the
reorder buffer.

cs 152 L1 5 .36 DAP Fa97,  U.CB


Dynamic Scheduling in Pentium Pro and PowerPC

° PPro doesn’t pipeline 80x86 instructions


° PPro decode unit translates the Intel instructions
into 72-bit micro-operations (similar to MIPS instrs,
2 source regs and 1 destination regs.)
° Sends micro-operations to reorder buffer &
reservation stations
° Takes 1 clock cycle to determine length of 80x86
instructions + 2 more to create the micro-
operations
° Most instructions translate to 1 to 4 micro-
operations
° Complex 80x86 instructions are executed by a
conventional microprogram (8K x 72 bits) that
issues long sequences of micro-operations
cs 152 L1 5 .37 DAP Fa97,  U.CB
6.10 Pipeline Implementation
Issues
° The correct implementation of pipelines is not easy.
Predicting all possible conflicts and possible
solutions, such as forwarding and debugging the
project is hard.
° The correct choice of a pipelining solution is
technology dependent. E.g. When the no. of
transistors on chips and speed of transistors made a 5
stage pipeline the best solution, delayed branches
were a simple solution to control hazards. With longer
pipelines, SS execution and Dynamic Scheduling, it is
now redundant. As transistors became cheaper and
logic became much faster than memory, multiple FUs
and Dynamic Pipelining made more sence.
° It is important to consider instr set design:
Instrs should have approximately the same length and running times.
Addressing modes should be kept simple,

cs 152 L1 5 .38 DAP Fa97,  U.CB


Pipeline Implementation Issues

° Increasing the pipeline depth doesn’t always


increase performance:
° Data hazards may become more frequent, increasing the no. of stalls,
increasing time / instr and decreasing performance.
° Control Hazards mean that increasing pipeline depth results in
slower branches, increasing the number of clock cycles for the
program.
° Pipeline regst overhead can limit the decrease inclock period
obtained by further pipelining. A larger percentage of the cycle is
spent on setting pipeline regst.
3.0

2.5
Relative performance

2.0

1.5

1.0

0.5

0.0
1 2 4 8 16
Pipeline depth
cs 152 L1 5 .39 DAP Fa97,  U.CB
Limits to Multi-Issue
Machines
° Need about Pipeline Depth x No. Functional Units of
independent instrs. Difficulties in building HW.
° Duplicate FUs to get parallel execution
° Increase ports to Register File
° Increase ports to memory
° Decoding SS and impact on clock rate, pipeline depth.
° Limitations specific to either SS or VLIW
implementation
• Decode issue in SS
• VLIW code size: unroll loops + wasted fields in
VLIW

cs 152 L1 5 .40 DAP Fa97, U.CB


MIPS R3000 Multicycle
Operations
op Rd Ra Rb

Ex: Multiply, Divide, Cache Miss

Stall all stages above multicycle


operation in the pipeline
mul Rd Ra Rb A B Drain (bubble) stages below it

Use control word of local stage


state to step through multicycle
operation
Rd
R

Rd T
to reg
file

cs 152 L1 5 .41 DAP Fa97,  U.CB


Issues in Pipelined design

Limitation
IF D Ex M W
° Pipelining IF D Ex M W
IF D Ex M W Issue rate, FU stalls, FU depth
IF D Ex M W
° Super-pipeline
- Issue one instruction per (fast) cycle
- ALU takes multiple cycles IF D Ex M W
IF D Ex M W
Clock skew, FU stalls, FU depth
IF D Ex M W
IF D Ex M W

° Super-scalar IF D Ex M W Hazard resolution


- Issue multiple scalar IF D Ex M W
IF D Ex M W
instructions per cycle IF D Ex M W

° VLIW (“EPIC”)
- Each instruction specifies IF D Ex M W Packing
multiple scalar operations Ex M W
Ex M W
- Compiler determines parallelism
Ex M W

° Vector operations IF D Ex M W Applicability


- Each instruction specifies Ex M W
Ex M W
series of identical operations Ex M W

cs 152 L1 5 .42 DAP Fa97,  U.CB


6.11 Pipeline Summary
° Pipeline improves the average execution time / instr
(instr throughput) not individual instr execution
time (latency).
° Compared to single-cycle datapath this amounts to
reducing clock cycle time. We started with this
implementation.
° Compared to multi-cycle datapath this amounts to
reducing CPI.
° The latency of each instr is similar to the multi-
cycle implementation.
° Instr latencies introduce difficulties due to
dependencies in programs. The machine must wait
the full instr latency for the hazard to be resolved.
° The frequency of control dependecies can be
reduced by branch prediction HW and compiler
scheduling.
cs 152 L1 5 .43 DAP Fa97,  U.CB
Pipeline Summary

° Pipelines pass control information down the pipe


just as data moves down pipe
° Forwarding/Stalls handled by local control.
Forwarding reduces latencies due to data hazards.
° Exceptions stop the pipeline
° MIPS I instruction set architecture made pipeline
visible (delayed branch, delayed load)
° More performance from deeper pipelines,
parallelism
° Superscalar and VLIW
• CPI < 1
• Dynamic issue vs. Static issue
• More instructions issue at same time, larger the penalty of
hazards
cs 152 L1 5 .44 DAP Fa97,  U.CB
Pipeline Summary

° Longer pipelines, SS and dynamic scheduling has


recently sustained the 60% per year processor
performance increase since 1986.
° At first is seemed that the choice was between the
highest clock rate processors and the most
sophisticated SS processors. The Alpha 21264
proved it is possible to do both (issues 6 instrs /
cycle, out-of-order-execution, and 600 MHz in 1997).
° With such advances in processing, Amdahl’s Law
suggests that the bottleneck will be the memory
system.
° An alternative to trying to exploit more parallelism
at the instruction level (ILP) in uniprocessors is to
use multiprocessors, which exploit parallelism
(topic of Chapter 9).

cs 152 L1 5 .45 DAP Fa97,  U.CB


3 Recent Machines

Alpha 21164 Pentium II HP PA-8000


Year 1995 1996 1996
Clock 600 MHz (‘97) 300 MHz (‘97) 236 MHz (‘97)
Cache8K/8K/96K/2M 16K/16K/0.5M 0/0/4M
Issue rate 2int+2FP 3 instr (x86) 4 instr
Pipe stages 7-9 12-14 7-9
Out-of-Order 6 loads 40 instr (µop) 56 instr
Rename regs none 40 56

cs 152 L1 5 .46 DAP Fa97,  U.CB


Final Pipelined Datapath and Control
Branch

IF.Flush ID.Flush EX.Flush

Hazard
detection
unit
M ID/EX
u M
40000040 u
x
WB x
0 EX/MEM
M M
Control u M u WB
x x MEM/WB
0
0
EX Cause M WB
IF/ID

RegWrite

MemWrite
Except
PC
4 Shift ALUSrc
left 2 Read
Read

MemtoReg
data 1 M
register 1
u Data
Instruction Read x memory
=
Instruction

memory register 2
Registers
PC Address Write ALU Address
register Read Read
Read M
data Write data 2 M data u
M Write x
data u data
x u
x ALU
control
16 32 MemRead
Sign
extend ALUOp

Instruction [25– 21] RegDst

Instruction [20– 16]


Instruction [20– 16] M
Instruction [15– 11] u
x
Forwarding
unit

cs 152 L1 5 .47 DAP Fa97,  U.CB


Performance comparissons for Different Datapaths

Multicycle Pipelined
Faster

datapath datapath
(section 5.4) (Chapter 6)
Clock rate

Single-cycle
Slower

datapath
(section 5.3)

Slower Faster
Instruction throughput
(instructions per clock cycle or 1/CPI)

cs 152 L1 5 .48 DAP Fa97,  U.CB


Relationships Between Datapaths

Specialized
Single-cycle Pipelined
datapath datapath
(section 5.3) (Chapter 6)
Hardware

Multicycle
Shared

datapath
(section 5.4)

1 Several
Clock cycles of latency for an instruction

cs 152 L1 5 .49 DAP Fa97,  U.CB


6.12 Historical Perspective

Today
early 90's RISC
Superscalars
(IBM Power 1 and Power PC)

80's RISC
pipelines
vector proc. (mips,sparc,IBM RS6000) 80ns,
Cache 2Kb Ctrl. St
(ibm 360/85, ...) 4x16b bus
960ns mem
Load/Store ISA Dynamic Inst. 32KB cache
(cdc 6600,7600, Scheduling with 60-160ns
Cray-1, . . .) extensive pipelining
(ibm 360/91)
1966 25x basic model Virtual Memory
(multics, ge-645,
1967 ibm 360/67, ...)
60ns TLB
hardwired Inst. Pipelining
8x16b bus Inst. Buffering
780ns mem (Stretch (IBM7030)
- 100x ibm704 Microprogramming

1961
cs 152 L1 5 .50 DAP Fa97,  U.CB

You might also like