0% found this document useful (0 votes)

12 views50 pages

Pipe 4

Uploaded by

pedro paulo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views50 pages

Pipe 4

Uploaded by

pedro paulo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Review: Summary of Pipelining

Basics
° Pipelines pass control information down the pipe just
as data moves down pipe
° Forwarding/Stalls handled by local control
° Hazards limit performance
• Structural: need more HW resources
• Data: need forwarding, compiler scheduling
• Control: early evaluation & PC, delayed branch, prediction

° Increasing length of pipe increases impact of hazards;

pipelining helps instruction bandwidth, not latency
° Interrupts, Instruction Set, FP makes pipelining harder
° Compilers reduce cost of data and control hazards
• Load delay slots
• Branch delay slots
• Branch prediction
cs 152 L1 5 .1 DAP Fa97, U.CB
Recap: Pipeline Hazards
I-Fet ch DCD MemOpFetch OpFetch Exec Store

IFetch DCD °°°

Structural
Hazard

I-Fet ch DCD OpFetch Jump Control Hazard

IFetch DCD °°°

IF DCD EX Mem WB RAW (read after write) Data Hazard

IF DCD EX Mem WB
WAW Data Hazard
IF DCD EX Mem WB (write after write)

IF DCD OF Ex Mem

IF DCD OF Ex RS WAR Data Hazard

(write after read)
cs 152 L1 5 .2 DAP Fa97,  U.CB
Recap: Data Hazards

° Avoid some “by design”

• eliminate WAR by always fetching operands early (DCD) in pipe
• eleminate WAW by doing all WBs in order (last stage, static)

° Detect and resolve remaining ones

• stall or forward (if possible)

IF DCD EX Mem WB RAW Data Hazard

IF DCD EX Mem WB
WAW Data Hazard
IF DCD EX Mem WB

IF DCD OF Ex Mem

IF DCD OF Ex RS WAR Data Hazard

cs 152 L1 5 .3 DAP Fa97,  U.CB

Recap: Exception Problem
° Exceptions/Interrupts: 5 instructions executing in 5 stage pipeline
• How to stop the pipeline?
• Restart?
• Who caused the interrupt?
Stage Problem interrupts occurring
IF Page fault on instruction fetch; misaligned memory
access; memory-protection violation
ID Undefined or illegal opcode
EX Arithmetic exception
MEM Page fault on data fetch; misaligned memory
access; memory-protection violation; memory error
° Load with data page fault, Add with instruction page
fault?
° Solution 1: interrupt vector/instruction
cs 152 L1 5 .4 DAP Fa97,  U.CB
The Big Picture: Where are We Now?

° The Five Classic Components of a Computer

Processor
Input
Control
Memory

Datapath Output

° Today’s Topics:
• Recap last lecture
• Review MIPS R3000 pipeline
• Advanced Pipelining
• SuperScalar

cs 152 L1 5 .5 DAP Fa97,  U.CB

FYI: MIPS R3000 clocking
discipline
phi1

phi2

° 2-phase non-overlapping clocks

° Pipeline stage is two (level sensitive) latches

phi1 phi2 phi1

Edge-triggered

cs 152 L1 5 .6 DAP Fa97,  U.CB

MIPS R3000 Instruction
Pipeline

Inst Fetch Decode ALU / E.A Memory Write Reg

Reg. Read

TLB I-Cache RF Operation WB

E.A. TLB D-Cache

Resource Usage

TLB TLB
I-cache
RF WB
ALUALU
D-Cache

Write in phase 1, read in phase 2 => eliminates bypass from WB

cs 152 L1 5 .7 DAP Fa97,  U.CB

Recall: Data Hazard on
r1 Time (clock cycles)
IF ID/RF EX MEM WB

ALU
I add r1,r2,r3 Im Reg Dm Reg

ALU
Im Reg Dm Reg
s
t
sub r4,r1,r3
r.

ALU
Im Reg Dm Reg
and r6,r1,r7
O

ALU
r Im Reg Dm Reg
d or r8,r1,r9
e

ALU
Im Reg Dm Reg
r xor r10,r1,r11

With MIPS R3000 pipeline, no need to forward from WB stage

cs 152 L1 5 .8 DAP Fa97,  U.CB

MIPS R3000 Multicycle
Operations
op Rd Ra Rb

Ex: Multiply, Divide, Cache Miss

Stall all stages above multicycle

operation in the pipeline
mul Rd Ra Rb A B Drain (bubble) stages below it

Use control word of local stage

state to step through multicycle
operation
Rd
R

Rd T
to reg
file

cs 152 L1 5 .9 DAP Fa97,  U.CB

6.8 Superscalar and Dynamic Pipelining
This and next section are brief overviews of advanced
topics. More info in Computer Architecture: A
Quantitave Approach, 2nd edition.
For faster processors:
° Superpipelining: longer pipelines. Some recent
microprocessors have gone to pipelines with 8 or more
stages.
° Superscalar: replicate the internal components of the
computer so that it can issue a varying no. of
instrs/cycle (1 to 6). Instr execution rate can excede
clock rate, or CPI < 1. Some suggest IPC (Instrs/cycle).
• Parallelism and dependencies determined/resolved by HW
• IBM PowerPC 604, Sun UltraSparc, DEC Alpha 21164, HP 7100

° Dynamic Pipeline Scheduling: compiler must schedule

delays so that later instrs ready for execution proceed
in parallel.
cs 152 L1 5 .10 DAP Fa97,  U.CB
Getting CPI < 1: Issuing Multiple
Instructions/Cycle
° Superscalar (SS) MIPS: 2 instructions, 1 ALU or branch
& 1 load or store
– Fetch 64-bits/clock cycle; ALU or branch on left, LW or SW on right
– Can only issue 2nd instruction if 1st instruction issues. The HW makes
this decision dynamically, issuing only the 1st instr if conditions are
not met.
– More ports for regs file. May need 2 regs for the ALU operation and 2 for
a store. Also 1 write port for ALU and 1 for a load. 1 more adder for
effective address calculations for loads and stores.
Type PipeStages
ALU instruction IF ID EX MEM WB
LW instructionIF ID EX MEM WB
ALU instruction IF ID EX MEM WB
LW instruction IF ID EX MEM WB
ALU instruction IF ID EX MEM WB
LW instruction IF ID EX MEM WB
cs 152 L1 5 .11 DAP Fa97,  U.CB
Superscalar Datapath

M
40000040 u
x

M
u
x
4

ALU

M
Registers u
Instruction x
PC
memory Write
data

Data
memory

Sign ALU Address

extend Sign
extend

M
u
x

Superscalar additions: 32 more bits from instr memory, 2 read ports + 1 write
port for regs file, 1 more ALU (top ALU for address calculation, bottom ALU
for all else).
cs 152 L1 5 .12 DAP Fa97,  U.CB
Superscalar Characteristics

° Loads have a latency of 1 cycle. If the next instr

uses the load’s result it must stall. The 1 cycle load
delay expands to the next 2 instrs in the next slot in
SS.
° Performance improvement: e.g. a 1000 MHz four-
way superscalar microprocessor can execute a
peak rate of 4 billion instrs / second, and have a
best CPI of 0.25. Today’s superscalar machines try
to schedule 2 - 6 instrs in each pipe stage.
° If instrs in the instr stream are dependent or don’t
meet certain criteria, only the first few (maybe just
the first) instrs in the sequence are issued.
° More ambitious compiler or HW scheduling
techniques are needed, as well as more complex
instr decoding, to effectively exploit parallelism
available in SS.
cs 152 L1 5 .13 DAP Fa97,  U.CB
Scheduling Code for Superscalar

Reorder the following instrs to avoid as many stalls

as possible.

Loop: lw $t0, 0($s1) #$t0 is first array element

addu $t0, $t0, $s2 # add scalar in $s2
sw $t0, 0($s1) # store result
addi $s1, $s1, -4 # decrement pointer
bne $s1, $zero, Loop # branch if $s1 != 0

The 1st 3 instrs have data dependencies and so do

the last 2.
cs 152 L1 5 .14 DAP Fa97,  U.CB
Best Scheduling Solution

ALU or branch instr data transfer instr Clock cycle

Loop: lw $t0,0($s1) 1
addi $s1, $s1, -4 2
addu $t0, $t0, $s2 3
bne $s1, $zero, Loop sw $t0, 4($S1) 4

Only one pair of instrs executes in superscalar mode.

4 cycles / loop iteration => 4 cycles / 5 instrs
=> CPI = 0.8 (Not good when compared to best
case CPI = 0.5).
To get more performance from loops that access
arrays => Loop Unrolling: make multiple copies of
the loop, and instrs from different iterations are
scheduled together.
cs 152 L1 5 .15 DAP Fa97,  U.CB
Unrolled Loop that Minimizes Stalls for
Scalar
4 copies to schedule without delays (loop index is multiple of 4).
ALU or branch instr data transfer instr Clock cycle
Loop: addi $s1, $s1, -16 lw $t0, 0($s1) 1
lw $t1, 12($s1) 2
addu $t0, $t0, $s2 lw $t2, 8($s1) 3
addu $t1, $t1, $s2 lw $t3, 4($s1) 4
16
addu $t2, $t2, $s2 sw $t0, 0($s1) 5
addu $t3, $t3, $s2 sw $t1, 12($s1) 6
sw $t2, 8($S1) 7
bne $s1, $zero, Loop sw $t3, 4($S1) 8
Since the 1st pair decrements $s1 by 16 the addresses loaded are the original
value of $s1, then this address - 4, - 8, and - 12.
12 of 14 instrs execute in superscalar mode.
8 cycles / 4 loop iterations = 2 cycles / iteration (without unrolling 4 cycles/iter)
=>factor of 2 improvement: from reducing loop control instrs + SS execution.
cs 152 L1 5 .16 Overhead: 4 temp regs rather than 1. DAP Fa97,  U.CB
Performance Improvement Limitations

° Pipelining and SS increase peak instr throughput.

° While ALU/data transf split is simple for the HW, get
CPI of 0.5 only for programs with:
• Exactly 50% ALU and branch operations
• No hazards

° If more instructions issue at same time, greater

difficulty of decode and issue
• Even 2-scalar => examine 2 opcodes, 6 register specifiers, &
decide if 1 or 2 instructions can issue

° Longer pipelines & wider SS issue => more

pressure on compiler scheduling to deliver
potential performance of the HW.
° Compiler writers must understand the pipeline to
generate appropriate code and achieve best
performance.
cs 152 L1 5 .17 DAP Fa97,  U.CB
Multiple Pipes/ Harder
Superscalar

IR0 IR1 Issues:

Reg. File ports
Register
File
Detecting Data
Dependences
A B B A Bypassing
RAW Hazard
WAR Hazard
R R
D$ D$ Multiple load/store ops?
T T
Branches

cs 152 L1 5 .18 DAP Fa97,  U.CB

Limits of
Superscalar

° Data + control dependencies + instr latencies => upper

limit on delivered performance.
° Designers must guarantee correct execution of all instr
sequences.
° VLIW (Very Long Instr Word): Several instrs are issued during
each cycle, like in SS, but here the compiler guarantees that there are no
dependencies between instrs that issue at the same time and that there
are sufficient HW resources to execute them (simplifies instr decode and
issue logic. Tradeoff instruction space for simple decoding
• The long instruction word has room for many operations
• By definition, all the operations the compiler puts in the long
instruction word can execute in parallel
• E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
- 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide
• Need compiling technique that schedules across several branches

cs 152 L1 5 .19 DAP Fa97,  U.CB

SS X VLIW

° SS processors can run without changing binary

machine code that runs on more traditional
architectures.

° VLIW works well when the source code of the

programs is available so that the programs can be
recompiled.

cs 152 L1 5 .20 DAP Fa97,  U.CB

Dynamic Pipeline Scheduling

Tries to find later instrs to execute while waiting for stall

to be resolved.
Pipeline is divided into 3 major units:
°instr fetch and issue unit: fetches, decodes and sends intrs to
corresponding functional unit of execute stage.
°execute units: each one has buffers called reservation stations, that hold
the operands and the operation.
commit unit: decides when it is safe to put result into the regs file or
memory.

To make programs behave as if they were running on a

simple nonpipelined computer, the IF/ID unit must issue
instrs in order, and the commit unit must write the
results to regs and memory in program execution order:
in-order completion.
Exception occurs in this conservative mode =>
computer can point to last instr executed and only instrs
issued before the falty will update regs and memory.
cs 152 L1 5 .21 DAP Fa97,  U.CB
Dynamically Scheduled Pipeline

Instruction fetch In-order issue

and decode unit

Reservation Reservation … Reservation Reser vation

station station station station

Floating Load/
Functional Integer Integer … Out-of-order execute
units point Store

In-order commit
Commit
unit

cs 152 L1 5 .22 DAP Fa97,  U.CB

Difficulties of Dynamic Pipeline
Scheduling
Functional Units are free to start and finish whenever
they want.

Out-of-Order completion: (more radical) allows the

commit to be out of order => introduces imprecise
interrupts.

Dynamic Scheduling is normally combined with branch

prediction (Speculative Execution) => commit unit
must be able to discard all results in the execution unit
due to instrs executed after a mispredicted branch.
Dynamic Scheduling is also combined with SS
execution, so each unit may be issuing or committing
4 - 6 instrs each cycle.
cs 152 L1 5 .23 DAP Fa97,  U.CB
HW Schemes: Instruction
Parallelism
° Why in HW at run time?
• Works when can’t know real dependence at compile time
• Compiler simpler
• Code for one machine runs well on another
• Hide memory latency
• Speculatively execute instrs while waiting for potential hazards to
be resolved.

° Key idea: Allow instructions behind stall to proceed

DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F12,F8,F14
• Enables out-of-order execution => out-of-order
completion
° ID stage checked both for structural and data hazards
cs 152 L1 5 .24 DAP Fa97,  U.CB
HW Schemes: Instruction
Parallelism

° Out-of-order execution divides ID stage:

1.Issue—decode instructions, check for structural hazards
2.Read operands—wait until no data hazards, then read operands

° Scoreboards allow instruction to execute whenever 1

& 2 hold, not waiting for prior instructions
° CDC 6600: In order issue, out of order execution, out
of order commit ( also called completion)

Dynamic machines are predicting program flow,

looking at the instrs in multiple segments to see
which to execute next, and then speculatively
executing instrs based on the predictions and the
instr dependencies.

cs 152 L1 5 .25 DAP Fa97,  U.CB

Example Architecture

DEC Alpha 21264: deep pipelines, SS and Dynamic

Pipelining
4 instrs fetched / cycle
out-of-order execution
in-order completion
simple integer and FP pipelines: 9 stages
==> 600 MHz in 1997

Compare with Cray T-90 supercomputer in 1997: 455

MHz. Clock rate isn’t only performance parameter
but this is still imprecive.

cs 152 L1 5 .26 DAP Fa97,  U.CB

Commit Unit

Controls updates to regs file and memory

Some Dynamically Scheduled machines update the

regs file immediately during execution.
Others have a copy of the regs file, and the actual
update to the regs file occurs later as part of the
commit.

For memory, there is normally a store buffer (or write

buffer). The commit unit allows the store to write to
memory from the buffer, when the buffer has a valid
address and valid data, and when the store is no
longer dependent on predicted branches.

cs 152 L1 5 .27 DAP Fa97,  U.CB

Scoreboard Implications

° Out-of-order completion => WAR, WAW hazards?

° Solutions for WAR.
• Queue both the operation and copies of its operands
• Read registers only during Read Operands stage

° For WAW, must detect hazard: stall until other

completes.
° Need to have multiple instructions in execution phase
=> multiple execution units or pipelined execution
units.
° Scoreboard keeps track of dependencies, state or
operations.

cs 152 L1 5 .28 DAP Fa97,  U.CB

Dynamic Branch Prediction

° Solution: 2-bit scheme where change prediction

only if get misprediction twice

T
NT
Predict Taken Predict Taken
T
T NT NT
Predict Not Predict Not
Taken T Taken
NT

cs 152 L1 5 .29 DAP Fa97,  U.CB

BHT Accuracy

° Mispredict because either:

• Wrong guess for that branch
• Got branch history of wrong branch when index the table

cs 152 L1 5 .30 DAP Fa97,  U.CB

Need Address @ Same Time as Prediction

° Branch Target Buffer (BTB): Address of branch index to get

prediction AND branch address (if taken)
• Note: must check for branch match now, since can’t use wrong branch
address Branch Prediction:
Predicted PC Taken or not Taken

° Return instruction addresses predicted with stack

cs 152 L1 5 .31 DAP Fa97,  U.CB
Dynamic Branch Prediction Summary

° Branch History Table: 2 bits for loop accuracy

° Branch Target Buffer: include branch address &
prediction

cs 152 L1 5 .32 DAP Fa97,  U.CB

HW support for More
ILP
° Need HW buffer for results of
uncommitted instructions:
reorder buffer
• Reorder buffer can be Reorder
operand source FP
Buffer

• Once operand commits, Op

Queue
result is found in register FP Regs
• 3 fields: instr. type,
destination, value
Res Stations Res Stations
• Use reorder buffer number
instead FP Adder FP Adder
of reservation station
° Instructions

cs 152 L1 5 .33 DAP Fa97,  U.CB

6.9 Dynamic Scheduling in PowerPC 604 and Pentium Pro
° Both In-order Issue, Out-of-order execution, In-order
Commit PC
Instruction
cache
Data
cache

Branch Instruction queue

prediction
Register file

Decode/dispatch unit

Reser vation Reser vation Reser vation Reser vation Reser vation Reser vation
station station station station station station

Store Load
Floating
Branch Integer Integer Complex Load/
point
integer store

Commit
unit

Reorder
buf fer

PPro central reservation station for any functional unit with one bus
shared by a branch and an integer unit
cs 152 L1 5 .34 DAP Fa97,  U.CB
Dynamic Scheduling in PowerPC 604 and Pentium Pro

Parameter PPC PPro

Max. instructions issued/clock 4 3
Max. instr. complete exec./clock 6 5
Max. instr. commited/clock 6 3
Instructions in reorder buffer 16 40
Number of rename buffers 12 Int/8 FP 40
Number of reservations stations 12 20
No. integer functional units (FUs) 2 2
No. floating point FUs 1 1
No. branch FUs1 1
No. complex integer FUs 1 0
No. memory FUs 1 1 load +1 store

cs 152 L1 5 .35 DAP Fa97,  U.CB

Dynamic Scheduling in Pentium Pro and PowerPC 604

°Both use a 512-entry branch history table for branch

prediction to predict branches and speculatively execute
instrs after a predicted branch.
°Dispatcher sends each instr and its operands to the
reservation station of one of the six FUs.
°Dispatcher also places an entry for the instr in the
reorder buffer of the commit unit.
°An instr cannot issue unless there is space available in
both an appropriate reservation station and in the
reorder buffer.

cs 152 L1 5 .36 DAP Fa97,  U.CB

Dynamic Scheduling in Pentium Pro and PowerPC

° PPro doesn’t pipeline 80x86 instructions

° PPro decode unit translates the Intel instructions
into 72-bit micro-operations (similar to MIPS instrs,
2 source regs and 1 destination regs.)
° Sends micro-operations to reorder buffer &
reservation stations
° Takes 1 clock cycle to determine length of 80x86
instructions + 2 more to create the micro-
operations
° Most instructions translate to 1 to 4 micro-
operations
° Complex 80x86 instructions are executed by a
conventional microprogram (8K x 72 bits) that
issues long sequences of micro-operations
cs 152 L1 5 .37 DAP Fa97,  U.CB
6.10 Pipeline Implementation
Issues
° The correct implementation of pipelines is not easy.
Predicting all possible conflicts and possible
solutions, such as forwarding and debugging the
project is hard.
° The correct choice of a pipelining solution is
technology dependent. E.g. When the no. of
transistors on chips and speed of transistors made a 5
stage pipeline the best solution, delayed branches
were a simple solution to control hazards. With longer
pipelines, SS execution and Dynamic Scheduling, it is
now redundant. As transistors became cheaper and
logic became much faster than memory, multiple FUs
and Dynamic Pipelining made more sence.
° It is important to consider instr set design:
Instrs should have approximately the same length and running times.
Addressing modes should be kept simple,

cs 152 L1 5 .38 DAP Fa97,  U.CB

Pipeline Implementation Issues

° Increasing the pipeline depth doesn’t always

increase performance:
° Data hazards may become more frequent, increasing the no. of stalls,
increasing time / instr and decreasing performance.
° Control Hazards mean that increasing pipeline depth results in
slower branches, increasing the number of clock cycles for the
program.
° Pipeline regst overhead can limit the decrease inclock period
obtained by further pipelining. A larger percentage of the cycle is
spent on setting pipeline regst.
3.0

2.5
Relative performance

2.0

1.5

1.0

0.5

0.0
1 2 4 8 16
Pipeline depth
cs 152 L1 5 .39 DAP Fa97,  U.CB
Limits to Multi-Issue
Machines
° Need about Pipeline Depth x No. Functional Units of
independent instrs. Difficulties in building HW.
° Duplicate FUs to get parallel execution
° Increase ports to Register File
° Increase ports to memory
° Decoding SS and impact on clock rate, pipeline depth.
° Limitations specific to either SS or VLIW
implementation
• Decode issue in SS
• VLIW code size: unroll loops + wasted fields in
VLIW

cs 152 L1 5 .40 DAP Fa97, U.CB

MIPS R3000 Multicycle
Operations
op Rd Ra Rb

Ex: Multiply, Divide, Cache Miss

Stall all stages above multicycle

operation in the pipeline
mul Rd Ra Rb A B Drain (bubble) stages below it

Use control word of local stage

state to step through multicycle
operation
Rd
R

Rd T
to reg
file

cs 152 L1 5 .41 DAP Fa97,  U.CB

Issues in Pipelined design

Limitation
IF D Ex M W
° Pipelining IF D Ex M W
IF D Ex M W Issue rate, FU stalls, FU depth
IF D Ex M W
° Super-pipeline
- Issue one instruction per (fast) cycle
- ALU takes multiple cycles IF D Ex M W
IF D Ex M W
Clock skew, FU stalls, FU depth
IF D Ex M W
IF D Ex M W

° Super-scalar IF D Ex M W Hazard resolution

- Issue multiple scalar IF D Ex M W
IF D Ex M W
instructions per cycle IF D Ex M W

° VLIW (“EPIC”)
- Each instruction specifies IF D Ex M W Packing
multiple scalar operations Ex M W
Ex M W
- Compiler determines parallelism
Ex M W

° Vector operations IF D Ex M W Applicability

- Each instruction specifies Ex M W
Ex M W
series of identical operations Ex M W

cs 152 L1 5 .42 DAP Fa97,  U.CB

6.11 Pipeline Summary
° Pipeline improves the average execution time / instr
(instr throughput) not individual instr execution
time (latency).
° Compared to single-cycle datapath this amounts to
reducing clock cycle time. We started with this
implementation.
° Compared to multi-cycle datapath this amounts to
reducing CPI.
° The latency of each instr is similar to the multi-
cycle implementation.
° Instr latencies introduce difficulties due to
dependencies in programs. The machine must wait
the full instr latency for the hazard to be resolved.
° The frequency of control dependecies can be
reduced by branch prediction HW and compiler
scheduling.
cs 152 L1 5 .43 DAP Fa97,  U.CB
Pipeline Summary

° Pipelines pass control information down the pipe

just as data moves down pipe
° Forwarding/Stalls handled by local control.
Forwarding reduces latencies due to data hazards.
° Exceptions stop the pipeline
° MIPS I instruction set architecture made pipeline
visible (delayed branch, delayed load)
° More performance from deeper pipelines,
parallelism
° Superscalar and VLIW
• CPI < 1
• Dynamic issue vs. Static issue
• More instructions issue at same time, larger the penalty of
hazards
cs 152 L1 5 .44 DAP Fa97,  U.CB
Pipeline Summary

° Longer pipelines, SS and dynamic scheduling has

recently sustained the 60% per year processor
performance increase since 1986.
° At first is seemed that the choice was between the
highest clock rate processors and the most
sophisticated SS processors. The Alpha 21264
proved it is possible to do both (issues 6 instrs /
cycle, out-of-order-execution, and 600 MHz in 1997).
° With such advances in processing, Amdahl’s Law
suggests that the bottleneck will be the memory
system.
° An alternative to trying to exploit more parallelism
at the instruction level (ILP) in uniprocessors is to
use multiprocessors, which exploit parallelism
(topic of Chapter 9).

cs 152 L1 5 .45 DAP Fa97,  U.CB

3 Recent Machines

Alpha 21164 Pentium II HP PA-8000

Year 1995 1996 1996
Clock 600 MHz (‘97) 300 MHz (‘97) 236 MHz (‘97)
Cache8K/8K/96K/2M 16K/16K/0.5M 0/0/4M
Issue rate 2int+2FP 3 instr (x86) 4 instr
Pipe stages 7-9 12-14 7-9
Out-of-Order 6 loads 40 instr (µop) 56 instr
Rename regs none 40 56

cs 152 L1 5 .46 DAP Fa97,  U.CB

Final Pipelined Datapath and Control
Branch

IF.Flush ID.Flush EX.Flush

Hazard
detection
unit
M ID/EX
u M
40000040 u
x
WB x
0 EX/MEM
M M
Control u M u WB
x x MEM/WB
0
0
EX Cause M WB
IF/ID

RegWrite

MemWrite
Except
PC
4 Shift ALUSrc
left 2 Read
Read

MemtoReg
data 1 M
register 1
u Data
Instruction Read x memory
=
Instruction

memory register 2
Registers
PC Address Write ALU Address
register Read Read
Read M
data Write data 2 M data u
M Write x
data u data
x u
x ALU
control
16 32 MemRead
Sign
extend ALUOp

Instruction [25– 21] RegDst

Instruction [20– 16]

Instruction [20– 16] M
Instruction [15– 11] u
x
Forwarding
unit

cs 152 L1 5 .47 DAP Fa97,  U.CB

Performance comparissons for Different Datapaths

Multicycle Pipelined
Faster

datapath datapath
(section 5.4) (Chapter 6)
Clock rate

Single-cycle
Slower

datapath
(section 5.3)

Slower Faster
Instruction throughput
(instructions per clock cycle or 1/CPI)

cs 152 L1 5 .48 DAP Fa97,  U.CB

Relationships Between Datapaths

Specialized
Single-cycle Pipelined
datapath datapath
(section 5.3) (Chapter 6)
Hardware

Multicycle
Shared

datapath
(section 5.4)

1 Several
Clock cycles of latency for an instruction

cs 152 L1 5 .49 DAP Fa97,  U.CB

6.12 Historical Perspective

Today
early 90's RISC
Superscalars
(IBM Power 1 and Power PC)

80's RISC
pipelines
vector proc. (mips,sparc,IBM RS6000) 80ns,
Cache 2Kb Ctrl. St
(ibm 360/85, ...) 4x16b bus
960ns mem
Load/Store ISA Dynamic Inst. 32KB cache
(cdc 6600,7600, Scheduling with 60-160ns
Cray-1, . . .) extensive pipelining
(ibm 360/91)
1966 25x basic model Virtual Memory
(multics, ge-645,
1967 ibm 360/67, ...)
60ns TLB
hardwired Inst. Pipelining
8x16b bus Inst. Buffering
780ns mem (Stretch (IBM7030)
- 100x ibm704 Microprogramming

1961
cs 152 L1 5 .50 DAP Fa97,  U.CB

09 - Thread Level Parallelism
50% (2)
09 - Thread Level Parallelism
34 pages
Pentium 4 Processor
100% (9)
Pentium 4 Processor
23 pages
Instruction Level Pipelining
100% (1)
Instruction Level Pipelining
113 pages
Pipelining: Advanced Computer Architecture
100% (1)
Pipelining: Advanced Computer Architecture
30 pages
Lecture Notes Week 1 Introduction To Microprocessor
100% (2)
Lecture Notes Week 1 Introduction To Microprocessor
51 pages
Mark Vie-GEH-6721-Vol-III PDF
No ratings yet
Mark Vie-GEH-6721-Vol-III PDF
180 pages
XX Chapter16 InstructionLevelParallelismAndSuperscalarProcessors PDF
No ratings yet
XX Chapter16 InstructionLevelParallelismAndSuperscalarProcessors PDF
90 pages
Chip Basics: Time, Area, Power, Reliability, Configurability
No ratings yet
Chip Basics: Time, Area, Power, Reliability, Configurability
46 pages
8 Pipeline DDP Control
No ratings yet
8 Pipeline DDP Control
54 pages
03 Pipeline
0% (1)
03 Pipeline
38 pages
Pentium 4
No ratings yet
Pentium 4
108 pages
Superscalar
No ratings yet
Superscalar
38 pages
Arch4 Pipelined Processor Design Afterlecture
No ratings yet
Arch4 Pipelined Processor Design Afterlecture
130 pages
Onur Digitaldesign - Comparch 2021 Lecture14 Pipelined Processor Design Afterlecture
No ratings yet
Onur Digitaldesign - Comparch 2021 Lecture14 Pipelined Processor Design Afterlecture
97 pages
Performance of A Computer
No ratings yet
Performance of A Computer
83 pages
MIPS
No ratings yet
MIPS
70 pages
02a ILP Pipeline
No ratings yet
02a ILP Pipeline
40 pages
EE457Unit9a OoO
No ratings yet
EE457Unit9a OoO
77 pages
CS6303 Computer Architecture 2
No ratings yet
CS6303 Computer Architecture 2
56 pages
CS 6303 Computer Architecture TWO Mark With Answer
100% (1)
CS 6303 Computer Architecture TWO Mark With Answer
14 pages
Topic2c Ss Dynamicscheduling
No ratings yet
Topic2c Ss Dynamicscheduling
94 pages
Enhancing Performance With Pipelining
No ratings yet
Enhancing Performance With Pipelining
71 pages
Lecture10 - Chapter4-P2
No ratings yet
Lecture10 - Chapter4-P2
46 pages
Lecture-14-03 02 2025
No ratings yet
Lecture-14-03 02 2025
53 pages
Chapter 4.5 - 4.8 Piplined Processor and Hazards
No ratings yet
Chapter 4.5 - 4.8 Piplined Processor and Hazards
68 pages
Enhancing Performance With Pipelining
No ratings yet
Enhancing Performance With Pipelining
85 pages
MGC DVCon 13 Sequence Sequence On The Wall Who's The Fairest of Them All
No ratings yet
MGC DVCon 13 Sequence Sequence On The Wall Who's The Fairest of Them All
24 pages
Chapter 04 Processor 3.5
No ratings yet
Chapter 04 Processor 3.5
52 pages
Processor Microarchitecture An Implementation Perspective. Gonzalez, Latorre, Magklis.
No ratings yet
Processor Microarchitecture An Implementation Perspective. Gonzalez, Latorre, Magklis.
116 pages
Dynamic Scheduling Using Tomasulo's Algorithm: Lotzi Bölöni
No ratings yet
Dynamic Scheduling Using Tomasulo's Algorithm: Lotzi Bölöni
54 pages
Two Forms of Pipelining: - E.g., Floating Point Operations
No ratings yet
Two Forms of Pipelining: - E.g., Floating Point Operations
36 pages
Lec12 Pipeline 2 Notes
No ratings yet
Lec12 Pipeline 2 Notes
58 pages
Pipe 1 New
No ratings yet
Pipe 1 New
64 pages
Pipelining ControlUnitAndHazards
No ratings yet
Pipelining ControlUnitAndHazards
109 pages
CS M151B / EE M116C: Computer Systems Architecture
No ratings yet
CS M151B / EE M116C: Computer Systems Architecture
50 pages
Gem 5 IO
No ratings yet
Gem 5 IO
161 pages
Superscalar and Superpipelined Processors
No ratings yet
Superscalar and Superpipelined Processors
4 pages
Pipelining
No ratings yet
Pipelining
29 pages
Embedded Computer Architecture 5SAI0
No ratings yet
Embedded Computer Architecture 5SAI0
59 pages
CS M151B / EE M116C: Computer Systems Architecture
No ratings yet
CS M151B / EE M116C: Computer Systems Architecture
38 pages
Chapter4 2
No ratings yet
Chapter4 2
34 pages
Pipe 2 New
No ratings yet
Pipe 2 New
41 pages
Ca07 2014 PDF
No ratings yet
Ca07 2014 PDF
56 pages
ILP-Architectures Part I
No ratings yet
ILP-Architectures Part I
56 pages
Lec7 Pipelining
No ratings yet
Lec7 Pipelining
22 pages
15IF11 Multicore A PDF
No ratings yet
15IF11 Multicore A PDF
64 pages
Uvm Topics
No ratings yet
Uvm Topics
44 pages
CODch 6 Slides
No ratings yet
CODch 6 Slides
77 pages
Ca06 2014 PDF
No ratings yet
Ca06 2014 PDF
53 pages
HRY-312 Computer Organization Introduction To Pipelining
No ratings yet
HRY-312 Computer Organization Introduction To Pipelining
30 pages
M116C 1 M116C 1 Lec10-Pipeline-II
No ratings yet
M116C 1 M116C 1 Lec10-Pipeline-II
18 pages
Lec02 Superscalar SW VLIW 22 23
No ratings yet
Lec02 Superscalar SW VLIW 22 23
34 pages
Pipe Lining
No ratings yet
Pipe Lining
66 pages
Course 3 Module 5
No ratings yet
Course 3 Module 5
23 pages
Cse410 10 Pipelining A
No ratings yet
Cse410 10 Pipelining A
7 pages
Arm Cortex-A Processor Comparison Table by SSM
No ratings yet
Arm Cortex-A Processor Comparison Table by SSM
6 pages
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
No ratings yet
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
50 pages
Chapter 4 The Processor
No ratings yet
Chapter 4 The Processor
72 pages
Ee660 2017 Spring Materials Week 04 Slides
No ratings yet
Ee660 2017 Spring Materials Week 04 Slides
40 pages
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
No ratings yet
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
36 pages
Instruction Level Parallelism
No ratings yet
Instruction Level Parallelism
49 pages
Lec13 Pipe Control
No ratings yet
Lec13 Pipe Control
19 pages
L15 MipsPipeline
No ratings yet
L15 MipsPipeline
26 pages
CS104: Computer Organization: 30 March, 2020
No ratings yet
CS104: Computer Organization: 30 March, 2020
31 pages
A Study of The Alpha 21364 Processor Arul Prakash CS6810: Advanced Computer Architecture
No ratings yet
A Study of The Alpha 21364 Processor Arul Prakash CS6810: Advanced Computer Architecture
12 pages
Cpe 242 Computer Architecture and Engineering Instruction Level Parallelism
No ratings yet
Cpe 242 Computer Architecture and Engineering Instruction Level Parallelism
46 pages
Lect5 PDF
No ratings yet
Lect5 PDF
21 pages
Yalepatt CV
No ratings yet
Yalepatt CV
27 pages
Computer Architecture: Introduction To The Concept of Pipelined Processor
No ratings yet
Computer Architecture: Introduction To The Concept of Pipelined Processor
20 pages
Exercises
No ratings yet
Exercises
33 pages
L14 MipsPipeline Ovw
No ratings yet
L14 MipsPipeline Ovw
17 pages
DLX-Phases of Instruction Cycle
No ratings yet
DLX-Phases of Instruction Cycle
12 pages
L24 Pipeline
No ratings yet
L24 Pipeline
40 pages
CS 162 Computer Architecture Lecture 3: Pipelining Contd.: Instructor: L.N. Bhuyan
No ratings yet
CS 162 Computer Architecture Lecture 3: Pipelining Contd.: Instructor: L.N. Bhuyan
21 pages
Lec 06
No ratings yet
Lec 06
18 pages
A Journey Through The CPU Pipeline
No ratings yet
A Journey Through The CPU Pipeline
20 pages
Super Scalar Architecture With Dynamic Branch Prediction
No ratings yet
Super Scalar Architecture With Dynamic Branch Prediction
5 pages
Advanced Computer Architecture (ACA) Assignment
No ratings yet
Advanced Computer Architecture (ACA) Assignment
16 pages
CS530 Fall2015 Lecture9
No ratings yet
CS530 Fall2015 Lecture9
5 pages
AStudyof Techniquesto Increase Instruction Level Parallelism
No ratings yet
AStudyof Techniquesto Increase Instruction Level Parallelism
6 pages
DAWG: A Defense Against Cache Timing Attacks in Speculative Execution Processors
No ratings yet
DAWG: A Defense Against Cache Timing Attacks in Speculative Execution Processors
14 pages
Exam2 Practice Sol
No ratings yet
Exam2 Practice Sol
6 pages
Computer Organization and Architecture What Does Superscalar Mean?
No ratings yet
Computer Organization and Architecture What Does Superscalar Mean?
14 pages
Pipeline Review: Here Is The Example Instruction Sequence Used To Illustrate Pipelining On The Previous Page
No ratings yet
Pipeline Review: Here Is The Example Instruction Sequence Used To Illustrate Pipelining On The Previous Page
11 pages
Remainder of CIS501: Parallelism: System Software App App App
No ratings yet
Remainder of CIS501: Parallelism: System Software App App App
14 pages
Dynamatic (4) Odt
No ratings yet
Dynamatic (4) Odt
4 pages
Tomasulo's Algorithm and Scoreboarding
No ratings yet
Tomasulo's Algorithm and Scoreboarding
17 pages
2014fa CS61C L31 DG PipelineII 6up
No ratings yet
2014fa CS61C L31 DG PipelineII 6up
4 pages
Electronics II Essentials
From Everand
Electronics II Essentials
The Editors of REA
No ratings yet
Reference Guide To Useful Electronic Circuits And Circuit Design Techniques - Part 2
From Everand
Reference Guide To Useful Electronic Circuits And Circuit Design Techniques - Part 2
Kerwin Mathew
No ratings yet