0% found this document useful (0 votes)
41 views25 pages

Chapter Six: 2004 Morgan Kaufmann Publishers

Pipelining improves performance by increasing instruction throughput. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2004 Morgan Kaufmann Publishers Pipelining What makes it easy - all instructions are the same length - just a few instruction formats - memory operands appear only in loads and stores what makes it hard? - structural hazards: suppose we had only one memory - control hazards: need to worry about branch instructions - data hazards: an instruction depends on a previous instruction, etc.

Uploaded by

samquickly
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views25 pages

Chapter Six: 2004 Morgan Kaufmann Publishers

Pipelining improves performance by increasing instruction throughput. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2004 Morgan Kaufmann Publishers Pipelining What makes it easy - all instructions are the same length - just a few instruction formats - memory operands appear only in loads and stores what makes it hard? - structural hazards: suppose we had only one memory - control hazards: need to worry about branch instructions - data hazards: an instruction depends on a previous instruction, etc.

Uploaded by

samquickly
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 25

Chapter Six

2004 Morgan Kaufmann Publishers

Pipelining
Improve performance by increasing instruction throughput
Program execution Time order (in instructions) 200 400 600 800 1000 1200 1400 1600 1800

lw $1, 100($0) Instruction R g e fetch lw $2, 200($0) lw $3, 300($0)

AU L 800 ps

Data access

Rg e Instruction R g e fetch AU L 800 ps Data access Rg e Instruction fetch 800 ps

Note: timing assumptions changed for this example

Program execution Time order (in instructions) lw $1, 100($0)

200

400

600

800

1000

1200

1400

Instruction fetch

Rg e Instruction fetch 200 ps

AU L Rg e Instruction fetch

Data access AU L Rg e

Rg e Data access AU L Rg e Data access Rg e

lw $2, 200($0) 200 ps lw $3, 300($0)

200 ps 200 ps 200 ps 200 ps 200 ps

Ideal speedup is number of stages in the pipeline. Do we achieve this?


2004 Morgan Kaufmann Publishers

Pipelining
What makes it easy all instructions are the same length just a few instruction formats memory operands appear only in loads and stores What makes it hard? structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction Well build a simple pipeline and look at these issues Well talk about modern processors and what really makes it hard: exception handling trying to improve performance with out-of-order execution, etc.
2004 Morgan Kaufmann Publishers

Basic Idea
IF: Instruction fetch ID: Instruction decode/ register file read EX: Execute/ address calculation MEM: Memory access WB: Write back

Ad d 4 Shift lf2 et Read R a ed regist r 1 d t 1 e a a Read regist r 2 e Regiser ts Write Ra ed register dt 2 a a Wie r t data 1 6 2 Sign 3 et n xe d AD Add D result

P C

A de s drs Instruction Instruction memory

Zr eo AU A U L L result

A de s drs

Ra ed data Data Mm r e oy

Write data

What do we need to add to actually split the datapath into stages?


2004 Morgan Kaufmann Publishers

Pipelined Datapath

IF/ID

ID/EX

EX/MEM

MEM/WB

Add 4 Shift left 2 Add Add result

PC

Address

Read register 1

Instruction memory

Read data 1 Read register 2 Registers Read Write data 2 register Write data

Zero ALU ALU result

Address Data memory Write data

Read data

16

Sign extend

32

Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem?
2004 Morgan Kaufmann Publishers

Corrected Datapath

IF/ID

ID/EX

EX/MEM

MEM/WB

Add 4 Shift left 2 Add Add result

PC

Address

Read register 1

Instruction memory

Read data 1 Read register 2 Registers Read Write data 2 register Write data

Zero ALU ALU result

Address Data memory Write data

Read data

16

Sign extend

32

2004 Morgan Kaufmann Publishers

Graphically Representing Pipelines


Time (in clock cycles) Program execution order (in instructions) lw $1, 100($0) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC7

IM

Reg

ALU

DM

Reg

lw $2, 200($0)

IM

Reg

ALU

DM

Reg

lw $3, 300($0)

IM

Reg

ALU

DM

Reg

Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? use this representation to help understand datapaths

2004 Morgan Kaufmann Publishers

Pipeline Control
PCSrc

IF/ID

ID/EX

EX/MEM

MEM/WB

Add 4 Shift left 2 RegWrite PC Address Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data Write data Instruction (150) 16 Sign extend 32 6 ALU control MemWrite ALUSrc Zero Add ALU result MemtoReg Address Data memory Read data Add Add result Branch

Instruction memory

MemRead

Instruction (2016)

ALUOp

Instruction (1511) RegDst

2004 Morgan Kaufmann Publishers

Pipeline control
We have 5 stages. What needs to be controlled in each stage? Instruction Fetch and PC Increment Instruction Decode / Register Fetch Execution Memory Stage Write Back How would control be handled in an automobile plant? a fancy control center telling everyone what to do? should we use a finite state machine?

2004 Morgan Kaufmann Publishers

Pipeline Control
Pass control signals along just like the data
Execution/Address Calculation Memory access stage stage control lines control lines Reg ALU ALU ALU Mem Mem Dst Op1 Op0 Src Branch Read Write 1 1 0 0 0 0 0 0 0 0 1 0 1 0 X 0 0 1 0 0 1 X 0 1 0 1 0 0 Write-back stage control lines Reg Mem to write Reg 1 0 1 1 0 X 0 X

Instruction R-format lw sw beq

WB Instruction M EX WB M WB

Control

IF/ID

ID/EX

EX/MEM

MEM/WB

2004 Morgan Kaufmann Publishers

10

Datapath with Control


PCSrc ID/EX WB

EX/MEM WB

Control

MEM/WB WB

IF/ID

EX

Add 4 Shift left 2 Add Add result ALUSrc

Branch

PC

Address

Read register 1

Instruction memory

Read data 1 Read register 2 Registers Read Write data 2 register Write data

Zero ALU ALU result

Address Data memory Write data

Read data

Instruction [150]

16

Sign extend

32

ALU control ALUOp

MemRead

Instruction [2016] Instruction [1511]

RegDst

2004 Morgan Kaufmann Publishers

11

Dependencies
Problem with starting next instruction before first is finished dependencies that go backward in time are data hazards
Value of register $2: Program execution order (in instructions) sub $2, $1, $3 IM Reg DM Reg Time (in clock cycles) CC 1 CC 2 10 10 CC 3 10 CC 4 10 CC 5 10/20 CC 6 20 CC 7 20 CC 8 20 CC 9 20

and $12, $2, $5

IM

Reg

DM

Reg

or $13, $6, $2

IM

Reg

DM

Reg

add $14, $2, $2

IM

Reg

DM

Reg

sw $15, 100($2)

IM

Reg

DM

Reg

2004 Morgan Kaufmann Publishers

12

Software Solution
Have compiler guarantee no hazards Where do we insert the nops ? sub and or add sw $2, $1, $3 $12, $2, $5 $13, $6, $2 $14, $2, $2 $15, 100($2)

Problem: this really slows us down!

2004 Morgan Kaufmann Publishers

13

Forwarding
Use temporary results, dont wait for them to be written register file forwarding to handle read/write to same register ALU forwarding
Time (in clock cycles) CC 1 CC 2 Value of register $2: 10 10 Value of EX/MEM: X X Value of MEM/WB: X X Program execution order (in instructions) sub $2, $1, $3 IM Reg DM Reg CC 3 10 X X CC 4 10 20 X CC 5 10/20 X 20 CC 6 20 X X CC 7 20 X X CC 8 20 X X CC 9 20 X X

and $12, $2, $5

IM

Reg

DM

Reg

or $13, $6, $2

IM

Reg

DM

Reg

add $14,$2 , $2

IM

Reg

DM

Reg

sw $15, 100($2)

IM

Reg

DM

Reg

what if this $2 was $13?

2004 Morgan Kaufmann Publishers

14

Forwarding
The main idea (some details not shown)

ID/EX M u x Registers ForwardA M u x ALU

EX/MEM

MEM/WB

Data memory

M u x

ForwardB R s R t R t R d EX/MEM.RegisterRd

M u x Forwarding unit

MEM/WB.RegisterRd

2004 Morgan Kaufmann Publishers

15

Can't always forward


Load word can still cause a hazard: an instruction tries to read a register following a load instruction that writes to the same register.
Time (in clock cycles) CC 1 CC 2 Program execution order (in instructions) lw $2, 20($1) IM Reg DM Reg CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

and $4, $2, $5

IM

Reg

DM

Reg

or $8, $2, $6

IM

Reg

DM

Reg

add $9, $4, $2

IM

Reg

DM

Reg

slt $1, $6, $7

IM

Reg

DM

Reg

Thus, we need a hazard detection unit to stall the load instruction


2004 Morgan Kaufmann Publishers

16

Stalling
We can stall the pipeline by keeping an instruction in the same stage
Time (in clock cycles) CC 1 CC 2 CC 3 Program execution order (in instructions) lw $2, 20($1) IM Reg DM Reg bubble and becomes nop IM Reg DM Reg CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 10

add $4, $2, $5

IM

Reg

DM

Reg

or $8, $2, $6

IM

Reg

DM

Reg

add $9, $4, $2

IM

Reg

DM

Reg

2004 Morgan Kaufmann Publishers

17

Hazard Detection Unit


Stall by letting an instruction that wont write anything go forward
Hazard detection unit ID/EX.MemRead

ID/EX WB Control IF/ID 0 M u x M EX EX/MEM WB M MEM/WB WB

M u x Registers ALU PC Instruction memory M u x Data memory M u x

IF/ID.RegisterRs IF/ID.RegisterRt IF/ID.RegisterRt IF/ID.RegisterRd ID/EX.RegisterRt Rs Rt Forwarding unit Rt Rd M u x

2004 Morgan Kaufmann Publishers

18

Branch Hazards
When we decide to branch, other instructions are in the pipeline!
Time (in clock cycles) CC 1 Program execution order (in instructions) 40 beq $1, $3, 28 IM Reg DM Reg CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

44 and $12, $2, $5

IM

Reg

DM

Reg

48 or $13, $6, $2

IM

Reg

DM

Reg

52 add $14, $2, $2

IM

Reg

DM

Reg

72 lw $4, 50($7)

IM

Reg

DM

Reg

We are predicting branch not taken need to add hardware for flushing instructions if we are wrong
2004 Morgan Kaufmann Publishers

19

Flushing Instructions
IF.Flush Hazard detection unit ID/EX WB Control 0 IF/ID + 4 Shift left 2 M u x ALU M u x Sign extend Data memory M u x + M u x M EX EX/MEM WB M EX/MEM WB

M u x

Registers PC Instruction memory

M u x Fowarding unit

Note: weve also moved branch decision to ID stage


2004 Morgan Kaufmann Publishers

20

Branches
If the branch is taken, we have a penalty of one cycle For our simple design, this is reasonable With deeper pipelines, penalty increases and static branch prediction drastically hurts performance Solution: dynamic branch prediction
Taken Not taken Predict taken Taken Taken Not taken Predict not taken Taken Not taken Predict not taken Not taken Predict taken

A 2-bit prediction scheme


2004 Morgan Kaufmann Publishers

21

Branch Prediction
Sophisticated Techniques:

A branch target buffer to help us look up the destination


Correlating predictors that base prediction on global behavior and recently executed branches (e.g., prediction for a specific
branch instruction based on what happened in previous branches)

Tournament predictors that use different types of prediction strategies and keep track of which one is performing best. A branch delay slot which the compiler tries to fill with a useful instruction (make the one cycle delay part of the ISA) Branch prediction is especially important because it enables other more advanced pipelining techniques to be effective! Modern processors predict correctly 95% of the time!

2004 Morgan Kaufmann Publishers

22

Improving Performance
Try and avoid stalls! E.g., reorder these instructions: lw lw sw sw $t0, $t2, $t2, $t0, 0($t1) 4($t1) 0($t1) 4($t1)

Dynamic Pipeline Scheduling Hardware chooses which instructions to execute next Will execute instructions out of order (e.g., doesnt wait for a dependency to be resolved, but rather keeps going!) Speculates on branches and keeps the pipeline full (may need to rollback if prediction incorrect)

Trying to exploit instruction-level parallelism

2004 Morgan Kaufmann Publishers

23

Advanced Pipelining
Increase the depth of the pipeline Start more than one instruction each cycle (multiple issue) Loop unrolling to expose more ILP (better scheduling) Superscalar processors DEC Alpha 21264: 9 stage pipeline, 6 instruction issue All modern processors are superscalar and issue multiple instructions usually with some limitations (e.g., different pipes) VLIW: very long instruction word, static multiple issue (relies more on compiler technology)

This class has given you the background you need to learn more!

2004 Morgan Kaufmann Publishers

24

Chapter 6 Summary
Pipelining does not improve latency, but does improve throughput

Deeply pipelined

Multiple issue with deep pipeline (Section 6.10)

Multiple issue with deep pipeline (Section 6.10) Multiple-issue pipelined (Section 6.9) Single-cycle (Section 5.4) Pipelined Deeply pipelined

Multicycle (Section 5.5)

Pipelined

Multiple-issue pipelined (Section 6.9)

Single-cycle (Section 5.4)

Multicycle (Section 5.5)

Slower

Faster

1 Use latency in instructions

Several

Instructions per clock (IPC = 1/CPI)

2004 Morgan Kaufmann Publishers

25

You might also like