0% found this document useful (0 votes)
11 views42 pages

SRM Pipelining 05

Chapter 5 discusses pipelining in CPU architecture, highlighting its impact on performance through parallel execution of instructions. It covers various hazards that can occur during pipelining, including structure, data, and control hazards, and presents solutions such as interlocking and forwarding to mitigate these issues. The chapter also emphasizes the importance of instruction set architecture (ISA) design in facilitating efficient pipelining.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views42 pages

SRM Pipelining 05

Chapter 5 discusses pipelining in CPU architecture, highlighting its impact on performance through parallel execution of instructions. It covers various hazards that can occur during pipelining, including structure, data, and control hazards, and presents solutions such as interlocking and forwarding to mitigate these issues. The chapter also emphasizes the importance of instruction set architecture (ISA) design in facilitating efficient pipelining.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Chapter 5: Pipelining

1
Introduction
• CPU performance factors
– Instruction count
• Determined by ISA and compiler
– CPI and Cycle time
• Determined by CPU hardware
• We will examine two CPU implementations
– A simplified version
– A more realistic and pipelined version
• Simple subset, shows the most aspects
– Memory reference: ld/lw, sd/sw
– Arithmetic-logical: add, sub, and, and or
– Condition branch: beq (branch if equal)

2
§4.5 An Overview of Pipelining
Pipelining Analogy
• Laundry Example A B C D

• Ann, Brian, Cathy, Dave 30 minutes


each have one load of clothes
to wash, dry, fold, and put away
– Washer takes 30 minutes
– Dryer takes 30 minutes 30 minutes
– ”Folder” takes 30 minutes
– “Putter” takes 30 minutes
30 minutes
• One load: 120 minutes
30 minutes

3
Pipelining: Its Natural!
• Pipelined laundry: overlapping execution
– Parallelism improves performance ■ Four loads:
■ Speedup
= 8/3.5 = 2.3
■ Non-stop:
■ Speedup
= 2n/0.5n + 1.5 ≈ 4
= number of stages

Important to note
Each laundry still takes 120 minutes.
Improvement are for 4 load throughput.
More complicated if stages take different
amount of time 4
RISC-V Pipeline
Five stages, one step per stage
1. IF: Instruction Fetch from memory
2. ID: Instruction Decode & register read
3. EX: Execute operation or calculate address
4. MEM: Access memory operand
5. WB: Write result Back to register

5
Graphical Representation of Instruction Pipeline

• IF: Instruction Fetch from memory


– Box representing instruction memory
– Right-half shade representing usage of IM at the second half of the cycle
• ID: Instruction Decode & register read
– Box representing register
– Right-half shade representing usage (read) of Register at the second half of the
cycle
• EX: Execute operation or calculate address
– Shade representing usage
• MEM: Access memory operand (only for load/store)
– White background representing NOT used by add instruction in this example
• WB: Write result Back to register (only for load and AL instructions)
– Box representing register
– Left-half shade representing write to register at the first half of the cycle
6
Classic 5-Stage Pipeline for a RISC
• In each cycle, hardware
initiates a new instruction
and executes some part of
five different instructions:
– Simple

Clock number
Instruction number 1 2 3 4 5 6 7 8 9
Instruction i IF ID EX MEM WB
Instruction i+1 IF ID EX MEM WB
Instruction i+2 IF ID EX MEM WB
Instruction i+3 IF ID EX MEM WB
Instruction i+4 IF ID EX MEM WB 7
Pipeline Performance

• Assume time for stages is


– 100ps for register read or write
– 200ps for other stages
• Compare pipelined datapath with single-cycle datapath

8
Pipeline Performance
Single-cycle (Tc= 800ps)
2400ps

vs
Pipelined (Tc= 200ps)

1400ps

• For large number of instructions, say 1M, the speedup will be


– ~= 800ps/200ps = 4 9
Pipeline Speedup

• Execute billions instructions, so throughput is what matters.


• Pipelining doesn’t help latency of single instruction
• Potential speedup = number pipeline stages;

• Unbalanced lengths of pipeline stages reduces10speedup;


Pipelining and ISA Design
• RISC ISA designed for pipelining
– All instructions are 32-bits
• Easier to fetch and decode in one cycle
• c.f. x86: 1- to 17-byte instructions

– Few and regular instruction formats


• Can decode and read registers in one step

– Load/store addressing
• Can calculate address in 3rd stage, access memory in 4th stage

– Alignment of memory operands


• Memory access takes only one cycle

11
Hazards

• Situations that prevent starting the next instruction in the


next cycle

• Structure hazards
– A required resource is busy
• Data hazard
– Need to wait for previous instruction to complete its data
read/write
• Control hazard because of branch or jump
– Deciding on control action depends on previous instruction
12
Structure Hazards
• Conflict for use of a resource
– Find a situation in laundry example?
• In RISC-V pipeline if with a single memory
🡪 IF and MEM conflict
– Load/store requires mem access
– Instruction fetch would have to stall for
that cycle
• Would cause a pipeline “bubble”
• Hence, pipelined datapaths require
separate instruction/data memories
– Or separate instruction/data caches
Load or store

13
One Memory Port🡪Structural Hazards
Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

ALU
I Load Ifetch Reg DMem Reg

n
s

ALU
Ifetch Reg DMem Reg
t Instr 1
r.

ALU
Ifetch Reg DMem Reg
Instr 2
O
r

ALU
Ifetch Reg DMem Reg
d Instr 3
e

ALU
r Instr 4 Ifetch Reg DMem Reg

14
One Memory Port/Structural Hazards
Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

ALU
I Load Ifetch Reg DMem Reg

n
s y !
a

ALU
t Instr 1 Ifetch Reg DMem Reg
e l
d
r. l e
y c

ALU
Instr 2 Ifetch Reg DMem Reg c
n e
O O
r Bu
Bub Bub Bub Bub
Stall bbl
d e
ble ble ble ble
e

ALU
r Instr 3 Ifetch Reg DMem Reg

How do you “bubble” the pipe? 🡪 No-Op


15
Summary of Structure Hazard
• To address structure hazard, have separate memories for
instructions and data
• However, it will increase cost
– E.g.: pipelining function units or duplicated resources is a high
cost;

† If the structure hazard is rare, it may not be


worth the cost to avoid it.

16
Data Hazards
• An instruction needs data produced
by a previous instruction
– Read-After-Write (RAW) data dependency
add x1, x2, x3
sub x4, x1, x5

add x1, x2, x3

sub x4, x1, x5

– Sub would read old value of x1 at cycle 3

17
Data Hazards and Solution #1: Interlocking
• An instruction needs data produced by a previous instruction
– Read-After-Write (RAW) data dependency
add x1, x2, x3
sub x4, x1, x5
• Interlock: Hardware detect their dependency, and
– Insert no-op instructions, e.g. “add $0,$0,$0”, as bubble
– Waste 400: two instructions in between since sub needs to
wait for two stages for add to write the result x1 to register
add x1, x2, x3 Two cycles delay!

x1

sub x4, x1, x5 18


Solution #2: Forwarding (aka Bypassing)
• Use result right after when it is computed instead of waiting for it to be
stored in a register
– add produces the result at the end of its EXE stage
– sub uses the result at the beginning of its EXE stage, which is right after the cycle
for add’s EXE
– Requires extra connections in the datapath

19
Load-Use Data Hazard
• Load produce the results after the MEM stage
– Sub use the result at the beginning of the EXE stage, which is in
the same cycle as load’s MEM, thus, not possible to forward
• Can’t avoid stalls by forwarding for load-use
– If value not computed when needed
– Can’t forward backward in time!

One cycle delay!

20
Code Scheduling to Avoid Stalls (Software
Solution)
• Reorder code to avoid use of load result in the next
instruction
• C code for a = b + e; c = b + f;
ld x1, 0(x31) ld x1, 0(x31)
ld x2, 8(x31) ld x2, 8(x31)
stall
addx3, x1, x2 ld x4, 16(x31)
sd x3, 24(x31) addx3, x1, x2
ld x4, 16(x31) sd x3, 24(x31)
addx5, x1, x4 addx5, x1, x4
stall
sd x5, 32(x31) sd x5, 32(x31)
13 cycles 11 cycles

21
To Check Cycles Delayed and How Forward Works
in Different Cases
• In the 5-stage pipeline, check whether the results can be
generated before it is being used
– If so, forwarding
– If not, stall
• Load-Use
• Produce-Store
– sw rs2, offset(rs1)
• sw needs rs1 to be ready at the EXE stage
• sw needs rs2 to be ready at the MEM stage
add x9, x7, x8
add x9, x7, x8 sw x9, 32(x31)
sw x10, 32(x9)
2-cycle delay if no forwarding 2-cycle delay if no forwarding
No delay with forwarding No delay with forwarding
(Forwarding from EXE to EXE) (forwarding from EXE to MEM)
22
Control Hazards
• Branch determines flow of control
– Fetching next instruction depends on branch outcome
– Pipeline might fetch incorrect instruction in the next cycle after a
beq instru is fetched
• Still working on ID stage of branch

• In RISC-V pipeline
– Need to compare registers and compute target early in the
pipeline
– Add hardware to do it in ID stage

23
Stall on Branch
• Wait until branch outcome determined before fetching next
instruction
– One cycle stall (bubble) if branch condition is determined at ID
stage
– Two cycles stall if branch condition is determined at EXE stage

One cycle delay!

24
Branch Prediction
• Longer pipelines can’t readily determine branch outcome
early
– Stall penalty becomes unacceptable
• Predict outcome of branch
– Only stall if prediction is wrong
• In pipeline
– Can predict branches not taken
– Fetch instruction after branch, with no delay

25
RISC-V with Predict Not Taken

Prediction
correct

Prediction
incorrect

26
More-Realistic Branch Prediction
• Static branch prediction
– Based on typical branch behavior
– Example: loop and if-statement branches
• Predict backward branches taken
• Predict forward branches not taken

• Dynamic branch prediction


– Hardware measures actual branch behavior
• e.g., record recent history of each branch
– Assume future behavior will continue the trend
• When wrong, stall while re-fetching, and update history

27
Pipeline Summary

The BIG
Picture
• Pipelining improves performance by increasing instruction
throughput
– Executes multiple instructions in parallel
– Each instruction has the same latency
• Subject to hazards
– Structure, data, control
• Instruction set design affects complexity of pipeline
implementation

28
Pipeline Execution Diagram: Steps
1. Identify RAW dependencies between two instructions that are one after the other or
there is one instruction in between
– AL-Use: 2-cycle delay without forwarding, no delay with forwarding
– Load-Use: 2-cycle delay without forwarding, 1 cycle delay with forwarding
• With forwarding, we can reschedule load to eliminate the 1 cycle delay even with
forwarding
– No need to looking for RAW dependency between instructions that are far from each other
(>=1 instructions in between)
• Thus only check for the two instructions that could be executed one after another or has
one other instruction in between
2. Identify branch instruction
– 1 cycle delay (or two cycles delay) depending on the implementation (question)
3. Pipeline diagrams (4 situations)
– a) No pipeline at all, one cycle per stage, no overlap
– b) Pipeline with no forwarding, 2 cycle delay for AL-USE, Load-USE, beq (EXE outcome)
– c) Pipeline with forwarding, 1 cycle delay for Load-use, and 2 cycle-delay for beq
– d) Pipeline with forwarding and load-use rescheduling: reschedule the instruction to
eliminate the 1-cycle delay for load-use
• No any two instructions can be in the same stage in the same cycle
– Structural hazard

29
for (i=1; i<M-1; i++) B2[i] = B[i-1] + B[i] + B[i+1];
• Base address B and B2 are in register x22 and x23. i is stored in
register x5, M is stored in x4. Using beq (==) for (<)
add x5, x0, 1 // i=0 to exit
add x22, x4, -1 // loop bound x22 has M-1
LOOP: beq x5, x22, Exit
slliw x6, x5, 2 // x6 now store i*4, slliw is i<<2 (shift left logic)
add x7, x22, x6 // x7 now stores address of B[i].
lw x9, 0(x7) // load B[i] from memory location (x7+0) to x9
lw x10, -4(x7) // load B[i-1] to x10
add x9, x10, x9 // x9 = B[i] + B[i-1]
lw x10, 4(x7) //load B[i+1] to x10
add x9, x10, x9 // x9 = B[i-1] + B[i] + B[i+1]
add x8, x23, x6 // x8 now stores the address of B2[i]
sw x9, 0(x8) // store value for B2[i] from register x9 to memory (x8+0)
addi x5, x5, 1 // i++
beq x0, x0, LOOP
Exit:
30
for (i=1; i<M-1; i++) B2[i] = B[i-1] + B[i] + B[i+1];
• Base address B and B2 are in register x22 and x23. i is stored in
register x5, M is stored in x4. Using beq (==) for (<)
1. add x5, x0, 1 to exit
2. add x22, x4, -1
3. LOOP: beq x5, x22, Exit
4. slliw x6, x5, 2 RAW Dependencies
5. add x7, x22, x6 Instruction that Instruction that The # instructions in Load-u
6. lw x9, 0(x7) writes the register
add x5, x0, 1
reads the register
beq x5, x22, Exit
register
x5
between
1
se

7. lw x10, -4(x7) add x22, x4, -1 beq x5, x22, Exit x22 0
8. add x9, x10, x9 slliw x6, x5, 2 add x7, x22, x6 x6 0
x7 0
9. lw x10, 4(x7) add x7, x22, x6
add x7, x22, x6
lw x9, 0(x7)
lw x10, -4(x7) x7 1
10. add x9, x10, x9 lw x9, 0(x7) add x9, x10, x9 x9 1 Y
11. add x8, x23, x6 lw x10, -4(x7) add x9, x10, x9 x10 0 Y
lw x10, 4(x7) add x9, x10, x9 x10 0 Y
12. sw x9, 0(x8) add x9, x10, x9 sw x9, 0(x8) x9 1
13. addi x5, x5, 1 add x8, x23, x6 sw x9, 0(x8) x8 0

14. beq x0, x0, LOOP addi x5, x5, 1 beq x5, x22, Exit x5 1

15. Exit:
31
Instruction that Instruction that The In instructions in Load-u
writes the register reads the register register between se

Examples
add x5, x0, 1
add x22, x4, -1
slliw x6, x5, 2
beq x5, x22, Exit
beq x5, x22, Exit
add x7, x22, x6
x5
x22
x6
1
0
0
add x7, x22, x6 lw x9, 0(x7) x7 0
add x7, x22, x6 lw x10, -4(x7) x7 1
lw x9, 0(x7) add x9, x10, x9 x9 1 Y
lw x10, -4(x7) add x9, x10, x9 x10 0 Y
lw x10, 4(x7) add x9, x10, x9 x10 0 Y
add x9, x10, x9 sw x9, 0(x8) x9 1
add x8, x23, x6 sw x9, 0(x8) x8 0
addi x5, x5, 1 beq x5, x22, Exit x5 0

32
33
§4.9 Exceptions
Exceptions and Interrupts

• “Unexpected” events requiring change


in flow of control
– Different ISAs use the terms differently
• Exception
– Arises within the CPU
• e.g., undefined opcode, overflow, syscall, …

• Interrupt
– From an external I/O controller
• Dealing with them without sacrificing performance is
hard

34
Handling Exceptions
• In MIPS, exceptions managed by a System Control
Coprocessor (CP0)
• Save PC of offending (or interrupted) instruction
– In MIPS: Exception Program Counter (EPC)
• Save indication of the problem
– In MIPS: Cause register
– We’ll assume 1-bit
• 0 for undefined opcode, 1 for overflow
• Jump to handler at 8000 00180

35
An Alternate Mechanism
• Vectored Interrupts
– Handler address determined by the cause
• Example:
– Undefined opcode: C000 0000
– Overflow: C000 0020
– …: C000 0040
• Instructions either
– Deal with the interrupt, or
– Jump to real handler

36
Handler Actions
• Read cause, and transfer to relevant handler
• Determine action required
• If restartable
– Take corrective action
– use EPC to return to program
• Otherwise
– Terminate program
– Report error using EPC, cause, …

37
Exceptions in a Pipeline
• Another form of control hazard
• Consider overflow on add in EX stage
add $1, $2, $1
– Prevent $1 from being clobbered
– Complete previous instructions
– Flush add and subsequent instructions
– Set Cause and EPC register values
– Transfer control to handler
• Similar to mispredicted branch
– Use much of the same hardware

38
1-Bit Predictor: Shortcoming
• Inner loop branches mispredicted twice!
outer: …

inner: …

beq …, …, inner

beq …, …, outer

■ Mispredict as taken on last iteration of inner


loop
■ Then mispredict as not taken on first
iteration of inner loop next time around
39
2-Bit Predictor
• Only change prediction on two successive mispredictions

40
Calculating the Branch Target
• Even with predictor, still need to calculate the target address
– 1-cycle penalty for a taken branch
• Branch target buffer
– Cache of target addresses
– Indexed by PC when instruction fetched
• If hit and instruction is branch predicted taken, can fetch target
immediately

41
Dynamic Branch Prediction

• In deeper and superscalar pipelines, branch penalty is


more significant
• Use dynamic prediction
– Branch prediction buffer (aka branch history table)
– Indexed by recent branch instruction addresses
– Stores outcome (taken/not taken)
– To execute a branch
• Check table, expect the same outcome
• Start fetching from fall-through or target
• If wrong, flush pipeline and flip prediction

42

You might also like