0% found this document useful (0 votes)
26 views38 pages

5 Pipelining

The document discusses pipelining in computer processors. It describes the concept of pipelining including its benefits and challenges. It provides examples of pipelined execution and calculations of speedup from pipelining. It also discusses different types of hazards that can occur in pipelined systems including structural, data, and control hazards.

Uploaded by

kholood badea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views38 pages

5 Pipelining

The document discusses pipelining in computer processors. It describes the concept of pipelining including its benefits and challenges. It provides examples of pipelined execution and calculations of speedup from pipelining. It also discusses different types of hazards that can occur in pipelined systems including structural, data, and control hazards.

Uploaded by

kholood badea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Chapter 4

Pipelining
Basic Computer Architecture
 What is computer architecture
 Instruction set architecture
 What to do
 Computer organization
 How to do (Data path and Control)

Chapter 4 — The Processor — 2


Instruction Set Architecture
 Instruction set architecture for MIPS
 Arithmetic logical instruction
 add R3,R2,R1
 Data transfer instructions
 lw R2, offset(R1)
 sw R2, offset(R1)
 Branch instructions
 beq rs, rt, offset

Chapter 4 — The Processor — 3


Computer Organization
 Sequential (single cycle) Execution
 One instruction is fetched from instruction memory
 All the steps in instruction execution are completed
 Then next instruction is fetched
 In other words, new instruction can not be fetched
from the memory unless the previous instruction
has completed its execution

Instruction 1
Instruction 2
Instruction 3

Chapter 4 — The Processor — 4


single cycle
 Why a single cycle implementation is not
used today
 Inefficient: single cycle for each instruction with
Which
inst.? same length
 Longest path determines the clock cycle
 CPI is 1 and clock cycle is too long, overall
performance is poor
 We need another implementation
technique that is more efficient and having
higher throughput. (Pipelining)

Chapter 4 — The Processor — 5


Pipelining
 Pipelining
 Next instruction is fetched from the memory
before the previous instruction has completed
its execution
 In other words, overlapping of instruction
execution
 Why pipelining is used ?
 To improve performance (higher throughput)

Chapter 4 — The Processor — 6


Pipelining

Each stage = 30 min


Total Execution time for 4 loads = 8 h

Original

Each stage = 30 min


Total Execution time for 4 loads = 3.5 h

Improved
 Four loads:
 Speedup
= 8/3.5 = 2.3
Chapter 4 — The Processor — 7
Pipelining vs Performance
 What is performance ?
 Latency or response time
 How long it takes to do a single task
 Throughput
 Total work done (all tasks) per unit time
 Pipelining increases latency or throughput ?
 Only throughput

Chapter 4 — The Processor — 8


MIPS Pipeline
 Five stages, one step per stage
1. IF (Fetch): Instruction fetch from memory IF

2. ID (Decode): Instruction decode & register read ID

3. EX (Execute): Execute operation or calculate address


EX
4. MEM (Memory): Access data memory MEM

5. WB (Writeback): Write result back to register


WB

 Focus on 8 inst.:
 lw, sw, add, sub, AND, OR, slt, beq

Chapter 4 — The Processor — 9


Graphical Representation of MIPS 5-stage Pipeline

read write
(right shaded) ALU (left shaded)

White background
because add does not access memory

Chapter 4 — The Processor — 10


Pipeline Performance
 Assume that:
 Time for register read and write is 100ps
 Time for any other stage is 200 ps
Inst. Inst. fetch Reg. read ALU op Mem. access Reg. write Tot. time

lw 200ps 100 ps 200ps 200ps 100 ps 800ps


sw 200ps 100 ps 200ps 200ps 700ps
R-format (add, 200ps 100 ps 200ps 100 ps 600ps
sub, AND, OR,
slt)

Branch (beq) 200ps 100 ps 200ps 500ps

Chapter 4 — The Processor — 11


Pipeline Performance
Single-cycle (Clock cycle time,Tc= 800ps)

Time bet. 1st and 4th inst.=3x800=2400 ps

Pipelined (Tc= 200ps)

Time bet. 1st and 4th inst.=3x200=600 ps

The clock cycle design must


allow for the slowest inst.
See previous table

Chapter 4 — The Processor — 12


Pipeline Performance
𝐶𝐶 = 5

𝐶𝐶 = 9

𝐶𝐶 =104

Chapter 4 — The Processor — 13


Example 1
 Consider a nonpipelined machine with 8
execution stages of lengths 20 ns.
 The time between two instructions
 20+20+20+20+20+20+20+20 = 160 ns
 Suppose we introduce pipelining on this
machine.
 The time between two instructions = 20 ns
 The speedup obtained from pipelining
 Speedup = 160 / 20 = 8

Chapter 4 — The Processor — 14


Example 2
 Consider a nonpipelined machine with 10
execution stages of lengths 10, 20, 20, 30, 10,
10, 50, 45, 20, 10.
 The time between two instructions on this machine
 10+20+20+30+10+10+50+45+20+10 = 225 ns
 Suppose we introduce pipelining on this
machine.
 The time between two instructions = 50 ns. The clock cycle design must
allow for the slowest inst.

 Speedup = 225 / 50 = 4.5

Chapter 4 — The Processor — 15


Pipeline Speedup

Chapter 4 — The Processor — 16


Example 3
 In a non-pipelined machine time between instructions is
200 ns. If we use pipelining with four stages such that all
the stages are balanced. What is the time between
instructions after pipelining ?

 Time between instructions after pipelining = 200/4 = 50 ns


 Notice: speedup = 200/50=4= number of pipelined stages

Chapter 4 — The Processor — 17


Pipelining and ISA Design
 MIPS ISA designed for pipelining
 All instructions are 32-bits
 Easier to fetch and decode in one cycle
 c.f. x86: 1- to 17-byte instructions
 Few and regular instruction formats
 Can decode and read registers in one step
 Load/store addressing
 Can calculate address in 3rd stage, access memory
in 4th stage
 Alignment of memory operands
 Memory access takes only one cycle

Chapter 4 — The Processor — 18


Pipelining Hazards
 There are situations in pipelining when the next
instruction can not execute in the following clock cycle.
These events are called hazards
 In other word, any condition that causes a pipeline to
stall is called a hazard.
 There are three types of hazards:
 Structural hazards:
 A required resource is busy
 Data hazards:
 Need to wait for previous instruction to complete its data read/write
 Control hazards:
 Deciding on control action depends on previous instruction

Chapter 4 — The Processor — 19


Structure Hazards
 Due to the conflict for use of a resource
 A required resource is busy
 ex. Using a washer-dryer combination
 Assume a MIPS pipeline with a single
memory
 Load/store requires data access
 Instruction fetch would have to stall (wait) for that
cycle
 Would cause a pipeline “bubble”

 Hence, pipelined datapaths require separate


instruction/data memories

Chapter 4 — The Processor — 20


Structure Hazards

Chapter 4 — The Processor — 21


Structure Hazards

Chapter 4 — The Processor — 22


Data Hazards
 Data hazards:
 Arise from the dependence of one instruction on an
earlier one that is still in the pipeline
 Need to stall (wait) for previous instruction to complete
its data read/write add $s0, $t0, $t1
sub $t2, $s0, $t3

The value of $s0


is written back here

So, Wait Until it becomes available


The value of $s0 is needed here but it is
not available in this stage to be read
Chapter 4 — The Processor — 23
Data Hazards: Example
 An instruction depends on completion of data access by
a previous instruction
 add $s0, $t0, $t1
sub $t2, $s0, $t3
 To resolve this hazard:
 (1) wait until the hazard is resolved (but impacts CPI)

Write result is in the fifth stage

Chapter 4 — The Processor — 24


Data Hazards:
(2)Forwarding (Bypassing)

 Use result when it is computed


 Don’t wait for it to be stored in a register
 Requires extra connections in the datapath (Hardware)
 Valid only if the destination stage is later in time than
the source stage
 Can’t prevent all pipeline stalls

Chapter 4 — The Processor — 25


Load-Use Data Hazard
 Can’t always avoid stalls by forwarding
 If value not computed when needed
 Can’t forward backward in time!

Chapter 4 — The Processor — 26


Load-Use Data Hazard
 So, we would have to stall one stage for a
load-use data hazard

Chapter 4 — The Processor — 27


(3)Code Scheduling to Avoid Stalls
 Reorder code to avoid use of load result in
the next instruction (Software)
 C code for A = B + E; C = B + F;
lw $t1, 0($t0)
lw $t2, 4($t0)
stall add $t3, $t1, $t2 Forwarding is adopted here
sw $t3, 12($t0)
lw $t4, 8($t0)
stall add $t5, $t1, $t4
sw $t5, 16($t0)
13 cycles

Chapter 4 — The Processor — 28


lw $t1, 0($t0)
(3)Code Scheduling to Avoid Stalls lw $t2, 4($t0)
add $t3, $t1, $t2
lw $t1, 0($t0) sw $t3, 12($t0)
lw $t4, 8($t0)
lw $t2, 4($t0)
add $t5, $t1, $t4
stall sw $t5, 16($t0)
add $t3, $t1, $t2

sw $t3, 12($t0)

lw $t4, 8($t0)

stall

add $t5, $t1, $t4

sw $t5, 16($t0)

13 cycles
Chapter 4 — The Processor — 29
(3)Code Scheduling to Avoid Stalls

 Reorder code to avoid use of load result in


the next instruction (Software)
 C code for A = B + E; C = B + F;
Forwarding is adopted here

lw $t1, 0($t0) lw $t1, 0($t0)


lw $t2, 4($t0) lw $t2, 4($t0)
stall add $t3, $t1, $t2 lw $t4, 8($t0)
sw $t3, 12($t0) add $t3, $t1, $t2
lw $t4, 8($t0) sw $t3, 12($t0)
stall add $t5, $t1, $t4 add $t5, $t1, $t4
sw $t5, 16($t0) sw $t5, 16($t0)
13 cycles 11 cycles

Chapter 4 — The Processor — 30


lw $t1, 0($t0)
(3)Code Scheduling to Avoid Stalls
lw $t2, 4($t0)
lw $t4, 8($t0)
lw $t1, 0($t0) add $t3, $t1, $t2
sw $t3, 12($t0)
lw $t2, 4($t0)
add $t5, $t1, $t4
lw $t4, 8($t0) sw $t5, 16($t0)

add $t3, $t1, $t2

sw $t3, 12($t0)

add $t5, $t1, $t4

sw $t5, 16($t0)

11 cycles

Chapter 4 — The Processor — 31


Control Hazards
 Also called branch hazards because
control hazards are due to branch
instructions
 Branch determines flow of control
 Fetching next instruction depends on branch
outcome
 Pipeline can’t always fetch correct instruction
 Control hazard will occur when the proper
instruction was not fetched
 What is the solution ?

Chapter 4 — The Processor — 32


Control Hazards: (1) Stall on Branch
 Wait until branch outcome determined before fetching
next instruction
 It means we have to wait until stage 4.
 Advantage: simple both to software and hardware

Stall 3 cycles

Chapter 4 — The Processor — 33


Control Hazards: (2) putting extra hardware
 Lets assume that we put in enough extra hardware during the second stage
of pipeline (ID stage). so that we can:
 test registers (Comparator)
 calculate the branch address (adder)
 update the PC
 Even with this extra hardware, we have to wait until stage 2.

Stall 1 cycle
Chapter 4 — The Processor — 34
Control Hazards: (3) Branch Prediction

 Longer pipelines can’t determine branch


outcome early
 Stall penalty becomes unacceptable
 Solution: Predict outcome of branch
 Only stall if prediction is wrong
 In MIPS pipeline
 Can predict branches not taken
 Fetch instruction after branch, with no delay

Chapter 4 — The Processor — 35


MIPS with Predict Not Taken

Prediction correct
i.e. branch not taken

Prediction incorrect
i.e. branch taken

Chapter 4 — The Processor — 36


Check Yourself (HW)

http://www.edumips.org
http://www.ecs.umass.edu/ece/koren/architecture/windlx/main.html
Chapter 4 — The Processor — 37
Pipeline Summary
 Pipelining improves performance by
increasing instruction throughput
 Executes multiple instructions in parallel
 Each instruction has the same latency
 Pipelining hazards
 Structure, data, control
 Instruction set design affects complexity of
pipeline implementation

Chapter 4 — The Processor — 38

You might also like