0% found this document useful (0 votes)

32 views38 pages

Microprocessors Piplining Slides

Pipelining improves CPU performance by allowing subsequent instructions to begin executing before previous instructions finish. It works by splitting instruction execution into stages and allowing the next instruction to begin the first stage while the current instruction progresses through later stages. However, pipelining introduces hazards that can reduce performance if not addressed. There are three types of hazards: structural hazards due to insufficient hardware resources, data hazards when instructions depend on previous results, and control hazards due to branches. Stalls may be needed to resolve hazards which reduces the benefits of pipelining.

Uploaded by

Hamza Hassan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views38 pages

Microprocessors Piplining Slides

Uploaded by

Hamza Hassan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 38

Pipelining

• Every instruction consists of five basic steps

– Instruction fetch(IF)
– Instruction decode(ID)
– Instruction Execute(IE)
– Memory and IO (MEM)
– Write Back(WB)-storing the result in a register
Pipelining
• Pipelining is the process of fetching the next
instruction when the current instruction in being
executed
• 80x86 supports pipelining arch
• Pipelining improves the throughput of the system
Pipeline Hazards (1)
• Pipeline Hazards are situations that prevent the next
instruction in the instruction stream from executing in its
designated clock cycle
• Hazards reduce the performance from the ideal speedup
gained by pipelining
• Three types of hazards
– Structural hazards
• Arise from resource conflicts when the hardware can’t support all possible
combinations of overlapping instructions
– Data hazards
• Arise when an instruction depends on the results of a previous instruction in
a way that is exposed by overlapping of instruction in pipeline
– Control hazards
• Arise from the pipelining of branches and other instructions that change the
PC (Program Counter)
Pipeline Hazards (2)
• Hazards in pipeline can make the pipeline to stall
• Eliminating a hazard often requires that some
instructions in the pipeline to be allowed to proceed
while others are delayed
– When an instruction is stalled, instructions issued latter
than the stalled instruction are stopped, while the ones
issued earlier must continue.
Structural Hazards (1)
• If certain combination of instructions can’t be
accommodated because of resource conflicts, the machine is
said to have a structural hazard
• It can be generated by:
– Some resources has not been duplicated enough to allow all the
combinations in the pipeline to execute
– For example: a machine may have only one register file write port,
but under certain conditions, the pipeline might want to perform
two writes in one clock cycle – this will generate structural hazard
• When a sequence of instructions encounter this hazard, the pipeline will stall
one of the instructions until the required unit is available
• Such stalls will increase the Clock cycle Per Instruction from its ideal 1 for
pipelined machines
Structural Hazards (2)

• Consider a Von Neumann architecture (same memory for instructions

and data)
Structural Hazards (3)

• Stall cycle added (commonly called pipeline bubble)

Structural Hazards (4)

Instruction
Clock number
Number

1 2 3 4 5 6 7 8 9 10 11 12

ME
load IF ID EX WB
M
Instruction ME
IF ID EX WB
i+1 M
Instruction ME
IF ID EX WB
i+2 M
Instruction ME
i+3
stall stall stall IF ID EX
M
WB
Instruction ME
IF ID EX WB
i+4 M
Instruction ME
IF ID EX
i+5 M
• Another way to represent the stall – no instruction is
initiated in clock cycle 4,5 and 6
Structural Hazards (5)
• Both IF and MEM use cache
– Sol:- Use separate cache for data and code each
• ID and WB use Register port.
– Sol:- in ist half of clock cycle use Register for ID
(Reading from the register) and in second half of cycle
use it for WB(Writing into the register)
Structural Hazards (4)

Instruction
Clock number
Number

1 2 3 4 5 6 7 8 9 10

ME
load IF ID EX WB
M
Instruction ME
IF ID EX WB
i+1 M
Instruction ME
IF ID EX WB
i+2 M
Instruction ME
IF ID EX WB
i+3 M
Instruction ME
IF ID EX WB
i+4 M
Instruction ME
IF ID EX WB
i+5 M
• After resolving the issue
Data Hazards (1)
• Data hazards occur when the pipeline changes the
order of read/write accesses to operands so that the
order differs from the order seen by sequentially
executing instructions on an un-pipelined machine
• Consider the execution of following instructions, on
our pipelined example processor:
– ADD R1, R2, R3
– SUB R4, R1, R5
– AND R6, R1, R7
– OR R8, R1, R9
– XOR R10, R1, R11
Data Hazards (2)

• The use of results from ADD instruction causes hazard since the
register is not written until after those instructions read it.
Software Solution
• Compiler may insert NOP in between the dependent
instructions
– In the previous example CC3, CC4 and CC5 must be
NOP , so that at the end of CC5 result will be available in
R1 which can be used later.
• Software Optimization
– Compiler may rearrange the independent instructions in
order to reduce NOP
Forwarding
• It’s a Hardware Solution
• Forwarding works as follow:
1.The ALU result from EX/MEM register is always
fed back to the ALU input latches
2.If the forwarding hardware detects that the previous
ALU operation has written the register corresponding
to a source for the current ALU operation, control
logic selects the forwarded result as the ALU input,
rather than the value read from the register file.
Data Hazards (3)

• Eliminate the stalls for the hazard involving SUB and AND
instructions using a technique called forwarding
Data Hazards (4)

• Store requires an operand during MEM and forwarding is shown here.

– The result of the load is forwarded from the output in MEM/WB to the memory
input to be stored
– In addition the ALUOutput is forwarded to ALU input for address calculation
for both Load and Store
Data Hazards Classification
• Depending on the order of read and write access in the
instructions, data hazards could be classified as three types.
• Consider two instructions i and j, with i occurring before j.
Possible data hazards:
– RAW (Read After Write)
• True dependency
• j tries to read a source before i writes to it , so j incorrectly gets the old
value;
• Means Read takes place before write
– WAW (Write After Write)
• j tries to write an operand before is written by i. The write ends up being
performed in wrong order, having i overwrite the operand written by j, the
destination containing the operand written by i rather than the one written by
j
• Present in pipelines that write in more than one pipe stage
– WAR (Write After Read)
• Anti dependency
• j tries to write a destination before it is read by i, so the instruction i
incorrectly gets the new value
Data Hazards Requiring Stalls (1)
• Unfortunately not all data hazards can be handled by
forwarding. Consider the following sequence:
– LW R1, 0(R2)
– SUB R4, R1, R5
– AND R6, R1, R7
– OR R8, R1, R9
• The problem with this sequence is that the Load
operation will not have data until the end of
MEM/WB stage.
Data Hazards Requiring Stalls (2)

• The load instruction can forward the results to AND and OR

instruction, but not to the SUB instruction since that would mean
forwarding results in “negative” time
Data Hazards Requiring Stalls (3)

• The load interlock causes a stall to be inserted at clock cycle 4,

delaying the SUB instruction and those that follow by one cycle.
– This delay allows the value to be successfully forwarded onto the next clock
cycle
Data Hazards Requiring Stalls (4)
LW R1, 0(R2) IF ID EX MEM WB

SUB R4, R1, R5 IF ID stall stall EX MEM WB

AND R6, R1, R7 IF ID EX MEM WB

OR R8, R1, R9 IF ID EX MEM WB

*stall – pause current and all subsequent stages

Compiler Scheduling for Data Hazards (1)
• Consider a typical code, such as A = B+C

ME
LW R1, B IF ID EX WB
M
ME
LW R2, C IF ID EX WB
M
ADD R3, R1, ME
R2 IF ID stall EX WB
M
ME
SW A, R3 IF ID EX WB
M

• The ADD instruction must be stalled to allow the load of C to complete

Compiler Scheduling for Data Hazards (2)
• Rather than just allow the pipeline to stall, the
compiler could try to schedule the pipeline to avoid
the stalls, by rearranging the code
– The compiler could try to avoid the generating the code
with a load followed by an immediate use of the load
destination register
– This technique is called pipeline scheduling or
instruction scheduling and it is a very used technique in
modern compilers
Instruction scheduling example
• Generate code for our example processor that avoids
pipeline stalls from the following sequence:
– A = B +C
– D=E-F
• Solution
1. LW Rb, B // result of load inst is available after MEM stage
2. LW Rc, C
3. ADD Ra, Rb, Rc
4. SW a, Ra ;
5. LW Re, E;
6. LW Rf, f
7. SUB Rd, Re, Rf
8. SW d, Rd
In 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1 IF ID EX M W
B
2 IF ID EX M W
B
3 IF ID ID EX M W
B
4 IF ID EX M W
B
5 IF ID EX M W
B
6 IF ID EX M W
B
7 IF ID ID EX M W
B
8 IF IF ID EX M W
B
Rearrange Instructions
1. LW Rb, B // result of load inst is available after MEM stage
2. LW Rc, C
3. LW Re, E;
4. LW Rf, f
5. ADD Ra, Rb, Rc
6. SW a, Ra ; //stores need operands at MEM stage
7. SUB Rd, Re, Rf
8. SW d, Rd
In 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 IF ID E M W LW Rb, B
X B
2 IF ID E M W
X B
LW Rc, C
3 IF ID E M W LW Re, E;
X B
5 IF ID E M W LW Rf, f
X B
IF ID E M W ADD Ra, Rb, Rc
X B
6 IF ID E M W
X B
Sub Rd, Re,Rf
4 IF ID E M W STW a, Ra
X B
7 IF ID E M W STW d, Rd
X B
8
Control Hazards (1)
• Can cause a greater performance loss than the data hazards
• When a branch is executed it may or it may not change the
PC (PC gives the address of the next instruction)
– If a branch is changing the PC to its target address, than it is a
taken branch
– If a branch doesn’t change the PC to its target address, than it is a
not taken branch
• If instruction i is a taken branch, than the value of PC will
not change until the end MEM stage of the instruction
execution in the pipeline
– A simple method to deal with branches is to stall the pipe as soon
as we detect a branch until we know the result of the branch
Control Hazards (2)

Branch Instruction IF ID EX MEM WB

Branch Successor IF stall stall IF ID EX MEM WB

Branch Successor
+1 IF ID EX MEM WB
Branch Successor
+2 IF ID EX MEM

• A branch causes 3 cycle stall in our example processor

pipeline
– One cycle is a repeated IF – necessary if the branch would be
taken. If the branch is not taken, this IF is redundant
– Two idle cycles
Control Hazards (3)
• The three clock cycles lost for every branch is a
significant loss
– With a 30% branch frequency, the machine with branch
stalls achieves only about half of the speedup from
pipelining
• The number of clock cycles in a branch stall can be
reduced by two steps:
– Find out if the branch is taken or not in early stage in the
pipeline
– Compute the taken PC (address of the branch target)
earlier
Control Hazards (4)
•Reducing the stall from branch hazards by having branch
calculation into ID phase of pipeline.

•It uses a separate adder to compute the branch target address during
ID. Because the branch target addition happens during ID, it will
happen for all instructions.

•The selection of the sequential PC or the branch target PC will still

occur during IF, but now it uses values from ID phase, rather than
from EX/MEM register.

•If branch taken, clear IF/ID register(current instruction could be

wrong)
1 2 3 4 5 6 7
BEQ R1, R2, L; IF ID EX M WB
Add R3, R0,R3 IF NOP NOP
Sub R4, R5, R6
L: OR R3, R2, R4 IF ID EX M WB
• All the stages in the pipelining must be
synchronized
• Ideally all the stages must take similar time in
completing their task, but in reality its not the case.
• For example
– IF 10ns
– ID 5ns
– EX 10 ns
– MEM 5 ns
– WB 5ns
• Stages are synchronized through a fixed clock
• Each stage must perform with in the clock period
• Every stage must pass the result at the edge of the
clock
• The clock rate can be chosen such that the longest
unit finishes in 1 clock period
• In the previous example, max time was 10ns so
clock freq=1/10ns=100MHz
Exercise
The 5 stages of the processor have the following
latencies
Processo Fetch Decode Execute Memory Write
r back
A 200ps 400ps 350ps 500ps 100ps

Clock freq for Processor A =1/500ps=2000MHz

Latency=time taken by the instruction
=500*5 =2500ps
Assume there is 20ps overhead in each stage/unit then latency
becomes 500+20=520ps
5*520=2620ps
Throughput =1/520ps

The Intel Pentium Processor
No ratings yet
The Intel Pentium Processor
12 pages
Hardwired Control Unit
100% (1)
Hardwired Control Unit
12 pages
10 M1 C2 SIC XE Assembler SolvedProblem
No ratings yet
10 M1 C2 SIC XE Assembler SolvedProblem
392 pages
Parallel Processing
No ratings yet
Parallel Processing
32 pages
Addressing Modes of 8086
100% (1)
Addressing Modes of 8086
13 pages
Enterprise Systems Architecture 390 Reference Summary
No ratings yet
Enterprise Systems Architecture 390 Reference Summary
74 pages
Digital Design & Computer Engineering
No ratings yet
Digital Design & Computer Engineering
28 pages
Chapter 5 PPTV 41 STDV 1
No ratings yet
Chapter 5 PPTV 41 STDV 1
47 pages
8051 Assembly
No ratings yet
8051 Assembly
22 pages
Instruction Pipelining
No ratings yet
Instruction Pipelining
16 pages
CA Lecture 4 Module 3
No ratings yet
CA Lecture 4 Module 3
27 pages
Advanced Microprocessors: The Pentium Processors
No ratings yet
Advanced Microprocessors: The Pentium Processors
4 pages
ECE - Lecture Notes DTM 4th Semester
No ratings yet
ECE - Lecture Notes DTM 4th Semester
7 pages
Unit 3 Pipelining
No ratings yet
Unit 3 Pipelining
42 pages
Co Unit3
No ratings yet
Co Unit3
41 pages
Unit3 Pipelining
No ratings yet
Unit3 Pipelining
54 pages
Multithreading: Multithreading Computers Have Hardware Support To Efficiently Execute Multiple
No ratings yet
Multithreading: Multithreading Computers Have Hardware Support To Efficiently Execute Multiple
5 pages
Pipelining and Vector Processing-1-30
No ratings yet
Pipelining and Vector Processing-1-30
30 pages
Signal and Image Processing On The TMS320C54x DSP: Prof. Brian L. Evans
No ratings yet
Signal and Image Processing On The TMS320C54x DSP: Prof. Brian L. Evans
38 pages
Branch Prediction - 1: Computer Architecture: A Constructive Approach
No ratings yet
Branch Prediction - 1: Computer Architecture: A Constructive Approach
29 pages
Computer Organization and Assembly Language: Pipeline: Introduction
No ratings yet
Computer Organization and Assembly Language: Pipeline: Introduction
25 pages
Wa0003.
No ratings yet
Wa0003.
2 pages
Lec6 PDF
No ratings yet
Lec6 PDF
22 pages
CSE 431 Computer Architecture Fall 2005 Lecture 06: Basic MIPS Pipelining Review
No ratings yet
CSE 431 Computer Architecture Fall 2005 Lecture 06: Basic MIPS Pipelining Review
25 pages
Addressing Mode
No ratings yet
Addressing Mode
10 pages
MID-II CA Spring 2025 Solution
No ratings yet
MID-II CA Spring 2025 Solution
6 pages
Micro Programmed Control Unit Coa
No ratings yet
Micro Programmed Control Unit Coa
9 pages
ARM and x86 Instruction Sets
No ratings yet
ARM and x86 Instruction Sets
3 pages
Happyfish Code
No ratings yet
Happyfish Code
6 pages
P-III Internal Block Diagram
No ratings yet
P-III Internal Block Diagram
3 pages

Microprocessors Piplining Slides

Uploaded by

Microprocessors Piplining Slides

Uploaded by

Pipelining

• Every instruction consists of five basic steps

• Consider a Von Neumann architecture (same memory for instructions

• Stall cycle added (commonly called pipeline bubble)

• Store requires an operand during MEM and forwarding is shown here.

• The load instruction can forward the results to AND and OR

• The load interlock causes a stall to be inserted at clock cycle 4,

SUB R4, R1, R5 IF ID stall stall EX MEM WB

AND R6, R1, R7 IF ID EX MEM WB

OR R8, R1, R9 IF ID EX MEM WB

*stall – pause current and all subsequent stages

• The ADD instruction must be stalled to allow the load of C to complete

Branch Instruction IF ID EX MEM WB

Branch Successor IF stall stall IF ID EX MEM WB

• A branch causes 3 cycle stall in our example processor

•The selection of the sequential PC or the branch target PC will still

•If branch taken, clear IF/ID register(current instruction could be

Clock freq for Processor A =1/500ps=2000MHz

You might also like