0% found this document useful (0 votes)
10 views35 pages

Lecture-4-08 01 2025

The document discusses the MIPS32 datapath and pipelining concepts, detailing the execution stages of instructions in a non-pipelined and pipelined architecture. It highlights the importance of overlapping execution stages to improve performance, as well as the challenges and solutions related to hardware conflicts during pipelining. Additionally, it emphasizes the design simplicity of the control unit due to the regularity of the MIPS instruction set.

Uploaded by

Munesh Meena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views35 pages

Lecture-4-08 01 2025

The document discusses the MIPS32 datapath and pipelining concepts, detailing the execution stages of instructions in a non-pipelined and pipelined architecture. It highlights the importance of overlapping execution stages to improve performance, as well as the challenges and solutions related to hardware conflicts during pipelining. Additionally, it emphasizes the design simplicity of the control unit due to the regularity of the MIPS instruction set.

Uploaded by

Munesh Meena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

LECTURE 4: REVIEWING MIPS

DATAPATH-2 & PIPELINING


CONCEPTS
Example of Instruction Execution

ADD R2, R5, R10 ADDI R2, R5, 150


IF IR ← Mem [PC]; IF IR ← Mem [PC];
NPC ← PC + 4; NPC ← PC + 4;
ID A ← Reg [rs]; ID A ← Reg [rs];
B ← Reg [rt]; Imm ← (IR15)16 ## IR15..0
EX ALUOut ← A + B; EX ALUOut ← A + Imm;
MEM PC ← NPC; MEM PC ← NPC;
WB Reg [rd] ← ALUOut; WB Reg [rd] ← ALUOut;
Example of Instruction Execution

SW R3, 25 (R10) BEQZ R3, Label


LW R2, 200(R6)
IF IR ← Mem [PC]; IF IR ← Mem [PC];
IF IR ← Mem [PC];
NPC ← PC + 4; NPC ← PC + 4;
NPC ← PC + 4;
ID A ← Reg [rs]; ID A ← Reg [rs];
ID A ← Reg [rs];
B ← Reg [rt]; Imm ← (IR15)16 ## IR15..0
Imm ← (IR15)16 ## IR15..0
Imm ← (IR15)16 ## IR15..0
EX ALUOut ← A + Imm; EX ALUOut ← NPC + (Imm << 2);
EX ALUOut ← A + Imm; cond ← (A ==0)
MEM PC ← NPC;
MEM PC ← NPC; MEM PC ← NPC;
LMD ← Mem [ALUOut];
Mem [ALUOut] ← B; if (cond) PC ← ALUOut;
WB Reg [rt] ← LMD;
WB WB
MIPS32 Design of Datapath
◦ Here we will design the data path for the five steps as
mentioned for executing MIPS32 instructions.
◦ Assume that there is no pipelining.
◦ Also known as single-cycle implementation
◦ Only after one instruction is finished can the next instruction start.
◦ Later on we shall extent the data path for pipelined
implementation.
◦ We shall discuss various pipelining related issues and techniques for
faster execution of instructions.
The IF Stage

Memory

+ NP
C IR ← Mem [PC];
4
NPC ← PC + 4;

P Instruction
IR
C Memory • 32-bit PC
• 32-bit NPC
• 32-bit IR
• 32-bit Adder
The ID Stage

rs
A ← Reg [rs];
rt A B ← Reg [rt];
Register IMM ← (IR15)16 ## IR15..0
rd
IR Bank IMM1 ← IR25..0 ## 00
B
The IMM1 calculation is not shown

I • 32-bit A
Sign
M • 32-bit B
Extend M
• 32-bit IMM

From WB
The EX Stage
Memory Reference:
ALUOut ← A + IMM;
=0? cond

Register-Register ALU Instruction:


ALUOut ← A func B;
From NPC
MUX

From A A Register-Immediate ALU Instruction:


L ALUOut ALUOut ← A func IMM;
From B U
MUX

From IMM
Branch:
(<<2)
ALUOut ← NPC + (IMM << 2);
Func cond ← (A op 0);
• 32-bit ALUOut
• 1-bit cond
• 32-bit 2:1 MUX
The MEM Stage
Load instruction:
To PC PC ← NPC;
LMD ← Mem [ALUOut];

Store instruction:
From NPC PC ← NPC;
MUX Mem [ALUOut] ← B;
From cond
Branch instruction:
if (cond) PC ← ALUOut;
From ALUOut else PC ← NPC;
Data
LMD
Memory Other instructions:
From B
PC ← NPC;

• 32-bit LMD
• 32-bit 2:1 MUX
The WB Stage
Register-Register ALU Instruction:
Reg [rd] ← ALUOut;
From LMD To write port

MUX
of register
From ALUOut bank Register-Immediate ALU Instruction:
Reg [rt] ← ALUOut;

Load Instruction:
Reg [rt] ← LMD;

• 32-bit 2:1 MUX


Putting it all together

=0 cond

NP
+

MUX
4 C

rs

MUX
A

ALUOu
Instructio rt A
L

t
PC n IR Register
rd U Data

LMD
Memory
Bank

MUX
Memor

MUX
B y

I
Sign
M
Extend M
Simplicity of the Control Unit Design
◦ Due to the regularity in instruction encoding and simplicity of the instruction set, the design of the
control unit becomes very easy.
◦ Control signals in the data path:
a) LoadPC i) LoadIMM q) LoadLMD
b) LoadNPC j) MuxALU1 r) MuxWB
c) ReadIM k) MuxALU2 s) WriteReg
d) LoadIR l) ALUfunc
e) ReadRegPort1 m) LoadALUOut
f) ReadRegPort2 n) MuxPC
g) LoadA o) ReadDM
h) LoadB p) WriteDM
Example: Control Signal for ADD R1, R2, R3

ReadRegPort1
ReadIM ALUfunc = add
ReadRegPort2
ADD R1, R2, R3 LoadIR MuxALU1 = 0
LoadA
LoadNPC MuxALU2 = 0
IF IR ← Mem [PC]; LoadB
NPC ← PC + 4; LoadALUOut

ID A ← Reg [rs]; LoadPC


B ← Reg [rt];
EX ALUOut ← A + B;
MEM PC ← NPC;
MuxWB = 1
WB Reg [rd] ← ALUOut; WriteReg
Example: Control Signal for LW R1, 25(R2)

ReadIM ReadRegPort1 ALUfunc = add


LW R1, 25(R2) LoadIR LoadA MuxALU1 = 0
LoadNPC LoadIMM MuxALU2 = 1
IF IR ← Mem [PC];
NPC ← PC + 4; LoadALUOut

ID A ← Reg [rs]; LoadPC


Imm ← (IR15)16 ## IR15..0 ReadDM
EX ALUOut ← A + Imm; LoadLMD
MEM PC ← NPC;
LMD ← Mem [ALUOut];
MuxWB = 0
WB Reg [rt] ← LMD; WriteReg
Pipelining
◦Non overlapped execution resulted in a pure sequential flow of
execution
◦If the functional unit are different then we can overlap the
stages of execution. Resulting in a more better utilization of
functional units.
Analogy
◦ Consider a laundry service doing this activities- Wash; Dry; Iron;
Pack
◦ Normally these activities are performed in a sequential order
◦ Note that when we wash the clothes, the other resources are not
utilize, so what if we have a laundry service that can perform
something like this

Wash Dry Iron Pack


Overlapping of
Wash Dry Iron Pack
different functional
Wash Dry Iron Pack units
Wash Dry Iron Pack
Wash Dry Iron Pack
T1 T2 T3 T4 T5 T6 T7 T8
Wash Dry Iron Pack
Wash Dry Iron Pack
Wash Dry Iron Pack
Wash Dry Iron Pack
Wash Dry Iron Pack

◦Here we are able to wash, dry, iron and pack 5 sets of


clothes in 8 units of time
◦Mechanism- Partitioning the activities into some stages
Pipelining in a Computer
◦ Now we will look of how to partition some computational
activities into some k stages
◦ Objective of Pipelining
1. A nominal increase in the cost of implementation
2. Ideal Speedup = k
◦ Pipelining in can be applied in
◦ Instruction: Several instructions executed in some sequence
◦ Arithmetic computation: Same operation carried out on several data
sets.
◦ Memory access: Several memory accesses to consecutive locations are
made.
◦ Basic requirements for pipelining the MIPS32 data path:
◦ We should be able to start a new instruction every clock cycle.
◦ Each of the five steps mentioned before (IF, ID, EX, MEM and WB)
becomes a pipeline stage.
◦ Each stage must finish its execution within one clock cycle.
◦ Since execution of several instructions are overlapped, we must
ensure that there is no conflict during the execution.
◦ Simplicity of the MIPS32 instruction set makes this evaluation quite easy.
◦ We shall discuss these issues in detail.
◦Consider time T to execute each stage
◦ In a non-pipeline computer => IF+ID+EX+MEM+WB, a single
instruction will require 5T to complete its execution.
◦ Therefore, time require to execute n instructions = 5Tn

δ T δ T δ T δ T δ T δ

IF ID EX MEM WB

Time of execute n instructions = (4 + n).(T + δ) ≈ (4 + n).T, if T >>δ


Ideal Speedup = 5Tn / (4 + n)T ≈ 5, for large n.
In practice, due to various conflicts, speedup is much less.
Alternatively pipeline speedup

𝑡𝑖𝑚𝑒 𝑛𝑜𝑛 𝑝𝑖𝑝𝑒𝑙𝑖𝑛𝑒𝑑


◦Pipeline speedup =
𝑡𝑖𝑚𝑒 𝑝𝑖𝑝𝑒𝑙𝑖𝑛𝑒𝑑
◦Consider executing ‘n’ instructions on a k-stage
pipelined processor:
◦Non-pipelined processor: kn
◦Pipelined processor : (k-1)+n
𝑘𝑛
◦Speedup =
𝑘−1 +𝑛
Achieving Speedup
◦ Options
◦ Replication: Increase the hardware by factor of n to achieve a n speed
up
◦ Cost is high
◦ Partitioning the processing into k stages
◦ Cost is less
Clock cycles

Instructions 1 2 3 4 5 6 7 8
i IF ID EX MEM WB
i+1 IF ID EX MEM WB
i+2 IF ID EX MEM WB
i+3 IF ID EX MEM WB

Instr-i Instr-i+2
completes completes

Instr-i+1 Instr-i+3
completes completes
Clock cycles
Instructions 1 2 3 4 5 6 7 8
i IF ID EX MEM WB
i+1 IF ID EX MEM WB
i+2 IF ID EX MEM WB
i+3 IF ID EX MEM WB

Hardware conflict examples in pipelining:


• IF & MEM: In clock cycle 4, both instructions i and i+3 access memory.
• Solution: use separate instructions and data cache.

• ID & WB: In clock cycle 5, both instructions i and i+3 access register bank.
• Solution: allow both read and write access to registers in the same clock cycle.
Advantages of Pipelining

◦ In the non-pipelined version, the execution time of an instruction is


equal to the combined delay of the five stages (say, 5T ).
◦ In the pipelined version, once the pipeline is full, one instruction gets
executed after every T time.
◦ Assuming all state delays are equal (equal to T ), and neglecting latch delay.
◦ However, due to various conflicts between instructions (called
hazards), we cannot achieve the ideal performance.
◦ Several techniques have been proposed to improve the performance.
◦ To be discussed.
Observation: IF and MEM conflict
◦ To support overlapped execution, peak memory bandwidth must be
increased 5 times over that required for the non-pipelined version.
◦ An instruction fetch occurs every clock cycle.
◦ Also there can be two memory accesses per clock cycle (one for instruction
and one for data).
◦ Separate instruction and data caches are typically used to support this.

I-
Cache
CPU
D-
Cache
Observation: ID and WB conflict
◦ The register bank is accessed both in the stages ID and WB.
◦ ID requires 2 register reads, and WB requires 1 register write.
◦ We thus have the requirement of 2 reads and 1 write in every clock cycle.
◦ Two register reads can be supported by having two register read ports.
◦ Simultaneous reads and write may result in clashes (e.g., same register used).
◦ Solution adopted in MIPS32 pipeline is to perform the write during the first
half of the clock cycle, and the reads during the second half of the clock cycle.

Clock Cycle Clock Cycle Clock Cycle

Write ReadsWrite ReadsWrite Reads


Src Reg 1
Read (5 bits) Dest Reg
Port 1 Reg Data (5 bits)
(32 bits) Write
REGISTER Reg Data
Port
BANK (32 bits)

Read Src Reg 2


(5 bits)
Port 2
Reg Data
(32 bits)
Updating PC
◦Since a new instruction is fetched every clock cycle, it is
required to increment the PC on each clock.
◦ PC updating has to be done during IF stage itself, as otherwise the
next instruction cannot be fetched.
◦ In the non-pipelined version discussed earlier, this was done during
the MEM stage.
Basic Performance Issues in a Pipeline
IF ID EX MEM WB

◦ Register stages are inserted between pipeline stages, which


increases the execution time of an individual instruction.
◦ Because of overlapped execution of instructions, throughput increases.
◦ The clock period T has to be chosen suitably:
◦ Slowest stage in the pipeline.
◦ Clock skew and jitter.
◦ Register setup time: minimum time the register input must be held
stable before the active clock edge arrives.
Example 1
◦Consider the 5-stage MIPS32 pipeline, with the following
features:
◦ Pipeline clock rate of 1GHz (i.e. 1 ns clock cycle time).
◦ For a non-pipelined implementation, ALU operations and branches
take 4 cycles, while memory operations take 5 cycles.
◦ Relative frequencies of ALU operations, branches and memory
operations are 50%, 15%, and 35% respectively.
◦ In the pipelined implementation, due to clock skew and setup time,
the clock cycle time increases by 0.25 ns.
◦ Calculate the estimated speedup of the pipelined implementation in a
steady state.
Solution
◦ Solution:
a) For non-pipelined processor:
◦ Average instruction execution time = Clock cycle time x Average CPI
= 1 ns x (0.50 x 4 + 0.15 x 4 + 0.35 x 5) = 4.35 ns
b) For pipelined processor:
◦ Clock cycle time = 1 + 0.25 = 1.25 ns
◦ In the steady state, one instruction will get executed every clock
cycle.
◦ Speedup = 4.35 / 1.25 = 3.48
Revisiting Micro-operations for Non-pipelined MIPS32
Memory Reference:
ALUOut ← A + IMM;

IR ← Mem [PC];
IF Register-Register ALU Instruction:
NPC ← PC + 4;
ALUOut ← A func B;

A ← Reg [rs]; EX Register-Immediate ALU


ID B ← Reg [rt]; Instruction:
IMM ← (IR15)16 ## IR15..0 ALUOut ← A func IMM;
IMM1 ← IR25..0 ## 00
Branch:
ALUOut ← NPC + (IMM << 2);
cond ← (A op 0);
Load instruction:
PC ← NPC;
LMD ← Mem [ALUOut]; Register-Register ALU Instruction:
Reg [rd] ← ALUOut;
Store instruction:
PC ← NPC; Register-Immediate ALU
Mem [ALUOut] ← B; WB
MEM Instruction:
Reg [rt] ← ALUOut;
Branch instruction:
if (cond) PC ← ALUOut; Load Instruction:
else PC ← NPC; Reg [rt] ← LMD;

Other instructions:
PC ← NPC;
Putting it all together

=0 cond

NP
+

MUX
4 C

rs

MUX
A

ALUOu
Instructio rt A
L

t
PC n IR Register
rd U Data

LMD
Memory
Bank

MUX
Memor

MUX
B y

I
Sign
M
Extend M
Thank You

You might also like