0% found this document useful (0 votes)
6 views11 pages

CSCE 5610 Computer System Architecture: Instruction Level Parallelism

The document discusses the inefficiencies of single-cycle implementations in computer architecture, highlighting that all instructions must take the same time due to the longest execution path. It contrasts this with pipelining, which allows for overlapping instruction execution, significantly improving performance. The document also illustrates the stages of instruction execution and the benefits of utilizing functional units more effectively.

Uploaded by

x6ycdqdpj6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views11 pages

CSCE 5610 Computer System Architecture: Instruction Level Parallelism

The document discusses the inefficiencies of single-cycle implementations in computer architecture, highlighting that all instructions must take the same time due to the longest execution path. It contrasts this with pipelining, which allows for overlapping instruction execution, significantly improving performance. The document also illustrates the stages of instruction execution and the benefits of utilizing functional units more effectively.

Uploaded by

x6ycdqdpj6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Single-Cycle implementation

CSCE 5610
Computer System Architecture

Instruction Level Parallelism

Why Single-Cycle implementation is not used today?


● Requires the same length for every instruction
● The clock cycle is determined by the longest possible path (worst case
scenario!)
● The load instruction (lw) that uses five functional units in series:
Inst. Mem Reg. File ALU Data Mem. Reg. File

A Relevant Question The Slow Way


• Assuming you’ve got: 6 PM 7 8 9 10 11 Midnight
— One washer (takes 30 minutes)
Time

30 40 20 30 40 20 30 40 20 30 40 20
— One drier (takes 40 minutes)

— One “folder” (takes 20 minutes)

• It takes 90 minutes to wash, dry, and fold 1 load of laundry.


— How long does 4 loads take?

• If each load is done sequentially it takes 6 hours


Laundry Pipelining Instruction Execution Review
• Start each load as soon as possible • Executing a MIPS instruction can take up to five steps.
— Overlap loads
6 PM Midnight Step Nam Description
7 8 9 10 11
e
Time Instruction Fetch IF Read an instruction from memory.
Instruction ID Read source registers and generate control
30 40 40 40 40 20 Decode signals.
Execute EX Compute an R-type result or a branch outcome.
Memory MEM Read or write the data memory.
• However,
Writeback as we saw, WB
not all instructions need
Store a result in all
thefive steps. register.
destination

Instructio Steps required


n
beq IF ID EX
R-type IF ID EX WB
sw IF ID EX MEM
• Pipelined laundry takes 3.5 hours
lw IF ID EX MEM WB

Single-cycle Datapath Diagram Single-cycle Review


0 • All five execution steps occur in one clock cycle.
M
Add u • This means the cycle time must be long enough to accommodate all the
PC 4
Add
x
steps of the most complex instruction—a “lw” in our instruction set.
1
Shift
— If the register file has a 1ns latency and the memories and ALU have a 2ns
1ns left 2
PCSrc latency, “lw” will require 8ns.
RegWrite 2ns
Read Instruction I [25 - 21] 2ns MemWrite MemToReg — Thus all instructions will take 8ns to execute.
Read Read
address [31-0]
register 1 data 1 ALU Read Read 1
• Each hardware element can only be used once per clock cycle.
I [20 - 16] address data
Read Zero M
Instruction Read 0 Write
0 register 2 Result u
memory data 2 x
M address
M Write u Data 0
u register x
x Write
Registers memory
2ns I [15 - 11]
1 Write 1 ALUOp data
data
MemRead
RegDst ALUSrc

I [15 - 0] Sign
extend

• How long does it take to execute each instruction?


Example: Instruction Fetch (IF) Instruction Decode (ID)
• Let’s quickly review how lw is executed in the single-cycle datapath. • The Instruction Decode (ID) step reads the source registers from the register file.
• We’ll ignore PC incrementing and branching for now.
• In the Instruction Fetch (IF) step, we read the instruction memory.

RegWrite RegWrite
MemWrite MemToReg MemWrite MemToReg
Read Instruction I [25 - 21] Read Instruction I [25 - 21]
Read Read Read Read
address [31-0] address [31-0]
register 1 data 1 register 1 data 1
ALU Read Read 1 ALU Read Read 1
I [20 - 16] Zero address data I [20 - 16]
Read M Zero address data
Instruction Read M
register 2 Read 0 Result Write u Instruction Read 0 u
memory 0 0 register 2 Result Write
data 2 M address x memory data 2 x
M Write M address
u 0 M Write
u Data u Data 0
register x Write u register x
x Registers memory x Write
I [15 - 11] 1 ALUOp data Registers memory
1 Write I [15 - 11] Write 1 ALUOp data
1
data data
MemRead MemRead
RegDst ALUSrc ALUSrc
RegDst
I [15 - 0] Sign I [15 - 0] Sign
extend extend

Execute (EX) Memory (MEM)


• The third step, Execute (EX), computes the effective memory address from the • The Memory (MEM) step involves reading the data memory, from the address
source register and the instruction’s constant field. computed by the ALU.

RegWrite
RegWrite
MemWrite MemToReg
Read Instruction I [25 - 21] MemWrite MemToReg
Read Read Read Instruction I [25 - 21]
address [31-0] Read
register 1 data 1 address [31-0] Read
ALU Read Read 1
I [20 - 16] Zero address data register 1 data 1
Read M ALU Read Read 1
Instruction I [20 - 16] Zero address data
register 2 Read 0 Result Write u Read M
memory 0 Instruction
data 2 M x register 2 Read 0 Result Write u
M address memory 0
Write u data 2 M x
u Data 0 M address
register x Write Write u
x memory u Data 0
I [15 - 11] Registers 1 ALUOp data register x Write
1 Write x Registers memory
I [15 - 11] Write 1 ALUOp data
data 1
MemRead data
RegDst ALUSrc
MemRead
RegDst ALUSrc
I [15 - 0] Sign
I [15 - 0] Sign
extend
extend
Writeback (WB) A Bunch of Lazy Functional Units
• Finally, in the Writeback (WB) step, the memory value is stored into the • Notice that each execution step uses a different functional unit.
destination register.
• In other words, the main units are idle for most of the 8ns cycle!
— The instruction RAM is used for just 2ns at the start of the cycle.
— Registers are read once in ID (1ns), and written once in WB (1ns).
RegWrite
— The ALU is used for 2ns near the middle of the cycle.
Read I [25 - 21]
MemWrite MemToReg — Reading the data memory only takes 2ns as well.
Instruction Read Read
register 1 data 1 ALU Read Read 1
• That’s a lot of hardware sitting around doing nothing.
address [31-0] 16] address data
Zero
Read M
I register 2 Read 0 Result Write u
0
[2 data 2 M address x
M Write
0- u Data 0
u register x
x Write
Instruction Registers memory
I [15 - 11] Write 1 ALUOp data
1
data
memory MemRead
RegDst ALUSrc

I [15 - 0] Sign
extend

Putting Those Slackers to Work Decoding and Fetching Together


• We shouldn’t have to wait for the entire instruction to complete before we can • Why don’t we go ahead and fetch the next instruction while we’re decoding the
reuse the functional units. first one?
• For example, the instruction memory is free in the Instruction Decode step as
shown below, so...

Idle Instruction Decode (ID) Fetch 2nd Decode 1st instruction

RegWrite
RegWrite
MemWrite MemToReg
MemWrite MemToReg Read Instruction I [25 - 21]
Read Instruction I [25 - 21] Read Read
Read Read address [31-0]
address [31-0] register 1 data 1
register 1 data 1 ALU Read Read 1
ALU Read Read 1 I [20 - 16] Zero address data
I [20 - 16] Zero address data Read M
Read M Instruction Read 0 Write u
Instruction 0 register 2 Result
register 2 Read 0 Result Write u memory x
memory 0 data 2 M address
data 2 M address x M Write
M Write u Data 0
u 0 u register x
u register Data x Write
x Write Registers memory
x Registers memory I [15 - 11] Write 1 ALUOp data
I [15 - 11] 1 ALUOp data 1
1 Write
data
data
MemRead
MemRead RegDst ALUSrc
RegDst ALUSrc
I [15 - 0] Sign
I [15 - 0] Sign extend
extend
Executing, Decoding and Fetching Break Datapath into 5 Stages
• Similarly, once the first instruction enters its Execute stage, we can go ahead and • Each stage has its own functional units.
decode the second instruction. • Each stage can execute in 2ns
• But now the instruction memory is free again, so we can fetch the third
instruction!
IF ID EXE MEM WB
Fetch 3rd Execute 1st
Decode 2nd
RegWrite
RegWrite MemWrite MemToReg
Read Instruction I [25 - 21]
Read Read
MemWrite MemToReg address [31-0]
Read Instruction I [25 - 21] register 1 data 1
Read Read ALU Read Read 1
address [31-0] I [20 - 16] Zero address data
register 1 data 1 Read M
ALU Read Read 1 Instruction
I [20 - 16] Zero address data register 2 Read 0 Result Write u
Read M memory 0
Instruction data 2 M address x
register 2 Read 0 Result Write u M Write
memory 0 u Data 0
data 2 M address x u register
M Write x Write
u 0 x Registers memory
u Data I [15 - 11] 1 ALUOp data
register x Write 1 Write
x Registers memory
I [15 - 11] 1 ALUOp data data
1 Write
MemRead
data RegDst ALUSrc
MemRead
RegDst ALUSrc I [15 - 0] Sign
I [15 - 0] extend
Sign
extend

2ns 2ns 2ns 2ns

Pipelining Loads A Pipeline Diagram


Clock cycle Clock cycle
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
lw $t0, 4($sp) IF ID EX MEM WB lw $t0, 4($sp) IF ID EX MEM WB
lw $t1, 8($sp) lw sub $v0, $a0, $a1 and $t1, $t2, $t3
IF ID EX MEM WB WB
$t2, 12($sp) lw or $s0, $s1, $s2 add $sp, $sp, -4 IF ID EX MEM
$t3, 16($sp) lw IF ID EX MEM WB IF ID EX MEM WB
$t4, 20($sp)
IF ID EX MEM WB • A pipeline diagram shows the execution of a series IF of instructions.
ID EX MEM WB
IF ID EX MEM WB — The instruction sequence is shown vertically, fromIFtop to ID
bottom. EX MEM WB
6 PM 7 8 9
— Clock cycles are shown horizontally, from left to right.
Time
— Each instruction is divided into its component stages. (We show five
30 40 40 40 40 20 stages for every instruction, which will make the control unit easier.)
• This clearly indicates the overlapping of instructions. For example, there are three
instructions active in the third cycle above.
— The “lw” instruction is in its Execute stage.
— Simultaneously, the “sub” is in its Instruction Decode stage.
— Also, the “and” instruction is just being fetched.
Pipeline Terminology Pipelining Performance
Clock cycle
Clock cycle
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9 lw $t0, 4($sp)
lw $t0, 4($sp) IF ID EX MEM WB
IF ID EX MEM WB lw $t1, 8($sp) lw
sub $v0, $a0, $a1 and $t1, $t2, $t3 IF ID EX MEM WB
WB $t2, 12($sp) lw
or $s0, $s1, $s2 add $sp, $sp, -4 IF ID EX MEM
$t3, 16($sp) lw IF ID EX MEM WB
IF ID EX MEM WB
$t4, 20($sp)
IF ID EX MEM WB
filling IF ID
full emptying EX MEM WB filling
IF ID EX MEM WB
IF ID EX MEM WB
• Execution time on ideal pipeline:
• The pipeline depth is the number of stages—in this case, five.
— time to fill the pipeline + one cycle per instruction
• In the first four cycles here, the pipeline is filling, since there are unused
— N instructions -> 4 cycles + N cycles or (2N + 8) ns for 2ns clock period
functional units.
• In cycle 5, the pipeline is full. Five instructions are being executed simultaneously,
so all hardware units are in use. • Compare with other implementations:
• In cycles 6-9, the pipeline is emptying. — Single Cycle: N cycles or 8N ns for 8ns clock period

• How much faster is pipelining for N=1000 ?

Pipelining Other Instruction Types Important Observation


• R-type instructions only require 4 stages: IF, ID, EX, and WB • Each functional unit can only be used once per instruction
— We don’t need the MEM stage • Each functional unit must be used at the same stage for all instructions. See the
problem if:
• What happens if we try to pipeline loads with R-type instructions?
— Load uses Register File’s Write Port during its 5th stage
— R-type uses Register File’s Write Port during its 4th stage

Clock cycle Clock cycle


1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
add $sp, $sp, -4 add $sp, $sp, -4 IF ID EX WB
IF ID EX WB sub $v0, $a0, $a1
sub $v0, $a0, $a1 IF ID EX WB
lw $t0, 4($sp) IF ID EX WB lw $t0, 4($sp)
or $s0, $s1, $s2 lw IF ID EX MEM WB or $s0, $s1, $s2 lw IF ID EX MEM WB
$t1, 8($sp)
$t1, 8($sp)
IF ID EX WB IF ID EX WB

IF ID EX ME WB IF ID EX ME WB
M M
A solution: Insert NOP stages Pipeline Registers
• Enforce uniformity • We’ll add intermediate registers to our pipelined datapath too.
— Make all instructions take 5 cycles. • There’s a lot of information to save, however. We’ll simplify our diagrams by
— Make them have the same stages, in the same order drawing just one big pipeline register between each stage.
• Some stages will do nothing for some instructions • The registers are named for the stages they connect.
R-type IF ID
EX NO W IF/ID ID/EX EX/MEM MEM/WB
P B
Clock cycle
• No register is needed after the WB stage, because after WB the instruction is
1 2 3 4 5 6 7 8 9
add $sp, $sp, -4 sub $v0,IF$a0, $a1 lw done.
ID $t0, 4($sp)
EX NOP WB
or $s0, $s1, $s2 lw $t1, 8($sp)
IF ID EX NOP WB
IF ID EX MEM WB
• Stores and Branches have NOP stages, too…
IF ID EX NOP WB
IF ID EX MEM WB

store IF ID EX ME NO
M P
branch IF ID EX NO NO
P P

Pipelined Datapath What about Control Signals?


• The control signals are generated in the same way as in the single-cycle processor
—after an instruction is fetched, the processor decodes it and produces the
appropriate control values.
• Control signals can be categorized by the pipeline stage that uses them.
Pipelined Datapath and Control Pipelined Datapath and Control

An Example Execution Sequence Cycle 1 (Filling)


• Here’s a sample sequence of instructions to execute.

1000: lw $8, 4($29)


addresses in
decimal
1004: sub $2, $4, $5
1008: an $9, $10, $11
d
1012: or $16, $17, $18
• We’ll make some assumptions,
1016: justad so $13,
we can
$14,show
$0 actual data values.
d
— Each register contains its number plus 100. For instance, register $8
contains 108, register $29 contains 129, and so forth.
— Every data memory location contains 99.
• Our pipeline diagrams will follow some conventions.
— An X indicates values that aren’t important, like the constant field of
an R-type instruction.
— Question marks ??? indicate values we don’t know, usually resulting from
instructions coming before and after the ones in our example.
Cycle 2 Cycle 3

Cycle 4 Cycle 5 (Full)


Cycle 6 (Emptying) Cycle 7

Cycle 8 Cycle 9
That’s a lot of Diagrams There Performance Revisited
Clock cycle • Assuming the following functional unit latencies:
1 2 3 4 5 6 7 8 9
lw $t0, 4($sp) IF ID EX MEM WB 3ns 2ns 2ns 3ns 2ns
sub $v0, $a0, $a1 and $t1, $t2, $t3
or $s0, $s1, $s2 add $t5, $t6, $0 IF ID EX MEM WB Inst mem Reg Read Data Reg
A
Mem Write
IF ID EX MEM WB L
U
• Compare the last nine slides with the pipeline
IF diagram
ID above.
EX MEM WB
— You can see how instruction executions are overlapped.
IF ID EX MEM WB
— Each functional unit is used by a different instruction in each cycle. • What is the cycle time of a single-cycle implementation?
— The pipeline registers save control and data values generated in previous
— What is its throughput (how many works/instr. finished in a unit of time)?
clock cycles for later use.
— When the pipeline is full in clock cycle 5, all of the hardware units are
utilized. This is the ideal situation, and what makes pipelined processors so • What is the cycle time of a ideal pipelined
fast.
implementation?
— What is its steady-state throughput?

• How much faster is pipelining?

Ideal Speedup The Pipelining Paradox


Clock cycle Clock cycle
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
lw $t0, 4($sp) IF ID EX MEM WB lw $t0, 4($sp) IF ID EX MEM WB
sub $v0, $a0, $a1 and $t1, $t2, $t3 sub $v0, $a0, $a1 and $t1, $t2, $t3
IF ID EX MEM WB IF ID EX MEM WB
or $s0, $s1, $s2 add $sp, $sp, -4 or $s0, $s1, $s2 add $sp, $sp, -4
IF ID EX MEM WB IF ID EX MEM WB
• IF ID simultaneously.
In our pipeline, we can execute up to five instructions EX MEM WB • IF of ID
Pipelining does not improve the execution time EX instruction.
any single MEM WBEach
— This implies that the maximum speedup is 5 IF times. ID EX MEM WB instruction here actually takes longer to execute than
IF in aID
single-EX
cycle MEM
datapath WB
— In general, the ideal speedup equals the pipeline depth. (15ns vs. 12ns)!
• Why was our speedup on the previous slide ―only "4" times? • Instead, pipelining increases the throughput, or the amount of work done per unit
— The pipeline stages are imbalanced: a register file and ALU operations can be time. Here, several instructions are executed together in each clock cycle.
done in 2ns, but we must stretch that out to 3ns to keep the ID, EX, and WB • The result is improved execution time for a sequence of instructions, such as an
stages synchronized with IF and MEM. entire program.
— Balancing the stages is one of the many hard parts in designing a
pipelined processor.

You might also like