Chapter 12 Performance of Single-Cycle and Multi-Cycle Data Path
Chapter 12 Performance of Single-Cycle and Multi-Cycle Data Path
We just saw a single-cycle datapath and control unit for our simple MIPS-
based instruction set.
Single-cycle implementation: An implementation in which every
instruction is executed in one clock cycle. While easy to understand, it is
too slow to be practical.
1
The single-cycle design again…
2
The example add from last time
Consider the instruction add $s4, $t1, $t2.
3
How the add goes through the datapath
PC+4
0
M
Add u
x
PC 4 Add 1
Shift
left 2
PCSrc
RegWrite
MemWrite MemToReg
Read Instruction I [25 - 21] 01001
Read Read 00...01
address [31-0]
register 1 data 1
ALU Address Read 1
I [20 - 16] 01010
Read Zero data M
Instruction 00...10
register 2 Read 0 u
memory 0 Result
data 2 M x
M Write
register u Data 0
u x Write
x Registers memory
I [15 - 11] Write 1 ALUOp data
1 data
10100
MemRead
ALUSrc
RegDst
I [15 - 0] Sign 00...11
extend
4
The slowest instruction...
If all instructions must complete within one clock cycle, then the cycle
time has to be large enough to accommodate the slowest instruction.
For example, lw $t0, –4($sp) needs 8ns, assuming the delays shown here.
reading the instruction memory 2ns
reading the base register $sp 1ns
computing memory address $sp-4 2ns 8ns
reading the data memory 2ns
storing data back to $t0 1ns
Read Instruction I [25 - 21]
Read Read
address [31-0]
register 1 data 1
ALU Read Read 1
I [20 - 16]
Read Zero address data M
Instruction
register 2 Read 0 u
memory 0 Result Write
data 2 M x
M Write address
register u Data 0
u x Write
2 ns x Registers 2 ns memory
I [15 - 11] Write 1 data 0 ns
1 data
0 ns 2 ns
0 ns 1 ns
I [15 - 0] Sign
extend
0 ns
5
...determines the clock cycle time
If we make the cycle time 8ns then every instruction will take 8ns, even
if they don’t need that much time => clock rate = 125 MHz
For example, the instruction add $s4, $t1, $t2 really needs just 6ns.
6
How bad is this?
With these same component delays, a sw instruction would need 7ns, and
beq would need just 5ns.
Let’s consider the gcc program.
Instruction Frequency
Arithmetic 48%
Loads 22%
Stores 11%
Branches 19%
7
Disadvantage of single-cycle implementation
The clock cycle will be determined by the longest possible path, which is
not the most common instruction. This type of implementation violates
the idea of making the common case fast.
May be wasteful with respect to area since some functional units, such as
adders, must be duplicated since they cannot be shared during a single
clock cycle.
This is also why we used a Harvard architecture with two memories; you
can’t easily read two addresses from the same memory in one cycle.
Example:
— We’ve made very optimistic assumptions about memory latency:
• Main memory accesses on modern machines is >50ns.
For comparison, an ALU on the Pentium4 takes ~0.3ns.
— Our worst case cycle (loads/stores) includes 2 memory accesses
• A modern single cycle implementation would be stuck at <10Mhz.
• Caches will improve common case access time, not worst case.
8
A multistage approach to instruction execution
A multicycle implementation fixes some shortcomings in the single-cycle
implementation.
— Faster instructions are not held back by slower ones.
— The clock cycle time can be decreased.
— We don’t have to duplicate any hardware units.
A multicycle processor requires a somewhat simpler datapath which we’ll
see today, but a more complex control unit that we’ll see later.
10
The clock cycle
Things are simpler if we assume that each “stage” takes one clock cycle.
— This means instructions will require multiple clock cycles to execute.
— But since a single stage is fairly simple, the cycle time can be low.
For the proposed execution stages below and the sample datapath delays
shown earlier, each stage needs 2ns at most.
— This accounts for the slowest devices, the ALU and data memory.
— A 2ns clock cycle time corresponds to a 500MHz clock rate!
11
Cost benefits
As an added bonus, we can eliminate some of the extra hardware from
the single-cycle datapath.
— We will restrict ourselves to using each functional unit once per cycle,
just like before.
— But since instructions require multiple cycles, we could reuse some
units in a different cycle during the execution of a single instruction.
For example, we could use the same ALU:
— to increment the PC (first clock cycle), and
— for arithmetic operations (third clock cycle).
12
Two extra adders
13
The extra single-cycle adders
0
M
Add u
x
PC 4 Add 1
Shift
left 2
PCSrc
RegWrite
MemWrite MemToReg
Read Instruction I [25 - 21]
Read Read
address [31-0]
register 1 data 1
ALU Read Read 1
I [20 - 16]
Read Zero address data M
Instruction
register 2 Read 0 u
memory 0 Result Write
data 2 M x
M Write address
register u Data 0
u x Write
x Registers memory
I [15 - 11] Write 1 ALUOp data
1 data
MemRead
ALUSrc
RegDst
I [15 - 0] Sign
extend
14
Our new adder setup
We can eliminate both extra adders in a multicycle datapath, and instead
use just one ALU, with multiplexers to select the proper inputs.
A 2-to-1 mux ALUSrcA sets the first ALU input to be the PC or a register.
A 4-to-1 mux ALUSrcB selects the second ALU input from among:
— the register file (for arithmetic operations),
— a constant 4 (to increment the PC),
— a sign-extended constant (for effective addresses), and
— a sign-extended and shifted constant (for branch targets).
This permits a single ALU to perform all of the necessary functions.
— Arithmetic operations on two register operands.
— Incrementing the PC.
— Computing effective addresses for lw and sw.
— Adding a sign-extended, shifted offset to (PC + 4) for branches.
15
The multicycle adder setup highlighted
PCWrite
PC ALUSrcA
IorD MemRead
0
RegDst RegWrite M
0 Address u
Read Read x
M ALU
u register 1 data 1 1
Memory Zero
x
Read Result
1 register 2 Read
Write Mem 0 0
data 2
data Data M Write 4 1
u register
2 ALUOp
x
MemWrite Write
1 Registers 3
data
0 ALUSrcB
M
u Sign Shift
x extend left 2
1
MemToReg
16
Eliminating a memory
Similarly, we can get by with one unified memory, which will store both
program instructions and data. (a Princeton architecture)
This memory is used in both the instruction fetch and data access stages,
and the address could come from either:
— the PC register (when we’re fetching an instruction), or
— the ALU output (for the effective address of a lw or sw).
We add another 2-to-1 mux, IorD, to decide whether the memory is being
accessed for instructions or for data.
17
The new memory setup highlighted
PCWrite
PC ALUSrcA
IorD MemRead
0
RegDst RegWrite M
0 Address u
x
M Read Read
1 ALU
u register 1 data 1
Memory Zero
x
Read Result
1 register 2 Read
Write Mem 0 0
data 2
data Data M Write 4 1
u register
2 ALUOp
x
MemWrite Write
1 Registers 3
data
0 ALUSrcB
M
u Sign Shift
x extend left 2
1
MemToReg
18
Intermediate registers
Sometimes we need the output of a functional unit in a later clock cycle
during the execution of one instruction.
— The instruction word fetched in stage 1 determines the destination of
the register write in stage 5.
— The ALU result for an address computation in stage 3 is needed as the
memory address for lw or sw in stage 4.
These outputs will have to be stored in intermediate registers for future
use. Otherwise they would probably be lost by the next clock cycle.
— The instruction read in stage 1 is saved in Instruction register.
— Register file outputs from stage 2 are saved in registers A and B.
— The ALU output will be stored in a register ALUOut.
— Any data fetched from memory in stage 4 is kept in the Memory data
register, also called MDR.
19
The final multicycle datapath
PCWrite
PC ALUSrcA
IorD
0
RegDst RegWrite M
MemRead u
0 0
x
M Read Read M
A 1 ALU
u Address register 1 data 1 u
Zero
x x
Read ALU
1 IRWrite Result 1
Memory register 2 Read B Out
0 data 2 0
[31-26] M Write 4 1 PCSource
Write Mem u register
[25-21] 2 ALUOp
data Data x
[20-16] Write
1 Registers 3
[15-11] data
MemWrite [15-0]
Instruction 0 ALUSrcB
register M
u Sign Shift
Memory x extend left 2
data 1
register
MemToReg
20
Register write control signals
We have to add a few more control signals to the datapath.
Since instructions now take a variable number of cycles to execute, we
cannot update the PC on each cycle.
— Instead, a PCWrite signal controls the loading of the PC.
— The instruction register also has a write signal, IRWrite. We need to
keep the instruction word for the duration of its execution, and must
explicitly re-load the instruction register when needed.
The other intermediate registers, MDR, A, B and ALUOut, will store data
for only one clock cycle at most, and do not need write control signals.
21
The single-cycle datapath; what is the cycle time?
0
M
Add u
x
PC 4
Add 1
Shift
left 2
PCSrc 2ns
2ns 1ns
RegWrite
2ns MemWrite MemToReg
Read Instruction I [25 - 21]
Read Read
address [31-0]
register 1 data 1
ALU Read Read 1
I [20 - 16]
Read Zero address data M
Instruction
register 2 Read 0 u
memory 0 Result Write
data 2 M x
M Write address
register u Data 0
u x Write
x Registers memory
I [15 - 11] Write 1 ALUOp data
1 data
MemRead
ALUSrc
RegDst
I [15 - 0] Sign
extend
PC
1ns
M
u
2ns
0 2ns Read Read
x
0
M A ALU M
u Address register 1 data 1 1 u
Zero
x x
Read ALU
1 Result 1
Memory register 2 Read B Out
0 data 2 0
[31-26] M Write 4 1
Write Mem u register
[25-21] 2
data Data x
[20-16] Write
1 Registers 3
[15-11] data
[15-0]
Instruction 0
register M
u Sign Shift
Memory x extend left 2
data 1
register
23
Comparing cycle times
The clock period has to be long enough to allow all of the required work
to complete within the cycle.
In the single-cycle datapath, the “required work” was just the complete
execution of any instruction.
— The longest instruction, lw, requires 8ns.
— So the clock cycle time has to be 8ns, for a 125MHz clock rate.
For the multicycle datapath, the “required work” is only a single stage.
— The longest delay is 2ns, for both the ALU and the memory.
— So our cycle time has to be 2ns, or a clock rate of 500MHz.
— The register file needs only 1ns, but it must wait an extra 1ns to stay
synchronized with the other functional units.
The single-cycle cycle time is limited by the slowest instruction, whereas
the multicycle cycle time is limited by the slowest functional unit.
24
Comparing instruction execution times
In the single-cycle datapath, each instruction needs an entire clock cycle,
or 8ns, to execute.
With the multicycle CPU, different instructions need different numbers of
clock cycles, and hence different amounts of time.
— A branch needs 3 cycles, or 3 x 2ns = 6ns.
— Arithmetic and sw instructions each require 4 cycles, or 8ns.
— Finally, a lw takes 5 stages, or 10ns.
We can make some observations about performance already.
— Loads take longer with this multicycle implementation, while all other
instructions are faster than before.
— So if our program doesn’t have too many loads, then we should see an
increase in performance.
25
The gcc example
Let’s assume the gcc program.
Instruction Frequency
Arithmetic 48%
Loads 22%
Stores 11%
Branches 19%
26
Summary
A single-cycle CPU has two main disadvantages.
— The cycle time is limited by the worst case latency.
— It requires more hardware than necessary.
A multicycle processor splits instruction execution into several stages.
— Instructions only execute as many stages as required.
— Each stage is relatively simple, so the clock cycle time is reduced.
— Functional units can be reused on different cycles.
We made several modifications to the single-cycle datapath.
— The two extra adders and one memory were removed.
— Multiplexers were inserted so the ALU and memory can be used for
different purposes in different execution stages.
— New registers are needed to store intermediate results.
We will look at the pipeline approach for datapath in next week
27