Chapter 04
Chapter 04
Chapter 4
The Processor
The Processor ?
Instruction
Instruction decode and operand fetch: read one or two
Decode
registers, using fields of the instruction to select the
register from the register file (RF)
Operand
Fetch
Use ALU, depending on instruction class, to calculate
Arithmetic result
Execute
Memory address for load/store
Branch target address
Result
Access data memory only for load/store
Store
Write the ALU or memory back into a register,
using fields of the instruction to select the register
Next
PC target address or PC + 4
Instruction
Datapath vs Controller
Datapath Controller
signals
Control Points
CarryIn Select
A
32 Adder
A
32
MUX
Sum
32 Y
32
B Carry B
32 32
Adder MUX
Sum = A + B ALU control Y=S?A:B
4
A
32
ALU
Result
32
B Result = F(A, B)
32
ALU
Sequential Elements (1/2)
D-type flip-flop: stores data in a circuit
Uses a clock signal to determine when to update the
stored value
Edge-triggered: update when Clk changes from 0 to 1
Clk
D Q
D
Clk
Q
D Q Write
Write D
Clk
Q
Increment by
4 for next
32-bit
instruction
register
Step 3a: R-Format Instructions
Read two register operands
Perform arithmetic/logical operation
Write register result
RegW rite
31 26 21 16 0
op rs rt immediate
6 bits 5 bits 5 bits 16 bits
rs 43 ALU operation
Read
register 1 MemWrite
Read
rt Read
data 1
Instruction register 2 Zero
Registers ALU ALU
rt Write result Address
Read
register data
Read
data 2
Write Data
data
memory
RegWrite Write
data
16 32
Sign MemRead
extend
R-Type/Load/Store Datapath
Sign-bit wire
replicated
1
Add M
u
x
4 ALU 0
Add result
RegWrite Shift
left 2
ALUOp
Ideal 32
Rd Rs Rt Register Write
Instruction RegWr ALUctr
5 5 5 Occurs Here
Memory
busA
Rw Ra Rb
PC
busW 32
ALU
32 32-bit Result
32 Registers 32
Clk busB
Clk
32
The Critical Path
Register file and ideal memory:
During read, behave as combinational logic:
Address valid => Output valid after access time
32 32 Ideal
32 32-bit
ALU Data
PC
Clk Clk
32
Clk
Worst Case Timing (Load)
Clk
Clk-to-Q
PC Old Value New Value
Instruction Memoey Access Time
Rs, Rt, Rd, Old Value New Value
Op, Func
Delay through Control Logic
ALUctr Old Value New Value
<21:25>
<21:25>
<16:20>
<11:15>
<0:15>
perform
Memory
Addr • To control the flow of data
OpFunct Rt Rs Rd Imm16
Control
PCsrc RegDst ALUSrc MemWr MemtoReg
RegWr MemRd ALUct Equal
r
Datapath
Branch: F = subtract
Load/
35 or 43 rs rt address
Store
31:26 25:21 20:16 15:0
Branch 4 rs rt address
31:26 25:21 20:16 15:0
PCSrc
1
Add M
u
x
4 ALU 0
Add result
RegWrite Shift
left 2
Jump 2 address
31:26 25:0
Acyclic
Combinational
Acyclic
Logic (A)
Combinational
Logic
==>
storage element
Acyclic
Combinational
Logic (B)
storage element
storage element
§4.6 An Overview of Pipelining
Pipelining Analogy
Pipelined laundry: overlapping execution
Parallelism improves performance
Four loads:
Speedup
= 8/3.5 = 2.3
Non-stop:
Speedup
= 2n/0.5n + 1.5 ≈ 4
= number of stages
PCSrc
M
Add u
x
4 Add ALU
result
Shift
left 2
Registers 43
Read ALU operation
MemWrite
Read register 1 ALUSrc
PC Read
address Read data 1 MemtoReg
register 2 Zero
Instruction ALU ALU
Write Read Address Read
register M result data
data 2 u M
Instruction u
memory Write x Data x
data memory
Write
RegWrite data
16 32
Sign
extend MemRead
Pipeline Performance
Assume time for stages is
100ps for register read or write
200ps for other stages
Compare pipelined datapath with single-cycle
datapath
— 56
MIPS ISA Designed for Pipelining
Pipelining RISC is more easier than
All instructions are 32-bits that of CISC
Add
4 Add Add
result
Shift
left 2
Read
Instruction
— 59
Pipelining lw Instructions
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Clock
ALU
Mem Reg Mem Reg
I Load
n
ALU
s Instr 1 Mem Reg Mem Reg
t
ALU
r. Mem Reg Mem Reg
Instr 2
O
ALU
Instr 3 Mem Reg Mem Reg
r
d
ALU
e Instr 4 Mem Reg Mem Reg
r
EX:
latch into PC
WB: NOP
Value of CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
register $2: 10 10 10 10 10/ -2 0 -20 -20 -20 -20
Program
execution
order
(in instructions)
sub $2, $1, $3 IM Reg DM Reg
Program
execution order
(in instructions)
sub $2, $1, $3 IM Reg DM Reg
NextPC
mux
Registers
EX/MEM
MEM/WR
ALU
ID/EX
Data
mux
Memory
mux
Immediate
In MIPS pipeline
ALU
Ifetch Reg DMem Reg
ALU
Reg Reg
14: and r2,r3,r5 Ifetch DMem
ALU
Reg
18: or r6,r1,r7 Ifetch Reg DMem
ALU
Ifetch Reg DMem Reg
22: add r8,r1,r9
ALU
36: xor r10,r1,r11 Ifetch Reg DMem Reg
Prediction
correct
Prediction
incorrect
branch instruction
sequential successor1
sequential successor2
........ Branch delay of length n
sequential successorn These insts. are executed !!
branch target if taken
MEM
Right-to-left WB
flows lead to
hazards
Add
Add
4 Add result
Shift
left 2
Read
Instruction
PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u M
Data u
Write x memory
data x
1
0
Write
data
16 32
Sign
extend
Add
4 Add Add
result
Shift
left 2
Read
Instruction
ID Stage of lw
Write x u
memory x
data 1
0
Write
data
16 32
Sign
extend
Ex: lw rt,rs,imm16
A = Reg[IR[25-21]]; B = Sign-ext(IR[15-0])
lw
0
M Instruction decode
u
x
1
Add
4 Add Add
result
Shift
left 2
Read
Instruction
Add
4 Add Add
result
Shift
left 2
Read
Instruction
Add
Add
4 Add result
Shift
left 2
Read
Instruction
PC Address register 1
Read
Read data 1
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
Data M
u
Write x memory u
data x
1
0
Write
data
16 32
Sign
extend
WB Stage of lw
data x
1
0
Write
data
16 32
Sign
extend
Ex: lw rt,rs,imm16
Reg[IR[20-16]] = MDR
wrong IR !!
0 Who will supply this lw
M
u address? Write back
x
1
Add
4 Add Add
result
Shift
left 2
Read
Instruction
Add
4 Add Add
result
Shift
left 2
Read
Instruction
Example
lw $10, 20($1)
sub $11, $2, $3
add $12, $3, $4
lw $13, 24($1)
add $14, $5, $6
ID EX MEM WB
ExtOp ExtOp
ALUSrc ALUSrc
MEM/WB Register
Ex/MEM Register
ALUOp ALUOp
ID/Ex Register
IF/ID Register
Do forward only if (1) data hazard conditions are true; and (2)
the forwarding instruction will write to a register and the
destination register is not $zero
Check if EX/MEM.RegWrite is active for 1a/1b and if
MEM/WB.RegWrite is active for 2a/2b
ForwardA = 00 ID/EX The first ALU operand comes from the register file.
ForwardB = 00 ID/EX The second ALU operand comes from the register file.
Need to stall
for one cycle
Stall inserted
here
Or, more
accurately…
Pipelined Control with Forwarding Unit and
Hazard Detection Unit
Original Datapath
Flush these
instructions
(Set control
values to 0)
PC
True
72
48
Clock Cycle 4 after …
One pipeline bubble on a taken branch
Data Hazards for Branch -- I
If the comparison registers are a destination of 1st and
2nd preceding ALU instruction
IF ID EX MEM WB
beq stalled IF ID
beq stalled IF ID
beq stalled ID
ALU
s Mem Reg Mem Reg
add
t
r.
ALU
beq Mem Reg Mem Reg
ALU
r misc Mem Reg Mem Reg
d
ALU
e lw Mem Reg Mem Reg
r
0 clock cycle penalty per branch instruction if can find instruction to put in
slot
1. A is the best choice, fills delay slot & reduces instruction count (IC)
2. In B, the sub instruction may need to be copied, increasing IC
3. In B and C, must be okay to execute sub when branch fails
Delay-Branch Scheduling Schemes
outer: …
…
inner: …
… T NT
beq …, …, inner
…
beq …, …, outer
Note: when the exception is not vectored, a single entry point for all
exceptions is used.
Re-startable exception
E.g. cache miss
Refetch then execute the instruction from scratch
Clock 6
Clock 7
n ALU/branch IF ID EX MEM WB
n + 12 Load/store IF ID EX MEM WB
n + 16 ALU/branch IF ID EX MEM WB
n + 20 Load/store IF ID EX MEM WB
Hold pending
operands
??%/year
1000
Performance (vs. VAX-11/780)
52%/year
100
Pipelining,
Data locality,
10
Parallelism processing
25%/year
1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
Chapter 4 — The Processor — 165
§4.11 Real Stuff: The ARM Cortex-A8 and Intel Core i7 Pipelines
Cortex A8 and Intel i7
Processor ARM A8 Intel Core i7 920
Market Personal Mobile Device Server, cloud
Thermal design power 2 Watts 130 Watts
Clock rate 1 GHz 2.66 GHz
Cores/Chip 1 4
Floating point? No Yes
Multiple issue? Dynamic Dynamic
Peak instructions/clock cycle 2 4
Pipeline stages 14 14
Pipeline schedule Static in-order Dynamic out-of-order
with speculation
Branch prediction 2-level 2-level
1st level caches/core 32 KiB I, 32 KiB D 32 KiB I, 32 KiB D
2nd level caches/core 128-1024 KiB 256 KiB
3rd level caches (shared) - 2- 8 MB