The Processor: CPU Performance Factors
The Processor: CPU Performance Factors
The Processor: CPU Performance Factors
Edition
The Hardware/Software Interface
Chapter 4
The Processor
§4.1 Introduction
Introduction
n CPU performance factors
n Instruction count
n Determined by ISA and compiler
n CPI and Cycle time
n Determined by CPU hardware
n We will examine two LEGv8 implementations
n A simplified version
n A more realistic pipelined version
n Simple subset, shows most aspects
n Memory reference: LDUR, STUR
n Arithmetic/logical: add, sub, and, or, slt
n Control transfer: beq, j
CPU Overview
Control
Combinational Elements
n AND-gate n Adder A
Y
+
n Y=A&B n Y=A+B B
A
Y
B
n Arithmetic/Logic Unit
n Multiplexer n Y = F(A, B)
n Y = S ? I1 : I0
A
I0 M
u Y ALU Y
I1 x
B
S F
Clk
D Q
D
Clk
Q
Sequential Elements
n Register with write control
n Only updates on clock edge when write
control input is 1
n Used when stored value is required later
Clk
D Q Write
Write D
Clk
Q
Increment by
4 for next
32-bit instruction
register
R-Format Instructions
n Read two register operands
n Perform arithmetic/logical operation
n Write register result
Branch Instructions
n Read register operands
n Compare operands
n Use ALU, subtract and check Zero output
n Calculate target address
n Sign-extend displacement
n Shift left 2 places (word displacement)
n Add to PC + 4
n Already calculated by instruction fetch
Sign-bit wire
replicated
Full Datapath
ALU Control
n Assume 2-bit ALUOp derived from opcode
n Combinational logic derives ALU control
ALU
opcode ALUOp Operation Opcode field ALU function control
LDUR 00 load register XXXXXXXXXXX add 0010
STUR 00 store register XXXXXXXXXXX add 0010
CBZ 01 compare and XXXXXXXXXXX pass input b 0111
branch on zero
R-type 10 add 100000 add 0010
subtract 100010 subtract 0110
AND 100100 AND 0000
ORR 100101 OR 0001
Load Instruction
Performance Issues
n Longest delay determines clock period
n Critical path: load instruction
n Instruction memory ® register file ® ALU ®
data memory ® register file
n Not feasible to vary period for different
instructions
n Violates design principle
n Making the common case fast
n We will improve performance by pipelining
n Four loads:
n Speedup
= 8/3.5 = 2.3
n Non-stop:
n Speedup
= 2n/0.5n + 1.5 ≈ 4
= number of stages
LEGv8 Pipeline
n Five stages, one step per stage
1. IF: Instruction fetch from memory
2. ID: Instruction decode & register read
3. EX: Execute operation or calculate address
4. MEM: Access memory operand
5. WB: Write result back to register
Pipeline Performance
Single-cycle (Tc= 800ps)
Structure Hazards
n Conflict for use of a resource
n In LEGv8 pipeline with a single memory
n Load/store requires data access
n Instruction fetch would have to stall for that
cycle
n Would cause a pipeline “bubble”
n Hence, pipelined datapaths require
separate instruction/data memories
n Or separate instruction/data caches
n In LEGv8 pipeline
n Need to compare registers and compute
target early in the pipeline
n Add hardware to do it in ID stage
Stall on Branch
n Wait until branch outcome determined
before fetching next instruction
MEM
Right-to-left WB
flow leads to
hazards
Pipeline Operation
n Cycle-by-cycle flow of instructions through
the pipelined datapath
n “Single-clock-cycle” pipeline diagram
n Shows pipeline usage in a single cycle
n Highlight resources used
n c.f. “multi-clock-cycle” diagram
n Graph of operation over time
n We’ll look at “single-clock-cycle” diagrams
for load & store
Wrong
register
number
Pipelined Control
n Control signals derived from instruction
n As in single-cycle implementation
Forwarding Paths
Stall inserted
here
Flush these
instructions
(Set control
values to 0)
PC
2-Bit Predictor
n Only change prediction on two successive
mispredictions
§4.9 Exceptions
Exceptions and Interrupts
n “Unexpected” events requiring change
in flow of control
n Different ISAs use the terms differently
n Exception
n Arises within the CPU
n e.g., undefined opcode, overflow, syscall, …
n Interrupt
n From an external I/O controller
n Dealing with them without sacrificing
performance is hard
An Alternate Mechanism
n Vectored Interrupts
n Handler address determined by the cause
n Exception vector address to be added to a
vector table base register:
n Unknown Reason: 00 0000two
n Overflow: 10 1100two
n …: 11 1111two
n Instructions either
n Deal with the interrupt, or
n Jump to real handler
Exceptions in a Pipeline
n Another form of control hazard
n Consider overflow on add in EX stage
ADD X1, X2, X1
n Prevent X1 from being clobbered
Exception Properties
n Restartable exceptions
n Pipeline can flush the instruction
n Handler executes, then returns to the
instruction
n Refetched and executed from scratch
n PC saved in ELR register
n Identifies causing instruction
n Actually PC + 4 is saved
n Handler must adjust
Exception Example
Multiple Exceptions
n Pipelining overlaps multiple instructions
n Could have multiple exceptions at once
n Simple approach: deal with exception from
earliest instruction
n Flush subsequent instructions
n “Precise” exceptions
n In complex pipelines
n Multiple instructions issued per cycle
n Out-of-order completion
n Maintaining precise exceptions is difficult!
Speculation
n “Guess” what to do with an instruction
n Start operation as soon as possible
n Check whether guess was right
n If so, complete the operation
n If not, roll-back and do the right thing
n Common to static and dynamic multiple issue
n Examples
n Speculate on branch outcome
n Roll back if path taken is different
n Speculate on load
n Roll back if location is updated
Scheduling Example
n Schedule this for dual-issue LEGv8
Loop: LDUR X0, [X20,#0] // X0=array element
ADD X0, X0,X21 // add scalar in X21
STUR X0, [X20,#0] // store result
SUBI X20, X20,#4 // decrement pointer
CMP X20, X22 // branch $s1!=0
BGT Loop
ALU/branch Load/store cycle
Loop: nop LDUR X0, [X20,#0] 1
SUBI X20, X20,#4 nop 2
ADD X0, X0,X21 nop 3
CMP X20, X22 sw $t0, 4($s1) 4
BGT Loop STUR X0, [X20,#0] 5
Hold pending
operands
Register Renaming
n Reservation stations and reorder buffer
effectively provide register renaming
n On instruction issue to reservation station
n If operand is available in register file or
reorder buffer
n Copied to reservation station
n No longer required in the register; can be
overwritten
n If operand is not yet available
n It will be provided to the reservation station by a
function unit
n Register update may not be required
Chapter 4 — The Processor — 116
Speculation
n Predict branch and continue issuing
n Don’t commit until branch outcome
determined
n Load speculation
n Avoid load and cache miss delay
n Predict the effective address
n Predict loaded value
n Load before completing outstanding stores
n Bypass stored values to load unit
n Don’t commit load until speculation cleared
Power Efficiency
n Complexity of dynamic scheduling and
speculations requires power
n Multiple simpler cores may be better
Microprocessor Year Clock Rate Pipeline Issue Out-of-order/ Cores Power
Stages width Speculation
i486 1989 25MHz 5 1 No 1 5W
Pentium 1993 66MHz 5 2 No 1 10W
Pentium Pro 1997 200MHz 10 3 Yes 1 29W
P4 Willamette 2001 2000MHz 22 3 Yes 1 75W
P4 Prescott 2004 3600MHz 31 3 Yes 1 103W
Core 2006 2930MHz 14 4 Yes 2 75W
UltraSparc III 2003 1950MHz 14 4 No 1 90W
UltraSparc T1 2005 1200MHz 6 1 No 8 70W
Core i7 Pipeline
Performance Impact
Pitfalls
n Poor ISA design can make pipelining
harder
n e.g., complex instruction sets (VAX, IA-32)
n Significant overhead to make pipelining work
n IA-32 micro-op approach
n e.g., complex addressing modes
n Register update side effects, memory indirection
n e.g., delayed branches
n Advanced pipelines have long delay slots