740 Fall10 Lecture4 Afterlecture Pipelining
740 Fall10 Lecture4 Afterlecture Pipelining
Computer Architecture
Lecture 4: Pipelining
2
Review: Other ISA-level Tradeoffs
Load/store vs. Memory/Memory
Condition codes vs. condition registers vs. compare&test
Hardware interlocks vs. software-guaranteed interlocking
VLIW vs. single instruction
0, 1, 2, 3 address machines
Precise vs. imprecise exceptions
Virtual memory vs. not
Aligned vs. unaligned access
Supported data types
Software vs. hardware managed page fault handling
Granularity of atomicity
Cache coherence (hardware vs. software)
…
3
Review: The Von-Neumann Model
MEMORY
Mem Addr Reg
PROCESSING UNIT
INPUT OUTPUT
ALU TEMP
CONTROL UNIT
IP Inst Register
4
Review: The Von-Neumann Model
Stored program computer (instructions in memory)
One instruction at a time
Sequential execution
Unified memory
The interpretation of a stored value depends on the control
signals
time
Execution time =
program
Algorithm Microarchitecture
Program ISA Logic design
ISA Microarchitecture Circuit implementation
Compiler Technology
7
Improving Performance (Reducing Exec Time)
Reducing instructions/program
More efficient algorithms and programs
Better ISA?
8
Other Performance Metrics: IPS
Machine A: 10 billion instructions per second
Machine B: 1 billion instructions per second
Which machine has higher performance?
9
Other Performance Metrics: FLOPS
Machine A: 10 billion FP instructions per second
Machine B: 1 billion FP instructions per second
Which machine has higher performance?
10
Other Performance Metrics: Perf/Frequency
SPEC/MHz
time 1
Remember Execution time =
program
=
Performance
Performance/Frequency
time
cycle
=
# instructions # cycles time
X X cycle
program instruction
# cycles
= 1/{ }
program
What is wrong with comparing only “cycle count”?
Unfairly penalizes machines with high frequency
For machines of equal frequency, fairly reflects
performance assuming equal amount of “work” is done
Fair if used to compare two different same-ISA processors on the same binaries
11
An Example
(1 - f) f
timeenhanced
(1 - f) f/S
13
Microarchitecture Design Principles
Bread and butter design
Spend time and resources on where it matters (i.e. improving
what the machine is designed to do)
Common case vs. uncommon case
Balanced design
Balance instruction/data flow through uarch components
Design to eliminate bottlenecks
14
Cycle Time (Frequency) vs. CPI (IPC)
Usually at odds with each other
Why?
Memory access latency: Increased frequency increases the
number of cycles it takes to access main memory
15
Intro to Pipelining (I)
Single-cycle machines
Each instruction executed in one cycle
The slowest instruction determines cycle time
Multi-cycle machines
Instruction execution divided into multiple cycles
Fetch, decode, eval addr, fetch operands, execute, store result
Advantage: the slowest “stage” determines cycle time
Microcoded machines
Microinstruction: Control signals for the current cycle
Microcode: Set of all microinstructions needed to implement
instructions Æ Translates each instruction into a set of
microinstructions
16
Microcoded Execution of an ADD
ADD DR Å SR1, SR2 MEMORY
Fetch: What if this is SLOW? Mem Addr Reg
MAR Å IP Mem Data Reg
MDR Å MEM[MAR]
IR Å MDR
DATAPATH
Decode:
Control Signals Å ALU GP Registers
DecodeLogic(IR)
Execute:
TEMP Å SR1 + SR2 Control Signals
Store result (Writeback): CONTROL UNIT
DR Å TEMP
Inst Pointer Inst Register
IP Å IP + 4
17
Intro to Pipelining (II)
In the microcoded machine, some resources are idle in
different stages of instruction processing
Fetch logic is idle when ADD is being decoded or executed
Pipelined machines
Use idle resources to process other instructions
Each stage processes a different instruction
When decoding the ADD, fetch the next instruction
Think “assembly line”
Pipelined vs. multi-cycle machines
Advantage: Improves instruction throughput (reduces CPI)
Disadvantage: Requires more logic, higher power consumption
18
A Simple Pipeline
19
Execution of Four Independent ADDs
Multi-cycle: 4 cycles per instruction
F D E W
F D E W
F D E W
F D E W
Time
20
Issues in Pipelining: Increased CPI
Data dependency stall: what if the next ADD is dependent
ADD R3 Å R1, R2 F D E W
ADD R4 Å R3, R7 F D D E W
LD R3 Å R2(0) F D E M W
ADD R4 Å R3, R7 F D E E M W
21
Implementing Stalling
Hardware based interlocking
Common way: scoreboard
i.e. valid bit associated with each register in the register file
Valid bits also associated with each forwarding/bypass path
Func Unit
Instruction Register
Cache File Func Unit
Func Unit
22
Data Dependency Types
23
Issues in Pipelining: Increased CPI
Control dependency stall: what to fetch next
24