Onur 447 Spring15 Lecture13 Ooo and Dataflow Afterlecture
Onur 447 Spring15 Lecture13 Ooo and Dataflow Afterlecture
Computer Architecture
Lecture 13: Out-of-Order Execution
and Data Flow
Pipelining
Out-of-Order Execution
3
Lab 2 Grade Distribution
12 Avg 70.5
Number of Students
10
Med 96.9
Stdev 39.1
8 Max 100
6 Min 50.8
4
Lab 2 Extra Credits
Complete and correct:
Terence An
Jared Choi
Almost correct:
Pete Ehrett
Xiaofan Li
Amanda Marano
Ashish Shrestha
Almost-1 correct:
Sohil Shah
5
Readings Specifically for Today
Smith and Sohi, “The Microarchitecture of Superscalar
Processors,” Proceedings of the IEEE, 1995
More advanced pipelining
Interrupt and exception handling
Out-of-order and superscalar execution concepts
6
Readings for Next Lecture
SIMD Processing
Basic GPU Architecture
Other execution models: VLIW, DAE, Systolic Arrays
7
Recap of Last Lecture
Maintaining Speculative Memory State (Ld/St Ordering)
Out of Order Execution (Dynamic Scheduling)
Link Dependent Instructions: Renaming
Buffer Instructions: Reservation Stations
Track Readiness of Source Values: Tag (and Value) Broadcast
Schedule/Dispatch: Wakeup and Select
Tomasulo’s Algorithm
OoO Execution Exercise with Code Example: Cycle by Cycle
OoO Execution with Precise Exceptions
Questions on OoO Implementation
Where data is stored? Single physical register file vs. reservation stations
Critical path, renaming IDs, …
OoO Execution as Restricted Data Flow
Reverse Engineering the Data Flow Graph
8
Review: In-order vs. Out-of-order Dispatch
In order dispatch + precise exceptions:
IMUL R3 R1, R2
F D E E E E R W
ADD R3 R3, R1
F D STALL E R W ADD R1 R6, R7
F STALL D E R W IMUL R5 R6, R8
ADD R7 R3, R5
F D E E E E E R W
F D STALL E R W
16 vs. 12 cycles
9
Review: Out-of-Order Execution with Precise Exceptions
TAG and VALUE Broadcast Bus
S R
Integer add
C E
E
H Integer mul
O
E E E E E
F D R W
D FP mul
D
U E E E E E E E E
E
L
E E E E E E E E E ... R
Load/store
11
Review: Summary of OOO Execution Concepts
Register renaming eliminates false dependencies, enables
linking of producer to consumers
12
Review: Our Example
MUL R3 R1, R2
ADD R5 R3, R4
ADD R7 R2, R6
ADD R10 R8, R9
MUL R11 R7, R10
ADD R5 R5, R11
13
Review: State of RAT and RS in Cycle 7
15
Restricted Data Flow
An out-of-order machine is a “restricted data flow” machine
Dataflow-based execution is restricted to the microarchitecture
level
ISA is still based on von Neumann model (sequential
execution)
19
Memory Dependence Handling (I)
Need to obey memory dependences in an out-of-order
machine
and need to do so while providing high performance
Approaches
Conservative: Stall the load until all previous stores have
computed their addresses (or even retired from the machine)
Aggressive: Assume load is independent of unknown-address
stores and schedule the load right away
Intelligent: Predict (with a more sophisticated predictor) if the
load is dependent on the/any unknown address store
21
Handling of Store-Load Dependences
A load’s dependence status is not known until all previous store
addresses are available.
How does the OOO engine treat the scheduling of a load instruction wrt
previous stores?
Option 1: Assume load dependent on all previous stores
22
Memory Disambiguation (I)
Option 1: Assume load dependent on all previous stores
+ No need for recovery
-- Too conservative: delays independent loads unnecessarily
24
Data Forwarding Between Stores and Loads
We cannot update memory out of program order
Need to buffer all store and load instructions in instruction window
25
Food for Thought for You
Many other design choices
Smith and Sohi, “The Microarchitecture of Superscalar Processors,” Proc. IEEE, Dec.
1995.
28
A Modern OoO Design: Intel Pentium 4
29
Boggs et al., “The Microarchitecture of the Pentium 4 Processor,” Intel Technology Journal, 2001.
Intel Pentium 4 Simplified
Mutlu+, “Runahead Execution,”
HPCA 2003.
30
Alpha 21264
Yeager, “The MIPS R10000 Superscalar Microprocessor,” IEEE Micro, April 1996 32
IBM POWER4
Tendler et al.,
“POWER4 system
microarchitecture,”
IBM J R&D, 2002.
33
IBM POWER4
2 cores, out-of-order execution
100-entry instruction window in each core
8-wide instruction fetch, issue, execute
Large, local+global hybrid branch predictor
1.5MB, 8-way L2 cache
Aggressive stream based prefetching
34
IBM POWER5
Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE
Micro 2004.
35
Recommended Readings
Out-of-order execution processor designs
37
Other Approaches to Concurrency
(or Instruction Level Parallelism)
Approaches to (Instruction-Level) Concurrency
Pipelining
Out-of-order execution
Dataflow (at the ISA level)
SIMD Processing (Vector and array processors, GPUs)
VLIW
Decoupled Access Execute
Systolic Arrays
39
Data Flow:
Exploiting Irregular Parallelism
Remember: State of RAT and RS in Cycle 7
41
Remember: Dataflow Graph
42
Review: More on Data Flow
In a data flow machine, a program consists of data flow
nodes
A data flow node fires (fetched and executed) when all it
inputs are ready
i.e. when all inputs have tokens
43
Data Flow Nodes
44
Dataflow Nodes (II)
T T
T F
+ T F
T T
T F
+ T F
Dataflow Graphs
{x = a + b;
y=b*7 a b
in
(x-y) * (x+y)}
1 + 2 *7
Values in dataflow graphs are
represented as tokens x
y
token < ip , p , v >
3 - 4 +
instruction ptr port data
OUT
47
Control Flow vs. Data Flow
48
Data Flow Characteristics
Data-driven execution of instruction-level graphical code
Nodes are operators
Arcs are data (I/O)
As opposed to control-driven execution
frame instruction
pointer pointer
(tag or context ID) 50
An Example Frame and Execution
1 + 1 3L, 4L Program a b
2 * 2 3R, 4R
3 - 3 5L
1 2
+ *7
4 + 4 5R
5 * 5 out x
<fp, ip, p , v> y
3 - 4 +
1
2
3 L 7 5 *
4
Frame
5
Need to provide storage for only one operand/operator
51
Monsoon Dataflow Processor [ISCA 1990]
op r d1,d2 Instruction
ip Fetch
Code
fp+r Operand
Token
Fetch
Frames Queue
ALU
Form
Token
Network Network
A Dataflow Processor
53
MIT Tagged Token Data Flow Architecture
Wait−Match Unit: try
to match incoming
token and context id
and a waiting token
with same instruction
address
Success: Both
tokens forwarded,
fetch instruction
Fail: Incoming token
stored in Waiting
Token Memory,
bubble inserted
54
TTDA Data Flow Example
55
TTDA Data Flow Example
56
TTDA Data Flow Example
57
Manchester Data Flow Machine
58
Data Flow Advantages/Disadvantages
Advantages
Very good at exploiting irregular parallelism
Only real dependencies constrain processing
Disadvantages
Debugging difficult (no precise state)
Interrupt/exception handling is difficult (what is precise state
semantics?)
Implementing dynamic data structures difficult in pure data
flow models
Too much parallelism? (Parallelism control needed)
High bookkeeping overhead (tag matching, data storage)
Instruction cycle is inefficient (delay between dependent
instructions), memory locality is not exploited
59
Combining Data Flow and Control Flow
Can we get the best of both worlds?
Two possibilities
60
Data Flow Summary
Availability of data determines order of execution
A data flow node fires when its sources are ready
Programs represented as data flow graphs (of nodes)
Data Flow at the ISA level has not been (as) successful
61
Further Reading on Data Flow
ISA level dataflow
Gurd et al., “The Manchester prototype dataflow computer,”
CACM 1985.
Microarchitecture-level dataflow:
Patt, Hwu, Shebanow, “HPS, a new microarchitecture: rationale
and introduction,” MICRO 1985.
Patt et al., “Critical issues regarding HPS, a high performance
microarchitecture,” MICRO 1985.
Hwu and Patt, “HPSm, a high performance restricted data
flow architecture having minimal functionality,” ISCA 1986.
62