Lect5 PDF
Lect5 PDF
Superpipeline
Dependency issues
Parallel instruction
execution
Superscalar Architecture
Superscalar is a computer designed to improve the
performance of the execution of scalar instructions.
A scalar
l iis a variable
i bl th
thatt can h
hold
ld only
l one atomic
t i
value at a time, e.g., an integer or a real.
A scalar architecture processes one data item at a
time the computers we discussed up till now.
Examples of non-scalar
non scalar variables:
Arrays
Matrices
Records
1
Superscalar Architecture (Cont’d)
In a superscalar architecture (SSA), several scalar
instructions can be initiated simultaneously and
executed independently.
Pipelining allows also several instructions to be
executed at the same time, but they have to be in
different pipeline stages at a given moment.
SSA includes all features of pipelining but, in addition,
there can be several instructions executingg
simultaneously in the same pipeline stage.
SSA introduces therefore a new level of parallelism,
called instruction-level parallelism.
2
Motivation
Most operations are on scalar quantities (about 80%).
Speedup these operations will lead to large overall performance
improvement.
Superpipelining
Superpipelining is based on dividing the stages of a pipeline into
several sub-stages, and thus increasing the number of
instructions which are handled by the pipeline at the same time.
For example,
F l by
b di
dividing
idi each h stage
t iinto
t ttwo sub-stages,
b t a
pipeline can perform at twice the speed in the ideal situation.
Many pipeline stages may perform tasks that require less than half
a clock cycle.
No duplication of hardware is needed for these stages.
3
Superpipelining (Cont’d)
For a given architecture and the corresponding instruction set
there is an optimal number of pipeline stages/sub-stages.
Increasing the number of stages/sub-stages over this limit
reduces the overall performance
performance.
Overhead of data buffering between the stages.
Not all stages can be divided into (equal-length) sub-stages.
The hazards will be more difficult to resolved.
The clock skew problem.
More complex hardware.
Superpipeline of degree 2
A sub-stage often takes
half a clock cycle to
finish.
Superscalar of degree 2
Two instructions are
executed concurrently
in each pipeline stage.
Duplication of hardware
is required by definition.
4
Superpipelined Superscalar Design
Sub-stage 1 2 3
I1
time
FI DI EI WO
I2 FI DI EI WO
I3 Superpipeline of
FI DI EI WO
I4 FI DI EI WO
degree 3 and
I5 FI FI DI DI EI EI WO WO superscalar of
I6 FI FI DI DI EI EI WO WO degree 4:
I7 FI FI DI DI EI EI WO WO 12 times speed-up
I8 FI FI DI DI EI EI WO WO over the base
I9 FI DI EI WO machine.
I10 FI DI EI WO 48 times speed-
I11 FI DI EI WO up over sequential
I12 FI DI EI WO execution
execution.
5
An SSA Example
Instruction Buffer
Instruction
Ins. Fetch
Cache
Unit
n
Decode,
Memory
Rename & Integer
Dispatch Unit
Register File
Integer
Instruction
Issuing
Unit
Instruction Window
(Queues, reservation Floating
es
n
stations,
t ti etc.)
t ) Point Unit
Floating
Point Unit
Data
Cache
Commit
Superpipeline
Dependency issues
Parallel instruction
execution
6
Parallel Execution Limitation
The situations which prevent instructions to be executed in
parallel by SSA are very similar to those which prevent efficient
execution on a pipelined architecture (pipeline hazards):
Resource conflicts.
Control (procedural) dependency.
Data dependencies.
Resource Conflicts
Several instructions compete for the same hardware
resource at the same time.
e
e.g.,
g two arithmetic instructions need the same
floating-point unit for execution.
similar to structural hazards in pipeline.
7
Procedural Dependency
The presence of branches creates major problems in
assuring the optimal parallelism.
cannot execute instructions after a branch in parallel
with instructions before a branch.
similar to control hazards in pipeline.
Data Conflicts
Caused by data dependencies between instructions in
the program.
similar to date hazards in pipeline
pipeline.
8
Window of Execution
Due to data dependencies, only some part of the instructions are
potential subjects for parallel execution.
In order to find instructions to be issued in parallel, the processor
has to select from a sufficiently large instruction sequence
sequence.
There are usually a lot of data dependencies in a short instruction
sequence.
9
Window of Execution Example
L2 move r3,r7
load r8,(r3) r8 := a[i]
L2: add r3,r3,#4 r3 := r3+4
load r9,(r3) r9 := a[i+1]
ble r8,r9,L3 jump if r8r9
move r3,r7
store r9,(r3) a[i] := r9
add r3,r3,#4
store r8,(r3) a[i+1] := r8
add r5 r5 #1
r5,r5,#1 change++
Basic Blocks
10
Data Dependencies
All instructions in the window of execution may begin
execution, subject to data dependence and resource
constraints.
constraints
11
True Data Dependency Example
L2 move r3,r7
load r8,(r3)
add r3,r3,#4
load r9,(r3)
ble r8,r9,L3
Output Dependency
An output dependency exists if two instructions are writing into
the same location.
If the second instruction writes before the first one, an error
occurs:
MUL R4,R3,R1 (R4 := R3 * R1)
. . .
ADD R4,R2,R5 (R4 := R2 + R5)
L2 move r3,r7
load r8,(r3)
,( )
add r3,r3,#4
load r9,(r3)
ble r8,r9,L3
12
Anti--dependency
Anti
An anti-dependency exists if an instruction uses a location as
an operand while a following one is writing into that location.
If the first one is still using the location when the second one
writes into it,
it an error occurs:
MUL R4,R3,R1 (R4 := R3 * R1)
. . .
ADD R3,R2,R5 (R3 := R2 + R5)
L2 move r3,r7
load r8 (r3)
r8,(r3)
add r3,r3,#4
load r9,(r3)
ble r8,r9,L3
13
Output and Anti-
Anti- Dependencies (Cont’d)
Output dependencies and anti-dependencies can usually be
eliminated by using additional registers.
This technique is called register renaming.
Effect of Dependencies
Data
dependency
Procedural
dependency
Resource
dependency
14
Lecture 5: Superscalar Processors
Superpipeline
Dependency issues
Parallel instruction
execution
15
Division and Decoupling
To increase ILP, we should divide the instruction execution
into smaller tasks and decouple them. In particular, we
have three important activities:
Instruction issue an instruction is initiated and
starts execution.
Instruction completion an instruction has
competed its specified operations.
Instr ction commit the res
Instruction results
lts of the instr
instruction
ction
operations are written back to the register files or
cache.
The machine state is changed.
16
In--Order Issue with In
In In--Order Completion
Instructions are issued in exact program order, and completed in
the same order (with parallel issue and completion, of course!).
An instruction cannot be issued before the previous one has been
issued;
An
A iinstruction
t ti cannott be b completed
l t dbbefore
f th
the previous
i one h
has
been completed.
To guarantee in-order completion, an instruction will stall when
there is a conflict and when a unit requires more than one cycle
to execute.
Example:
Assume a p processor that can issue and decode two instructions
per cycle, that has three functional units (two single-cycle integer
units, and a two-cycle floating-point unit), and that can complete
and write back two results per cycle.
And an instruction sequence with the characteristics given in the
next slide.
17
IOI with IOC Discussion
The processor detects and handles (by stalling) true
data dependencies and resource conflicts.
A instructions
As i t ti are iissued
d and
d completed
l t d iin th
their
i strict
ti t
order.
In this way, the exploited parallelism is very much
dependent on the way the program has been written
or compiled.
Ex. if I3 and I6 switch position, the pairs I4/I6 and I3/I5
can be executed in parallel (see the following slides).
To exploit such parallelism improvement, the compiler
needs to perform elaborate data-flow analysis.
18
IOI with IOC Discussion (Cont’d)
19
Out--of-
Out of-Order Issue w. Out-
Out-of-
of-Order Completion
20
Speedup w/o Procedural Dependencies
Summary
The following techniques are main features for superscalar processors:
Several pipelined units which are working in parallel;
Out-of-order issue and out-of-order completion;
R i t renaming.
Register i
21