0% found this document useful (0 votes)
140 views77 pages

Instruction-Level Parallelism 2

The document discusses dynamic scheduling and out-of-order execution techniques used in instruction-level parallelism. It describes Tomasulo's algorithm for dynamic scheduling which uses reservation stations to track the availability of operands and eliminate hazards. It allows instructions to issue and execute out-of-order while maintaining exception behavior. The algorithm renames registers to avoid write-after-write and write-after-read hazards.

Uploaded by

Anonimus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
140 views77 pages

Instruction-Level Parallelism 2

The document discusses dynamic scheduling and out-of-order execution techniques used in instruction-level parallelism. It describes Tomasulo's algorithm for dynamic scheduling which uses reservation stations to track the availability of operands and eliminate hazards. It allows instructions to issue and execute out-of-order while maintaining exception behavior. The algorithm renames registers to avoid write-after-write and write-after-read hazards.

Uploaded by

Anonimus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 77

Instruction-Level Parallelism

dynamic scheduling
prepared and instructed by
Shmuel Wimer
Eng. Faculty, Bar-Ilan University

May 2015 Instruction-Level Parallelism 2 1


Dynamic Scheduling
Dynamic Scheduling rearranges instruction execution
to reduce the stalls while maintaining data flow and
exception behavior.
•  Enables handling some cases when dependences are
unknown at compile time (e.g. memory reference).
•  Simplifies the compiler.
•  Allows the processor to tolerate cache misses delays by
executing other code while waiting for miss resolution.
•  Allows code compiled for one pipeline to run efficiently
on a different pipeline.
•  Increases significantly the hardware complexity.

May 2015 Instruction-Level Parallelism 2 2


 
In ordinary pipeline instructions are in-line issued and
executed.
• If an instruction is stalled in the pipeline, no later instructions
can proceed.
• If instruction depends on instruction , all instructions after
must be stalled until is finished and can execute.
SUB.D cannot execute because
ADD.D dependence on DIV.D
causes stall, but is independent
of the present pipeline.
To execute SUB.D we separate instruction issue (at ID)
into two parts: checking for structural hazards and
waiting for the absence of a data hazard.
May 2015 Instruction-Level Parallelism 2 3
Out-Of-Order Execution
Instructions are still in-order issued, but start execution
as soon as their data operands are available.
Such a pipeline does out-of-order (OOO) execution,
implying out-of-order completion.
OOO introduces possibilities of WAR and WAW hazards,
not existing in in-order pipeline.
ADD.D and SUB.D are anti dependent.

Executing SUB.D before ADD.D


(waits for F0) violates the anti ???
dependence, causing a WAR
hazard.

May 2015 Instruction-Level Parallelism 2 4


Likewise, to avoid violating output dependences of F6
by MUL.D, WAW hazards must be handled.
register renaming avoids these hazards.
OOO completion must preserve exception behavior to
happen exactly as by in-order.
• No instruction generates an exception until the processor
knows that the instruction raising the exception will be
executed.
OOO splits the ID stage into two stages:
1. Issue—Decode instructions, check for structural hazards.
2. Read operands—Wait until no data hazards, then read
operands.
May 2015 Instruction-Level Parallelism 2 5
An IF stage preceding issue stage fetches either into an
instruction register or a pending instructions queue.
Instructions are issued from these.
The EX stage follows the read operands stage and may
take multiple cycles, depending on the operation.
The pipeline allows simultaneous execution of multiple
instructions.
• Without it a major advantage of OOO is lost.
• Requires multiple functional units.
Instructions are issued in-order, but can enter execution
out of order. There are two OOO techniques:
scoreboarding and Tomasulo’s algorithm.
May 2015 Instruction-Level Parallelism 2 6
Tomasulo’s Dynamic Scheduling
Invented for IBM 360/91 FPU by Robert Tomasulo.
• Minimizes RAW hazards by tracking when operands
are available.
• Minimizes WAR and WAW hazards by register
renaming.
We assume the existence of FPU and load-store unit,
and use MIPS ISA.
Register renaming eliminates WAR and WAW hazards.
• Rename all destination registers, including those with
pending read and write for earlier instructions.
• OOO writes do not affect instructions depending on
earlier value of an operand.
May 2015 Instruction-Level Parallelism 2 7
This code includes potential WAW and WAR hazards.
Anti dependence, WAR hazard, and WAW hazard if
MUL.D finishes before ADD.D. It is called name
dependence.
True data dependencies.
Name dependencies can be eliminated by register
renaming. Any subsequent usage of F8 must be replaced by T.
Very difficult for the compiler (branches may intervene).
May 2015 Instruction-Level Parallelism 2 8
Tomasulo’s algorithm can handle renaming across
branches.
Register renaming is provided by Reservation Station
(RS), buffering the operands of instructions waiting to
issue.
RS fetches and buffers an operand as soon as it is
available, eliminating the need to get it from the
Register File (RF).
Pending instruction designate the RS that will provide
their operands. At issue, pending operands are
renamed from RF specifier to RS.
When successive writes to RF (WAW) overlap in
execution, only the last one actually updates the RF.
May 2015 Instruction-Level Parallelism 2 9
Here can be more RSs than real registers, so it can
eliminate hazards that compiler could not.
Unlike the ordinary pipelined processor, where the
hazard detection and execution control was
centralized, it is now distributed.
The information held at each RS of a functional unit determines
when an instruction can start execution at that unit.
RS passes results directly to the functional units where
the results are required through Common Data Bus
(CDB) rather than going through RF.
Pipeline supporting multiple execution units and issuing multiple
instructions per CLK cycle requires more than one CDB.
May 2015 Instruction-Level Parallelism 2 10
This serves as
IF/ID register

May 2015 Instruction-Level Parallelism 2 11


Instructions are sent from the instruction unit into a
queue from where they issue in FIFO order.
RSs include the operations and the actual operands,
together with information for hazard detection and
resolution.

Load buffers:
1. hold the components of the effective address until
it is computed,
2. track outstanding loads waiting on memory, and
3. hold the results of completed loads, waiting for the
CDB.
May 2015 Instruction-Level Parallelism 2 12
Store buffers:
1. hold the components of the effective address until
it is computed,
2. hold the destination addresses of outstanding
stores waiting for the data value to store, and
3. hold the address and data to store until the memory
unit is available.
All results of the FPU and load unit are put on the CDB,
which goes to the FP registers, to the RSs and to the
store buffers.
The adder implements also subtraction and the
multiplier implements also division.
May 2015 Instruction-Level Parallelism 2 13
The Steps of an Instruction

1. Issue
Get next instruction from the head of the queue.
Instructions are maintained in FIFO and hence issued
in-order.
If there is an empty matched RS, issue the instruction to
that RS together with the operands if those are
currently in RF.
If there is not an empty matched RS, there is a
structural hazard. Instruction stalls until RS is freed.

May 2015 Instruction-Level Parallelism 2 14


If the operands are not in RF, keep track of the
functional unit producing the operands. This steps
renames registers, eliminating WAR and WAW hazards.
2. Execute
If an operand is not yet available, monitor CDB for its
readiness.
When available, the operand is placed at any RS
awaiting it. When all the operands are available the
operation is executed.
By delaying operations until all their operands are
available RAW hazards are avoided.
May 2015 Instruction-Level Parallelism 2 15
Several instructions could become ready on the same
CLK cycle.
Independent units can start execution in the same
cycle.
If few instructions are ready for the same FPU, choice
can be arbitrary.
Load and stores require two-step execution process.
The 1st step computes the effective address when the
register is available. The address is placed in the load or
store buffer.
Load is executed as soon as the memory unit is
available.
May 2015 Instruction-Level Parallelism 2 16
Stores wait for the value to be stored before being sent
to the memory unit.
Load and stores are maintained in the program order to
prevent hazards through memory.
To preserve exception behavior, no instruction is
allowed to initiate execution until all branches
preceding that instruction in program order have
completed.
This guarantees that only instructions that would really
be executed raise an exception. If BP is used, the
processor must know that the BP is correct before
allowing execution of instruction after BP (in program).
May 2015 Instruction-Level Parallelism 2 17
If the processor records the exception, it can allow the
execution after BP, but raise it only if it enters to write
results.
Speculation will provide more complete solution
3. Write Results
When the result is available, put it on the CDB and from
there into the RF and any RSs waiting for the result.
Stores are buffered into the store buffer until both the
value to be stored and the store address are available.
The result is then written as soon as the memory unit is
free.
May 2015 Instruction-Level Parallelism 2 18
The Reservation Station Data Structure
 
Each RS has seven fields:
• – The operation to perform on the source operands
and .
• , – The RS that will produce and . or indicates that
the source operands are available in or , or
operand is unnecessary.
• , – The values of and .
• – Holds information for the memory address
calculation for load or store.
• – This RS and its functional unit are occupied.
May 2015 Instruction-Level Parallelism 2 19
 
Each RF has the field:
• – The number of the RS containing the operation
whose result should be stored into the register.
means that no active instruction is computing a
result destined for this register and the register
contents is a valid value.
 
Each of the load and store buffers have a field , which
holds the result of the effective address once the first
execution step (of the two-step) is completed.
Tomasulo’s scheme has two major advantages:
1. the distribution of the hazard detection logic, and
2. the elimination of stalls for WAW and WAR hazards.
May 2015 Instruction-Level Parallelism 2 20
Example: What is the
contents of Tomasulo’s
data structure when the
first load has completed
and written its result?
Instruction status

Instruction status is
not a part of the
hardware

May 2015 Instruction-Level Parallelism 2 21


Register status

May 2015 Instruction-Level Parallelism 2 22


WAR hazard involving R6
is eliminated in one of two
ways.

 
If the L.D has been completed, field of DIV.D will store
the result and is therefore independent of ADD.D (as
shown in instruction status).
 
If L.D had not completed, of DIV.D would point to
Load1 RS and DIV.D would be independent of ADD.D.
In either case the ADD.D can issue and execute without
affecting DIV.D.
May 2015 Instruction-Level Parallelism 2 23
Example: Assume the following latencies: load 1 cycle,
add 2 cycles, multiply 6 cycles and divide 12 cycles.
What the status tables look like when the MUL.D is
ready to write result?
Instruction status
Latency
1
1
6
2
12
2

May 2015 Instruction-Level Parallelism 2 24


Load1

Load1

Register
status

Add has been completed since the operands of DIV.D


were copied, thereby avoiding the WAR hazard in F6.
Even if the load of F6 was delayed, the add into F6
could be executed without triggering a WAW hazard.
May 2015 Instruction-Level Parallelism 2 25
Tomasulo Algorithm Details
Instruction Wait Action or bookkeeping
state until

 
Issue FP Station
operation empty

i
 
and are the source registers. is the destination
register. is the reservation station () or buffer that the
instruction is assigned to. is the register file, is the
register status.
May 2015 Instruction-Level Parallelism 2 26
 
If the operands are available in the registers, they are
stored in the fields. Otherwise, the fields are set to
indicate the that will produce the values needed as
source operands.
The instruction waits at the until both its operands are
available, indicated by zero in the fields.
The fields are set to zero either when this instruction is
issued, or when an instruction on which this instruction
depends completes and does its write back.
When an instruction has finished execution and the
CDB is available, it can do its write back.

May 2015 Instruction-Level Parallelism 2 27


 
is the sign-extended immediate field.

May 2015 Instruction-Level Parallelism 2 28


Instruction state Wait until Action or bookkeeping

Compute results. Operands and Execute FP


are in and ; operation
; Execute
Execute
and is head of Load-store
Load-store
load-store queue step
step 1
1
Read from ; Load step 1 Execute Load
Load step 1
complete Execute
step 2 Load
complete step 2

 
All the buffers, registers, and s whose value of or is the
same as the completing , update their values from the
CDB and mark their fields with zero to indicate that
values have been received.
May 2015 Instruction-Level Parallelism 2 29
Instruction Wait Action or bookkeeping
state until
; Execution Write result
complete at and of FP
; CDB available operation or
; load
;
;
;
; Execution Write result
; complete at and of store
Write result
of store

The CDB broadcasts its result to many destinations in a


single clock cycle.
If the waiting instructions have their operands, they can
all begin execution on the next clock cycle.
May 2015 Instruction-Level Parallelism 2 30
Loads go through two steps in Execute, and stores
perform slightly differently during Write Result, where
they may have to wait for the value to store.
To preserve exception behavior, instructions should not
be allowed to execute if a branch that is earlier in
program order has not yet completed.
Because program order is not maintained after the
issue stage, this restriction is usually implemented by
preventing any instruction from leaving the issue step, if
there is a pending branch in the pipeline.
We will later remove this restriction.

May 2015 Instruction-Level Parallelism 2 31


A Loop Example
The power of Tomasulo’s algorithm in handling WAR
and WAW hazards is demonstrated in loops.

If branched are predicted taken, RS usage allows


multiple executions of the loop to proceed at once.
The loop is unrolled dynamically by HW, using the RSs
obtained by renaming to act as additional registers.
No need for compiler unrolling.
May 2015 Instruction-Level Parallelism 2 32
Let all the instructions in two successive iterations be
issued, but assume that none of the operations within
the loop has completed.
Instruction status

The integer ALU operation is ignored, and it is assumed


the branch was predicted as taken.
May 2015 Instruction-Level Parallelism 2 33
Reservation station

0
0

Register status

May 2015 Instruction-Level Parallelism 2 34


Two copies of the loop could be sustained with a CPI
close to 1.0, provided MULT completes in 4 clock cycles.
For 6 cycles MULT, additional iteration is needed to be
processed before the steady state can be reached,
requiring more FP RSs.
Load and store can safely be done OOO if they access
different addresses.
In case of same address, if load precedes store, order
interchange results in a WAR hazard.
If store precedes load, interchanging order results in a
RAW hazard.
May 2015 Instruction-Level Parallelism 2 35
Similarly, interchanging two stores to the same address
results in a WAW hazard.
 
To determine if a load can be executed, the processor
can check whether any uncompleted preceding store
(in code order) shares the same memory address.
Let a load have completed field calculation. Address
conflicts are detected by examining the field of all
active store buffers.
If a conflict is found, the load is not sent to the load
buffer until the conflicting store completes.

May 2015 Instruction-Level Parallelism 2 36


 
The processor must have computed the field
associated with any earlier memory operation.
A simple solution is to perform the effective address
calculations ( field) in code order.
Stores operate similarly, except that the processor must
check for conflicts in both load and store buffers.
A store must wait until there are no unexecuted loads
or stores that are earlier in program order and share
the same memory address.
Notice that loads can be reordered freely. (why?)

May 2015 Instruction-Level Parallelism 2 37


Dynamic scheduling yields very high performance,
provided branches are predicted accurately. The major
drawback is the HW complexity.
Each RS must contain a high speed associative buffer,
and complex control logic.
Single CDB is a bottleneck. More CDBs can be added.
Since each CDB must interact with each RS, the
associative tag-matching HW must be duplicated at
each RS for each CDB.
Summary: Tomasulo’s alg. combines two techniques:
renaming of the ISA registers to a larger set, and
buffering of source operands from the RF.
May 2015 Instruction-Level Parallelism 2 38
Tomasulo’s scheme, invented for IBM 360/91, is widely
adopted in multiple-issue processors since 1990s.
It can achieve high performance without requiring the
compiler to target code to a specific pipeline structure.
Caches, with the inherently unpredictable delays, is one
of the major motivations for dynamic scheduling.
OOO execution allows the processor to continue
executing instructions while awaiting the completion of
a cache miss, hiding some of the cache miss penalty.
Dynamic scheduling is a key component of speculation
(discussed next).
May 2015 Instruction-Level Parallelism 2 39
Hardware-Based Speculation
Hardware speculation extends the ideas of dynamic
scheduling.
Branch prediction (BP) reduces the direct stalls
attributable to branches, but is insufficient to generate
the desired amount of ILP.
Exploiting more parallelism requires to overcome the
limitation of control dependence.

It is done by speculating on the outcome of branches


and executing the program as if our guesses were
correct.

May 2015 Instruction-Level Parallelism 2 40


Speculation combines three key ideas: dynamic BP,
speculation and dynamic scheduling.
dynamic BP speculatively chooses which instructions to
execute, allowing the execution of instructions before
control dependences are resolved.

Speculation fetches, issues, and executes instructions,


as if BP were always correct, unlike dynamic scheduling
which only fetches and issues such instructions.
A mechanism to handle the situation where the
speculation is incorrect is required (undo).

May 2015 Instruction-Level Parallelism 2 41


An undo capability is required to cancel the effects of
an incorrectly speculated sequence.
Dynamic scheduling without speculation only partially
overlaps basic blocks because it requires that a branch
be resolved before actually executing any instructions
in the successor basic code block.
Dynamic scheduling with speculation deals with the
scheduling of different combinations of basic code
blocks.
HW-based speculation is essentially a data-flow
execution: Operations execute as soon as their
operands are available.
May 2015 Instruction-Level Parallelism 2 42
An instruction is executed and bypassing its results to
other instructions.
It however does not perform any updates that cannot
be undone (writing to RF or MEM), until it is known to
be no longer speculative.
This additional step in the execution sequence is called
instruction commit.
instructions may finish execution considerably before
they are ready to commit.
Speculation allows instructions to execute OOO but it
forces them to commit in order.

May 2015 Instruction-Level Parallelism 2 43


The commit phase requires special set of buffers
holding the results of instructions that have finished
execution but have not committed yet.
This buffer is called the reorder buffer (ROB). It is also
used to pass results among instructions that may be
speculated.
The ROB holds the result of an instruction between
completion and commitment.
The ROB is a source of operands for instructions in the
interval between completion and commitment, just as
the RSs provide operands in Tomasulo’s algorithm.
May 2015 Instruction-Level Parallelism 2 44
In Tomasulo’s algorithm, once an instruction writes its
result, any subsequently issued instructions will find the
result in the RF.
With speculation, the RF is not updated until the
instruction commits.
The ROB is similar to the store buffer in Tomasulo’s
algorithm. The function of the store buffer is integrated
into the ROB for simplicity.
Each entry in the ROB contains four fields:
The instruction type field indicates whether the
instruction is a branch, a store, or a register operation
(ALU, load).
May 2015 Instruction-Level Parallelism 2 45
The destination field supplies the register number (for
loads and ALU operations) or the memory address (for
stores) where the instruction result should be written.
The value field holds the value of the instruction result
until the instruction commits.
The ready field indicates that the instruction has
completed execution, and the value is ready.

May 2015 Instruction-Level Parallelism 2 46


FP unit supporting speculation

May 2015 Instruction-Level Parallelism 2 47


The Four Steps of Instruction Execution
Issue. Get an instruction from the instruction queue.
Issue it if there is an empty RS and an empty slot in the
ROB, otherwise instruction issue is stalled.
Send the operands to the RS if they are available in
either RF or the ROB. Update the control entries to
indicate the buffers are in use.
The number of the ROB entry allocated for the result is
also sent to the RS, so that it can be used to tag the
result when it is placed on the CDB.
Notice that the ROB is a queue. Its update at Issue
ensures in-order commitment.
May 2015 Instruction-Level Parallelism 2 48
Execute. If one or more of the operands is not yet
available, monitor the CDB while waiting for the
register to be computed. This step checks for RAW
hazards.
When both operands are available at RS, execute the
operation.
Instructions may take multiple clock cycles in this stage,
and loads still require two steps in this stage.
Stores need only have the base register available at this
step, since execution for a store at this point is only
effective address calculation.

May 2015 Instruction-Level Parallelism 2 49


Write result. Write it on the CDB (with the ROB tag sent
when the instruction issued) and from the CDB into the
ROB and any RS waiting for this result.
Mark the RS as available.
Special actions are required for stores. If the value to be
stored is available, it is written into the Value field of
the ROB entry for the store.
If not available yet, the CDB is monitored until that
value is broadcast, at which time the Value field of the
ROB entry of the store is updated.

May 2015 Instruction-Level Parallelism 2 50


Commit. Commitment takes three different sequences.
1) A branch reached the head of the ROB. If prediction
is correct the branch finishes. If incorrect (wrong
speculation), the ROB is flushed and execution
restarts at the correct successor of the branch.
2) Normal commit occurs when an instruction reaches
the head of the ROB and its result is present in the
buffer. The processor then updates the RF with the
result and removes the instruction from the ROB.
3) Store is similar except that MEM is updated rather
than RF.

May 2015 Instruction-Level Parallelism 2 51


Instruction commitment reclaims its entry in the ROB
and the RF or MEM destination is updated, eliminating
the need for the ROB entry.
If the ROB fills, issuing instructions stops until an entry
is made free, thus enforcing in-order commitment.

Example. How tables look


when MUL.D is ready to
commit? (same example
discussed in Tomasulo)

May 2015 Instruction-Level Parallelism 2 52


(status)

The ROB entries are dictated at the issue stage, hence


the #1, #2, #4, etc.
Although the SUB.D (#4) has completed execution, it
does not commit until the MUL.D (#3) commits.
May 2015 Instruction-Level Parallelism 2 53
May 2015 Instruction-Level Parallelism 2 54
Issue all instructions

Waits until RS[r] and ROB[b] both available

May 2015 Instruction-Level Parallelism 2 55


Waits until RS[r] and ROB[b] both available

FP operations and stores

FP

loads

stores

May 2015 Instruction-Level Parallelism 2 56


Instructions Waits
execution until Action or bookkeeping

Execute
FP op
Load
step 1

Load
step 2

Store

May 2015 Instruction-Level Parallelism 2 57


Write results all except store
Waits until execution done at RS [r] and CDB available

Write results store


Waits until execution done at RS[r] and RS[r].Qk == 0

May 2015 Instruction-Level Parallelism 2 58


Commit
Waits until Instruction is at the head (h) of the ROB and
ROB[h].ready == yes.
Action or bookkeeping

May 2015 Instruction-Level Parallelism 2 59


A Loop Example

Assume that all the instructions in the loop have been


issued twice, and that the L.D and MUL.D from the first
iteration have committed and all other instructions
have completed execution.
Since only the FP pipeline is considered, assume the
effective address (R1) for the store is computed by the
time the instruction is issued.
May 2015 Instruction-Level Parallelism 2 60
(status)

May 2015 Instruction-Level Parallelism 2 61


The register values and any memory values are not
written until an instruction commits, enabling undoing
speculative actions upon miss prediction.
Let the BNE be not taken the first time (the first loop is
always performed). The instructions prior to the branch
will commit when each reaches the head of the ROB.
When the branch reaches the ROB head, the ROB is
cleared and the instructions fetch begins from the other
path.
In practice, speculative processors try to recover as
early as possible after a branch is miss predicted.

May 2015 Instruction-Level Parallelism 2 62


Recovery is done by clearing the ROB for all entries that
appear after the miss predicted branch, allowing those
that are in the ROB before the branch to continue.
Fetch restarts at the correct branch successor.
Exceptions are handled by not recognizing the
exception until it is ready to commit.
If a speculated instruction raises an exception, the
exception is recorded in the ROB, being flushed along
with the instruction when the ROB is cleared.

May 2015 Instruction-Level Parallelism 2 63


 
In speculative processor the RSs and register status field
contain the same basic information as Tomasulo’s.
The differences are that RS numbers are replaced with
ROB entry numbers in the and fields, as well as in the
register status fields.
A destination field was added to the RSs, designating
the ROB entry destined for the result produced by this
RS entry.
The key difference from Tomasulo is that no instruction
after the earliest uncommitted instruction is allowed to
commit (complete).

May 2015 Instruction-Level Parallelism 2 64


It implies that the processor with the ROB can
dynamically execute code while maintaining a precise
exception.
If MUL.D caused an exception, it
waits until it reaches the ROB’s
head and takes the exception,
flushing all pending instructions
from the ROB. Because
commitment happens in order, this
yields a precise exception.
Tomasulo’s algorithm in contrast completes the SUB.D
and ADD.D, and F8 and F6 are overwritten before the
MUL.D raised exception, yielding imprecise exception.
May 2015 Instruction-Level Parallelism 2 65
Multiple Issue
Dynamic scheduling and speculation can achieve an
ideal CPI of one.
We would like to decrease the CPI to less than one. But
cannot if only one instruction is issued per clock cycle.
A VLIW (very long instruction word) multiple-issue
processor is issuing a fixed number of instructions
formatted either as one large instruction or as a fixed
instruction packet.
VLIW processors are inherently statically scheduled by
the compiler.

May 2015 Instruction-Level Parallelism 2 66


VLIWs use multiple, independent functional units. It
issues multiple operations by placing these in one
instruction.
Intel Itanium I and II contain six operations per
instruction packet.
We consider a VLIW processor with instructions that
contain five operations: one integer operation (which
could also be a branch), two FP operations, and two
memory references.
The instruction have a 16–24 bit field for each unit,
yielding an instruction length of 80 - 120 bits.
May 2015 Instruction-Level Parallelism 2 67
Example: Suppose we have a VLIW that could issue two
memory references, two FP operations, and one integer
operation or branch in every clock cycle.
 
Show an unrolled (compiler) version of the loop for
such a processor. Unroll to minimize stalls. Ignore
delayed branches.

May 2015 Instruction-Level Parallelism 2 68


Load hazard

Addition latency

VLIW instructions that occupy the inner loop replace


the unrolled sequence (23 cycles).
The code takes 9 cycles assuming no branch delay. The
issue rate is 2.5 operations per cycle.
May 2015 Instruction-Level Parallelism 2 69
The efficiency, the percentage of available slots that
contained an operation, is about 60%.
To achieve this issue rate requires a larger number of
registers than MIPS would normally use in this loop.
The VLIW code requires at least eight FP registers, while
the same code for the base MIPS can use two FP
registers or five when unrolled and scheduled.

May 2015 Instruction-Level Parallelism 2 70


+ Dynamic Scheduling + Speculation
Put multiple issue, dynamic scheduling and speculation
together. Such microarchitecture is used in modern
microprocessors.
Consider an issue rate of two instructions per clock, no
different from modern processors that issue more
instructions per clock.
Assume a separate integer and FPU, each can initiate an
operation on every clock. The pipeline can issue any
combination of two instructions in a clock.
Tomasulo’s scheme supports integer unit, FPU and
speculative execution.
May 2015 Instruction-Level Parallelism 2 71
Combining speculative dynamic scheduling with
multiple issue requires to be able to complete and
commit multiple instructions per clock.
Example: Execute the following code on a two-issue
processor, once without speculation and once with
speculation.

May 2015 Instruction-Level Parallelism 2 72


There are separate integer units for address calculation,
ALU operations, and branch condition evaluation.
Create a table for the first three iterations of this loop
for both processors. Assume that up to two instructions
of any type can commit per clock.

May 2015 Instruction-Level Parallelism 2 73


Execute
Loop Mem Write
iteration Instruction Issue access CDB Comment
Cannot start EXE until
branch resolution
No speculation

Cannot start EXE until


branch resolution

May 2015 Instruction-Level Parallelism 2 74


Commit
Execute
Loop Read Write
iteration Instruction Issue access CDB Comment
Start EXE upon R1’s
value written to CDB
With speculation

Start EXE upon R1’s


value written to CDB

May 2015 Instruction-Level Parallelism 2 75


Because the completion rate on the non speculative
pipeline is falling behind the issue rate rapidly, the non
speculative pipeline will stall when a few more
iterations are issued.

The branch can be a critical performance limiter.


Speculation helps significantly.
The third branch in the speculative processor executes
in clock cycle 13, while it executes in clock cycle 19 on
the non speculative pipeline.

May 2015 Instruction-Level Parallelism 2 76


April 2019 Instruction-Level Parallelism 2 77

You might also like