0% found this document useful (0 votes)
4 views22 pages

Instruction Level Parallelism

The document discusses instruction-level parallelism (ILP) and its impact on program execution speed, emphasizing the importance of potential parallelism in both the program and processor. It explains concepts like pipelining, multiple instruction issue, and scheduling constraints that affect how instructions can be executed concurrently. Additionally, it addresses the complexities of data dependencies and the challenges of optimal instruction scheduling, which is NP-complete, suggesting heuristic approaches like list scheduling to manage these issues.

Uploaded by

hkjoshi400
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views22 pages

Instruction Level Parallelism

The document discusses instruction-level parallelism (ILP) and its impact on program execution speed, emphasizing the importance of potential parallelism in both the program and processor. It explains concepts like pipelining, multiple instruction issue, and scheduling constraints that affect how instructions can be executed concurrently. Additionally, it addresses the complexities of data dependencies and the challenges of optimal instruction scheduling, which is NP-complete, suggesting heuristic approaches like list scheduling to manage these issues.

Uploaded by

hkjoshi400
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Instruction Level

Parallelism
How fast can a program be run on a
processor with instruction-level parallelism?

 The potential parallelism in the program.


 The available parallelism on the processor.
 Our ability to extract parallelism from the original sequential
program.
 Our ability to find the best parallel schedule given scheduling
constraints.

If all the operations in a program are highly dependent upon one


another, then no amount of hardware or parallelization
techniques can make the program run fast in parallel.
Processor Architectures
 Usually a processor issuing several operations in
a single clock cycle.
 In fact, it is possible for a machine to issue just
one operation per clock and yet achieve
instruction level parallelism using the concept of
pipelining.
 Every processor, be it a high-performance
supercomputer or a standard machine, uses an
instruction pipeline.
 With an instruction pipeline, a new instruction
can be fetched every clock while preceding
instructions are still going through the pipeline.
 There is a simple 5-stage instruction pipeline: it
first fetches the instruction (IF), decodes it (ID),
executes the operation (EX), accesses the
memory (MEM), and writes back the result (WB).
Pipelined Execution
 Some instructions take several clocks to execute.
 When a memory access hits in the cache, it usually takes several
clocks for the cache to return the data.
 The execution of an instruction is pipelined if succeeding instructions
not dependent on the result are allowed to proceed.
 Thus, even if a processor can issue only one operation per clock,
several operations might be in their execution stages at the same
time.
 If the deepest execution pipeline has n stages, potentially n
operations can be in flight at the same time.
 Note that not all instructions are fully pipelined.
 While floating-point adds and multiplies often are fully pipelined,
floating-point divides, being more complex and less frequently
executed, often are not.
Multiple Instruction Issue
 By issuing several operations per clock, processors can keep even more operations in
flight.
 The largest number of operations that can be executed simultaneously can be computed
by multiplying the instruction issue width by the average number of stages in the
execution pipeline.
 Like pipelining, parallelism on multiple-issue machines can be managed either by software
or hardware.
 Machines that rely on software to manage their parallelism are known as VLIW (Very-Long-
Instruction-Word) machines, while those that manage their parallelism with hardware are
known as superscalar machines.
 VLIW machines, as their name implies, have wider than normal instruction words that
encode the operations to be issued in a single clock.
 The compiler decides which operations are to be issued in parallel and encodes the
information in the machine code explicitly.
 Superscalar machines, on the other hand, have a regular instruction set with an ordinary
sequential-execution semantics.
 Superscalar machines automatically detect dependences among instructions and issue
them as their operands become available. Some processors include both VLIW and
superscalar functionality.
Coding Scheduling Constraints

 Control-dependence constraints: All the operations executed in


the original program must be executed in the optimized one.
 Data-dependence constraints: The operations in the optimized
program must produce the same results as the corresponding ones in
the original program.
 Resource constraints: The schedule must not oversubscribe the
resources on the machine.

These scheduling constraints guarantee that the optimized


program pro duces the same results as the original.
Data Dependence
 True dependence: read after write. If a write is followed by a read of
the same location, the read depends on the value written; such a
dependence is known as a true dependence.
 Antidependence: write after read. If a read is followed by a write to
the same location, we say that there is an antidependence from the
read to the write
 Output dependence: write after write. Two writes to the same
location share an output dependence. If the dependence is violated,
the value of the memory location written will have the wrong value
after both operations are performed.
Antidependence and output dependences are referred to as
storage-related dependences.
These are not true dependences and can be eliminated by using
different locations to store different values.
Antidependences
 Antidependences are not real dependences, in the sense that they do
not arise from the flow of data.
 They are due to a single location being used to store different values.
Most of the time, antidependences can be removed by renaming
locations — e.g. registers.
 In the example below, the program on the left contains a WAW
antidependence between the two memory load instructions, that can
be removed by renaming the second use of R1.
Instruction ordering

 When a compiler emits the instructions corresponding to a program,


it imposes a total order on them.
 However, that order is usually not the only valid one, in the sense
that it can be changed without modifying the program’s behavior.
 For example, if two instructions i1 and i2 appear sequentially in that
order and are independent, then it is possible to swap them.
 Among all the valid permutations of the instructions composing a
program — i.e. those that preserve the program’s behavior — some
can be more desirable than others.
 For example, one order might lead to a faster program on some
machine, because of architectural constraints. The aim of instruction
scheduling is to find a valid order that optimizes some metric, like
execution speed.
Instruction Scheduling Example
(a + b) + c + (d + e)
Parallel evaluation of the expression
Dependence Graph

 The dependence graph is a directed graph representing dependences


among instructions.
 Its nodes are the instructions to schedule, and there is an edge from
node n1 to node n2 iff the instruction of n2 depends on n1.
 Any topological sort of the nodes of this graph represents a valid way
to schedule the instructions
Dependence Graph Example
Difficulty of scheduling

 Optimal instruction scheduling is NP-complete.


 As always, this implies that we will use techniques based on
heuristics to find a good — but sometimes not optimal — solution to
that problem.
 List scheduling is a technique to schedule the instructions of a single
basic block.
 Its basic idea is to simulate the execution of the instructions, and to
try to schedule instructions only when all their operands can be used
without stalling the pipeline
List Scheduling Algorithm

 The list scheduling algorithm maintains two lists: –


 Ready is the list of instructions that could be scheduled without stall,
ordered by priority,
 Active is the list of instructions that are being executed.
 At each step, the highest-priority instruction from ready is scheduled,
and moved to active, where it stays for a time equal to its delay.
 Before scheduling is performed, renaming is done to remove all
antidependences that can be removed
Prioritizing instructions

 Nodes (i.e. instructions) are sorted by priority in the ready list.


 Several schemes exist to compute the priority of a node, which can
be equal to: –
 The length of the longest latency-weighted path from it to a root of the
dependence graph, –
 The number of its immediate successors, –
 The number of its descendants, –
 Its latency, – etc.
Unfortunately, no single scheme is better for all cases
Scheduling Conflicts
 It is hard to decide whether scheduling should be done before or after
register allocation.
 If register allocation is done first, it can introduce antidependences
when reusing registers.
 If scheduling is done first, register allocation can introduce spilling
code, destroying the schedule.
 Solution: schedule first, then allocate registers and schedule once
more if spilling was necessary

You might also like