The document discusses instruction-level parallelism (ILP) and its impact on program execution speed, emphasizing the importance of potential parallelism in both the program and processor. It explains concepts like pipelining, multiple instruction issue, and scheduling constraints that affect how instructions can be executed concurrently. Additionally, it addresses the complexities of data dependencies and the challenges of optimal instruction scheduling, which is NP-complete, suggesting heuristic approaches like list scheduling to manage these issues.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
4 views22 pages
Instruction Level Parallelism
The document discusses instruction-level parallelism (ILP) and its impact on program execution speed, emphasizing the importance of potential parallelism in both the program and processor. It explains concepts like pipelining, multiple instruction issue, and scheduling constraints that affect how instructions can be executed concurrently. Additionally, it addresses the complexities of data dependencies and the challenges of optimal instruction scheduling, which is NP-complete, suggesting heuristic approaches like list scheduling to manage these issues.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22
Instruction Level
Parallelism How fast can a program be run on a processor with instruction-level parallelism?
The potential parallelism in the program.
The available parallelism on the processor. Our ability to extract parallelism from the original sequential program. Our ability to find the best parallel schedule given scheduling constraints.
If all the operations in a program are highly dependent upon one
another, then no amount of hardware or parallelization techniques can make the program run fast in parallel. Processor Architectures Usually a processor issuing several operations in a single clock cycle. In fact, it is possible for a machine to issue just one operation per clock and yet achieve instruction level parallelism using the concept of pipelining. Every processor, be it a high-performance supercomputer or a standard machine, uses an instruction pipeline. With an instruction pipeline, a new instruction can be fetched every clock while preceding instructions are still going through the pipeline. There is a simple 5-stage instruction pipeline: it first fetches the instruction (IF), decodes it (ID), executes the operation (EX), accesses the memory (MEM), and writes back the result (WB). Pipelined Execution Some instructions take several clocks to execute. When a memory access hits in the cache, it usually takes several clocks for the cache to return the data. The execution of an instruction is pipelined if succeeding instructions not dependent on the result are allowed to proceed. Thus, even if a processor can issue only one operation per clock, several operations might be in their execution stages at the same time. If the deepest execution pipeline has n stages, potentially n operations can be in flight at the same time. Note that not all instructions are fully pipelined. While floating-point adds and multiplies often are fully pipelined, floating-point divides, being more complex and less frequently executed, often are not. Multiple Instruction Issue By issuing several operations per clock, processors can keep even more operations in flight. The largest number of operations that can be executed simultaneously can be computed by multiplying the instruction issue width by the average number of stages in the execution pipeline. Like pipelining, parallelism on multiple-issue machines can be managed either by software or hardware. Machines that rely on software to manage their parallelism are known as VLIW (Very-Long- Instruction-Word) machines, while those that manage their parallelism with hardware are known as superscalar machines. VLIW machines, as their name implies, have wider than normal instruction words that encode the operations to be issued in a single clock. The compiler decides which operations are to be issued in parallel and encodes the information in the machine code explicitly. Superscalar machines, on the other hand, have a regular instruction set with an ordinary sequential-execution semantics. Superscalar machines automatically detect dependences among instructions and issue them as their operands become available. Some processors include both VLIW and superscalar functionality. Coding Scheduling Constraints
Control-dependence constraints: All the operations executed in
the original program must be executed in the optimized one. Data-dependence constraints: The operations in the optimized program must produce the same results as the corresponding ones in the original program. Resource constraints: The schedule must not oversubscribe the resources on the machine.
These scheduling constraints guarantee that the optimized
program pro duces the same results as the original. Data Dependence True dependence: read after write. If a write is followed by a read of the same location, the read depends on the value written; such a dependence is known as a true dependence. Antidependence: write after read. If a read is followed by a write to the same location, we say that there is an antidependence from the read to the write Output dependence: write after write. Two writes to the same location share an output dependence. If the dependence is violated, the value of the memory location written will have the wrong value after both operations are performed. Antidependence and output dependences are referred to as storage-related dependences. These are not true dependences and can be eliminated by using different locations to store different values. Antidependences Antidependences are not real dependences, in the sense that they do not arise from the flow of data. They are due to a single location being used to store different values. Most of the time, antidependences can be removed by renaming locations — e.g. registers. In the example below, the program on the left contains a WAW antidependence between the two memory load instructions, that can be removed by renaming the second use of R1. Instruction ordering
When a compiler emits the instructions corresponding to a program,
it imposes a total order on them. However, that order is usually not the only valid one, in the sense that it can be changed without modifying the program’s behavior. For example, if two instructions i1 and i2 appear sequentially in that order and are independent, then it is possible to swap them. Among all the valid permutations of the instructions composing a program — i.e. those that preserve the program’s behavior — some can be more desirable than others. For example, one order might lead to a faster program on some machine, because of architectural constraints. The aim of instruction scheduling is to find a valid order that optimizes some metric, like execution speed. Instruction Scheduling Example (a + b) + c + (d + e) Parallel evaluation of the expression Dependence Graph
The dependence graph is a directed graph representing dependences
among instructions. Its nodes are the instructions to schedule, and there is an edge from node n1 to node n2 iff the instruction of n2 depends on n1. Any topological sort of the nodes of this graph represents a valid way to schedule the instructions Dependence Graph Example Difficulty of scheduling
Optimal instruction scheduling is NP-complete.
As always, this implies that we will use techniques based on heuristics to find a good — but sometimes not optimal — solution to that problem. List scheduling is a technique to schedule the instructions of a single basic block. Its basic idea is to simulate the execution of the instructions, and to try to schedule instructions only when all their operands can be used without stalling the pipeline List Scheduling Algorithm
The list scheduling algorithm maintains two lists: –
Ready is the list of instructions that could be scheduled without stall, ordered by priority, Active is the list of instructions that are being executed. At each step, the highest-priority instruction from ready is scheduled, and moved to active, where it stays for a time equal to its delay. Before scheduling is performed, renaming is done to remove all antidependences that can be removed Prioritizing instructions
Nodes (i.e. instructions) are sorted by priority in the ready list.
Several schemes exist to compute the priority of a node, which can be equal to: – The length of the longest latency-weighted path from it to a root of the dependence graph, – The number of its immediate successors, – The number of its descendants, – Its latency, – etc. Unfortunately, no single scheme is better for all cases Scheduling Conflicts It is hard to decide whether scheduling should be done before or after register allocation. If register allocation is done first, it can introduce antidependences when reusing registers. If scheduling is done first, register allocation can introduce spilling code, destroying the schedule. Solution: schedule first, then allocate registers and schedule once more if spilling was necessary