Module3
Module3
PCC-CS 402
Module 3
• In Dynamic Parallelism (DP), the processor decides at run-time which instructions can be
executed in parallel.
Example – Pentium Processor.
• CPU examines the instruction stream and assigns instructions to each cycle.
• CPU resolves hazards using advanced techniques at runtime.
• Compiler can help by re-ordering instructions.
• It can be observed that Operation 3 depends upon the outcome of Operation 1 and
Operation 2.
• However, Operation 1 and Operation2 do not have interdependencies. Hence they can
be performed in parallel.
• If we assume that each operation consumes one unit of time, then without ILP, 3 units of
time will be consumed. With ILP two units of time will be consumed. Therefore amount
of ILP is 3/2.
Illustration 2
• The simplest and most common way to increase the amount of parallelism
available among instructions is to exploit parallelism among iterations of a
loop. This type of parallelism is often called loop-level parallelism.
• Example 1:
for (i=1; i<=1000; i= i+1)
x[i] = x[i] + y[i]
• This is a parallel loop. Every iteration of the loop can overlap with any
other iteration, although within each loop iteration there is little opportunity
for overlap.
Illustration 2
• Example 2:
for (i=1; i<=100; i= i+1)
{
a[i] = a[i] + b[i]; //s1
b[i+1] = c[i] + d[i]; //s2
}
• Is this loop parallel? If not how to make it parallel?
• Statement s1 uses the value assigned in the previous iteration by statement s2, so there is a loop-carried
dependency between s1 and s2. Despite this dependency, this loop can be made parallel because the
dependency is not circular:
1. neither statement depends on itself
2. while s1 depends on s2, s2 does not depend on s1.
• A loop is parallel unless there is a cycle in the dependencies, since the absence of a cycle means that the
dependencies give a partial ordering on the statements.
Illustration 2
• To expose the parallelism the loop must be transformed to conform to the partial order. Two observations are
critical to this transformation:
1. There is no dependency from s1 to s2. Then, interchanging the two statements will not affect the
execution of s2.
2. On the first iteration of the loop, statement s1 depends on the value of b[1] computed prior to
initiating the loop.
• This allows us to replace the loop above with the following code sequence, which makes possible
overlapping of the iterations of the loop:
a[1] = a[1] + b[1];
for (i=1; i<=99; i= i+1)
{
b[i+1] = c[i] + d[i];
a[i+1] = a[i+1] + b[i+1];
}
b[101] = c[100] + d[100];
Illustration 2
• Example 3:
for (i=1; i<=100; i= i+1)
{
a[i+1] = a[i] + c[i]; //S1
b[i+1] = b[i] + a[i+1]; //S2
}
• This loop is not parallel because it has cycles in the dependencies, namely the statements
S1 and S2 depend on themselves!
Implementation of ILP
• Out-of-order execution where instructions execute in any order that does not violate data
dependencies.
Note that this technique is independent of both pipelining and superscalar execution.
Current implementations of out-of-order execution dynamically (i.e., while the program is
executing and without any help from the compiler) extract ILP from ordinary programs. An
alternative is to extract this parallelism at compile time and somehow convey this
information to the hardware.
• Register renaming which refers to a technique used to avoid unnecessary serialization of
program operations imposed by the reuse of registers by those operations, used to enable
out-of-order execution.
• Branch prediction which is used to avoid stalling for control dependencies to be resolved.
Branch prediction is used with speculative execution.
ILP Architectures Classifications
• In contrast to a scalar processor that can execute at most one single instruction per clock
cycle, a superscalar processor can execute more than one instruction during a clock cycle
by simultaneously dispatching multiple instructions to different execution units on the
processor.
• It therefore allows for more throughput (the number of instructions that can be executed
in a unit of time) than would otherwise be possible at a given clock rate.
• Each execution unit is not a separate processor (or a core if the processor is a multi-core
processor), but an execution resource within a single CPU such as an arithmetic logic
unit.
Superscalar Processors
• Equally applicable to RISC & CISC, but more straightforward in RISC machines.
• Dependent upon:
• Instruction level parallelism possible
• Compiler based optimization
• Hardware support
• Limited by
• Data dependency
• Procedural dependency
• Resource conflicts
Very Long Instruction Word - VLIW
• The VLIW architecture is implemented through Static Scheduling. This means that they
are not done at runtime by the processor but are handled by the compiler.
• The compiler takes the complex instruction that need to be handled and compiles them
into Object Code.
• The Object Code is then passed to the Register File.
• It is this Object Code that is referred to as Very Long Instruction Word.
• The compiler must guarantee that the multiple instructions which group together are
independent so they can be executable in parallel.
• The Compiler pre-arranges the Object Code so the VLIW chip can quickly execute the
instructions in parallel.
VLIW limitations
• Superscalar Processors generally use Dynamic Scheduling that transforms all ILP
complexity to the processor hardware.
• VLIW chips don’t need most of the complex circuitry that Superscalar chips must use to
coordinate parallel execution at runtime.
• Normal processor:
execute 10 times:
read the next instruction and decode
fetch this number, fetch that number
add them
put the result in the location
end loop
• Array processor:
read instruction and decode it
fetch these 10 numbers
fetch those 10 numbers
add them
put the result in the location.
How Array Processor is beneficial
• In this configuration, the Attached Array Processor has an input output interface to
common processor and another interface with a local memory.
• The local memory connects to the main memory with the help of high speed
memory bus.
Why use Array Processors ?