Module 5 Instruction Level Parallelism and Pipelining
Module 5 Instruction Level Parallelism and Pipelining
Pipelining
Instruction-Level Parallelism: Concepts and
Challenges
• All processors since about 1985 use pipelining to
overlap the execution of instructions and improve
performance. This potential overlap among instructions
is called instruction-level parallelism (ILP), since the
instructions can be evaluated in parallel.
• There are two largely separable approaches to
exploiting ILP: (1) an approach that relies on hardware
to help discover and exploit the parallelism
dynamically,
(2) an approach that relies on software technology to
find parallelism statically at compile time.
• The value of the CPI (cycles per instruction) for a
pipelined processor is the sum of the base CPI and all
contributions from stalls:
Pipeline CPI= Ideal pipeline CPI+ Structural stalls+
Data hazard stalls+ Control stalls.
• The ideal pipeline CPI is a measure of the maximum
performance attainable by the implementation.
• By reducing each of the terms of the right-hand side,
we decrease the overall pipeline CPI or, alternatively,
increase the IPC (instructions per clock).
What is Instruction-Level Parallelism?
• The simplest and most common way to increase the
ILP is to exploit parallelism among iterations of a
loop.
• This type of parallelism is often called loop-level
parallelism.
• Example of a loop that adds two 1000-element arrays
and is completely parallel:
for (i=0; i<=999; i=i+1)
x[i] = x[i] + y[i];
• Every iteration of the loop can overlap with
any other iteration, although within each loop
iteration there is little or no opportunity for
overlap.
• An important alternative method for
exploiting loop-level parallelism is the use of
SIMD in both vector processors and Graphics
Processing Units (GPUs).
• A SIMD instruction exploits data-level parallelism by
operating on a small to moderate number of data items
in parallel (typically two to eight).
• A vector instruction exploits data-level parallelism by
operating on many data items in parallel using both
parallel execution units and a deep pipeline.
• For example, the above code sequence, which in simple
form requires seven instructions per iteration (two loads,
an add, a store, two address updates, and a branch) for a
total of 7000 instructions, might execute in one-quarter
as many instructions in some SIMD architecture where
four data items are processed per instruction.
• On some vector processors, this sequence might take
only four instructions: two instructions to load the
vectors x and y from memory, one instruction to add
the two vectors, and an instruction to store back the
result vector.
• Of course, these instructions would be pipelined and
have relatively long latencies, but these latencies
may be overlapped.
Data Dependences and Hazards