0% found this document useful (0 votes)
14 views54 pages

Module 5 Instruction Level Parallelism and Pipelining

Uploaded by

deepthik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views54 pages

Module 5 Instruction Level Parallelism and Pipelining

Uploaded by

deepthik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54

Instruction level parallelism and

Pipelining
Instruction-Level Parallelism: Concepts and
Challenges
• All processors since about 1985 use pipelining to
overlap the execution of instructions and improve
performance. This potential overlap among instructions
is called instruction-level parallelism (ILP), since the
instructions can be evaluated in parallel.
• There are two largely separable approaches to
exploiting ILP: (1) an approach that relies on hardware
to help discover and exploit the parallelism
dynamically,
(2) an approach that relies on software technology to
find parallelism statically at compile time.
• The value of the CPI (cycles per instruction) for a
pipelined processor is the sum of the base CPI and all
contributions from stalls:
Pipeline CPI= Ideal pipeline CPI+ Structural stalls+
Data hazard stalls+ Control stalls.
• The ideal pipeline CPI is a measure of the maximum
performance attainable by the implementation.
• By reducing each of the terms of the right-hand side,
we decrease the overall pipeline CPI or, alternatively,
increase the IPC (instructions per clock).
What is Instruction-Level Parallelism?
• The simplest and most common way to increase the
ILP is to exploit parallelism among iterations of a
loop.
• This type of parallelism is often called loop-level
parallelism.
• Example of a loop that adds two 1000-element arrays
and is completely parallel:
for (i=0; i<=999; i=i+1)
x[i] = x[i] + y[i];
• Every iteration of the loop can overlap with
any other iteration, although within each loop
iteration there is little or no opportunity for
overlap.
• An important alternative method for
exploiting loop-level parallelism is the use of
SIMD in both vector processors and Graphics
Processing Units (GPUs).
• A SIMD instruction exploits data-level parallelism by
operating on a small to moderate number of data items
in parallel (typically two to eight).
• A vector instruction exploits data-level parallelism by
operating on many data items in parallel using both
parallel execution units and a deep pipeline.
• For example, the above code sequence, which in simple
form requires seven instructions per iteration (two loads,
an add, a store, two address updates, and a branch) for a
total of 7000 instructions, might execute in one-quarter
as many instructions in some SIMD architecture where
four data items are processed per instruction.
• On some vector processors, this sequence might take
only four instructions: two instructions to load the
vectors x and y from memory, one instruction to add
the two vectors, and an instruction to store back the
result vector.
• Of course, these instructions would be pipelined and
have relatively long latencies, but these latencies
may be overlapped.
Data Dependences and Hazards

• In particular, to exploit instruction-level parallelism we


must determine which instructions can be executed in
parallel.
• It two instructions are parallel, they can execute
simultaneously in a pipeline of arbitrary depth without
causing any stalls, assuming the pipeline has sufficient
resources.
• If two instructions are dependent, they are not parallel
and must be executed in order, although they may often
be partially overlapped.
• The key in both cases is to determine whether an
instruction is dependent on another instruction.
Data Dependences(True data dependences)
• An instruction j is data dependent on instruction i if either of
the following holds:
■ Instruction i produces a result that may be used by instruction
j.
■ Instruction j is data dependent on instruction k, and
instruction k is data dependent on instruction i.
• 3.1 Instruction-Level Parallelism: Concepts and Challenges ■
151
• The second condition simply states that one instruction is
dependent on another if there exists a chain of dependences
of the first type between the two instructions.
• This dependence chain can be as long as the entire program.
• For example, consider the following MIPS code
sequence that increments a vector of values in
memory (starting at 0(R1) and with the last element
at 8(R2)) by a scalar in register F2.
• Loop: L.D F0,0(R1) ;F0=array element
ADD.D F4,F0,F2 ;add scalar in F2
S.D F4,0(R1) ;store result
DADDUI R1,R1,#-8 ;decrement pointer 8 bytes
BNE R1,R2,LOOP ;branch R1!=R2
The data dependences in this code sequence involve
both floating-point data:
Loop: L.D F0,0(R1) ;F0=array element
ADD.D F4,F0,F2 ;add scalar in F2
S.D F4,0(R1) ;store result

And integer data:


DADDIU R1,R1,#-8 ;decrement pointer
;8 bytes (per DW)
BNE R1,R2,Loop ;branch R1!=R2
• In both of the above dependent sequences, as shown by
the arrows, each instruction depends on the previous
one. The arrows here and in following examples show
the order that must be preserved for correct execution.
The arrow points from an instruction that must precede
the instruction that the arrowhead points to.
• If two instructions are data dependent, they must
execute in order and cannot execute simultaneously or
be completely overlapped.
• The dependence implies that there would be a chain of
one or more data hazards between the two instructions.
• The presence of a data dependence in an instruction
sequence reflects a data dependence in the source
code from which the instruction sequence was
generated. The effect of the original data
dependence must be preserved.
• Dependences are a property of programs. Whether a
given dependence results in an actual hazard being
detected and whether that hazard actually causes a
stall are properties of the pipeline organization.
• A data dependence conveys three things: (1) the
possibility of a hazard, (2) the order in which results must
be calculated, and (3) an upper bound on how much
parallelism can possibly be exploited.
• A dependence can be overcome in two different ways:
(1) maintaining the dependence but avoiding a hazard,
and (2) eliminating a dependence by transforming the
code.
• Scheduling the code is the primary method used to avoid
a hazard without altering a dependence, and such
scheduling can be done both by the compiler and by the
hardware.
• A data value may flow between instructions either
through registers or through memory locations.
• Dependences that flow through memory locations are
more difficult to detect, since two addresses may refer
to the same location but look different:
• For example, 100(R4) and 20(R6) may be identical
memory addresses. In addition, the effective address of
a load or store may change from one execution of the
instruc-tion to another (so that 20(R4) and 20(R4) may
be different), further complicating the detection of a
dependence.
Name Dependences
• The second type of dependence is a name
dependence. A name dependence occurs when two
instructions use the same register or memory
location, called a name, but there is no flow of data
between the instructions associated with that name.
• There are two types of name dependences between
an instruction i that precedes instruction j in program
order:
• 1. An antidependence between instruction i and
instruction j occurs when instruction j writes a
register or memory location that instruction i reads.
The original ordering must be preserved to ensure
that i reads the correct value.
• 2. An output dependence occurs when instruction i
and instruction j write the same register or memory
location. The ordering between the instructions must
be preserved to ensure that the value finally written
corresponds to instruction j.
• Both antidependences and output dependences are
name dependences, as opposed to true data
dependences, since there is no value being
transmitted between the instructions. Because a
name dependence is not a true dependence,
instructions involved in a name dependence can
execute simultaneously or be reordered, if the name
(register number or memory location) used in the
instructions is changed so the instructions do not
conflict.
• This renaming can be more easily done for register operands,
where it is called register renaming. Register renaming can be
done either statically by a compiler or dynamically by the
hardware.
Data Hazards
• A hazard exists whenever there is a name or data dependence
between instructions, and they are close enough that the
overlap during execution would change the order of access to
the operand involved in the dependence.
• Because of the dependence, we must preserve what is called
program order—that is, the order that the instructions would
execute in if executed sequentially one at a time as
determined by the original source program.
• Detecting and avoiding hazards ensures that
necessary program order is preserved.
• Data hazards may be classified as one of three types,
depending on the order of read and write accesses in
the instructions.
• By convention, the hazards are named by the
ordering in the program that must be preserved by
the pipeline.
• Consider two instructions i and j, with i preceding j in
program order. The possible data hazards are,
❑ RAW (read after write)—j tries to read a source before i writes
it, so j incorrectly gets the old value. This hazard is the most
common type and corresponds to a true data dependence.
Program order must be preserved to ensure that j receives
the value from i.

W
(write after write)—j tries to write an operand before it is
written by i. The writes end up being performed in the wrong
order, leaving the value written by i rather than the value
written by j in the destination. This hazard corresponds to an
output dependence. WAW hazards are present only in
pipelines that write in more than one pipe stage or allow an
instruction to proceed even when a previous instruction is
• WAR (write after read)—j tries to write a destination
before it is read by i, so incorrectly gets the new
value. This hazard arises from an antidependence (or
name dependence). WAR hazards cannot occur in
most static issue pipelines even deeper pipelines or
floating-point pipelines—because all reads are early
and all writes are late. A WAR hazard occurs either
when there are some instructions that write results
early in the instruction pipeline and other
instructions that read a source late in the pipeline, or
when instructions are reordered.
Control Dependences
• The last type of dependence is a control dependence. A
control dependence determines the ordering of an
instruction, i, with respect to a branch instruction so that
instruction i is executed in correct program order and only
when it should be.
• Every instruction, except for those in the first basic block of
the program, is control dependent on some set of branches,
and, in general, these control dependences must be
preserved to preserve program order.
• One of the simplest examples of a control dependence is the
dependence of the statements in the “then”part of an if
statement on the branch.
• For example, in the code segment
if p1 {
S1;
};
if p2 {
S2;
}
• S1 is control dependent on p1, and S2 is control dependent
on p2 but not on p1.
• In general, two constraints are imposed by control
dependences:
1. An instruction that is control dependent on a
branch cannot be moved before the branch so that
its execution is no longer controlled by the branch.
For example, we cannot take an instruction from
the then portion of an if statement and move it
before the if statement.
2. An instruction that is not control dependent on a
branch cannot be moved after the branch so that its
execution is controlled by the branch. For example,
we cannot take a statement before the if statement
and move it into the then portion.
• The control dependence is not the critical property
that must be preserved. Instead, the two properties
critical to program correctness—and normally
preserved by maintaining both data and control
dependences—are the exception behavior and the
data flow.
• Preserving the exception behavior means that any
changes in the ordering of instruction execution must
not change how exceptions are raised in the
program.
• Often this is relaxed to mean that the reordering of
instruction execution must not cause any new
exceptions in the program.
• A simple example shows how maintaining the control
and data dependences can prevent such situations.
• Consider this code sequence:
DADDU R2,R3,R4
BEQZ R2,L1
LW R1,0(R2)
L1:
• In this case, it is easy to see that if we do not maintain
the data dependence involving R2, we can change the
result of the program. Less obvious is the fact that if we
ignore the control dependence and move the load
instruction before the branch, the load instruction may
cause a memory protection exception.
• The second property preserved by maintenance of data
dependences and control dependences is the data flow.
The data flow is the actual flow of data values among
instructions that produce results and those that consume
them. Branches make the data flow dynamic, since they
allow the source of data for a given instruction to come
from many points.
• Program order is what determines which predecessor
will actually deliver a data value to an instruction.
Program order is ensured by maintaining the control
dependences.
For example, consider the following code fragment:
DADDU R1,R2,R3
BEQZ R4,L
DSUBU R1,R5,R6
L: ...
OR R7,R1,R8
• In this example, the value of R1 used by the OR
instruction depends on whether the branch is taken or
not. Data dependence alone is not sufficient to
preserve correctness. The OR instruction is data
dependent on both the DADDU and DSUBU
instructions, but preserving that order alone is
insufficient for correct execution.
• Instead, when the instructions execute, the data flow
must be preserved: If the branch is not taken, then the
value of R1 computed by the DSUBU should be used by
the OR, and, if the branch is taken, the value of R1
computed by the DADDU should be used by the OR. By
Branch, we prevent an illegal change to the data flow.
For simililar reasons, the DSUBU instruction cannot
be moved above the branch. Speculation, which
helps with the exception problem, will also allow us
to lessen the impact of the control dependence while
still maintaining the data flow.
What is pipelining?
• Pipelining is an implementation technique whereby
multiple instructions are overlapped in execution; it
takes advantage of parallelism that exists among the
actions needed to execute an instruction.
• pipelining is the key implementation technique used
to make fast CPUs.
• In a computer pipeline, each step in the pipeline
completes a part of an instruction.
• The different steps are completing different parts of
different instructions in parallel. Each of these steps
is called a pipe stage or a pipe segment.
• The stages are connected one to the next to form a
pipe—instructions enter at one end, progress
through the stages, and exit at the other end.
• The throughput of an instruction pipeline is
determined by how often an instruction exits the
pipeline. Because the pipe stages are hooked
together, all the stages must be ready to proceed at
the same time, just as we would require in an
assembly line.
• The time required between moving an instruction
one step down the pipeline is a processor cycle.
Because all stages proceed at the same time, the
length of a processor cycle is determined by the time
required for the slowest pipe stage, just as in an auto
assembly line the longest step would determine the
time between advancing the line. In a computer, this
processor cycle is usually 1 clock cycle.
• The time per instruction on the pipelined processor
—assuming ideal conditions—is equal to
Time per instruction on unpipelined machine/Number
of pipe stages
• Pipelining yields a reduction in the average
execution time per instruction.
• The baseline, the reduction can be viewed as
decreasing the number of clock cycles per
instruction (CPI), as decreasing the clock cycle
time, or as a combination.
A Simple Implementation of a RISC Instruction Set
• A simple implementation where every instruction takes
at most 5 clock cycles.
• Every instruction in this RISC subset can be implemented
in at most 5 clock cycles.
• The 5 clock cycles are as follows.
• 1. Instruction fetch cycle (IF): Send the program counter
(PC) to memory and fetch the current instruction from
memory. Update the PC to the next sequential PC by
adding 4 (since each instruction is 4 bytes) to the PC.
2. Instruction decode/register fetch cycle (ID):
Decode the instruction and read the registers
corresponding to register source specifiers from
the register file. Do the equality test on the
registers as they are read, for a possible branch.
Sign-extend the offset field of the instruction in
case it is needed. Compute the possible branch
target address by adding the sign-extended offset
to the incremented PC.
• Decoding is done in parallel with reading registers,
which is possible because the register specifiers are at
a fixed location in a RISC architecture. This technique
is known as fixed-field decoding.
• 3. Execution/effective address cycle (EX):
• The ALU operates on the operands prepared in the
prior cycle, performing
• one of three functions depending on the instruction
type.
• ■ Memory reference—The ALU adds the base register
and the offset to form
• Register-Register ALU instruction—The ALU
performs the operation specified by the ALU
opcode on the values read from the register
file.
• ■ Register-Immediate ALU instruction—The
ALU performs the operation specified by the
ALU opcode on the first value read from the
register file and the sign-extended immediate.
• Register-Immediate ALU instruction—The ALU
performs the operation specified by the ALU opcode
on the first value read from the register file and the
sign-extended immediate.
• 4. Memory access (MEM): If the instruction is a load,
the memory does a read using the effective address
computed in the previous cycle. If it is a store, then
the memory writes the data from the second register
read from the register file using the effective address.
• 5. Write-back cycle (WB):
• ■ Register-Register ALU instruction or load
instruction:
• Write the result into the register file, whether it
comes from the memory system (for a load) or from
the ALU (for an ALU instruction).
• In this implementation, branch instructions require 2
cycles, store instructions require 4 cycles, and all
other instructions require 5 cycles.
• Assuming a branch frequency of 12% and a
store frequency of 10%, a typical instruction
distribution leads to an overall CPI of 4.54.
This implementation, however, is not optimal
either in achieving the best performance or in
using the minimal amount of hard-ware given
the performance level.
The Classic Five-Stage Pipeline for
a RISC Processor
• We can pipeline the execution described above with
almost no changes by simply starting a new
instruction on each clock cycle.
• Each of the clock cycles from the previous section
becomes a pipe stage—a cycle in the pipeline. This
results in the execution pattern shown in Figure C.1,
which is the typical way a pipeline structure is drawn.
Although each instruction takes 5 clock cycles to
complete, during each clock cycle the hardware will
initiate a new instruction and will be executing some
part of the five different instructions.
• To start with, we have to determine what happens on
every clock cycle of the processor and make sure we
don’t try to perform two different operations with
the same data path resource on the same clock cycle.
• Thus, we must ensure that the overlap of instructions
in the pipeline cannot cause such a conflict.
• Figure C.2 shows a simplified version of a RISC data
path drawn in pipeline fashion.
• There are three observations on which this fact rests.
• First, we use separate instruction and data
memories, which we would typically
implement with separate instruction and data
caches.
• The use of separate caches eliminates a
conflict for a single memory that would arise
between instruction fetch and data memory
access.
• Second, the register file is used in the two stages:
one for reading in ID and one for writing in WB.
These uses are distinct, so we simply show the
register file in two places. Hence, we need to
perform two reads and one write every clock cycle.
• To handle reads and a write to the same register (and
for another reason, which will become obvious
shortly), we perform the register write in the first
half of the clock cycle and the read in the second
half.
• Third, Figure C.2 does not deal with the PC. To
start a new instruction every clock, we must
increment and store the PC every clock, and
this must be done during the IF stage in
preparation for the next instruction.
• Furthermore, we must also have an adder to
compute the potential branch target during ID.
One further problem is that a branch does not
change the PC until the ID stage.
• We must also ensure that instructions in different
stages of the pipeline do not interfere with one
another.
• This separation is done by introducing pipeline
registers between successive stages of the pipeline,
so that at the end of a clock cycle all the results from
a given stage are stored into a register that is used as
the input to the next stage on the next clock cycle.
Basic Performance Issues in Pipelining
• Pipelining increases the CPU instruction throughput
—the number of instructions
completed per unit of time—but it does not reduce
the execution time of an individual instruction.
• In fact, it usually slightly increases the execution time
of each instruction due to overhead in the control of
the pipeline.
• The fact that the execution time of each instruction
does not decrease puts limits on the practical depth
of a pipeline.
• In addition to limitations arising from pipeline
latency, limits arise from imbalance among the pipe
stages and from pipelining overhead. Imbalance
among the pipe stages reduces performance since
the clock can run no faster than the time needed for
the slowest pipeline stage.
• Pipeline overhead arises from the combination of pipeline
register delay and clock skew.
• The pipeline registers add setup time,
• which is the time that a register input must be stable before
the clock signal that triggers a write occurs, plus propagation
delay to the clock cycle.
• Clock skew, which is maximum delay between when the clock
arrives at any two registers, also contributes to the lower limit
on the clock cycle.
• Once the clock cycle is as small as the sum of the clock skew
and latch overhead, no further pipelining is useful, since there
is no time left in the cycle for useful work.

You might also like