Unit 3 - Advanced Computer Architecture - Www.rgpvnotes.in
Unit 3 - Advanced Computer Architecture - Www.rgpvnotes.in
Types of pipeline:
Pipelines can actually be divided into two classes:
a) Static or Linear Pipelines: These pipelines can perform one operation (Addition or Multiplication) at a time.
The operation of a static pipeline can only be changed after the pipeline has been drained. (A pipeline is said
to be drained when the last input data leave the pipeline.) For example, consider a static pipeline that is able
to perform addition and multiplication. Each time that the pipeline switches from a multiplication operation to
an addition operation, it must be drained and set for the new operation. The performance of static pipelines is
severely degraded when the operations change often, since this requires the pipeline to be drained and
refilled each time. The output of the pipeline is produced from the last stage. For Diagram of static or linear
pipeline you can refer to the previous Figure 3.1.
b) Dynamic or Non Linear Pipelines processor: A dynamic pipeline can perform more than one operation at a
time. To perform a particular operation on an input data, the data must go through a certain sequence of
stages. For example, Figure 3.2 shows a three-stage dynamic pipeline that performs addition and
multiplication on different data at the same time. To perform multiplication, the input data must go through
stages 1, 2, and 3; to perform addition, the data only need to go through stages 1 and 3. Therefore, the first
stage of the addition process can be performed on an input data D1 at stage 1, while at the same time the last
stage of the multiplication process is performed at stage 3 on a different input data D2. Note that the time
interval between the initiation of the inputs D1 and D2 to the pipeline should be such that they do not reach
stage 3 at the same time; otherwise, there is a collision. In general, in dynamic pipelines the mechanism that
controls when data should be fed to the pipeline is much more complex than in static pipelines.
An instruction pipeline overlaps the process of the preceding stages for different instructions to achieve a
much lower total completion time, on average, for a series of instructions. As an example, consider Figure 3.4,
which shows the execution of four instructions in an instruction pipeline. During the first cycle, or clock pulse,
instruction i1 is fetched from memory. Within the second cycle, instruction i1 is decoded while instruction i2 is
fetched. This process continues until all the instructions are executed. The last instruction finishes the write-
back stage after the eighth clock cycle. Therefore, it takes 80 nanoseconds (ns) to complete execution of all
the four instructions when assuming the clock period to be 10 ns.
The total completion time can also be obtained using equation (3.1) that is,
Tpipe = m*P+(n-1)*P
=5*10+(4-1)*10
=80 ns.
Note that in a nonpipelined design the completion time will be much higher. Using equation (3.2),
Tseq = n*m*P = 4*5*10 = 200 ns.
Even though pipelining speeds up the execution of instructions, it does pose potential problems. Some of
these problems and possible solutions are discussed next.
Fetching problem- In general, supplying instructions rapidly through a pipeline is costly in terms of chip area.
Buffering the data to be sent to the pipeline is one simple way of improving the overall utilization of a pipeline.
The utilization of a pipeline is defined as the percentage of time that the stages of the pipeline are used over a
sufficiently long period of time. A pipeline is utilized 100% of the time when every stage is used (utilized)
during each clock cycle.
Occasionally, the pipeline has to be drained and refilled, for example, whenever an interrupt or a branch
occurs. The time spent refilling the pipeline can be minimized by having instructions and data loaded ahead of
time into various geographically close buffers (like on-chip caches) for immediate transfer into the pipeline. If
instructions and data for normal execution can be fetched before they are needed and stored in buffers, the
pipeline will have a continuous source of information with which to work. Prefetch algorithms are used to
make sure potentially needed instructions are available most of the time. Delays from memory access conflicts
can thereby be reduced if these algorithms are used, since the time required to transfer data from main
memory is far greater than the time required to transfer data from a buffer.
The bottleneck Problem
The bottleneck problem relates to the amount of load (work) assigned to a stage in the pipeline. If too much
work is applied to one stage, the time taken to complete an operation at that stage can become unacceptably
long. This relatively long time spent by the instruction at one stage will inevitably create a bottleneck in the
pipeline system. In such a system, it is better to remove the bottleneck that is the source of congestion. One
solution to this problem is to further subdivide the stage. Another solution is to build multiple copies of this
stage into the pipeline.
Pipelining hazards:
If an instruction is available, but cannot be executed for some reason, a hazard exists for that instruction.
These hazards create issuing problems; they prevent issuing an instruction for execution. Three types of
hazard are discussed here. They are called structural hazard, data hazard, and control hazard. A structural
hazard refers to a situation in which a required resource is not available (or is busy) for executing an
instruction. A data hazard refers to a situation in which there exists a data dependency (operand conflict) with
a prior instruction. A control hazard refers to a situation in which an instruction, such as branch, causes a
change in the program flow. Each of these hazards is explained next.
Structural Hazards:
A structural hazard occurs as a result of resource conflicts between instructions. One type of structural hazard
that may occur is due to the design of execution units. If an execution unit that requires more than one clock
cycle (such as multiply) is not fully pipelined or is not replicated, then a sequence of instructions that uses the
unit cannot be subsequently (one per clock cycle) issued for execution. A replicating and/or pipelining
execution unit increases the number of instructions that can be issued simultaneously.
Another type of structural hazard that may occur is due to the design of register files. If a register file does not
have multiple write (read) ports, multiple writes (reads) to (from) registers cannot be performed
simultaneously.
For example, under certain situations the instruction pipeline might want to perform two register writes in a
clock cycle. This may not be possible when the register file has only one write port. The effect of a structural
hazard can be reduced fairly simply by implementing multiple execution units and using register files with
multiple input/output ports.
Data Hazards: In a non-pipelined processor, the instructions are executed one by one, and the execution of an
instruction is completed before the next instruction is started. In this way, the instructions are executed in the
same order as the program. However, this may not be true in a pipelined processor, where instruction
executions are overlapped. An instruction may be started and completed before the previous instruction is
completed.
The data hazard, which is also referred to as the data dependency problem, comes about as a result of
overlapping (or changing the order of) the execution of data-dependent instructions. For example, in Figure
3.5 instruction i2 has a data dependency on i1 because it uses the result of i1 (i.e., the contents of register R2)
as input data. If the instructions were sent to a pipeline in the normal manner, i2 would be in the OF stage
before i1 passed through the WB stage.
This would result in using the old contents of R2 for computing a new value for R5, leading to an invalid result.
To have a valid result, i2 must not enter the OF stage until i1 has passed through the WB stage. In this way, as
is shown in Figure 3.6, the execution of i2 will be delayed for two clock cycles. In other words, instruction i2 is
said to be stalled for two clock cycles. Often, when an instruction is stalled, the instructions that are
positioned after the stalled instruction will also be stalled. However, the instructions before the stalled
instruction can continue execution. The delaying of execution can be accomplished in two ways. One way is to
delay the OF or IF stages of i2 for two clock cycles.
To insert a delay, an extra hardware component called a pipeline interlock can be added to the pipeline. A
pipeline interlock detects the dependency and delays the dependent instructions until the conflict is resolved.
Another way is to let the compiler solve the dependency problem.
During compilation, the compiler detects the dependency between data and instructions. It then rearranges
these instructions so that the dependency is not hazardous to the system. If it is not possible to rearrange the
instructions, NOP (no operation) instructions are inserted to create delays.
RAW: This type of data hazard was discussed previously; it refers to the situation in which i2 reads a data
source before i1 writes to it. This may produce an invalid result since the read must be performed after the
write in order to obtain a valid result.
For example, in the sequence
i1: Add R2, R3, R4 --R2=R3+R4
i2: Add R5, R2, R1 --R5=R2+R1 an invalid result may be produced if i2 reads R2 before i1 writes to it.
WAR: This refers to the situation in which i2 writes to a location before i1 reads it.
For example, in the sequence
i1: Add R2, R3, R4 --R2=R3+R4
i2: Add R4, R5, R6 --R4=R5+R6 an invalid result may be produced if i2 writes to R4 before i1 reads it; that is,
the instruction i1 might use the wrong value of R4.
WAW: This refers to the situation in which i2 writes to a location before i1 writes to it.
For example, in the sequence
i1: Add R2, R3, R4 --R2=R3+R4
i2: Add R2, R5, R6 --R2=R5+R6 the value of R2 is recomputed by i2. If the order of execution were reversed,
that is, i2 writes to R2.
These are the methods to check the dependencies statically at the compile time by complier and by hardware
at the runtime. it is not always possible to determine the actual memory addresses of load and store
instructions in order to resolve a possible dependency between them. However, during the run time the actual
memory addresses are known, and thereby dependencies between instructions can be determined by
dynamically checking the dependency. In general, dynamic dependency checking has the advantage of being
able to determine dependencies that are either impossible or hard to detect at compile time.
Here we will discuss the techniques for dynamic dependency checking. Two of the most commonly used
techniques are called Tomasulo's method and the scoreboard method.
Tomasulo’s Algorithm
The Tomasulo algorithm was first implemented in the IBM 360/91 Floating Point Unit which came out three
years after the CDC 6600. This scheme was intedned to address several issues:
A small number of floating point registers available the 360/91 had 4 double precision registers.
Long memory latency this was just prior to the introduction of caches as a standard part of the
memory hierarchy.
The cost effectiveness of functional unit hardware with multiple copies of the same functional unit,
some units were often underutilized.
The performance penalties of name dependencies. These lead to WAW and WAR hazards.
Tave = Pb * (average number of cycles per branch instruction) + (1 - Pb) * (average number of cycles per
nonbranch instruction)where Pb denotes the probability that a given instruction is a branch. Thus average
number of cycles per branch instruction = Pt (1+c) + (1- Pt)
Where Pt denotes the probability that branch target is chosen.
Tave = Pb [Pt (1+c) + (1-Pt )(1)] + (1- Pb)(1) = 1 + cPb Pt.
Branch Prediction:
In this type of design, the outcome of a branch decision is predicted before the branch is actually executed.
There are two types of predictions:
a) Static Branch Prediction: In static prediction, a fixed decision for prefetching one of the two paths is made
before the program runs. For example, a simple technique would be to always assume that the branch is
taken. This technique simply loads the program counter with the target address when a branch is
encountered. Another such technique is to automatically choose one path (sequential or target) for some
branch types and another for the rest of the branch types. If the chosen path is wrong, the pipeline is
drained and instructions corresponding to the correct path are fetched; the penalty is paid.
b) Dynamic Branch Prediction: In dynamic prediction, during the execution of the program the processor
makes a decision based on the past information of the previously executed branches. For example, a simple
technique would be to record the history of the last two paths taken by each branch instruction. If the last
two executions of a branch instruction have chosen the same path, that path will be chosen for the current
execution of the branch instruction. If the two paths do not match, one of the paths will be chosen
randomly.
Delayed Branching: The delayed branching scheme eliminates or significantly reduces the effect of the branch
penalty.
In this type of design, a certain number of instructions after the branch instruction is fetched and executed
regardless of which path will be chosen for the branch. For example, a processor with a branch delay of k
executes a path containing the next k sequential instructions and then either continues on the same path or
starts a new path from a new target address.
As often as possible, the compiler tries to fill the next k instruction slots after the branch with instructions
that are independent from the branch instruction. NOP (no operation) instructions are placed in any remaining
empty slots. As an example, consider the following code:
Assume K=2, complier will modify the code by moving i1 and inserting a NOP instruction after i3.
Instructions after Rescheduling the i1. And i1 will execute regardless of the branch.
Being able to fetch instructions for that many pipes at once inter-pipeline interlocking
necessary.
but it will be very difficult to exploit even a majority of this parallelism. Nonetheless, obtaining a factor
of 4 to 6 boost in performance is quite significant, especially as processor speeds approach their limits.
Going beyond a single instruction stream and allowing multiple tasks (or threads) to operate at the same
time can enable greater system throughput. Because these are naturally independent at the fine-grained
level, we can select instructions from different streams to fill pipeline slots that would otherwise go vacant in
the case of issuing from a single thread. In turn, this makes it useful to add more functional units. We shall
further explore these multithreaded architectures later in the course.
Superpipeline processor design: The superpipeline approach achieves high performance by overlapping the
execution of multiple instructions on one instruction pipeline. A superpipeline processor often has an
instruction pipeline with more stages than a typical instruction pipeline design.
In other words, the execution process of an instruction is broken down into even finer steps. By increasing the
number of stages in the instruction pipeline, each stage has less work to do. This allows the pipeline clock rate
to increase (cycle time decreases), since the clock rate depends on the delay found in the slowest stage of the
pipeline. Super pipelining is based on dividing the stages of a pipeline into substages and thus increasing the
number of instructions which are supported by the pipeline at a given moment. For example if we divide each
stage into two, the clock cycle period t will be reduced to the half, t/2; hence, at the maximum capacity, the
pipeline produces a result every t/2 s. For a given architecture and the corresponding instruction set there is
an optimal number of pipeline stages; increasing the number of stages over this limit reduces the overall
performance. A solution to further improve speed is the superscalar architecture.
Given a pipeline stage time T, it may be possible to execute at a higher rate by starting operations at intervals
of T/n. This can be accomplished in two ways:
Further divide each of the pipeline stages into n substages.
Provide n pipelines that are overlapped.
The first approach requires faster logic and the ability to subdivide the stages into segments with uniform
latency. It may also require more complex inter-stage interlocking and stall-restart logic. The second approach
could be viewed in a sense as staggered superscalar operation, and has associated with it all of the same
requirements except that instructions and data can be fetched with a slight offset in time. In addition, inter-
pipeline interlocking is more difficult to manage because of the sub-clock period differences in timing between
the pipelines. Even so, staggered clock pipelines may be necessary with superscalar designs in the future, in
order to reduce peak power and corresponding power-supply induced noise. Alternatively, designs may be
forced to shift to a balanced mode of circuit operation in which logic transitions are balanced by reverse
transitions -- a technique used in the Cray supercomputers that resulted in the computer presenting a pure DC
load to the power supply, and greatly reduced noise in the system.
Inevitably, superpipelining is limited by the speed of logic, and the frequency of unpredictable branches. Stage
time cannot productively grow shorter than the interstage latch time, and so this is a limit for the number of
stages.
The MIPS R4000 is sometimes called a superpipelined machine, although its 8 stages really only split the I-
fetch and D-fetch stages of the pipe and add a Tag Check stage. Nonetheless, the extra stages enable it to
operate with higher throughput. The UltraSPARC's 9-stage pipe definitely qualifies it as a superpipelined
machine, and in fact it is a Super-Super design because of its superscalar issue. The Pentium 4 splits the
pipeline into 20 stages to enable increased clock rate. The benefit of such extensive pipelining is really only
gained for very regular applications such as graphics. On more irregular applications, there is little
performance advantage.
Static Arithmetic Pipeline
Some functions of the arithmetic logic unit of a processor can be pipelined to maximize performance. An
arithmetic pipeline is used for implementing complex arithmetic functions like floating-point addition,
multiplication, and division. These functions can be decomposed into consecutive sub functions. For example
Figure presents pipeline architecture for the floating-point addition can be divided into three stages:
mantissas alignment, mantissas addition, and result normalization.
In the first stage, the mantissas M1 and M2 are aligned based on the difference in the exponents E1 and E2. If
| E1 - E2 | = k > 0, then the mantissa with the smaller exponent is right shifted by k digit positions. In the
second stage, the mantissas are added (or subtracted). In the third stage, the result is normalized so that the
final mantissa has a nonzero digit after the fraction point. When necessary, this normalized adjustment is done
by shifting the result mantissa and the exponent.
Multifunctional arithmetic pipeline:
In arithmetic pipeline is similar to an assembly line in a factory. Data enters a stage of pipeline, which performs
some arithmetic operation on data. The results are then passed to the next stage, which performs its
operation and so on until the final computation has been performed.
Each stage performs only its specific function, it does not have to be capable of performing the task of
any other stage. An individual stage might be an adder or multiplier or other hardware to perform
some arithmetic function.
Variations on arithmetic pipeline are fixed arithmetic pipeline.
It is not very useful. Unless the exact function performed by the pipeline is required, the CPU cannot
use the fixed arithmetic pipeline. Configurable arithmetic pipeline.
It is better suitable as it uses multiplexer as its input. The control unit of CPU sets the select signals of
the multiplexer to control flow of data (i.e. pipeline is configurable). Vectored arithmetic unit.
A CPU may include a vectored arithmetic unit. A vectored arithmetic unit contains multiple functional
units to perform addition, multiplication, shifting, division etc) to operate different arithmetic
operations in parallel.
It is used to implement floating point operations, multiplication of fixed point numbers and similar
computations encountered in scientific operations.
Although arithmetic pipelines can perform many iterations of the same operation in parallel, they
cannot perform different operations simultaneously.
This figure presents a pipelined architecture for multiplying two unsigned 4-bit numbers using carry save
adders. The first stage generates the partial products M1, M2, M3, and M4. Figure represents how M1 is
generated; the rest of partial products can be generated in the same way. The M1, M2, M3, and M4, are
added together through the two stages of carry save adders and the final stage of carry lookahead adder.