0% found this document useful (0 votes)
25 views12 pages

1 - Unit 8 Pipeline - MP

The document discusses parallel processing and pipelining techniques. It describes different classifications of parallel processing like SISD, SIMD, MISD and MIMD based on instruction and data streams. It then explains the concept of pipelining by decomposing operations into sequential segments to enable parallel execution. Examples of arithmetic pipelines are also provided.

Uploaded by

ever4gitagupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views12 pages

1 - Unit 8 Pipeline - MP

The document discusses parallel processing and pipelining techniques. It describes different classifications of parallel processing like SISD, SIMD, MISD and MIMD based on instruction and data streams. It then explains the concept of pipelining by decomposing operations into sequential segments to enable parallel execution. Examples of arithmetic pipelines are also provided.

Uploaded by

ever4gitagupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

BMC@DC

Unit 8: Pipelining

Concept of Pipelining

Parallel Processing:
Parallel processing is a method of simultaneously breaking up and running program tasks on multiple
microprocessors, thereby reducing processing time. Instead of processing each instruction sequentially
as in a conventional computer, a parallel processing system is able to perform concurrent data
processing to achieve faster execution time. For example, while an instruction is being executed in the
ALU, the next instruction can be read from memory. The system may have two or more ALUs and be
able to execute two or more instructions at the same time. Furthermore, the system may have two or
more processors operating concurrently.

The purpose of parallel processing is to speed up the computer processing capability and increase its
throughput, that is, the amount of processing that can be accomplished during a given interval of time.
The amount of hardware increases with parallel processing, and with it, the cost of the system increases.
However, technological developments have reduced hardware costs to the point where parallel
processing techniques are economically feasible.

Parallel processing is established by distributing the data among the multiple functional units. For
example, the arithmetic, logic, and shift operations can be separated into three units and the operands
diverted to each unit under the supervision of a control unit.

Figure below shows one possible way of separating the execution unit into eight functional units
operating in parallel. The operands in the registers are applied to one of the units depending on the
operation specified by the instruction associated with the operands. The operation performed in each
functional unit is indicated in each block of the diagram. The adder and integer multiplier perform the
arithmetic operations with integer numbers. The floating-point operations are separated into three
circuits operating in parallel. The logic, shift, and increment operations can be performed concurrently
on different data. All units are independent of each other, so one number can be shifted while another
number is being incremented. A multifunctional organization is usually associated with a complex
control unit to coordinate all the activities among the various components.

BIT 2nd Sem |Microprocessor and Computer Architecture [BIT151] | Unit 8| Page: 1
BMC@DC

Figure: Processor with multiple functional units.

There are a variety of ways that parallel processing can be classified. It can be considered from the
internal organization of the processors, from the interconnection structure between processors, or from
the flow of information through the system.

Instruction stream: The sequence of instructions read from memory constitutes an instruction stream.

Data stream: The operations performed on the data in the processor constitutes a data stream.

Parallel processing may occur in the instruction stream, in the data stream, or in both.

Flynn's classification divides computers into four major groups as follows:

1. Single instruction stream, single data stream (SISD)


2. Single instruction stream, multiple data stream (SIMD)
3. Multiple instruction stream, single data stream (MISD)
4. Multiple instruction stream, multiple data stream (MIMD)
1. Single instruction stream, single data stream (SISD)

BIT 2nd Sem |Microprocessor and Computer Architecture [BIT151] | Unit 8| Page: 2
BMC@DC
An SISD computing system is a uniprocessor machine which is
capable of executing a single instruction, operating on a single
data stream. In SISD, machine instructions are processed in a
sequential manner and computers adopting this model are
popularly called sequential computers. Most conventional
computers have SISD architecture. All the instructions and data
to be processed have to be stored in primary memory.

The speed of the processing element in the SISD model is limited


(dependent) by the rate at which the computer can transfer
information internally. Dominant representative SISD systems
are IBM PC, workstations.

2. Single instruction stream, multiple data stream


(SIMD)

An SIMD system is a multiprocessor machine capable of


executing the same instruction on all the CPUs but
operating on different data streams. Machines based on
an SIMD model are well suited to scientific computing
since they involve lots of vector and matrix operations. So
that the information can be passed to all the processing
elements (PEs) organized data elements of vectors can be
divided into multiple sets(N-sets for N PE systems) and
each PE can process one data set.

3. Multiple instruction stream, single data stream (MISD)

An MISD computing system is a multiprocessor machine capable


of executing different instructions on different PEs but all of them
operating on the same dataset.

4. Multiple instruction stream, multiple data stream


(MIMD)

An MIMD system is a multiprocessor machine which is capable of executing multiple instructions on


multiple data sets. Each PE in the MIMD model has separate instruction and data streams; therefore
machines built using this model are capable to any kind of application. Unlike SIMD and MISD
machines, PEs in MIMD machines work asynchronously.

BIT 2nd Sem |Microprocessor and Computer Architecture [BIT151] | Unit 8| Page: 3
BMC@DC

Pipelining:
Pipelining is a technique of decomposing a sequential process into sub operations, with each
subprocess being executed in a special dedicated segment that operates concurrently with all
other segments. A pipeline can be visualized as a collection of processing segments through
which binary information flows.

Each segment performs partial processing dictated by the way the task is partitioned. The result
obtained from the computation in each segment is transferred to the next segment in the
pipeline. The final result is obtained after the data have passed through all segments. The name
"pipeline" implies a flow of information analogous to an industrial assembly line. It is
characteristic of pipelines that several computations can be in progress in distinct segments at
the same time. The overlapping of computation is made possible by associating a register with
each segment in the pipeline. The registers provide isolation between each segment so that
each can operate on distinct data simultaneously.

Example of pipeline

Suppose that we want to perform the combined multiply and add operations with a stream of numbers.
Ai * Bi + Ci for i = 1, 2, 3, . .. , 7

Each suboperation is to be implemented in a segment within a pipeline. Each segment has one or two
registers and a combinational circuit as shown in Fig. 9-2. R 1 through RS are registers that receive new
data with every clock pulse. The multiplier and adder are combinational circuits. The suboperations
performed in each segment of the pipeline are as follows:

The five registers are loaded with new data every clock pulse. The effect of each clock is shown in Table
9-1. The first clock pulse transfers A1 and 81 into R 1 and R2. The second dock pulse transfers the
product of R 1 and R2 into R3 and C1 into R4. The same clock pulse transfers A2 and B2 into R 1 and R2.

BIT 2nd Sem |Microprocessor and Computer Architecture [BIT151] | Unit 8| Page: 4
BMC@DC
The third clock pulse operates on all three segments simultaneously. It places A, and B, into R1 and R2,
transfers the product of R1 and R2 into R3, transfers C, into R4, and places the sum of R3 and R4 into RS.
It takes three clock pulses to fill up the pipe and retrieve the first output from RS. From there on, each
dock produces a new output and moves the data one step down the pipeline. This happens as long as
new input data flow into the system. When no more input data are available, the clock must continue
until the last output emerges out of the pipeline.

Time space diagram:


The behavior of a pipeline can be illustrated with a space-time diagram. This is a diagram that shows the
segment utilization as a function of time. The space-time diagram of a four-segment pipeline is
demonstrated in the figure below. The horizontal axis displays the time in clock cycles and the vertical
axis gives the segment number. The diagram shows six tasks T1 through T6 executed in four segments.
Initially, task 1i is handled by segment 1. After the first clock, segment 2 is busy with T,, while segment 1
is busy with task T2. Continuing in this manner, the first task T1 is completed after the fourth clock cycle.
From then on, the pipe completes a task every clock cycle. No matter how many segments there are in
the system, once the pipeline is full, it takes only one clock period to obtain an output.

BIT 2nd Sem |Microprocessor and Computer Architecture [BIT151] | Unit 8| Page: 5
BMC@DC

Figure: Space time diagram for pipeline

Arithmetic Pipeline
Arithmetic Pipelines are mostly used in high-speed computers. They are used to implement
floating-point operations, multiplication of fixed-point numbers, and similar computations
encountered in scientific problems.
To understand the concepts of arithmetic pipeline in a more convenient way, let us consider an
example of a pipeline unit for floating-point addition and subtraction.
The inputs to the floating-point adder pipeline are two normalized floating-point binary numbers
defined as:
X = A * 2a = 0.9504 * 103
Y = B * 2b = 0.8200 * 102
Where A and B are two fractions that represent the mantissa and a and b are the
exponents.
The combined operation of floating-point addition and subtraction is divided into four
segments. Each segment contains the corresponding suboperation to be performed in
the given pipeline. The suboperations that are shown in the four segments are:
1. Compare the exponents by subtraction.
2. Align the mantissas.
3. Add or subtract the mantissas.
4. Normalize the result.
The following block diagram represents the suboperations performed in each segment
of the pipeline.

BIT 2nd Sem |Microprocessor and Computer Architecture [BIT151] | Unit 8| Page: 6
BMC@DC

BIT 2nd Sem |Microprocessor and Computer Architecture [BIT151] | Unit 8| Page: 7
BMC@DC
1. Compare exponents by subtraction:
The exponents are compared by subtracting them to determine their difference. The
larger exponent is chosen as the exponent of the result.
The difference of the exponents, i.e., 3 - 2 = 1 determines how many times the mantissa
associated with the smaller exponent must be shifted to the right.
2. Align the mantissas:
The mantissa associated with the smaller exponent is shifted according to the difference
of exponents determined in segment one.
X = 0.9504 * 103
Y = 0.08200 *103
3. Add mantissas:
The two mantissas are added in segment three.
Z = X + Y = 1.0324 * 103
4. Normalize the result:
After normalization, the result is written as:
Z = 0.1324 * 104

Instruction Pipeline
Pipeline processing can occur not only in the data stream but in the instruction stream
as well. Most of the digital computers with complex instructions require instruction
pipeline to carry out operations like fetch, decode and execute instructions.
In general, the computer needs to process each instruction with the following sequence
of steps.
1. Fetch instruction from memory.
2. Decode the instruction.
3. Calculate the effective address.
4. Fetch the operands from memory.
5. Execute the instruction.
6. Store the result in the proper place.
Each step is executed in a particular segment, and there are times when different
segments may take different times to operate on the incoming information. Moreover,
there are times when two or more segments may require memory access at the same
time, causing one segment to wait until another is finished with the memory.
The organization of an instruction pipeline will be more efficient if the instruction cycle
is divided into segments of equal duration. One of the most common examples of this
type of organization is a Four-segment instruction pipeline.

A four-segment instruction pipeline combines two or more different segments and


makes it as a single one. For instance, the decoding of the instruction can be combined
with the calculation of the effective address into one segment.

BIT 2nd Sem |Microprocessor and Computer Architecture [BIT151] | Unit 8| Page: 8
BMC@DC
The following block diagram shows a typical example of a four-segment instruction
pipeline. The instruction cycle is completed in four segments.

Figure above shows the operation of the instruction pipeline. The time in the axis is divided into steps of
equal duration. The four segments are represented in the diagram with an abbreviated symbol.
1. Fl is the segment that fetches an instruction.

BIT 2nd Sem |Microprocessor and Computer Architecture [BIT151] | Unit 8| Page: 9
BMC@DC
2. DA is the segment that decodes the instruction and calculates the effective address.
3. FO is the segment that fetches the operand.
4. EX is the segment that executes the instruction.

It is assumed that the processor has separate instruction and data memories so that the
operation in Fl and FO can proceed at the same time. In the absence of a branch instruction,
each segment operates on different instructions. Thus, in step 4, instruction 1 is being executed
in segment EX; the operand for instruction 2 is being fetched in segment FO; instruction 3 is
being decoded in segment DA; and instruction 4 is being fetched from memory in segment FL
Assume now that instruction 3 is a branch instruction. As soon as this instruction is decoded in
segment DA in step 4, the transfer from FI to DA of the other instructions is halted until the
branch instruction is executed in step 6. If the branch is taken, a new instruction is fetched in
step 7. If the branch is not taken, the instruction fetched previously in step 4 can be used. The
pipeline then continues until a new branch instruction is encountered.

Fig: Timing of instruction pipeline

In general, there are three major difficulties that cause the instruction pipeline to deviate from its
normal operation.
1. Resource conflicts caused by access to memory by two segments at the same time. Most of these
conflicts can be resolved by using separate instruction and data memories.
2. Data dependency conflicts arise when an instruction depends on the result of a previous instruction,
but this result is not yet available.
3. Branch difficulties arise from branch and other instructions that change the value of PC.

Data Dependency
A difficulty that may caused a degradation of performance in an instruction pipeline is
due to possible collision of data or address. A collision occurs when an instruction
cannot proceed because previous instructions did not complete certain operations. A
data dependency occurs when an instruction needs data that are not yet available.

BIT 2nd Sem |Microprocessor and Computer Architecture [BIT151] | Unit 8| Page: 10
BMC@DC
For example, an instruction in the FO segment may need to fetch an operand that is
being generated at the same time by the previous instruction in segment EX. Therefore,
the second instruction must wait for data to become available by the first instruction.

Similarly, an address dependency may occur when an operand address cannot be


calculated because the information needed by the addressing mode is not available. For
example, an instruction with register indirect mode cannot proceed to fetch the operand
if the previous instruction is loading the address into the register. Therefore, the
operand access to memory must be delayed until the required address is available.
Pipelined computers deal with such conflicts between data dependencies in a variety of
ways.

Hardware The most straightforward method is to insert hardware interlocks. An


interlocks interlock is a circuit that detects instructions whose source operands are
destinations of instructions farther up in the pipeline. Detection of this
situation causes the instruction whose source is not available to be delayed
by enough clock cycles to resolve the conflict. This approach maintains the
program sequence by using hardware to insert the required delays.
operand forwarding
operand Another technique called operand forwarding uses special hardware to
forwarding detect a conflict and then avoid it by routing the data through special
paths between pipeline segments. For example, instead of transferring an
ALU result into a destination register, the hardware checks the destination
operand, and if it is needed as a source in the next instruction, it passes the
result directly into the ALU input, bypassing the register file. This method
requires additional hardware paths through multiplexers as well as the
circuit that detects the conflict.
Delayed A procedure employed in some computers is to give the responsibility for
load solving data conflicts problems to the compiler that translates the high-
level programming language into a machine language program. The
compiler for such computers is designed to detect a data conflict and
reorder the instructions as necessary to delay the loading of the conflicting
data by inserting delayed load no-operation instructions. This method is
referred to as delayed load.

Handling of branch Instruction


One of the major problems in operating an instruction pipeline is the occurrence of branch
instructions. A branch instruction can be conditional or unconditional. An unconditional branch
always alters the sequential program flow by loading the program counter with the target
address. In a conditional branch, the control selects the target instruction if the condition is
satisfied or the next sequential instruction if the condition is not satisfied. As mentioned
previously, the branch instruction breaks the normal sequence of the instruction stream,
BIT 2nd Sem |Microprocessor and Computer Architecture [BIT151] | Unit 8| Page: 11
BMC@DC
causing difficulties in the operation of the instruction pipeline. Pipelined computers employ
various hardware techniques to minimize the performance degradation caused by instruction
branching.

Prefetch target One way of handling a conditional branch is to prefetch the target instruction in
instruction addition to the instruction following the branch. Both are saved until the branch
is executed. If the branch condition is successful, the pipeline continues from the
branch target instruction. An extension of this procedure is to continue fetching
instructions from both places until the branch decision is made. At that time
control chooses the instruction stream of the correct program flow.

branch target buffer Another possibility is the use of a branch target buffer or BTB. The BTB is an
associative memory included in the fetch segment of the pipeline. Each entry in
the BTB consists of the address of a previously executed branch instruction and
the target instruction for that branch. It also stores the next few instructions
after the branch target instruction. When the pipeline decodes a branch
instruction, it searches the associative memory BTB for the address of the
instruction. If it is in the BTB, the instruction is available directly and prefetch
continues from the new path. If the instruction is not in the BTB, the pipeline
shifts to a new instruction stream and stores the target instruction in the BTB.
The advantage of this scheme is that branch instructions that have occurred
previously are readily available in the pipeline without interruption.

Loop buffer A variation of the BTB is the loop buffer. This is a small very high speed register
file maintained by the instruction fetch segment of the pipeline. When a
program loop is detected in the program, it is stored in the loop buffer in its
entirety, including all branches. The program loop can be executed directly
without having to access memory until the loop mode is removed by the final
branching out.

branch predection Another procedure that some computers use is branch prediction. A
pipeline with branch prediction uses some additional logic to guess the
outcome of a conditional branch instruction before it is executed. The
pipeline then begins prefetching the instruction stream from the
predicted path. A correct prediction eliminates the wasted time caused by
branch penalties.

delayed branch A procedure employed in most ruse processors is the delayed branch. In
this procedure, the compiler detects the branch instructions and
rearranges the machine language code sequence by inserting useful
instructions that keep the pipeline operating without interruptions. An
example of delayed branch is the insertion of a no-operation instruction
after a branch instruction. This causes the computer to fetch the target
instruction during the execution of the nooperation instruction, allowing a
continuous flow of the pipeline.

- End of unit 8 -

BIT 2nd Sem |Microprocessor and Computer Architecture [BIT151] | Unit 8| Page: 12

You might also like