100% found this document useful (1 vote)
240 views16 pages

5.pipeline and Multiprocessors

Uploaded by

citecollege301
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
240 views16 pages

5.pipeline and Multiprocessors

Uploaded by

citecollege301
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Unit-5 Pipeline and Multiprocessors

5.1 Parallel Processing


Parallel processing is a method in computing in which separate parts of an overall
complex task are broken up and run simultaneously on multiple CPUs, thereby
reducing the amount of time for processing.
Any system that has more than one CPU can perform parallel processing, as well as
multi-core processors which are commonly found on computers today. Multi-core
processors are IC chips that contain two or more processors for better performance,
reduced power consumption and more efficient processing of multiple tasks.
 Pipelining
Pipelining is the process of accumulating instruction from the processor through a
pipeline. It allows storing and executing instructions in an orderly process. It is also
known as pipeline processing.
Pipelining is a technique where multiple instructions are overlapped during
execution. Pipeline is divided into stages and these stages are connected with one
another to form a pipe like structure. Instructions enter from one end and exit from
another end.

Fig: Pipelining System

In pipeline system, each segment consists of an input register followed by a


combinational circuit. The register is used to hold data and combinational circuit
performs operations on it. The output of combinational circuit is applied to the input
register of the next segment.
There are two types of pipeline i.e. Arithmetic pipeline and Instruction Pipeline.
 Arithmetic Pipeline
An arithmetic pipeline divides an arithmetic problem into various sub problems for
execution in various pipeline segments. It is used for floating point operations,
multiplication and various other computations. The process or flowchart arithmetic
pipeline for floating point addition is shown in the diagram.

Floating point addition using arithmetic pipeline(Sub-Operation)


1. Compare the exponents.
2. Align the mantissas.
3. Add or subtract the mantissas.
4. Normalize the result
First of all, the two exponents are compared and the larger of two exponents is
chosen as the result exponent. The difference in the exponents then decides how
many times we must shift the smaller exponent to the right. Then after shifting of
exponent, both the mantissas get aligned. Finally, the addition of both numbers take
place followed by normalization of the result in the last segment.

Example: Let us consider two number,


X=0.3214*10^3 and Y=0.4500*10^2
Explanation: First of all, the two exponents are subtracted to give 3-
2=1. Thus 3 becomes the exponent of result and the smaller exponent is
shifted 1 times to the right to give

Y=0.0450*10^3

Finally, the two numbers are added to produce

Z=0.3664*10^3
As the result is already normalized the result remains the same.

 Instruction pipeline
In this a stream of instructions can be executed by overlapping fetch, decode and
execute phases of an instruction cycle. This type of technique is used to increase the
throughput of the computer system. An instruction pipeline reads instruction from
the memory while previous instructions are being executed in other segments of the
pipeline. Thus we can execute multiple instructions simultaneously. The pipeline
will be more efficient if the instruction cycle is divided into segments of equal
duration.

In the most general case computer needs to process each instruction in following
sequence of steps:

1) Fetch the instruction from memory (FI)


2) Decode the instruction (DA)
3) Calculate the effective address
4) Fetch the operands from memory (FO)
5) Execute the instruction (EX)
6) Store the result in the proper place

The flowchart for instruction pipeline is shown below.


Example:
Here the instruction is fetched on first clock cycle in segment 1.
Now it is decoded in next clock cycle, then operands are fetched and finally the
instruction is executed. We can see that here the fetch and decode phase overlap due
to pipelining. By the time the first instruction is being decoded, next instruction is
fetched by the pipeline.

In case of third instruction we see that it is a branched instruction. Here when it is


being decoded 4th instruction is fetched simultaneously. But as it is a branched
instruction it may point to some other instruction when it is decoded. Thus fourth
instruction is kept on hold until the branched instruction is executed. When it gets
executed then the fourth instruction is copied back and the other phases continue as
usual.
5.2 Pipeline Example
 Four Segment Instruction Pipeline

Fig: Four Segment Instruction Pipeline


In Instruction Pipeline Each step is executed in a particular segment, and there
are times when different segments may take different times to operate on the
incoming information. Moreover, there are times when two or more segments
may require memory access at the same time, causing one segment to wait until
another is finished with the memory.

The organization of an instruction pipeline will be more efficient if the instruction


cycle is divided into segments of equal duration. One of the most common
examples of this type of organization is a Four-segment instruction pipeline.

A four-segment instruction pipeline combines two or more different segments


and makes it as a single one. For instance, the decoding of the instruction can be
combined with the calculation of the effective address into one segment.

The following block diagram shows a typical example of a four-segment


instruction pipeline. The instruction cycle is completed in four segments.

 Segment 1: The instruction fetch segment can be implemented using


first in, first out (FIFO) buffer.
 Segment 2: The instruction fetched from memory is decoded in the
second segment, and eventually, the effective address is calculated in a
separate arithmetic circuit.
 Segment 3: An operand from memory is fetched in the third segment.
 Segment 4: The instructions are finally executed in the last segment of
the pipeline organization.
 Data Dependency
A position in which an instruction is dependent on a result from a sequentially
earlier instruction before it can be done its execution. In high-performance
processors operating pipeline or superscalar techniques, a data dependency will
learn an interruption in the flowing services of a processor pipeline or prevent the
parallel issue of instructions in a superscalar processor.

Consider two instructions ik and ii of the same program, where ik precedes ii. If
ik and ii have a common register or memory operand, they are data-dependent on
each other, except when the common operand is used in both instructions as a
source operand.

An example is when ii uses the result of ik as a source operand. In sequential


execution data, dependencies do not generate any issue, because instructions are
implemented rigidly in the stated sequence.
Data dependency can appear either in ‘straight-line code’ between subsequent
instructions or in a loop between instructions belonging to subsequent iterations
of a loop as shown in the figure.

Therefore, by ‘straight-line code’ it can define any code sequence, even instructions
of a loop body that does not involve instructions from subsequent loop iterations.
Straight-line code can include three different types of dependencies, known as RAW
(Read after Write), WAR (Write after Read), and WAW (Write after Write)
dependencies.

5.3 RISC Pipeline


RISC stands for Reduced Instruction Set Computers. It was introduced to execute
as fast as one instruction per clock cycle. This RISC pipeline helps to simplify the
computer architecture’s design.
 Three Segment Instruction
Three segment Instruction pipeline consists following Segment.
 I: Instruction fetch
 A: ALU Operation
 E: Execute Instruction
Consider now the operation of the following four instruction

1. LOAD: Load M[address1] to R1


2. LOAD: Load M[address2] to R2
3. ADD: Add R1 and R2 then sum is R3
4. STORE: Store R3 to M[address3]

Then pipeline timing with data conflict is

Clock Cycles 1 2 3 4 5 6
1) Load R1 I A E
2) Load R2 I A E
3) Add R1+R2 I A E
4) STORE R3 I A E

Pipeline timing with delayed load is

Clock Cycles 1 2 3 4 5 6
1. Load R1 I A E
2. Load R2 I A E
3. No Operation I A E
4. Add R1+R2 I A E
5. STORE R3 I A E

 Delayed Load
A similar sort of tactic, called the delayed load, can be used on LOAD instructions.
On LOAD instructions, the register that is to be target of the load is locked by the
processor. The processor then continuous execution of the instruction stream until it
reaches an instruction requiring that register at which point it idles until the load is
complete. If the compiler can rearrange instructions so that useful work can be done
while the load is in the pipeline.

 Delayed Branch
When branches are processed by a pipeline simply, after each taken branch, at least
one cycle remains unutilized. This is because of the assembly line-like apathy of
pipelining. Instruction slots following branches are known as branch delay slots.
Delay slots can also appear following load instructions; these are defined load delay
slots. Branch delay slots are wasted during traditional execution. However, when
delayed branching is employed, these slots can be at least partly used.

In the figure, it can transfer the add instruction of our program segment that initially
preceded the branch into the branch delay slot. With delayed branching, the
processor implements the add instruction first, but the branch will only be efficient
later. Thus, in this example, delayed branching keep the initial execution sequence

add r1, r2, r3;


b anywhere;
anywhere: sub
It defines an unconditional branch. Conditional branches cause the same or higher
delays during an easy pipelined execution. This is because of the additionally needed
operation of checking the particular condition.

Accordingly, instruction in the delay slot of an untaken branch will always be


executed. Branching to the target instruction (sub) is executed with one pipeline
cycle of delay. This cycle is used to execute the instruction in the delay slot (add).
Thus delayed branching results in the following execution sequence –
a, add
b, b
c, sub
5.4 Multiprocessors

A Multiprocessor is a computer system with two or more central processing units


(CPUs) share full access to a common RAM. The main objective of using a
multiprocessor is to boost the system’s execution speed, with other objectives being
fault tolerance and application matching.

 Characteristics of Multiprocessors
A multiprocessor is a single computer that has multiple processors. It is possible that
the processors in the multiprocessor system can communicate and cooperate at
various levels of solving a given problem. The communications between the
processors take place by sending messages from one processor to another, or by
sharing a common memory.

Characteristics of Multiprocessors

 Parallel Computing: This involves the simultaneous application of multiple


processors. These processors are developed using a single architecture to
execute a common task. In general, processors are identical and they work
together in such a way that the users are under the impression that they are the
only users of the system. In reality, however, many users are accessing the
system at a given time.
 Distributed Computing: This involves the usage of a network of processors.
Each processor in this network can be considered as a computer in its own
right and have the capability to solve a problem. These processors are
heterogeneous, and generally, one task is allocated to a single processor.
 Supercomputing: This involves the usage of the fastest machines to resolve
big and computationally complex problems. In the past, supercomputing
machines were vector computers but at present, vector or parallel computing
is accepted by most people.
 Pipelining: This is a method wherein a specific task is divided into several
subtasks that must be performed in a sequence. The functional units help in
performing each subtask. The units are attached serially and all the units work
simultaneously.
 Vector Computing: It involves the usage of vector processors, wherein
operations such as ‘multiplication’ are divided into many steps and are then
applied to a stream of operands (“vectors”).
 Systolic: This is similar to pipelining, but units are not arranged in a linear
order. The steps in systolic are normally small and more in number and
performed in a lockstep manner. This is more frequently applied in special-
purpose hardware such as image or signal processors.

 Interconnection Structure:
The processors must be able to share a set of main memory modules & I/O devices
in a multiprocessor system. This sharing capability can be provided through
interconnection structures. The interconnection structure that are commonly used
can be given as follows:

1) Time Shared Common bus


A common-bus multiprocessor system consists of a number of processors connected
through a common path to a memory unit. A time-shared common bus for five
processors. Only one processor can communicate with the memory or another
processor at any given time. Transfer operations are conducted by the processor that
is in control of the bus at the time. Any other processor wishing to initiate a transfer
must first determine the availability status of the bus, and only after the bus becomes
available can the processor address the destination unit to initiate the transfer.
A single common-bus system is restricted to one transfer at a time. This means that
when one processor is communicating with the memory, all other processors are
either busy with internal operations or must be idle waiting for the bus. As a
consequence, the total overall transfer rate within the system is limited by the speed
of the single path. The processors in the system can be kept busy more often through
the implementation of two or more independent buses to permit multiple
simultaneous bus transfers. However, this increases the system cost and complexity.

2) Multiport Memory
Multiport memory system employs separate buses between each memory module
and each CPU. This is shown in figure below for four CPUs and four memory
modules (MMs). Each processor bus is connected to each memory module. A
processor bus consists of the address, data, and control lines required to
communicate with memory. The memory module is said to have four ports and each
port accommodates one of the buses. The module must have internal control logic
to determine which port will have access to memory at any given time. Memory
access conflicts are resolved by assigning fixed priorities to each memory port. The
priority for memory access associated with each processor may be established by the
physical port position that its bus occupies in each module. Thus CPU1 will have
priority over CPU2, CPU2 will have priority over CPU3, and CPU4 will have the
lowest priority. The advantage of the multiport memory organization is the high
transfer rate that can be achieved because of the multiple paths between processors
and memory. The disadvantage is that it requires expensive memory control logic
and a large number of cables and connectors. As a consequence, this interconnection
structure is usually appropriate for systems with a small number of processors.
3) Crossbar Switch
The crossbar switch organization consists of a number of cross points that are placed
at intersections between processor buses and memory module paths. Figure below
shows a crossbar switch interconnection between four CPUs and four memory
modules. The small square in each cross point is a switch that determines the path
from a processor to a memory module. Each switch point has control logic to set up
the transfer path between a processor and memory. It examines the address that is
placed in the bus to determine whether its particular module is being addressed. It
also resolves multiple requests for access to the same memory module on a
predetermined priority basis.

4) Multistage Switching Network

 The 2×2 crossbar switch is used in the multistage network. It has 2 inputs
(A & B) and 2 outputs (0 & 1). To establish the connection between the
input & output terminals, the control inputs CA & CB are associated.

 The input is connected to 0 output if the control input is 0 & the input is
connected to 1 output if the control input is 1. This switch can arbitrate
between conflicting requests. Only 1 will be connected if both A & B require
the same output terminal, the other will be blocked/ rejected.
 We can construct a multistage network using 2×2 switches, in order to control
the communication between a number of sources & destinations. Creating a
binary tree of cross-bar switches accomplishes the connections to connect the
input to one of the 8 possible destinations.
Fig: 2*2 Crossbar Switch

Fig: 1 to 8-way switch using 2*2 Switch

 In the above diagram, PA & PB are 2 processors, and they are


connected to 8 memory modules in a binary way from 000(0) to
111(7) through switches. Three levels are there from a source to
a destination. To choose output in a level, one bit is assigned to
each of the 3 levels. There are 3 bits in the destination number:
1st bit determines the output of the switch in 1st level, 2nd bit in
2nd level & 3rd bit in the 3rd level.
 Example: If the source is: PB & the destination is memory
module 011 (as in the figure): A path is formed from PB to 0
output in 1st level, output 1 in 2nd level & output 1 in 3rd level.
 Usually, the processor acts as the source and the memory unit
acts as a destination in a tightly coupled system. The destination
is a memory module. But, processing units act as both, the source
and the destination in a loosely coupled system.
 Many patterns can be made using 2×2 switches such as Omega
networks, Butterfly Network, etc.

5) Hypercube interconnection
Hypercube (or Binary n-cube multiprocessor) structure represents a loosely coupled
system made up of N=2n processors interconnected in an n-dimensional binary cube.
Each processor makes a made of the cube. Each processor makes a node of the cube.
Therefore, it is customary to refer to each node as containing a processor, in effect
it has not only a CPU but also local memory and I/O interface. Each processor has
direct communication paths to n other neighbor processors. These paths correspond
to the cube edges.

There are 2 distinct n-bit binary addresses which can be assigned to the processors.
Each processor address differs from that of each of its n neighbors by exactly one-
bit position.

 Hypercube structure for n= 1, 2 and 3.


 A one cube structure contains n = 1 and 2n = 2.
 It has two processors interconnected by a single path.
 A two-cube structure contains n=2 and 2n=4.
 It has four nodes interconnected as a cube.
 An n-cube structure contains 2n nodes with a processor residing in each
node.

Each node is assigned a binary address in such a manner, that the addresses of two
neighbors differ in exactly one-bit position. For example, the three neighbors of the
node with address 100 are 000, 110, and 101 in a three-cube structure. Each of these
binary numbers differs from address 100 by one-bit value.
Routing messages through an n-cube structure may take from one to n links from a
source node to a destination node.

Example:
In a three-cube structure, node 000 may communicate with 011 (from 000 to 010 to
011 or from 000 to 001 to 011). It should cross at least three links to communicate
from node 000 to node 111. A routing procedure is designed by determining the
exclusive-OR of the source node address with the destination node address. The
resulting binary value will have 1 bits corresponding to the axes on which the two
nodes differ. Then, message is transmitted along any one of the exes.

For example, a message at node 010 going to node 001 produces an exclusive-OR
of the two addresses equal to 011 in a three-cube structure. The message can be
transmitted along the second axis to node 000 and then through the third axis to node
001.

You might also like