5.pipeline and Multiprocessors
5.pipeline and Multiprocessors
Y=0.0450*10^3
Z=0.3664*10^3
As the result is already normalized the result remains the same.
Instruction pipeline
In this a stream of instructions can be executed by overlapping fetch, decode and
execute phases of an instruction cycle. This type of technique is used to increase the
throughput of the computer system. An instruction pipeline reads instruction from
the memory while previous instructions are being executed in other segments of the
pipeline. Thus we can execute multiple instructions simultaneously. The pipeline
will be more efficient if the instruction cycle is divided into segments of equal
duration.
In the most general case computer needs to process each instruction in following
sequence of steps:
Consider two instructions ik and ii of the same program, where ik precedes ii. If
ik and ii have a common register or memory operand, they are data-dependent on
each other, except when the common operand is used in both instructions as a
source operand.
Therefore, by ‘straight-line code’ it can define any code sequence, even instructions
of a loop body that does not involve instructions from subsequent loop iterations.
Straight-line code can include three different types of dependencies, known as RAW
(Read after Write), WAR (Write after Read), and WAW (Write after Write)
dependencies.
Clock Cycles 1 2 3 4 5 6
1) Load R1 I A E
2) Load R2 I A E
3) Add R1+R2 I A E
4) STORE R3 I A E
Clock Cycles 1 2 3 4 5 6
1. Load R1 I A E
2. Load R2 I A E
3. No Operation I A E
4. Add R1+R2 I A E
5. STORE R3 I A E
Delayed Load
A similar sort of tactic, called the delayed load, can be used on LOAD instructions.
On LOAD instructions, the register that is to be target of the load is locked by the
processor. The processor then continuous execution of the instruction stream until it
reaches an instruction requiring that register at which point it idles until the load is
complete. If the compiler can rearrange instructions so that useful work can be done
while the load is in the pipeline.
Delayed Branch
When branches are processed by a pipeline simply, after each taken branch, at least
one cycle remains unutilized. This is because of the assembly line-like apathy of
pipelining. Instruction slots following branches are known as branch delay slots.
Delay slots can also appear following load instructions; these are defined load delay
slots. Branch delay slots are wasted during traditional execution. However, when
delayed branching is employed, these slots can be at least partly used.
In the figure, it can transfer the add instruction of our program segment that initially
preceded the branch into the branch delay slot. With delayed branching, the
processor implements the add instruction first, but the branch will only be efficient
later. Thus, in this example, delayed branching keep the initial execution sequence
–
Characteristics of Multiprocessors
A multiprocessor is a single computer that has multiple processors. It is possible that
the processors in the multiprocessor system can communicate and cooperate at
various levels of solving a given problem. The communications between the
processors take place by sending messages from one processor to another, or by
sharing a common memory.
Characteristics of Multiprocessors
Interconnection Structure:
The processors must be able to share a set of main memory modules & I/O devices
in a multiprocessor system. This sharing capability can be provided through
interconnection structures. The interconnection structure that are commonly used
can be given as follows:
2) Multiport Memory
Multiport memory system employs separate buses between each memory module
and each CPU. This is shown in figure below for four CPUs and four memory
modules (MMs). Each processor bus is connected to each memory module. A
processor bus consists of the address, data, and control lines required to
communicate with memory. The memory module is said to have four ports and each
port accommodates one of the buses. The module must have internal control logic
to determine which port will have access to memory at any given time. Memory
access conflicts are resolved by assigning fixed priorities to each memory port. The
priority for memory access associated with each processor may be established by the
physical port position that its bus occupies in each module. Thus CPU1 will have
priority over CPU2, CPU2 will have priority over CPU3, and CPU4 will have the
lowest priority. The advantage of the multiport memory organization is the high
transfer rate that can be achieved because of the multiple paths between processors
and memory. The disadvantage is that it requires expensive memory control logic
and a large number of cables and connectors. As a consequence, this interconnection
structure is usually appropriate for systems with a small number of processors.
3) Crossbar Switch
The crossbar switch organization consists of a number of cross points that are placed
at intersections between processor buses and memory module paths. Figure below
shows a crossbar switch interconnection between four CPUs and four memory
modules. The small square in each cross point is a switch that determines the path
from a processor to a memory module. Each switch point has control logic to set up
the transfer path between a processor and memory. It examines the address that is
placed in the bus to determine whether its particular module is being addressed. It
also resolves multiple requests for access to the same memory module on a
predetermined priority basis.
The 2×2 crossbar switch is used in the multistage network. It has 2 inputs
(A & B) and 2 outputs (0 & 1). To establish the connection between the
input & output terminals, the control inputs CA & CB are associated.
The input is connected to 0 output if the control input is 0 & the input is
connected to 1 output if the control input is 1. This switch can arbitrate
between conflicting requests. Only 1 will be connected if both A & B require
the same output terminal, the other will be blocked/ rejected.
We can construct a multistage network using 2×2 switches, in order to control
the communication between a number of sources & destinations. Creating a
binary tree of cross-bar switches accomplishes the connections to connect the
input to one of the 8 possible destinations.
Fig: 2*2 Crossbar Switch
5) Hypercube interconnection
Hypercube (or Binary n-cube multiprocessor) structure represents a loosely coupled
system made up of N=2n processors interconnected in an n-dimensional binary cube.
Each processor makes a made of the cube. Each processor makes a node of the cube.
Therefore, it is customary to refer to each node as containing a processor, in effect
it has not only a CPU but also local memory and I/O interface. Each processor has
direct communication paths to n other neighbor processors. These paths correspond
to the cube edges.
There are 2 distinct n-bit binary addresses which can be assigned to the processors.
Each processor address differs from that of each of its n neighbors by exactly one-
bit position.
Each node is assigned a binary address in such a manner, that the addresses of two
neighbors differ in exactly one-bit position. For example, the three neighbors of the
node with address 100 are 000, 110, and 101 in a three-cube structure. Each of these
binary numbers differs from address 100 by one-bit value.
Routing messages through an n-cube structure may take from one to n links from a
source node to a destination node.
Example:
In a three-cube structure, node 000 may communicate with 011 (from 000 to 010 to
011 or from 000 to 001 to 011). It should cross at least three links to communicate
from node 000 to node 111. A routing procedure is designed by determining the
exclusive-OR of the source node address with the destination node address. The
resulting binary value will have 1 bits corresponding to the axes on which the two
nodes differ. Then, message is transmitted along any one of the exes.
For example, a message at node 010 going to node 001 produces an exclusive-OR
of the two addresses equal to 011 in a three-cube structure. The message can be
transmitted along the second axis to node 000 and then through the third axis to node
001.