0% found this document useful (0 votes)
19 views41 pages

ch.4 and 5

The document discusses interconnection bus architecture and various bus structures used in parallel processing systems, including time-shared common buses, multiport memory, and crossbar switches. It explains the advantages and disadvantages of each structure, highlighting their impact on system performance and communication efficiency. Additionally, it introduces array processors as a type of SIMD system, detailing their operation, organization, and applications in enhancing instruction processing speed.

Uploaded by

alitheengineer02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views41 pages

ch.4 and 5

The document discusses interconnection bus architecture and various bus structures used in parallel processing systems, including time-shared common buses, multiport memory, and crossbar switches. It explains the advantages and disadvantages of each structure, highlighting their impact on system performance and communication efficiency. Additionally, it introduces array processors as a type of SIMD system, detailing their operation, organization, and applications in enhancing instruction processing speed.

Uploaded by

alitheengineer02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Parallel Processing

4th Year
Term2

Ch4. Interconnection Bus Architecture

Dr. Fatemah Al Assfor


Interconnection Bus Structures
Bus: is a communication lines connecting two or more devices (or components)
• It is a shared transmission medium
• A bus consists of multiple lines
Each line is capable of transmitting single binary value “0” or “1”
A bus that connects major components (such as: CPU, Memory, I/O) is called System Bus (or
internal system bus).

Address Data Control

-The width of lines in address bus (n-bit) determines the size of the main memory (RAM)
memory size= 2𝑛
-The number of lines in data bus determines how many bits can be stored in a main memory
location.
Interconnection Bus Structures (cont.)
Control signals (bus): are control lines used to control the operations of the system memory,
I/O devices , and instruction execution , ALU…etc.
Some of the control signals are:
- Memory read: RD - Clock
- Memory write: WR - Reset
- I/O read - Interrupt
- I/O write - Request
Several physical (H/W) techniques available for establishing an interconnection network,
Some of these schemes are presented in this section:
1. Time-shared common bus
2. Multiport memory
3. Crossbar switch
4. Multistage switching network
5. Hypercube system
1) Time-Shared Common Bus
A common-bus multiprocessor system consists of several processors and I/O devices
connected through a common path (bus) to main memory unit.
- Example: A time-shared common bus for three processors and two I/O devices is shown in
Fig. 4.1.

Common bus

Fig. 4.1 Time-Shared Common


Bus
- Any other processor wants to transfer information, must check the availability of the bus.
When it is available, that processor address the destination unit to start transfer, and the
receiving unit responds to the control signals from that processor.
1) Time-Shared Common Bus
Only one processor can communicate with the memory or another processor at any given time.
This means that all other processors or units are either busy with internal operations or must be
idle waiting for the bus.

Disadvantages
- Only one processor can communicate with the memory or another processor at any given
time.
- Consequently, the total overall transfer rate (bandwidth) within the system is limited by
the speed of the single path
1) Time-Shared Common Bus (cont.)
The performance of the system can be increased if two or more independent buses can be used to
transfer information. However, this increases the bus cost and complexity.
o Example: A more economical is the implementation of a dual bus structure as shown in Fig. (4.2).
o Part of the local memory may be designed as a cache memory attached to the CPU.

-As shown in Fig.4.2, Each


system bus controller links the
local bus to the common
system bus
- A common shared memory is
connected to the common
system bus. This memory is
shared by all CPUs. In this case
only one CPU can communicate
Fig.4.2 : Time shared common bus organization with this shared memory.
2) Multiport Memory
A multiport memory system employs separate buses between each memory module and
each CPU. This is shown in Fig . (4.3) , 4- CPUs and 4- memory modules.
- Each processor bus is connected to each memory module.
-A processor bus consists of: address, data, and control lines required to communicate with
memory.

Fig. 4.3: Multiport memory


organization
2) Multiport Memory (cont.)
- As shown in the above figure, a memory module is said to have 4- port and each port
accommodates one of the buses.
- Memory module must have internal control logic to determine which port will have access
to that memory at any given time.
- Memory access conflicts are resolved by assigning fixed priorities to each memory port.
Thus, CPU1 will have priority over CPU 2, CPU 2 will have priority over CPU 3, and CPU 4
will have the lowest priority.
Advantage
The advantage of multi-port memory organization is the high transfer rate (bandwidth) that
can be achieved because of the multiple paths between processors and memory.

Disadvantage
The disadvantage is that it requires expensive memory control logic and many cables and
connectors. Consequently, this interconnection structure is usually appropriate for systems
with a small number of processors.
3) Crossbar Switch (also called switching Network)
- Consists of a set of cross-points that are placed at intersections between processor buses and
memory module paths.
- A crossbar can be defined as a switching network with N inputs and M outputs, which
allows up to min{N, M} one-to-one interconnections without contention.
Types of Crossbar Switch:
a) Uni-directional crossbar
b) Bidirectional crossbar

Advantage: The major advantage of the cross-bar switchs:


o Supports simultaneous transfers from all memory modules
o High speed: In one clock cycle, a connection can be made between source and destination.

Disadvantage:
o The hardware required to implement the switch can become quite large and complex.
Fig. (4.4) shows the functional design of a crossbar switch connected to one memory
module.
3) Crossbar Switch (cont.)
- The small square in Fig. (4.4) of each cross-
point is a switch that determines the path from
a processor to a memory module.

- It allow any processor in the system to


connect to any other processor or memory unit
so that many processors can communicate
simultaneously without contention.

- Most common applications are:


• Used in designing high-performance small-
scale multiprocessors.
Fig. 4.4: Crossbar switch interconnection
• Used in designing of routers for direct
networks
3) Crossbar Switch (cont.)
a) Uni-directional Crossbar Switch
M1 M2
... Mn

- Each switch consists of 2-input AND +


2-input OR gates ((direction: Crosspoint

- PE Mem).
C11 C1n
P1 C12

- To construct (NxN) crossbar switch


network between N-processor and N-
memory modules, one must use
control signals or enable signals. The P2 C21
C22 C2n

signal cij enables the switch in ith row


and jth column.

...
- cij is the control signals to determine Pn
Cn1
Cn2
Cnn

which crosspoint gets “activated”.

Fig. 4.5 : Uni- directional Crossbar switch


3) Crossbar Switch (cont.)
a) Uni-directional Crossbar Switch
M1 M2
... Mn

Notes: For (nXm) crossbar switch network Crosspoint

- The last row is free of OR gates


- Total No. of AND gates= n*m P1
C11
C12
C1n

- Total No. of OR gates = (n-1)*m

H.W: Design (4X3) crossbar network.


Estimate the total number of AND and P2 C21
C22 C2n

OR gates needed.

...
Cn1
Cn2
Cnn
Pn

Fig. 4.5 : Uni- directional Crossbar switch


b) Bi-directional Crossbar Switch
- Each switch consists of two AND & M1

two OR gates (PE Mem and/or


(Mem PE ). Crosspoint

C11

- To construct (NXM) crossbar switch


1
network between N-processors and M-
memory modules, use cij control signal 1
to enable or activate the switch in ith 2
row and jth column.
2
P1
Notes:
- Total No. of AND gates= 2(n*m)
- Total No. of OR gates= 2(n*m)-(m+n)

H.W: Design (4X4) bidirectional


crossbar network. Estimate the total Fig. 4.6 : Bi- directional Crossbar switch
number of AND and OR gates needed.
Chapter 5
SIMD Machines:
Array , Systolic Array, and Wavefront Systems
1) Array Processor
• Array processor is a type of SIMD system.
• Array processor is a single dedicated computer containing a set of identical processing
elements (called Pi s) that operate in parallel under the control of a master controller
(MC) in asynchronous way.
• Each Pi has it is own local memory Mi and includes an ALU and registers.

• All processors Pi(s) execute the same instruction simultaneously (for vector processing).
Thus, providing a single instruction stream with multiple data streams (SIMD operation).

Master controller (MC):


-The master controller (MC) controls all the operations of the computer system and the
processing elements Pi(s), as well.
- It also decodes the instructions and determines how each instruction is to be executed.
The MC consists of two parts:
a) MCU (Master Control Unit): is the CPU of the master controller. It includes an ALU and a
set of registers.

b) MCM (Master Control Memory): holds the instructions and common data.
Array Processor (cont.)

Fig. 5.1: Array processor internal


organization
Array Processor Operation
- Each instruction in the program is executed under the supervision of MCU in a sequential
fashion.
- MCU fetches the next instruction. It is execution will take place in one of the following
ways:
a) Regular Instructions: If the fetched instruction is a scalar or a branch instruction, it is
executed by MC itself.
b) Array or vector Instruction: If the fetched instruction is a vector instruction, such as vector
add or vector multiply, then MCU broadcasts the same instruction to each Pi of the processor
array, allowing all Pi(s) to execute this instruction simultaneously.
(Assuming that the required data is already within the Pi’s private memory).
Array Processor Operation (cont.)
-The data used in the execution of an array instruction is routed into the local memories
before the execution of the instruction by two ways:
a) All the data values can be transferred to local memories from an external source via the
system data bus.
or
b) The MCU can broadcast the data values to the local memories via the control bus.
Array Processor (cont.)

Fig. 5.1: Array processor internal


organization
Array Processor (cont.)
Notes:
- In an array processor, it may be necessary to disable some processing elements during
vector operation, this is can be achieved by using a mask register M inside the MCU,
having a bit mi for each processor Pi.
mn mn-1 …. m1 m0
M- register

- If mi = 1, Pi will respond otherwise, Pi is disabled.


0 1 1 0 .. 0 1 1 0 M- register: 4 processor work only

- Data is exchanged between scratchpad registers and local memories of the Pis. This
exchange takes place through path provided by the Inter-Processor Communication
Network (IPCN).
Example: Consider the following recurrence equation:
𝑧𝑖 = 𝑧𝑖−1 + 𝑎𝑖 𝑓𝑜𝑟 𝑖 = 0, … … 3 with 𝑎−1 = 0
Using array processor, calculate the result of the equation and draw array processor graph
that indicate how the recurrence equation is calculated. Determine the number of steps
needed to complete the implementation of the equation.

Solution
First, expand the recurrence equation : 𝒛𝒊 = 𝒛𝒊−𝟏 + 𝒂𝒊 𝑓𝑜𝑟 𝑖 = 0, … … 3 with 𝑎−1 = 0
𝑧0 = 𝑧−1 + 𝑎0 = 0 + 𝑎0 = 𝑎0 i=0
𝑧1 = 𝑧0 + 𝑎1 = 𝑎0 + 𝑎1 i=1
𝑧2 = 𝑧1 + 𝑎2 = 𝑎0 + 𝑎1 + 𝑎2 i=2
𝑧3 = 𝑧2 + 𝑎3 = 𝑎0 + 𝑎1 + 𝑎2 + 𝑎3 i=3

- To perform the recurrence equation by an array processing system , we need four processing
elements (4- PE).
- We assume that each PE (or Pi) is initialized with the data 𝑎𝑖 . Now, the following graph shows how
the values of 𝑧𝑖 are calculated.
𝑧2 𝑧3

Disable Disable Enable Enable


Step 2
a0 a0+ a1 a0+ a1+ a2 a0+a1+a2+a3

Disable 𝑧1 Enable Enable Enable


Step1 a2+ a3
a0 a0+ a1 a1+ a2

𝑧0

initialization a0 a1 a2 a3

P0 P1 P2 P3

𝑧0 = 𝑧−1 + 𝑎0 = 0 + 𝑎0 = 𝑎0 i=0
𝑧1 = 𝑧0 + 𝑎1 = 𝑎0 + 𝑎1 i=1
𝑧2 = 𝑧1 + 𝑎2 = 𝑎0 + 𝑎1 + 𝑎2 i=2
𝑧3 = 𝑧2 + 𝑎3 = 𝑎0 + 𝑎1 + 𝑎2 + 𝑎3 i=3
Disable Disable Enable Enable
Step 2
a0 a0+ a1 a0+ a1+ a2 a0+a1+a2+a3

Disable Enable Enable Enable


Step 1
a0 a0+ a1 a1+ a2 a2+ a3

a0 a1 a2 a3
initialization
P0 P1 P2 P3

Notes:
- In general, for an array processor system with N processing elements (where N is power of
2), it is possible to evaluate N- values of 𝑧𝑖 ∶ (𝑧0 , 𝑧1 , … … . . 𝑧𝑁−1 ) using 𝒍𝒐𝒈𝟐 𝑵 steps.
- Also, we need to disable 2𝑘−1 processing elements during step k.
Usage of Array Processors
•Array processors enhance the total speed of instruction processing.
•Most array processors' design optimizes its performance for repetitive arithmetic
operations, making it faster at vector arithmetic than the host CPU.
•Since most Array processors run asynchronously from the host CPU, the system's overall
capacity is thus improved.
•Array Processors have their own local memory, providing additional extra memory to
systems with limited memory. This is an essential consideration for the systems with a limited
physical memory or address space.
Applications:
Array processing is used at various places, including:

Applications

1 2 5 6

3 4 Astronomy Seismic Exploration


Radar Systems Sonar Systems applications
Medical Speech
applications Enhancement
2) Systolic Arrays
- Systolic arrays are another kind of SIMD systems.
- It comprises from a set of simple processing elements PE(s) with regular and local
connections which takes external inputs and processes them in a predetermined manner
in a pipelined fashion.
- It is a Synchronous Network

What are the functions of each cell in a Systolic System?


- Systolic Array systems consists of an array of PE (Processing Elements) called cells, each cell
is connected to a small number of nearest neighbors PE.
- Generally, the operations are the same in each cell.
- Each cell performs an operation or small number of operations on a data item and then
passes it to its neighbor.
Regular Interconnections of Systolic Arrays
What are typical structures of a Systolic Architecture?

1) Bidirectional two- dimensional Network


What are typical structures of a Systolic Architecture?

2) Planar array: This configuration allows


I/O only through its boundary cells.

3) Focal Plane: This configuration allows I/O


to each systolic cell.
Example: Consider the following systolic array cell, provide a step- by- step block diagram
approach of a (2*2) matrix multiplication Z= X*Y

:
X2 :
X1 X2

..Y2 Y1 C ..Y2 C Y1+ X1.C

X1
Before After
Solution: Matrix multiplication Z=X*Y 𝑌22
𝑍11 𝑍12 𝑋11 𝑋12 𝑌11 𝑌12 𝑌12 𝑌21
= ∗
𝑍21 𝑍22 𝑋21 𝑋22 𝑌21 𝑌22
𝑌11 0
𝑍11 = 𝑋11 𝑌11 + 𝑋12 𝑌21
𝑍12 = 𝑋11 𝑌12 + 𝑋12 𝑌22 000
𝑿𝟏𝟏 𝑿𝟏𝟐
𝑍21 = 𝑋21 𝑌11 + 𝑋22 𝑌21
𝑍22 = 𝑋21 𝑌12 + 𝑋22 𝑌22 000
𝑿𝟐𝟏 𝑋22
:
:
X2 X2
X1

..Y2 C Y1+ X1.C


..Y2 Y1 C Initialization

X1
Before After
Clock 1 Clock 2
𝑌22
𝑌12 𝑌21
0 𝑌22

0 + 𝑋11 𝑌11 0 + 𝑋12 ∗ 0 0 + 𝑋11 𝑌12


00 𝑋11 𝑋12 0 𝑍11 = 𝑋11 𝑌11 + 𝑋12 𝑌21
𝑋11 𝑋12
𝑌11 0 𝑌12 𝑌21
00 0 0 0 + 𝑋21 𝑌11 0
𝑋21 𝑋22 0
𝑋21 𝑋22

𝑌11
0 0
0
Clock 3 Clock 4

0 0 0 0
𝒁𝟏𝟐
0 +𝑋11 *0 𝑋11 𝑌12 + 𝑋12 𝑌22 0 0 +0*0
0 𝑋11 𝑋12 𝑋11 𝑋12

0 𝑌22 𝒁𝟐𝟏 0 0 𝒁𝟐𝟐


0 + 𝑋21 𝑌12 𝑋21 𝑌11 + 𝑋22 𝑌21 0 0+0∗0 𝑋21 𝑌12 + 𝑿𝟐𝟐 𝒀𝟐𝟐
0
𝑋21 𝑋22 𝑋21 𝑋22

𝑌12 𝑌21 0 𝑌22

Homework: Using focal plane systolic array architecture, provide a step- by- step block diagram
approach for a (3*3) matrix multiplication.
3) Wavefront Array Processor
- Wavefront arrays are another kind of SIMD systems.
- It is very similar to Systolic Array since it comprises from a set of simple processing
elements (PE) with regular and local connections which takes external inputs and processes
them in a predetermined manner in a pipelined fashion.
- But its asynchronous Network

Example: Consider the following wavefront array cell, provide a step- by- step block diagram
approach of a (2*2) matrix multiplication Z= X*Y
A

B data
𝑍11 𝑍12 𝑿𝟏𝟏 𝑿𝟏𝟐 𝒀𝟏𝟏 𝒀𝟏𝟐
Solution: Matrix multiplication Z=X*Y 𝑍21 𝑍22
=
𝑿𝟐𝟏 𝑿𝟐𝟐

𝒀𝟐𝟏 𝒀𝟐𝟐

𝑍11 = 𝑋11 𝑌11 + 𝑋12 𝑌21


𝑍12 = 𝑋11 𝑌12 + 𝑋12 𝑌22 𝑌22

𝑌21 𝑌12
𝑍21 = 𝑋21 𝑌11 + 𝑋22 𝑌21
𝑌11 0
𝑍22 = 𝑋21 𝑌12 + 𝑋22 𝑌22

A 0 𝑿𝟏𝟐 𝑿𝟏𝟏
0 0

B Initialization
data
𝑿𝟐𝟐 𝑿𝟐𝟏 𝟎
0 0
𝑌22
𝑌21 𝑌12
𝑌11 0

0 𝑿𝟏𝟐 𝑿𝟏𝟏
Step 1
0 0
𝑌22

𝑌21 𝑌12
𝑿𝟐𝟐 𝑿𝟐𝟏 𝟎
0 0

0 𝑋12 𝑋11 0
𝟎 + 𝑿𝟏𝟏 𝒀𝟏𝟏 0

𝑌11 0

𝑋22 𝑋21 0
0 0

0
𝑌22
𝑌21 𝑌12

0 𝑋12 𝑋11
𝟎 + 𝑿𝟏𝟏 𝒀𝟏𝟏 0
0 Step 2
𝑌11 0
𝑋22 𝑋21 0 𝑌22
0
0 0

0 𝑍11
0 𝑿𝟏𝟏 𝒀𝟏𝟏 +
𝑋12 𝑋11
𝟎 + 𝑿𝟏𝟏 𝒀𝟏𝟐
𝑿𝟏𝟐 𝒀𝟐𝟏 + 𝟎

𝑌21 𝑌12

𝑋22 𝑋21
𝟎 + 𝑿𝟐𝟏 𝒀𝟏𝟏 0

𝑌11
0 𝑌22
𝑍11
𝑋12 𝑋11
0 𝑋11 𝑌11 + 𝑋12 𝑌21
+0 0 + 𝑋11 𝑌12

𝑌21 𝑌12 Step 3


𝑋22 𝑋21
0 + 𝑋21 𝑌11 0 0
0

𝑌11

𝑍11 𝑍12
0 0 𝑋12
𝑿𝟏𝟏 𝒀𝟏𝟏 + 𝑿𝟏𝟐 𝒀𝟐𝟏 𝑿𝟏𝟏 𝒀𝟏𝟐 + 𝑿𝟏𝟐 𝒀𝟐𝟐
+𝟎

0 𝑌22

0 𝑍21 𝑋22 𝑋21


𝑿𝟐𝟏 𝒀𝟏𝟏 + 0 + 𝑿𝟐𝟏 𝒀𝟏𝟐
𝑿𝟐𝟐 𝒀𝟐𝟏

𝑌21 𝑌12
0 0

𝑍11 𝑍12 𝑋12


0 0
𝑌22 Step 4
0
𝑍21 𝑋22 𝑋21
0 + 𝑋21 𝑌12 0 0
0
𝑌21 𝑌12

𝒁𝟏𝟏 𝒁𝟏𝟐
0 0 0
𝑋11 𝑌11 + 𝑋12 𝑌21 𝑋11 𝑌21 + 𝑋12 𝑌22
+0

0 0

𝒁𝟐𝟏 0 𝒁𝟐𝟐 𝑋22


0
𝑋21 𝑌11 + 𝑋22 𝑌21 𝑋21 𝑌12 + 𝑋22 𝑌22

𝑌21 𝑌22
Exercise: Consider the following wavefront array cell, provide a step- by- step block
diagram approach of a (3*3) matrix multiplication Z= X*Y

X2 :
X1 X2

..Y2 Y1 ..Y2 C Y1+ X1.C


C

X1

Before After

You might also like