0% found this document useful (0 votes)
20 views29 pages

1 Module 1 Introduction To Multiprocessors September 29 2024

Uploaded by

Omar Amer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views29 pages

1 Module 1 Introduction To Multiprocessors September 29 2024

Uploaded by

Omar Amer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Module #1

Introduction to Parallel Processing


Professor Mostafa Abd-El-Barr

Fall Term 2024-2025

Sunday, September 29, 2024 1


Outline
❑ Introduction on Parallelism
❑ Multiprocessor Interconnection Networks
❑ Introduction to Parallel Processing

2 Sunday, September 29, 2024


Introduction To Parallelism
❑ Definition “Parallelism”
Ability to execute different parts of a program concurrently on
different processors
❑ Goal: shorten execution time

Measures of Performance

• To computer scientists: speedup, execution time.


• To applications people: size of problem, accuracy of solution,
etc.
Speedup of Algorithm
✓ Speedup of algorithm = sequential execution time /
execution time on p processors (for the same data
set).

speedup

p
What Speedups Can You Get?
✓ Linear speedup
– implicitly means a 1-to-1 speedup per processor.
– (almost always) as good as you can do.
✓ Sub-linear speedup: This is more normal due to overhead of
startup, synchronization, communication, etc.
speedup linear

actual

p
Scalability
✓ Roughly speaking, a program is said to scale to a certain number of
processors p, if going from p-1 to p processors results in some
acceptable improvement in speedup (for instance, an increase of 0.5).
Amdahl’s Law
✓ If 1/s of the program is sequential, then you can never get a
speedup better than s.
– (Normalized) sequential execution time = 1/s + (1- 1/s) = 1
– Best parallel execution time on p processors = 1/s + (1 - 1/s) /p
– When p goes to infinity, parallel execution =1/s
– Speedup = s.

Why keep something sequential?

✓ Some parts of the program are not parallelizable (because of


dependences)
✓ Some parts may be parallelizable, but the overhead
overshadows the increased speedup.
Fundamental Assumption
• Processors execute independently: no control
over order of execution between processors

When can 2 statements execute in parallel?


Possibility 1 •
Processor2: Processor1:
statement1;
statement2;
Possibility 2 •
Processor2: Processor1:
statement2:
statement1;
When can 2 statements execute in parallel?
• Their order of execution must not matter!
• In other words,
statement1; statement2;
must be equivalent to
statement2; statement1;
Example 1 a = 1;
b = a;
• Statements cannot be executed in parallel
Example 2 a = 1;
a = 2;
• Statements cannot be executed in parallel.
Types of Dependence
Suppose we have two statements: Statements S1, S2
1. True dependence

S2 has a true dependence on S1 iff S2 reads a value written by S1

2. Anti-dependence

S2 has an anti-dependence on S1 iff S2 writes a value read by S1.

3. Output Dependence

S2 has an output dependence on S1 iff S2 writes a variable written by S1.


When can 2 statements execute in parallel?
S1 and S2 can execute in parallel iff
there are no dependences between S1 and S2
– true dependences
– anti-dependences
– output dependences
Example Status Observation
for(i=0; i<100; i++) No dependences. Iterations can be executed
a[i] = i; in parallel.
for(i=0; i<100; i++) No dependences. Loop is still parallelizable.
a[i] = a[i] + 100;
for(i=0; i<100; i++) { No dependences. Iterations and statements
a[i] = i; can be executed in parallel.
b[i] = 2*i;
}
for( i=0; i<100; i++ ) Dependence between Loop iterations are not
a[i] = f(a[i-1]); a[i] and a[i-1]. parallelizable.
Interconnection Networks
✓ Interconnection Networks Taxonomy

Interconnection Networks (Topology)

Static Dynamic

1D 2D HC
Bus-based Switch-based

Single Multiple SS MS Crossbar

11
Interconnection Networks
✓Multiprocessor interconnection networks (INs) can be classified based on a numbers of
criteria:
(1) Mode of operation (synchronous versus asynchronous),
(2) Control Strategy (centralized versus decentralized),
(3) Switching Techniques (Circuit versus packet), and
(4) Topology (static versus dynamic).

✓ An interconnection network could be either static or dynamic:


1. Dynamic Networks
o Dynamic network are established on the fly as needed.
o Dynamic networks can be classified based on interconnection scheme as bus-based versus
switch-based.
▪ Bus-based networks can further be classified as single bus or multiple buses.
▪ Switch-based dynamic networks can be classified as single-stage (SS), multi-stage (MS), or
crossbar networks.
2. Static Networks Interconnection Networks (Topology)
o Static network connections are fixed links.
o Static networks can be classified as Static Dynamic
▪ one-dimension (1D), 1D 2D HC
▪ two-dimension (2D), or Bus-based Switch-based
▪ hypercube (HC).
Single Multiple SS MS Crossbar
Interconnection Networks
❑ Topology-wise Shared memory systems can be designed using the following:
✓ Single Bus Systems
o A single bus is considered the simplest way to connect multiprocessor systems.
o The figure shows an illustration of a single bus system.
p p ••• p p
1 2 N −1 N

Shared Memory I/O

✓ Consists of N processors, each having its own cache, connected by a shared bus.
✓ The use of local caches reduces the processor-memory traffic.
✓ All processors communicate with a single shared memory.
✓ Typical size of such system varies between 2 to 50 processors.
✓ The actual size is determined by the traffic per processor and the bus bandwidth (defined as the maximum
rate at which the bus can propagate data once transmission has started).
✓ The single bus network complexity, measured in terms of the number of buses used, is O(1), while the time
complexity, measured in terms of the amount of input to output delay is O(N).
Machine Name Maximum # Processor Clock rate Maximum Bandwidth
processors Memory
HP 9000 K640 4 PA-8000 180 MHz 4,096 MB 960 MB/sec
IBM RS/6000 R40 8 PowerPC 604 112 MHz 2,048 MB 1800 MB/sec
Sun Enterprise 6000 30 UltraSPARC 1 167 MHz 30,720 MB 2600 MB/sec
Interconnection Networks
✓ Multiple Bus Systems
o A multiple-bus multiprocessor system uses several parallel buses to interconnect multiple
processors and multiple memory modules.
o Among the possibilities are
▪ multiple-bus with full bus-memory connection (MBFBMC),
▪ multiple-bus with single bus-memory connection (MBSBMC),
▪ multiple-bus with partial bus-memory connection (MBPBMC), and
▪ multiple-bus with class-based memory connection (MBCBMC).
o Illustrations of the multiple bus is shown below.

Mj

Connection Type # Connections Load on bus i

MBFBMC B(N+M) N+M


MBSBMC BN + M N+
MBPBMC B(N +M/g) N + M/g
MBCBMC BN + N+
Interconnection Networks
✓ Single-Stage (SS) networks

Straight Exchange Upper broadcast Lower broadcast

S( p p .....p p ) = p p .....p p p
m −1 m − 2 1 0 m−2 m−3 1 0 m −1
E(Pm-1 Pm-2 ….P1 P0) = Pm-1 Pm-2 ….P1 P0^--

o Example
In an 8-input single stage Shuffle-Exchange if the source is 0 (000) and the destination
is 6 (110), then the following is the required sequence of Shuffle/Exchange operations
and circulation of data:

E (000) → 1(001) → S (001) → 2(010) → E (010) → 3(011) → S (011) → 6(110)

The network complexity of the single stage interconnection network is O(N) and the time complexity is O(N).

15 Sunday, September 29, 2024


Interconnection
✓ Multiple-Stage (MS) Networks
Networks
o Example: The Shuffle-Exchange
000 000

001 001

010 010

011 011

100 100
101 101

110 110

111 111
✓ The figure shows an example of an 88 MIN that uses the 22 SEs described before.
✓ This network is known in the literature as the Shuffle-exchange network (SEN).
✓ The settings of the SEs in the figure illustrate how a number of paths (but NOT all) can be established
simultaneously in the network.
✓ Example:
o The figure shows how three simultaneous paths connecting the three pairs of input/output can be
established.
Interconnection Networks
o Example: The Banyan Network

000 1 5 9 000
001
001
010 2 6 10 010
011 011

100 3 7 11 100
101
101

110 4 8 12 110

111 111

17 Sunday, September 29, 2024


Interconnection Networks
✓ The Crossbar Switch
P C
Network complexity, measured in terms of the
number of switching points O(N 2 ) P C

P C

Time complexity, measured in terms of the input to


output delay O(1) P C

M M M M

18 Sunday, September 29,


2024
Interconnection Networks
❑ Topology-wise Message passing INs can be divided into static, dynamic, or random.
✓ Static networks form all connections when the system is designed rather than when the
connection is needed.
✓ In a static network, messages must be routed along established links.
✓ Dynamic INs establish a connection between two or more nodes on the fly as messages are
routed along the links.
✓ In either static or dynamic networks, a single message may have to hop through intermediate
processors on its way to its destination.
✓ Therefore, the ultimate performance of an interconnection network is greatly influenced by
the number of hops taken to traverse the network.
✓ Random network is the most general and widespread, because it is the interconnection
network of the Internet.
✓ There is not regularity in the topology, hence the name "random" Connections are added and
dropped as needed.
✓ The number of hops in a path from source to destination node is equal to the number of point-
to-point links a message must traverse to reach its destination.

19 Sunday, September 29, 2024


Interconnection Networks
✓ Cub Networks
o The interconnection pattern used in the cube network is defined as follows:
Ci ( pm−1 pm−2 ......pi +1 pi pi −1 ......p1 p0 ) = pm−1 pm−2 ......pi +1 pi pi −1 ......p1 p0
o Consider a 3-bit address (N = 8), then we have

C0 = 0 1 2 3 4 5 6 7

C1 = 0 1 2 3 4 5 6 7

20 Sunday, September 29,


2024
Interconnection Networks
✓ The table shows a performance comparison among a number of different dynamic INs.
✓ In this table, m represents the number of multiple busses, while N represents the number
of processors (memory modules) or Input/output of the network.

Network Delay Cost (Complexity)

Bus O(N) O(1)

Multiple-bus O(mN) O(m)

MINs O( log N) O(N log N)

❑ Consider the number of popular static topologies: (a) linear array, (b) ring, (c) mesh, (d) tree,
(e) hypercube.

❑ The following are definitions in this connections is as follows:


✓ The degree of a network is defined as the maximum number of links (channels) connected to any node in
the network.
✓ The diameter of a network is defined as the maximum path, p, of the shortest paths between any two
nodes.
✓ The Degree of a node, d, is defined as the number of channels incident on the node.
Interconnection Networks

o The hypercube is referred to as a logarithmic architecture.


o This is because the maximum number of links a message has to traverse in order to reach its
destination in an n-cube containing N = 2n nodes is log 2 N = n Links.
o One of the desirable features of hypercube networks is the recursive nature of their
constructions. An n-cube can be constructed from two sub-cubes each having an (n-1) degree.
o The 4-cube shown in the Figure is constructed from two sub-cubes each of degree three.
o The construction of the 4-cube out of the two 3-cubes requires an increase in the degree of each node.
o It is worth mentioning that the Intel iPSC is an example of hypercube-based commercially
available multiprocessor systems.
Interconnection Networks
✓ Mesh-connected Networks

o An n-dimensional mesh can be defined as an interconnection structure that has K 0  K1  ...  K n −1


nodes where n is the number of dimensions of the network and K is the radix of dimension i.
i

o The Figure shows an example of a 3  3  2mesh network.


o An advantage of mesh interconnection networks is that they are scalable.
o Larger meshes can be obtained from smaller ones without changing the node degree
o Examples include MPP from Goodyear Aerospace, Paragon from Intel, and J-Machine from MIT.

23 Sunday, September 29,


2024
Introduction to Parallel Processing
I. Flynn’s Taxonomy of Computer Architecture
1. The most popular taxonomy of computer architecture was defined by Flynn in 1966.
2. Flynn’s classification scheme is based on the notion of a stream of information.
3. Two types of information flow into a processor: instructions and data.
4. Instruction stream: is defined as the sequence of instructions performed by the processing unit.
5. Data stream: is defined as the data traffic exchanged between the memory and the processing unit.
6. Flynn’s classification can be classified into the following four distinct categories:
▪ Single-Instruction Single-Data streams (SISD).
▪ Single-Instruction Multiple-Data streams (SIMD).
▪ Multiple-Instruction Single-Data streams (MISD).
▪ Multiple-Instruction Multiple-Data streams (MIMD).
Introduction to Parallel Processing
SIMD Architecture
✓ There are two main configurations that have been used in SIMD machines.
1. The first scheme:
(a) each processor has its own local memory.
(b) Processors can communicate with each other through the interconnection network.
(c) If the interconnection network does not provide direct connection between a given pair of
processors, then this pair can exchange data via an intermediate processor.
(d) The ILLIAC IV used such an interconnection scheme.
(e) The interconnection network in the ILLIAC IV allowed each processor to communicate
directly with four neighboring processors in an 8  8 matrix pattern such that the ith
processor can communicate directly with the (i-1)th, (i+1)th, (i-8)th, and (i+8)th processors.
Introduction to Parallel Processing
2. The second SIMD scheme:
(a) processors and memory modules communicate with each other via the interconnection network.
(b) Two processors can transfer data between each other via intermediate memory module(s) or
possibly via intermediate processor(s).
(c) The BSP (Burroughs’ Scientific Processor) used the second SIMD scheme.

Control Unit

P1 P2 P3 Pn-1 Pn

Interconnection Network

M1 M2 M3 Mn-1 Mn

26 Sunday, September 29, 2024


Introduction to Parallel Processing
MIMD Architecture
✓ MIMD parallel architectures are made of multiple processors and multiple memory modules
connected via some interconnection network.
✓ They fall into two broad categories:
1. Shared Memory System
✓ Typically accomplishes inter-processor coordination through a global memory shared by all
processors.
✓ These are typically server systems that communicate through a bus and cache memory controller.
✓ The bus/cache architecture alleviates the need for expensive multiported memories and interface
circuitry as well as the need to adopt a message-passing paradigm when developing application
software.
✓ In shared memory access is balanced, i.e., each processor has equal opportunity to read/write to
memory.
✓ Therefore, these systems are also called SMP (Symmetric Multiprocessor) systems.
✓ Commercial examples of SMPs are Sun Microsystems multiprocessor servers, and Silicon Graphics
Inc. multiprocessor servers.
Introduction to Parallel Processing
2. Message Passing System (distributed memory)
✓ typically combines local memory and processor at each node of the interconnection network.
✓ There is no global memory so it is necessary to move data from one local memory to another
by means of message passing.
✓ This is typically done by a Send/Receive pair of commands, which must be written into the
application software by a programmer.
✓ Programmers must learn the message-passing paradigm, which involves data copying and
dealing with consistency issues.
✓ Commercial examples of include the nCUBE, iPSC/2, and various Transputer-based systems.
✓ These systems eventually gave way to Internet connected systems whereby the
processor/memory nodes were either Internet servers or clients on individual’s desktops.

28 Sunday, September 29, 2024


References
▪ Textbook Chapters 1 & 2.

29 Sunday, September 29, 2024

You might also like