0% found this document useful (0 votes)
55 views16 pages

Lec7 PDF

This document discusses parallel processing and provides an overview of key topics including: - Classifying parallel architectures based on instruction and data flows, including SISD, SIMD, MISD, and MIMD. - Examples of parallel programs including matrix addition and parallel sorting. - Flynn's taxonomy for parallel architectures based on the nature of instruction and data streams. - Details on SIMD, MISD, and MIMD architectures with examples. - Cautions about terminology in parallel processing due to fast development blurring boundaries.

Uploaded by

Rabia Chaudhary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views16 pages

Lec7 PDF

This document discusses parallel processing and provides an overview of key topics including: - Classifying parallel architectures based on instruction and data flows, including SISD, SIMD, MISD, and MIMD. - Examples of parallel programs including matrix addition and parallel sorting. - Flynn's taxonomy for parallel architectures based on the nature of instruction and data streams. - Details on SIMD, MISD, and MIMD architectures with examples. - Cautions about terminology in parallel processing due to fast development blurring boundaries.

Uploaded by

Rabia Chaudhary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Lecture 7: Parallel Processing

 Introduction and
motivation

 Architecture classification

 Performance of Parallel
Architectures

 Interconnection Network

Performance Improvement
 Reduction of instruction execution time:
 Increased clock frequency by fast circuit technology.
 Simplify instructions (RISC)

 Parallelism within processor:


 Pipelining.
 Parallel execution of instructions (ILP):
• Superscalar.
p
• VLIW architectures.

 Parallel processing.

1
Why Parallel Processing?
 Traditional computers often are not able to meet performance
needs in many applications:
 Simulation of large complex systems in physics, economy, biology,
etc.
etc
 Distributed data base with search function.
 Computer aided design.
 Visualization and multimedia.

 Such applications are characterized by a very large amount of


numerical computation and/or a high quantity of input data.
 In order to deliver sufficient performance for such applications, we
can have many processors in a single computer.
 PP has the potential of being more reliable: if one processor fails,
the system continues to work at a slightly lower performance.

Parallel Computer
 Parallel computers refer to architectures in which
many CPUs are running in parallel to implement a
certain application
pp or a set of applications.
pp
 Such computers can be organized in very different
ways, depending on several key parameters:
 number and complexity of individual CPUs;
 availability of common (shared) memory;
 interconnection
te co ect o tec technology
o ogy a
and
d topo
topology;
ogy;
 performance of interconnection network;
 I/O devices;
 etc.

2
Parallel Program
 In order to solve a problem using a parallel computer, one must
decompose the problem into sub-problems, which can be solved
in parallel.
 The results of sub-problems
sub problems may have to be combined to get the
final result of the main problem.
 Due to data dependency among the sub-problems, it is not easy
to decompose some problem to get a large degree of parallelism.
 Due to data dependency, the processors may also have to
communicate among each other.
 The time taken for communication is usually very high when
compared with the processing time.
 The communication mechanism must therefore be very well
designed in order to get a good performance.

Parallel Program Example (1)


 Matrix addition:

 A11  B11 A12  B12 A13  B13  A1 M  B1 M 


 A B A 22  B22 A 23  B23  A 2 M  B2 M 
 21 21

C  A  B   A 31  B31 A32  B32 A33  B33  A 3 M  B3 M 
      
 
A  B A N 2  BN 2 A N 3  B N 3  A NM  B NM 
 N1 N1

Vector computation with vector of m elements:


for i:=1 to n do
C[i,1:m]:=A[i,1:m] + B[i,1:m];
end for;

3
Parallel Program Example (2)
 Parallel sorting:
Unsorted-1 .Unsorted-2
. . U N S O R Unsorted-3
TED . . . Unsorted-4

Sorting Sorting Sorting Sorting Parallel part

Sorted-1 Sorted-2 Sorted-3 Sorted-4

Merge Sequential part

SORTED
cobegin
sort(1,250)|
sort(251,500)|
sort(501,750)| Sorting of 1000 integers
sort(751,1000)
coend;
merge;

Flynn’s Classification of Architectures


 Flynn’s classification (1966) is based on the nature of
the instruction flow executed by the computer and that
of the data flow on which the instructions operate
operate.
 The multiplicity of instruction stream and data stream
gives us four different classes:
 Single instruction, single data stream - SISD
 Single instruction, multiple data stream - SIMD
 Multiple instruction,
instruction single data stream - MISD
 Multiple instruction, multiple data stream- MIMD

4
Single Instruction, Single Data - SISD
 A single processor
 A single instruction stream
 Data stored in a single memory

CPU

Processing
Control Unit Inst. stream unit

Memory data stream


System

Single Instruction, Multiple Data - SIMD


 A single machine instruction stream
 Simultaneous execution on different sets of data
 A large number of processing elements
 Lockstep synchronization among the process
elements.
 The processing elements can:
 have their respective private data memory; or
 share a common memory via an interconnection
network.

 Array and vector processors are the most common


SIMD machines

5
SIMD with Shared Memory

Processing DS1
Unit_1

ork
Interconnection Netwo
DS2
Processing
IS Unit_2
Control Shared
Unit
Memory


Processing
DS
Sn
Unit_n

Multiple Instruction, Single Data - MISD


 A single sequence of data
 Transmitted to a set of processors
 Each
E h processor executes
t diff
differentt iinstruction
t ti
sequences
 Never been commercially implemented!

Data

PE1 PE2 ... PEn

6
Multiple Instruction, Multiple Data - MIMD

 A set of processors
 Simultaneously execute different instruction
sequences
 Different sets of data

 The MIMD class can be further divided:


 Shared memory (tightly coupled):
• Symmetric multiprocessor (SMP)
• Non-uniform memory access (NUMA)
 Distributed memory (loosely coupled) = Clusters

MIMD with Shared Memory


LM1
CPU_1
DS1
Control IS1 Processing
Unit_1 Unit_1
onnection Network

LM2
CPU_2
DS2
Control IS2 Processing
Unit_2
Shared
Unit_2
Memory

Interco

LMn
CPU_n
DS
Control IS2 Processing
n

Unit_n Unit_n

7
Cautions
 Very fast development in Parallel Processing and
related areas has blurred concept boundaries, causing
lot of terminological confusion:
 concurrent computing,
 multiprocessing,
 distributed computing,
 etc.
 There is
Th i no strict
t i t delimiter
d li it ffor contributors
t ib t tto th
the area
of parallel processing; it includes CA, OS, HLLs,
compilation, databases, and computer networks.

Lecture 7: Parallel Processing

 Introduction and
motivation

 Architecture classification

 Performance of Parallel
Architectures

 Interconnection Network

8
Performance of Parallel Architectures
Important questions:

 How fast does a p


parallel computer
p run at its maximal
potential?
 How fast execution can we expect from a parallel
computer for a given application?
 Note the increase of multi-tasking and multi-thread
computing.
 How do we correctly measure the performance of a
parallel computer and the performance improvement
we get by using one?

Performance Metrics
 Peak rate: the maximal computation rate that can be
theoretically achieved when all processors are fully utilized.
 The peak rate is of no practical significance for the user.
 It is mostly used by vendor companies for marketing their
computers.

 Speedup: measures the gain we get by using a certain parallel


computer to run a given parallel program in order to solve a
specific problem.

Ts
S =
Tp

TS: execution time needed with the sequential algorithm;


Tp : execution time needed with the parallel algorithm.

9
Performance Metrics (Cont’d)
 Efficiency: this metric relates the speedup to the number of
processors used; by this it provides a measure of the efficiency
with which the processors are used.
S
E=
P
S: speedup;
P: number of processors.

For the ideal situation, in theory:


S = P; which means E = 1. 1

Practically the ideal efficiency of 1 cannot be achieved!

Amdahl’s Law
 Let f be the ratio of computations that, according to the
algorithm, have to be executed sequentially (0  f  1); and P the
number of processors.
(1 – f ) × Ts
Tp = f × Ts + P
Ts 1
S= =
(1 – f ) × Ts (1 – f )
S f × Ts + f+
P P

10
9
8
7
6 For a parallel computer with 10 processing elements
5
4
3
2
1
0 .2 0 .4 0 .6 0 .8 1 .0 f

10
Amdahl’s Law (Cont’d)
 Amdahl’s law says that even a little ratio of sequential
computation imposes a limit to speedup.
 A higher speedup than 1/f can not be achieved, regardless of the
number of processors,
processors since
If there is 20% sequential
1 computation, the speedup will
S=
(1 – f ) maximally be 5, even If you
f+
P have 1 million processors.

 To efficiently exploit a high number of processors, f must be


small (the algorithm has to be highly parallel), since
S 1
E= =
P f × (P – 1) + 1

Other Aspects that Limit the Speedup


 Beside the intrinsic sequentiality of parts of an algorithm, there
are also other factors that limit the achievable speedup:
 communication cost;
 load balancing of processors;
 costs of creating and scheduling processes; and
 I/O operations (mostly sequential in nature).

 There are many algorithms with a high degree of parallelism.


 The value of f is very small and can be ignored;
 Suited for massively parallel systems; and
 The other limiting factors, like the cost of communications,
become critical, in such algorithms.

11
Efficiency and Communication Cost
 Consider a highly parallel computation, f is small and can be neglected.
 Let fc be the fractional communication overhead of a processor:
 Tcalc: the time that a processor executes computations;
 Tcomm: the time that a processor is idle because of communication;

Tcomm TS
fc = Tp = P × (1 + fc)
Tcalc
TS P 1
S = Tp = E = 1 + fc  1 – fc (if fc is small)
1 + fc

 With algorithms having a high degree of parallelism, massively parallel


computers,
t consisting
i ti off large
l number
b off processors, can b
be efficiently
ffi i tl
used if fc is small.
 The time spent by a processor for communication has to be small
compared to its time for computation.
 In order to keep fc reasonably small, the size of processes can not go
below a certain limit.

Lecture 7: Parallel Processing

 Introduction and
motivation

 Architecture classification

 Performance of Parallel
Architectures

 Interconnection Network

12
Interconnection Network
 The interconnection network (IN) is a key component
of the architecture. It has a decisive influence on:
 the overall performance; and
 total cost of the architecture.

 The traffic in the IN consists of data transfer and


transfer of commands and requests (control
information).

 The key parameters of the IN are


 total bandwidth: transferred bits/second; and
 implementation cost.

Single Bus
Node1 Node2 ... Noden

 Single bus networks are simple and cheap.


 One single communication is allowed at a time; the
bandwidth is shared by all nodes.
 P f
Performance iis relatively
l ti l poor.
 In order to keep a certain performance, the number of
nodes is limited (around 16 - 20).
 Multiple buses can be used instead, if needed.

13
Completely Connected Network

N x (N-1)/2 wires

 Each node is connected to every other one.


 Communications can be performed in parallel between any pair
of nodes.
 Both performance and cost are high.
 Cost increases rapidly with number of nodes.

Crossbar Network
Node1

Node2

Noden

 A dynamic network: the interconnection topology can be modified by


configurating the switches.
 It is completely connected: any node can be directly connected to any
other.
 Fewer interconnections are needed than for the static completely
connected network; however, a large number of switches is needed.
 A large number of communications can be performed in parallel (even
though one node can receive or send only one data at a time).

14
Mesh Network
Torus:

 Cheaper than completely connected networks, while giving


relatively good performance.
 In order to transmit data between two nodes, routing through
intermediate nodes is needed (maximum 2×(n-1)
2×(n 1) intermediates
for an n×n mesh).
 It is possible to provide wrap-around connections:
 Torus.
 Three dimensional meshes have been also implemented.

Hypercube Network

2-D 3-D 4-D 5-D

 2n nodes are arranged in an n-dimensional cube. Each node is


connected to n neighbors.
 In order to transmit data between two nodes, routing through
intermediate nodes is needed (maximum n intermediates).

15
Summary
 The growing need for high performance can not always be
satisfied by computers with a single CPU.
 With parallel computers, several CPUs are running concurrently
i order
in d tto solve
l a given
i application.
li ti
 Parallel programs have to be available in order to make use of
the parallel computers.
 Computers can be classified based on the nature of the
instruction flow and that of the data flow on which the
instructions operate.
 A key component of a parallel architecture is also the
interconnection network.
 The performance we can get with a parallel computer depends
not only on the number of available processors but is limited by
characteristics of the executed programs.

16

You might also like