Lec7 PDF
Lec7 PDF
Lec7 PDF
Introduction and
motivation
Architecture classification
Performance of Parallel
Architectures
Interconnection Network
Performance Improvement
Reduction of instruction execution time:
Increased clock frequency by fast circuit technology.
Simplify instructions (RISC)
Parallel processing.
1
Why Parallel Processing?
Traditional computers often are not able to meet performance
needs in many applications:
Simulation of large complex systems in physics, economy, biology,
etc.
etc
Distributed data base with search function.
Computer aided design.
Visualization and multimedia.
Parallel Computer
Parallel computers refer to architectures in which
many CPUs are running in parallel to implement a
certain application
pp or a set of applications.
pp
Such computers can be organized in very different
ways, depending on several key parameters:
number and complexity of individual CPUs;
availability of common (shared) memory;
interconnection
te co ect o tec technology
o ogy a
and
d topo
topology;
ogy;
performance of interconnection network;
I/O devices;
etc.
2
Parallel Program
In order to solve a problem using a parallel computer, one must
decompose the problem into sub-problems, which can be solved
in parallel.
The results of sub-problems
sub problems may have to be combined to get the
final result of the main problem.
Due to data dependency among the sub-problems, it is not easy
to decompose some problem to get a large degree of parallelism.
Due to data dependency, the processors may also have to
communicate among each other.
The time taken for communication is usually very high when
compared with the processing time.
The communication mechanism must therefore be very well
designed in order to get a good performance.
3
Parallel Program Example (2)
Parallel sorting:
Unsorted-1 .Unsorted-2
. . U N S O R Unsorted-3
TED . . . Unsorted-4
SORTED
cobegin
sort(1,250)|
sort(251,500)|
sort(501,750)| Sorting of 1000 integers
sort(751,1000)
coend;
merge;
4
Single Instruction, Single Data - SISD
A single processor
A single instruction stream
Data stored in a single memory
CPU
Processing
Control Unit Inst. stream unit
5
SIMD with Shared Memory
Processing DS1
Unit_1
ork
Interconnection Netwo
DS2
Processing
IS Unit_2
Control Shared
Unit
Memory
…
Processing
DS
Sn
Unit_n
Data
6
Multiple Instruction, Multiple Data - MIMD
A set of processors
Simultaneously execute different instruction
sequences
Different sets of data
LM2
CPU_2
DS2
Control IS2 Processing
Unit_2
Shared
Unit_2
Memory
…
Interco
LMn
CPU_n
DS
Control IS2 Processing
n
Unit_n Unit_n
7
Cautions
Very fast development in Parallel Processing and
related areas has blurred concept boundaries, causing
lot of terminological confusion:
concurrent computing,
multiprocessing,
distributed computing,
etc.
There is
Th i no strict
t i t delimiter
d li it ffor contributors
t ib t tto th
the area
of parallel processing; it includes CA, OS, HLLs,
compilation, databases, and computer networks.
Introduction and
motivation
Architecture classification
Performance of Parallel
Architectures
Interconnection Network
8
Performance of Parallel Architectures
Important questions:
Performance Metrics
Peak rate: the maximal computation rate that can be
theoretically achieved when all processors are fully utilized.
The peak rate is of no practical significance for the user.
It is mostly used by vendor companies for marketing their
computers.
Ts
S =
Tp
9
Performance Metrics (Cont’d)
Efficiency: this metric relates the speedup to the number of
processors used; by this it provides a measure of the efficiency
with which the processors are used.
S
E=
P
S: speedup;
P: number of processors.
Amdahl’s Law
Let f be the ratio of computations that, according to the
algorithm, have to be executed sequentially (0 f 1); and P the
number of processors.
(1 – f ) × Ts
Tp = f × Ts + P
Ts 1
S= =
(1 – f ) × Ts (1 – f )
S f × Ts + f+
P P
10
9
8
7
6 For a parallel computer with 10 processing elements
5
4
3
2
1
0 .2 0 .4 0 .6 0 .8 1 .0 f
10
Amdahl’s Law (Cont’d)
Amdahl’s law says that even a little ratio of sequential
computation imposes a limit to speedup.
A higher speedup than 1/f can not be achieved, regardless of the
number of processors,
processors since
If there is 20% sequential
1 computation, the speedup will
S=
(1 – f ) maximally be 5, even If you
f+
P have 1 million processors.
11
Efficiency and Communication Cost
Consider a highly parallel computation, f is small and can be neglected.
Let fc be the fractional communication overhead of a processor:
Tcalc: the time that a processor executes computations;
Tcomm: the time that a processor is idle because of communication;
Tcomm TS
fc = Tp = P × (1 + fc)
Tcalc
TS P 1
S = Tp = E = 1 + fc 1 – fc (if fc is small)
1 + fc
Introduction and
motivation
Architecture classification
Performance of Parallel
Architectures
Interconnection Network
12
Interconnection Network
The interconnection network (IN) is a key component
of the architecture. It has a decisive influence on:
the overall performance; and
total cost of the architecture.
Single Bus
Node1 Node2 ... Noden
13
Completely Connected Network
N x (N-1)/2 wires
Crossbar Network
Node1
Node2
Noden
14
Mesh Network
Torus:
Hypercube Network
15
Summary
The growing need for high performance can not always be
satisfied by computers with a single CPU.
With parallel computers, several CPUs are running concurrently
i order
in d tto solve
l a given
i application.
li ti
Parallel programs have to be available in order to make use of
the parallel computers.
Computers can be classified based on the nature of the
instruction flow and that of the data flow on which the
instructions operate.
A key component of a parallel architecture is also the
interconnection network.
The performance we can get with a parallel computer depends
not only on the number of available processors but is limited by
characteristics of the executed programs.
16