INTRODUCTION TO PARALLEL
COMPUTING
B S RAMANJANEYULU
System Software Development Group,
CDAC, Bangalore.
1
Presentation Outline
• Need for Parallel Computing
• Requirements of Parallel Computing
• Parallel Computing Terminology
• Parallel computer architectures
• Designing parallel algorithms
• Architectural taxonomy (SISD, SIMD, MISD and
MIMD)
• Symmetric multiprocessing (SMP)
• Clusters
• Parallel programming models
2
How to Run Applications Faster?
There are 3 ways to improve performance:
• Work Harder
• Work Smarter
• Get Help (multiple workers)
Computer Analogy
• Use faster hardware: e.g. reduce the time per instruction
• Optimized algorithms and techniques.
• Multiple computers to solve problem.
3
Sequential vs. Parallel
Sequential
Parallel
4
Sequential vs. Parallel (Contd…)
Traditional sequential programs execute one instruction at a time
using one processor
Parallelism implies executing tasks simultaneously (on multiple
processors) to complete the job faster
Parallelism can be done by:
− Breaking up the task into smaller tasks
− Assigning the smaller tasks to multiple workers (processors) to
work on simultaneously
− Coordinating the workers (processors)
− Parallel problem solving is natural. Examples: Building
construction; Automobile manufacturing
5
The Need For Faster Machines
Grand Challenge Problems:
Climate Modeling
Computational Fluid dynamics
Combustion Systems
Human Genome
Structural Mechanics
Molecular Modeling
Astrophysical Calculations
Seismic Data Processing
6
Data Parallelism
Example:
if CPU=“1" then
start=1
end=50
else if CPU=“2" then
start=51
end=100
end if
do
i = start , end
Task on d(i)
end do
7
Task Parallelism
Task parallelism
Multiple tasks executing concurrently is called task parallelism.
All the CPUs execute separate code blocks simultaneously.
Example:
if CPU=“1" then
do “Task 1”
else if CPU=“2" then
do “Task 2”
end if
8
Definition
Definition :
In computer architecture point of view, a parallel
computer is a “Collection of processing elements that
communicate and co-operate to solve large problems
fast”.
When this architecture is combined with a parallel
algorithm, we get the ‘parallel computing system’.
9
Sequential vs. Parallel Computing
SEQUENTIAL COMPUTING
Fetch/Store
Compute
PARALLEL COMPUTING
Fetch/Store
Compute
communicate
10
Execution Time
• Sequential system
– Execution time as a function of size of input
• Parallel system
– Execution time as a function of input size,
and number of processors used
11
Terminology of Parallel Computing
Speedup : Speedup ‘Tp’ is defined as the ratio of the serial
runtime of the best sequential algorithm for solving a problem to
the time taken by the parallel algorithm to solve the same
problem on ‘p’ processors.
Tp=T(seq) / T(parallel)
The ‘p’ processors used by the parallel algorithm are assumed to
be identical to the one used by the sequential algorithm
Efficiency: Ratio of speedup to the number of processors.
Efficiency = Tp / P
12
Terminology of Parallel Computing
Throughput (in FLOPS): (Contd…)
It is obtained by taking the clock rate of the given system and
dividing it by the number of clock cycles a floating point
instruction requires.
Cost : Cost of solving a problem on a parallel system is the
product of parallel runtime and the number of processors used ,
i.e., E = p.Sp
13
Requirements for Parallel Computing
Multiple processors
(The workers)
Network
(Link between workers)
OS support
14
Requirements for Parallel Computing (Contd…)
Parallel Programming Paradigms
Message Passing (MPI , PVM )
Data Parallel (Fortran 90/High Performance Fortran )
Multi-Threading
Hybrid
Others (OpenMP, shmem)
Decomposition of the problem into pieces that multiple
workers can perform.
15
Issues in Parallel Computing
• Parallel computer architectures
• Efficient parallel algorithms
• Parallel programming models
• Parallel computer languages
• Methods for evaluating parallel algorithms
• Parallel programming tools
16
Designing Parallel Algorithms
Detect and exploit any inherent parallelism in an existing
sequential Algorithm
Invent a new parallel algorithm
Adopt from another parallel algorithm that solves a similar
problem
17
Decomposition Techniques
Decomposition Techniques
The process of splitting the computations in a problem into a set of
concurrent tasks is referred to as decomposition.
Decomposing a problem effectively is of paramount importance
in parallel computing.
Without a good decomposition, we may not be able to achieve a
high degree of concurrency.
Decomposing a problem must ensure good load balance.
18
Decomposition Techniques (Contd…)
What is meant by good decomposition?
It should lead to high degree of concurrency (fine-granularity).
The interaction among tasks should be as little as possible
(coarse-granularity).
•The ratio between computation and communication is known as granularity.
19
Success depends on the combination of
Architecture, Compiler, Choice of Right Algorithm
Portability, Maintainability, and Efficient implementation
20
Architectural Taxonomy
Flynn's taxonomy uses the relationship of program instructions
to program data. The four categories are:
SISD – Single Instruction, Single Data Stream
SIMD – Single Instruction, Multiple Data Stream
MISD - Multiple Instruction, Single Data Stream
(no practical examples)
MIMD - Multiple Instruction, Multiple Data Stream
21
SISD Model features
Not a parallel computer
Conventional serial, scalar von Neumann computer
A single instruction is issued in each clock cycle
Each instruction operates on a single (scalar) data element
Performance measured in MIPS
Examples: most PCs and single CPU workstations
22
SIMD Model features
Also von Neumann architectures but
more powerful instructions
Each instruction may operate on more
than one data element
Usually intermediate host executes
program logic and broadcasts
instructions to other processors
Examples: Array Processors and Vector
Processors (used in the supercomputers of
1970’s and 80’s
23
MIMD Model features
Parallelism achieved by connecting multiple processors together
Each processor executes its own instruction stream independent of other
processors on unique data stream
Advantages
Processors can execute multiple job streams simultaneously
Each processor can perform any operation regardless of what
other processors are doing
Disadvantages
Load balancing overhead - synchronization needed to coordinate
processors at end of parallel structure in a single application
Can be difficult to program 24
MIMD Block Diagram
25
MIMD Classification
26
Parallel Computer Architecture Memory Models
Shared Memory Distributed Memory
27
Hybrid Memory
Symmetric Multiprocessors (SMP)
28
Symmetric Multiprocessors (SMP)
(Contd…)
•Uses commodity microprocessors with on-chip and off-chip
cache.
•Processors are connected to a shared memory through a high-
speed bus
•Single address space.
•Easy application development.
•Difficult to scale.
•Difficult to repair/ replace the faulty node (when compared to
clusters)
29
SMP, MPP and clusters
30
Competing Architectures
• Massively Parallel Processors (MPP)-proprietary systems built for
specific purposes
– high cost and a low performance/price ratio.
• Symmetric Multiprocessors (SMP)
– suffers from scalability
• Distributed Systems
– difficult to extract high performance.
• Clusters
– High Performance Computing--- With Commodity Processors
– High Availability Computing --- for Critical Applications
31
What is a Cluster?
A cluster is a type of parallel or distributed processing system,
which consists of a collection of interconnected stand-
alone/complete computers cooperatively working together as a
single, integrated computing resource.
A typical cluster consists of:
• Faster, closer connection Network than a typical LAN
• Low latency communication protocols
• Looser connection than SMP
32
Motivation for using Clusters
The communications bandwidth between
workstations is increasing as new networking
technologies and protocols are implemented in
LANs and WANs.
Workstation clusters are easier to integrate into
existing networks than special parallel computers.
33
Cluster Computer Architecture
34
Components of Cluster Computers
• Multiple High Performance Computers
– PCs
– Workstations
– SMPs
• State-of-the-art Operating Systems
– Layered
– Micro-kernel based
• High Performance Networks/Switches
– Gigabit Ethernet
– PARAMNet
– Myrinet
• Network Interface Cards (NICs)
• Fast Communication Protocols and Services
– Active Messages (AM)
– Virtual Interface Architecture (VIA)
35
Components of Cluster Computers (Contd…)
• Parallel Programming Environments and Tools
– Compilers
– PVM [Parallel Virtual Machine]
– MPI [Message Passing Interface]
• Applications
– Sequential
– Parallel or Distributed
36
Parallel programming models -- MPI, PVM and OpenMP
•MPI – Messaging Passing Interface
•PVM – Parallel Virtual Machine
•Both MPI and PVM are based on message passing mechanism.
•Both MPI and PVM can be used with shared-memory and
distributed memory architectures.
•MPI
- MPI is mainly for data-parallel problems.
- Collective and asynchronous operations are more powerful
in MPI.
•OpenMP – Open Multiprocessing
- OpenMP is thread-based multiprocessing.
37
- OpenMP – more suitable to SMP systems.
Features of CDAC’s PARAM Supercomputers
Distributed memory at system level and Shared memory at Node level.
Nodes connected by low latency high throughput System Area
Networks PARAMNet and Fast/Gigabit Ethernet.
Standard Message Passing interface (MPI) i.e. SUN MPI, IBM MPI,
Public Domain MPI and C-DAC’s own MPI (CMPI).
C-DAC’s High Performance Computing and Communication Software
(HPCC) for Parallel Program Development and run time support.
38
References
• http://www.llnl.gov/computing/tutorials/parallel_comp/
• Tutorials located in the Maui High Performance
Computing Center's "SP Parallel Programming
Workshop".
• Linux Parallel procesing HOW TO from
http://www.tldp.org/HOWTO/Parallel-Processing-
HOWTO.html
39
Thank you.
40