1 of 1 PDF
1 of 1 PDF
1 of 1 PDF
ABSTRACT called work load. Some processors may have more work load
In this paper a survey on current trends in parallel computing than others and hence some processors may be overloaded and
has been studied that depicts all the aspects of parallel lightly loaded causing an imbalance in task idle time. For an
computing system. A large computational problem that can not efficient parallel computing system the task idle time will be as
be solved by a single CPU can be divided into a chunk of small small as possible. The work load from the overloaded
enough subtasks which are processed simultaneously by a processors can be shared by the lightly loaded processors by
parallel computer. The parallel computer consists of parallel invoking an appropriate balancing algorithm. The workload
computing hardware, parallel computing model, software should be equally shared when the computing system is
support for parallel programming. Parallel performance homogenous otherwise faster processors should process more
measurement parameters and parallel benchmarks are used to numbers of jobs than slower processors per unit time to keep
measure the performance of a parallel computing system. The better performance of the system.
hardware and the software are specially designed for parallel
algorithm and programming. This paper explores all the aspects 2. PARALLEL COMPUTING
of parallel computing and its usefulness. HARDWARE ARCHITECTURE
Keywords A revolutionary change has been done in the last decade in
Parallel computing, parallel computing hardware, parallel hardware development related to the computation. There exist
model, parallel benchmarks. several parallel computing hardware architectures. Depending
upon the cost and the type of computational problem parallel
1. INTRODUCTION computing hardware architecture can be divided mainly into
A processor has its own physical limits in maximum processing two categories: common parallel computer architecture and
speed. To overcome this limitation multiple processors are super computer architecture. A classification of parallel
connected co-operating with each other to solve grand computer is shown in Fig 1. Super computers are very
challenge problems. The parallel computing refers to the expensive and take long time to produce. Each parallel
processing of multiple jobs simultaneously on multiple application does not need dedicated super computer and most of
processors. A large problem can be divided into multiple the organizations can not buy a super computer due to its high
independent tasks of nearly same size by applying an cost. Fortunately a new alternative concept has emerged that is
appropriate task partitioning technique and each of the tasks known as common parallel computing in which a large number
will be executed on different processors simultaneously. of systems which consists of cheap and easily available
Numerous application problems today needs more and more autonomous processor like workstations or PCs. So it is
computing power than a usual sequential computer. A large becoming extremely popular for large computing purpose such
problem either may take long time or indefinite time to finish as scientific calculations as compared to super computers.
when it is processed on a single processor. The time taken to
finish the problem may be too much to have any importance or 2.1. Common Parallel Computing
it may be obsolete for real time computing. So the clear solution Architecture
of the above problem is the parallel computing that ensures a In this architecture non-dedicated computers which are easily
cost-effective solution by connecting more number of available are connected through a high speed communication
processors through the high speed communication mediums. media to act as a parallel computer. The architecture requires
Parallel computing model is mainly of two types. First one is very less effort and can be built with negligible cost if the
shared memory and second one is distributed computing. In general purpose computers are available. The combined
shared memory architecture a number of processors are processing power and the storage capacity solve many big
connected to a common shared memory. Data or instructions problems in parallel easily those were not possible to solve
are shared through locks and semaphores. It is easy to program otherwise. Common parallel computers have been divided into
but sometimes mislead the results. In distributed computing the three categories: multiprocessor computer, shared memory and
independent processors having their own memory are connected distributed memory computing architecture.
through a fast communication medium. Data and information In multiprocessor architecture more than one CPU is
are shared through message passing. It is difficult to implement incorporated to a single computer. The compiler is responsible
but yields better computing performance. There is another for parallelizing the code automatically. This type of
model known as hybrid model which imbibes both the concepts architecture is not so efficient but better than a computer with
from shared memory model and distributed model. single CPU.
Assigning the tasks to the processors is not just the solution of In shared memory architecture a number of processors are
the problem in parallel. Some faster processor may sit idle connected to a common central memory. Shared memory
while a slower processor may busy in a parallel computing architecture is also well known as Symmetric Multi Processor
system resulting slower computing speed due to the uneven (SMP) [6] which has identical PEs, equal access to the other
distribution of tasks. The number of tasks on a processor is PEs and operating system kernel can run on any machine.
19
International Journal of Computer Applications (0975 – 8887)
Volume 59– No.2, December 2012
Since all processors are sharing a single address space, the data developer. It also simplifies the way of mapping load
sharing is fast but processes can corrupt each others data at the effectively onto real computers.Different parallel computing
same time. So the semaphores and locks are used to save the models are used to solve parallel problems. According to the
data from corruption. There is a lack of scalability between PEs memory model the parallel computational model can be divided
and memory which means that we can not add PEs as many as into three categories: shared memory computational model,
we need to a limited memory. This problem arises mainly due distributed computational model and hierarchical memory
to the bus contention [12, 16]. The examples of SMP machines model. We introduce the above models in brief.
are IBM R50 SGI Power Challenge, DEC Alpha Server 8400.
Distributed shared memory (DSM) is another type of shared 3.1. Shared Memory Parallel Computing
memory architecture. In DSM memory is dedicated to each Models
processor but the memories are connected through a bus to form
a shared memory and the inter process communication takes In shared memory architecture a number of processors are
place through shared variables. Although the memory is connected to a common central memory. Since all processors
distributed in DSM, the system hardware and software make it are sharing a single address space, the data sharing is fast but
as single address architecture. DSM removes the problem of processes can corrupt each others data at the same time. So the
bus contention and provides better performance than SMP. semaphores and locks are used to save the data from corruption.
DSM architecture machines are Stanford DASH, SGI Origin There is a lack of scalability between PEs and memory which
2000 and Cray T3D. means that we can not add PEs as many as we need to a limited
In distributed shared memory all the PEs which are connected memory.
through a network have their own independent local memory in 3.1.1. PRAM Model
distributed memory MIMD computer. Each PE is a full
computer connected through a network. This architecture is also PRAM model is relatively older and widely used shared
known as loosely coupled because the PEs are not tightly memory computing model for the design and analysis of
integrated as in shared memory architecture. As a PE can not parallel algorithms and was first developed by Fortune, Wyllie
directly access the memory of other PEs, it is called No Remote and Goldschlager. A limited numbers of processors share a
Memory Access (NORMA). PEs can communicate with the common global pool of memory. The processors can operate
others through the communication network by message passing. synchronously and allowed to access the memory concurrently
The network that connects the PEs may be of different and take only one unit of time to be completed. Imposing the
topologies like bus, tree, mesh, cube etc. Cluster of workstation restrictions on memory access, the PRAM model has different
(COW) and PC cluster fall under this category [7, 9, 14]. A instances. CRCW PRAM model [15] that permits simultaneous
cluster is a collection of independent computers that are read and write to the same memory cell. CREW PRAM [15] is
physically interconnected through LAN with the high another model that permits simultaneous read to the same
performance network technology like FDDI, Fiber Channel, memory cell but permits only one processor to write on a cell at
ATM switch etc. a time. Another model which does not permit the concurrent
access of any given memory cell is known as EREW PRAM.
2.2. Super Computing Architecture Another model which uses limited communication bandwidth
Super computing has extremely high execution rate and by calculating maximum memory contention in each phase of
extremely high I/O throughput. It needs very large primary and algorithm is known as QSM model. Though the PRAM model
secondary memory. So the cost and the time are the two crucial is easy to implement, it suffers from memory and network
factors for producing super computers. The principal contention problem.
architectures of super computing are Massively Parallel
Processor (MPP) and Parallel Vector Processor (PVP) [9, 14]. 3.2. Distributed Memory Parallel Computing
MPP system is the collection of hundred or thousand of Models
commodity processors interconnected by high speed and low There are numbers of distributed memory parallel computing
latency communication network. The memory of the processors models in which each complete computer having their own
in MPP is distributed but the processors are synchronized by the memory are connected through a communication network.
blocking message passing operations. Each process has its own Model BSP and LogP models are the most well known models
physical address space and communicates with the others under these Models. These models remove the shortcomings of
through message passing primitives. Intel paragon, Cray T3E the shared memory computational memory. We introduce both
and TFLP are the examples of this category. PVP uses specially of them in brief.
designed few vector processor of capacity having at least 1
Gflop/s performance thus PVP can maintains an extremely well 3.2.1. BSP Model
performance for some particular applications and naturally they
are expensive than MPP. PVP makes the use of huge number The bulk synchronization parallel model (BSP) has three
of vector registers and instruction buffer instead of cache. Cray components- p numbers of processors/ memory, supersteps with
C-90, Cray T-90 and NEC SX-4 are the example of PVP periodicity L and the bandwidth factor g which is defined as the
machines. A list of top ten supercomputers [13] has been shown ratio of computation to communication [17]. In each superstep,
in TABLE 1. each processor or memory can carry out computation on local
data to it. After each L unit of time a global check is done to
3. PARALLEL COMPUTING MODEL verify whether all components are finished. If all components
are still not finished, another next superstep is allowed to finish
The need of parallel computing model arises to solve any all the components. The bandwidth limitation exists with BSP
problem in order to facilitate analysis and prediction. The model that it can sent maximum message by a limiting factor
models are used for developing efficient problem solving tools h=L/g which is known as h- relations. This model is useful as it
and thus a model is utilized to solve a particular class of a includes three parameters and separately treats the
problem. A good computational model can make any communications and computations
complicated problem easier to the program designer and
20
International Journal of Computer Applications (0975 – 8887)
Volume 59– No.2, December 2012
3.2.2. LogP Model into sub-problems of nearly equal size which will be non-
overlapping. The sub-problems will be then solved
LogP model consists of four parameters- P numbers of concurrently. In divide and conquer the problem will be broken
computers, L (latency of message passing), O (overheads first into sub-problems. The sub-problem will be solved
involved in message passing) and g (minimum time interval recursively and the result of the sub-problems merged at the
between successive messages) [8]. At most L/g messages can end. Pipelining is the simple but good technique for parallel
be transmitted from one processor to another at any instant. If a algorithm. The problem will be divided into segments and the
process has more than this number of messages to transmit, it output of one segment is the input of another next segment and
stalls until the message can be sent without exceeding the they produce the result at the same rate.
capacity limit. This model is asynchronous in nature and thus Decomposition, assigning, mapping and scheduling are the
message passing latency is unpredictable. In this model all this common way to implement the parallel algorithm. A proper
parameters are not considered at the same time, some of them division of a large problem into a number of sub-problems can
can be neglected For example, some algorithms that does not facilitate the effective implementation of a parallel algorithm.
communicate data frequently, the bandwidth and capacity limit There are two main techniques for the decomposition of a
can be ignored. problem. First one is domain decomposition and second one is
functional decomposition.
3.3. Hierarchical Memory Computational In domain decomposition the data associated with the problem
Models is divided into small pieces of data of nearly equal in size. Now
the algorithm is divided in such a way that to operate on each
The speed of processor is more and more than the speed of
task. Different tasks are given to operate on different data. In
memory. So the cost of memory access should be considered.
domain decomposition tasks start simultaneously.
Since the access time of different memory location is different,
Functional decomposition divides the algorithm into
a more precise communication cost can be evaluated and the
independent tasks which can be processed simultaneously. If
performance of the model can be predicted more efficiently.
the data needed for the tasks is also independent, the division is
This model is very suitable when the bulk of data movement
perfect otherwise the communication will be considerable to
among different level of memory hierarchy occurs for some
avoid the repetitions of data. All tasks commence concurrently
class of problems.
but some of the tasks will have to wait until data is obtainable.
Hierarchical memory model (HMM) [1] and hierarchical
memory model with block transfer (HMM with BT) [2] are the
two early models of parallel computational model with memory 5. SOFTWARE SUPPORT for
hierarchy. In HMM model, there are K levels of memory each PARALLEL PROGRAMMING
of which contains 2K memory locations; access to memory
location x takes f(x) time for some function f. HMM with BT Designing a parallel programming is always challenging matter.
model is slightly different from HMM in the sense that HMM More and more focus is imposed on designing parallel
with BT model transfer data in large block ended at address a programming design. Two methodologies are widely used for
with length l will have cost f(a) + l. the purpose of parallel programs. They are auto-parallelization
compiler and library based software.
3.3.1. UMH Model Auto parallelization does its work in two fashions. First one is
complete automatic compiler which finds the parallelism during
This model is different from the above models. The cost the compilation of source code. This approach mainly aims to
function for the memory access is the function of memory level
numbers not data addresses. Another difference is that UMH
model allows the simultaneous data transfer on different level parallelize the loops like for and do. Second one is program
buses while HMM and HMM with BT only allow one transfer directed which uses compiler directives to make the code
at a time. parallel.
The library based parallel software embedded its library to the
3.3.2. HPM Model sequential programming languages to support the parallel
HPM is a memory hierarchical model for general homogeneous execution of a problem. MPI and OpenMP are the most widely
parallel computer systems with hierarchical parallelism and acceptable standard for parallel programming. MPI is a
hierarchical memories. It contains hierarchy of enhanced common message passing library which has the primitive
RAMS that co-operate with each other. In HPM model, the function like send() /receive() by which MPI process
level K is used for memory access and level K+ is used for communicate with the other process through message passing.
message passing. Its organization of hierarchical memories has OpenMP is a parallel framework for supporting compiler
many common features as UMS and DRAM models. directives, library support routine. Apart form this LINDA,
CPS, P4 etc are the example of parallel programming software.
4. PARALLEL PROGRAMMING DESIGN
and IMPLEMENTATION 6. PARALLEL PERFORMANCE
MEASUREMENT
There are three main approaches for designing the parallel
algorithm. First one is the parallelization of a sequential We will introduce some performance measurement parameters-
problem which has the chance of parallelism inside it. The execution time, speedup, efficiency and scalability. Each
inside parallelism can be exploited to make it parallel. The parameter has its own way of describing the characteristics of
second one is the way of parallelization of any new problem at the parallel program.
the start time. The third one is that taking any well known
algorithm and solves the problem accordingly. 6.1. Execution Time
There are several ways of designing a parallel algorithm. The Execution time is the time taken to execute an algorithm. For
most widely used techniques are portioning, divide and better performance execution time is always tried to keep
conquer, pipelining etc. in partitioning, a problem is divided
21
International Journal of Computer Applications (0975 – 8887)
Volume 59– No.2, December 2012
minimum that is lower the value of execution time better is the functions of a machine. It compares the relative efficiency of
performance of a system. Generally execution time is denoted processors. Example of this category is Whetstone benchmark,
by Ts and Tp where Ts represents the execution time for a fastest Dhrystone and wPrime etc. In Kernel benchmarks a part of a
sequential problem and Tp represents the execution time for a large program is extracted and this part of program is
parallel problem on p processors. There is a relation between Ts responsible for most of the execution time of that problem.
and Tp that will found in other parameters below. Examples are LINPACK, NAS etc. In real application
benchmarks the code segment is the application program itself.
6.2. Speed-up It is very effective in measuring the overall system performance
Speed-up measures how many times a parallel program works but needs more time resources. Examples are Perfect
faster than a sequential one when both programs solve the same Benchmarks, SPEC benchmarks etc. we represent some of the
problem. Speed-up is denoted by Sp which is the ratio of Ts and benchmarks which are mostly used to evaluate the performance.
Tp and can be represented as
7.1. Whetstone Benchmarks
Ts Whetstone benchmark was the first international benchmark in
Sp history [10]. It was intended to measure the performance of a
Tp computer system and to simulate the floating point intensive
Hence Sp measures the benefit of parallel computer over application problems. It consists of nine small loops of different
sequential computer. The highest value of Sp can be equal to statement of particular type like integer arithmetic, floating
the number processors used in parallel computer when there point arithmetic, ‘if’ statements etc. It uses global variables and
will be no communication among the processors which is a high percentage of execution time is spent in mathematical
impractical situation in parallel computing. So due the library functions. The result of this benchmark is represented in
communication cost the speed-up is always less than equal the MWIPS (mega whetstone institution Per Second).
number of processors used in parallel computer.
According to the Amdahl’s law [3], it is very difficult to get 7.2. Dhrystone Benchmarks
ideal parallel system to get the value of Sp is equal to p due to It was built to measure the performance of non numeric
the presence of some sequential code which can not be applications. It consists of measurement loops. Each loop
parallelized and must be processed sequentially by a single includes twelve procedures and ninety four statements. One
processor. Suppose r is the part of a program that can be hundred one statements are dynamically executed during one
parallelized and the rest s = 1-r part is sequential in nature. Then Dhrystone [18]. Dhrystone benchmarks do not contain any
the speed up becomes floating point operations and most of its operations involve
string operations. These benchmarks are widely acceptable by
1 the business vendors because the working set of the program
Sp fits properly in the cache of modern computer machines.
sr/ p
1 7.3. LINPACK
When p , Sp which implies that the maximum These benchmarks are used to measure the performance of a
s computer system when a dense system of linear equations is
speed-up can be achieved is less than equal to 1/s whatever may solved by applying Gaussian elimination method [11]. The
be the number of processors present in the system. benchmarks are involved in calculating high percentage of
floating point calculation in double precision. Most of the
6.3. Efficiency execution takes place just 15-line subroutine of the program.
Efficiency measures the number of operations performed by the The result of these benchmarks is expressed in MFLOPS. These
processors during the parallel execution. The efficiency can be benchmarks are widely used by TOP500 and China TOP100
formulated as [7].
22
International Journal of Computer Applications (0975 – 8887)
Volume 59– No.2, December 2012
23
International Journal of Computer Applications (0975 – 8887)
Volume 59– No.2, December 2012
3 Oak Ridge National Cray Jaguar Cray Xt5, HC 2.6 GHz USA 224162 1.759 6.95
Laboratory
4 National Dawning Nebulae TC 3600 Blade, China 120640 1.271 2.58
Supercomputing Intel X5650, NVidia Tesla C2050
center Shenzhen GPU
5 GS IC, Tokyo NEC/ HP TS UB AME-2 Japan 73278 1.192 1.40
Institute of HP ProLiant, Xenon 6C, NVidia,
Technology Linux/ Windows
6 DOE/NNSA/LANL/ Cray Cielo Cray XE6, 8C 2.4 GHz USA 142272 1.110 3.98
SNL
7 NASA/ Ames SGI Pleiades SGI Altix ICE USA 111104 1.088 4.10
Research 8200EX/8400EX
Center/NAS
8 DOE/SC/LBNL/ Cray Hopper USA 153408 1.054 2.91
NERSC Cray XE 6, 6C 2.1GHz
9 Commissariata Bull Tera 100 France 138368 1.050 4.59
1’Energie Atomique Bull bullx super-node
(CEA) S6010/S6030
10 DOE/NNSA/LANL IBM Roadrunner USA 122400 1.042 2.34
BladeCenter QS 22/LS 21
24
International Journal of Computer Applications (0975 – 8887)
Volume 59– No.2, December 2012
[14] Hwang k. et al. “Scalable Parallel Computing”. [17] Valiant L. “A bridging Model for Parallel
McGraw-Hill, 1998. Computation”. Communication of ACM, 1990, 33:
[15] Jaja J. “An Introduction to Parallel Algorithms”. 103-111.
Addison-Wesley, 1992. [18] Weicker R.P. “DHRYSTONE: A Synthetic Systems
[16] Protic J. et al. “Distributed Shared Memory: Concept Programming Benchmark”. Communications of the
and Systems”. IEEE Parallel and Distributed ACM, vol. 27, no.10, pp. 1013-1030, October, 1984.
Technology, Summer 1996.
25