0% found this document useful (0 votes)
13 views

Unit 4 parallel computing

Parallel processing has been developed as an effective technology in modern computers to meet the demand for higher performance, lower cost and accurate results in real-life applications. Concurrent events are common in today’s computers due to the practice of multiprogramming, multiprocessing.

Uploaded by

btechcseamar2022
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Unit 4 parallel computing

Parallel processing has been developed as an effective technology in modern computers to meet the demand for higher performance, lower cost and accurate results in real-life applications. Concurrent events are common in today’s computers due to the practice of multiprogramming, multiprocessing.

Uploaded by

btechcseamar2022
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Unit 4

What is parallel programming, and how does it work?


Parallel programming is often used interchangeably with the terms parallel processing, parallel
computing, and parallel computing solutions. Parallel programming is like running 10 burrito rolling
stations in parallel instead of slowly making 100 burritos yourself. In computer science terms, parallel
programming is the process of splitting a problem into smaller tasks that can be executed at the
same time – in parallel – using multiple computing resources. In other words, parallel programming
allows programmers to run large-scale projects that require speed and accuracy.
You can use parallel processing techniques on various devices, from mobile devices to laptops to
supercomputers. Different programming languages rely on different technologies to enable parallelism.
Open multi-processing (OpenMP) provides a cross-platform API for developing parallel applications
using C, C++, and Fortran across the cores of a single CPU.
On the other hand, technologies such as the message passing interface (MPI) enable parallel processes
between different computers or nodes.
Parallel programming is not limited to data parallelism, however. We can spread code execution across
several tasks for faster execution by distributing tasks across different threads, and across different
processors. By doing so, we’re also increasing a program’s natural resources for work and thus
increasing its capacity. In short, we get things done faster.
Parallel Algorithm Models
The need for a parallel algorithm model arises in order to understand the strategy that is used for the
partitioning of data and the ways in which these data are being processed. Therefore every model being
used provides proper structuring based on two techniques. They are as follows:
1. Selection of proper partitioning and mapping techniques.
2. Proper use of strategy in order to reduce the interaction.
Types of Parallel Models
1. The Data-Parallel Model
The data-parallel model algorithm is one of the simplest models of all other parallel algorithm models.
In this model, the tasks that need to be carried out are identified first and then mapped to the processes.
This mapping of tasks onto the processes is being done statically or semi-statically. In this model, the
task that is being performed by every process is the same or identical but the data on which these
operations or tasks are performed is different.
The problem to be solved is divided into a number of tasks on the basis of data partitioning. Here data
partitioning is being used because all the operations performed by each process are similar and proper
uniform partitioning of data followed by static mapping assures the proper load balancing.
Example: Dense Matrix Multiplication
Dense Matrix Multiplication

In the above example of dense matrix multiplication, the instruction stream is being divided into the
available number of processors. Each processor computes the data stream it is allocated with and
accesses the memory unit for read and write operation. As shown in the above figure, the data stream 1
is allocated to processor 1, once it computes the calculation the result is being stored in the memory
unit.
2. The Task Graph Model
The task dependency graph is being used by the parallel algorithms for describing the computations it
performs. Therefore, the use of interrelationships among the tasks in the task dependency graph can be
used for reducing the interaction costs.
This model can be used effectively for solving problems in which tasks are associated with a large
amount of data as compared to that actual computation. The parallelism that is described with the task
dependency graph where each task is an independent task is known as task parallelism. The task graph
model is majorly used for the implementation of parallel quick sort, a parallel algorithm based on
divide and conquer.
Example: Finding the minimum number

Finding the Minimum Number

In the above example of finding the minimum number, the task graph model works parallelly in order
to find the minimum number in the given stream. As shown in the above figure, the minimum of 23 and
12 is computed and passed on further by one process, similarly at the same time the minimum of 9 and
30 is calculated and passed on to the further process. This approach of computation requires less time
and effort.
3. Work Pool Model
The work pool model is also known as the task pool model. This model makes use of a dynamic
mapping approach for task assignment in order to handle load balancing. The size of some processes or
tasks is small and requires less time. Whereas some tasks are of large size and therefore require more
time for processing. In order to avoid the inefficiency load balancing is required.
The pool of tasks is created. These tasks are allocated to the processes that are idle in the runtime. This
work pool model can be used in the message-passing approach where the data that is associated with
the tasks is smaller than the computation required for that task. In this model, the task is moved without
causing more interaction overhead.
Example: Parallel tree search

Parallel Search Tree

In the above example of the parallel search tree, that uses the work pool model for its computation uses
four processors simultaneously. The four sub-tress are allocated to four processors and they carry out
the search operation.
4. Master-Slave Model
Master Slave Model is also known as Manager- worker model. The work is being divided among the
process. In this model, there are two different types of processes namely master process and slave
process. One or more process acts as a master and the remaining all other process acts as a slave.
Master allocates the tasks to the slave processes according to the requirements. The allocation of tasks
depends on the size of that task. If the size of the task can be calculated on a prior basis the master
allocates it to the required processes.
If the size of the task cannot be calculated prior the master allocates some of the work to every process
at different times. The master-slave model works more efficiently when work has to be done in
different phases where the master assigns different slaves to perform tasks at different phases. In the
master-slave model, the master is responsible for the allocation of tasks and synchronizing the activities
of the slaves. The master-slave model is generally efficient and used for shared address space and
message-passing paradigms.
Example: Distribution of workload across multiple slave nodes by the master process

Distribution of workload across multiple slave nodes by the master process


As shown in the above example of the Master-Slave model, the distribution of workload is being done
across multiple processes. As shown in the above diagram, one node is the master process that allocates
the workload to the other four slave processes. In this way, each sub-computation is carried out by
multiple slave processes.
5. The Pipeline Model
The Pipeline Model is also known as the Producer-Consumer model. This model is based on the
passing of a data stream through the processes that are arranged in succession. Here a single task goes
through all the other processes. They are then accessed by the required processes in a sequential
manner. Once the processing of one process is finished it goes to the next present process. In this
model, the pipeline acts as a chain of producers and consumers.
This pipeline of producers and consumers can also be arranged in a directed graph-like fashion rather
than a linear chain. The approach of Static mapping is being used for mapping of tasks onto the
processes.
Example: Parallel LU factorization algorithm

Parallel LU factorization algorithm

As shown in the above diagram, the Parallel LU factorization algorithm uses the pipeline model. In this
model, the producer reads the input matrix and generated the tasks that are required for computing the
LU factorization as an output. The producer divides this input matrix into a smaller size of multiple
tasks and shares them into a shared task queue. The consumers then retrieve these blocks and perform
the LU factorization on each independent block.
6. Hybrid Model
A hybrid model is the combination of more than one parallel model. This combination can be applied
sequentially or hierarchically to the different phases of the parallel algorithm. The model that can be
efficient for performing the task is selected as a model for that particular phase.
Example: A combination of master-slave, work pool, and data graph model.

A combination of master slave, work pool and data graph model


As shown in the above hybrid model where three different models are used at each phase master-slave
model, the work pool model, and the data graph model. Consider the above example where the master-
slave model is used for the data transformation task. The master process distributes the task to multiple
slave processes for parallel computation. In the second phase work pool model is used for data analysis
and similarly data graph model is used for making the data visualization. In this way, the operation is
carried out in multiple phases and by using different parallel algorithm models at each phase.

2.1 Message Passing Paradigm

The beowulf clusters that you will be writing, compiling, and running your MPI programs on is called
a Distributed Memory System. In this system we have a master node, computer, that you log into.
Connected to the master node is a network of several other nodes. When you run your MPI program on
the master node, the master node runs the same program on each one of the nodes in the cluster. This way
we have access on each node to their processor and memory. We can also transfer data between each
node giving the illustion of one giant computer. See the illustration below for example.

Distributed Memory System

Network

A program that runs on a node is called a Process. When your program is run a process is run on each
processor in the cluster. These processes communicate to each other using a system of message passing.
These messages are packets of data that are put into envelopes that contain routing information. Using the
message passing system allows us to copy data from the memory of one process to another. Here is an
illustration:
Communication of messages requires that both processes cooperate in a send and receive operation. The
transfer of data is called a send and the receiving of data by a process is called a receive.

2.2 Sending and Receiving

There are two different kinds of buffers in MPI. The application buffer which is where the data for each
process is held in memory, this is the address space that holds data to be sent and received. The system
buffer is used when messages are needed to be stored, this buffer will be used depending on what type of
communication method is being used. The system buffer allows us to send messages in asynchronous
mode. Asynchronous send operations are allowed to complete even though the receiving process may not
have yet received the message. In synchronous mode a send will complete when the receiving process
gives acknowledgement that the message was received by the receiving process.

Above is an illustration that sends data from Process 1 to Process 2. The variable in the application buffer
is sent through the network and copied into the system buffer on the receiving process. The data on the
receiving system buffer is then copied into the processes application buffer. There are two methods for
sending and receiving:
Blocking – In blocking communication a call is dependent on events. To send, the data in the
application buffer must be copied to a system buffer so the data is available for reuse. For
receives the data must be copied into the receive buffer so it is ready to be used.

Non-Blocking – In non-blocking communication a send will complete without waiting for


the receiving process to complete. This allows computation to overlap communication, but
keep in mind that it is not safe to modify or use the application buffer after a non-blocking
send. It is up to the programmer to test if the receive process is complete and the application
buffer is free for reuse.

2.3 Communicators and Groups

MPI needs to have a way to identify all the different process that will run in a parallel program. To do this
we have something called a rank. An integer is assigned to each process when it initializes. This way the
programmer can use the rank to specify a destination or source for sending and receiving messages. The
rank integer will start at zero and increase by one, for every running process. A communicator is an
object that MPI uses to group collections of process that are allowed to communicate with each other. All
the processes that we have available to us when we begin our MPI program will be ranked and grouped
into one single communicator called MPI_COMM_WORLD. MPI_COMM_WORLD is the default group
when the MPI program is initialized, we can then divide this into seperate groups to work with.

Shared memory programming


Shared memory programming is a method of allowing multiple programs to access the same
memory simultaneously. This can be useful for communication between programs or to avoid
redundant copies of data.
Here are some examples of shared memory programming:
 Database servers

Shared memory allows database server threads and processes to share data, which can
reduce memory usage and disk I/O.

 CUDA programming

Shared memory is a CUDA memory space that all threads in a thread block can access. This
allows all threads in the block to read and write to the shared memory, and all changes are
available to all threads in the block.
 Parent and child processes
The shared flag can be set when mapping a block of memory so that it is shared between a
parent and child process. This allows the processes to communicate with each other without
using signals, pipes, or files.

You might also like