Unit 4 parallel computing
Unit 4 parallel computing
In the above example of dense matrix multiplication, the instruction stream is being divided into the
available number of processors. Each processor computes the data stream it is allocated with and
accesses the memory unit for read and write operation. As shown in the above figure, the data stream 1
is allocated to processor 1, once it computes the calculation the result is being stored in the memory
unit.
2. The Task Graph Model
The task dependency graph is being used by the parallel algorithms for describing the computations it
performs. Therefore, the use of interrelationships among the tasks in the task dependency graph can be
used for reducing the interaction costs.
This model can be used effectively for solving problems in which tasks are associated with a large
amount of data as compared to that actual computation. The parallelism that is described with the task
dependency graph where each task is an independent task is known as task parallelism. The task graph
model is majorly used for the implementation of parallel quick sort, a parallel algorithm based on
divide and conquer.
Example: Finding the minimum number
In the above example of finding the minimum number, the task graph model works parallelly in order
to find the minimum number in the given stream. As shown in the above figure, the minimum of 23 and
12 is computed and passed on further by one process, similarly at the same time the minimum of 9 and
30 is calculated and passed on to the further process. This approach of computation requires less time
and effort.
3. Work Pool Model
The work pool model is also known as the task pool model. This model makes use of a dynamic
mapping approach for task assignment in order to handle load balancing. The size of some processes or
tasks is small and requires less time. Whereas some tasks are of large size and therefore require more
time for processing. In order to avoid the inefficiency load balancing is required.
The pool of tasks is created. These tasks are allocated to the processes that are idle in the runtime. This
work pool model can be used in the message-passing approach where the data that is associated with
the tasks is smaller than the computation required for that task. In this model, the task is moved without
causing more interaction overhead.
Example: Parallel tree search
In the above example of the parallel search tree, that uses the work pool model for its computation uses
four processors simultaneously. The four sub-tress are allocated to four processors and they carry out
the search operation.
4. Master-Slave Model
Master Slave Model is also known as Manager- worker model. The work is being divided among the
process. In this model, there are two different types of processes namely master process and slave
process. One or more process acts as a master and the remaining all other process acts as a slave.
Master allocates the tasks to the slave processes according to the requirements. The allocation of tasks
depends on the size of that task. If the size of the task can be calculated on a prior basis the master
allocates it to the required processes.
If the size of the task cannot be calculated prior the master allocates some of the work to every process
at different times. The master-slave model works more efficiently when work has to be done in
different phases where the master assigns different slaves to perform tasks at different phases. In the
master-slave model, the master is responsible for the allocation of tasks and synchronizing the activities
of the slaves. The master-slave model is generally efficient and used for shared address space and
message-passing paradigms.
Example: Distribution of workload across multiple slave nodes by the master process
As shown in the above diagram, the Parallel LU factorization algorithm uses the pipeline model. In this
model, the producer reads the input matrix and generated the tasks that are required for computing the
LU factorization as an output. The producer divides this input matrix into a smaller size of multiple
tasks and shares them into a shared task queue. The consumers then retrieve these blocks and perform
the LU factorization on each independent block.
6. Hybrid Model
A hybrid model is the combination of more than one parallel model. This combination can be applied
sequentially or hierarchically to the different phases of the parallel algorithm. The model that can be
efficient for performing the task is selected as a model for that particular phase.
Example: A combination of master-slave, work pool, and data graph model.
The beowulf clusters that you will be writing, compiling, and running your MPI programs on is called
a Distributed Memory System. In this system we have a master node, computer, that you log into.
Connected to the master node is a network of several other nodes. When you run your MPI program on
the master node, the master node runs the same program on each one of the nodes in the cluster. This way
we have access on each node to their processor and memory. We can also transfer data between each
node giving the illustion of one giant computer. See the illustration below for example.
Network
A program that runs on a node is called a Process. When your program is run a process is run on each
processor in the cluster. These processes communicate to each other using a system of message passing.
These messages are packets of data that are put into envelopes that contain routing information. Using the
message passing system allows us to copy data from the memory of one process to another. Here is an
illustration:
Communication of messages requires that both processes cooperate in a send and receive operation. The
transfer of data is called a send and the receiving of data by a process is called a receive.
There are two different kinds of buffers in MPI. The application buffer which is where the data for each
process is held in memory, this is the address space that holds data to be sent and received. The system
buffer is used when messages are needed to be stored, this buffer will be used depending on what type of
communication method is being used. The system buffer allows us to send messages in asynchronous
mode. Asynchronous send operations are allowed to complete even though the receiving process may not
have yet received the message. In synchronous mode a send will complete when the receiving process
gives acknowledgement that the message was received by the receiving process.
Above is an illustration that sends data from Process 1 to Process 2. The variable in the application buffer
is sent through the network and copied into the system buffer on the receiving process. The data on the
receiving system buffer is then copied into the processes application buffer. There are two methods for
sending and receiving:
Blocking – In blocking communication a call is dependent on events. To send, the data in the
application buffer must be copied to a system buffer so the data is available for reuse. For
receives the data must be copied into the receive buffer so it is ready to be used.
MPI needs to have a way to identify all the different process that will run in a parallel program. To do this
we have something called a rank. An integer is assigned to each process when it initializes. This way the
programmer can use the rank to specify a destination or source for sending and receiving messages. The
rank integer will start at zero and increase by one, for every running process. A communicator is an
object that MPI uses to group collections of process that are allowed to communicate with each other. All
the processes that we have available to us when we begin our MPI program will be ranked and grouped
into one single communicator called MPI_COMM_WORLD. MPI_COMM_WORLD is the default group
when the MPI program is initialized, we can then divide this into seperate groups to work with.
Shared memory allows database server threads and processes to share data, which can
reduce memory usage and disk I/O.
CUDA programming
Shared memory is a CUDA memory space that all threads in a thread block can access. This
allows all threads in the block to read and write to the shared memory, and all changes are
available to all threads in the block.
Parent and child processes
The shared flag can be set when mapping a block of memory so that it is shared between a
parent and child process. This allows the processes to communicate with each other without
using signals, pipes, or files.