0% found this document useful (0 votes)
69 views50 pages

PDC Summers Finals Revision Notes

The document provides an overview of key concepts in parallel and distributed computing including parallel computing, distributed computing, goals of parallel computing, Moore's law and parallelism, and parallelization strategies. It also discusses communication, synchronization, granularity, and other important concepts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views50 pages

PDC Summers Finals Revision Notes

The document provides an overview of key concepts in parallel and distributed computing including parallel computing, distributed computing, goals of parallel computing, Moore's law and parallelism, and parallelization strategies. It also discusses communication, synchronization, granularity, and other important concepts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 50

PDC Intro and Overview-

Sure, here's a summarized version of the key points and core concepts from your
lecture slides on Parallel and Distributed Computing:

Parallel Computing:

- Computation with multiple calculations or executions carried out


simultaneously.
- Addresses limitations of serial computing by using multiple processors.
- Multi-processor and multi-core processor architectures.
- Communication between processors is crucial.

Distributed Computing:

- Hardware and software components on networks communicate through


message passing.
- Utilizes network media for communication.

Parallel vs Distributed Computing:

- Comparison based on processors, shared vs distributed memory, and


communication methods (buses vs message passing).

Goals of Parallel Computing:

- Solve large problems more quickly.


- Parallelism allows solving fixed-size problems faster.
- Granularity determines the size of units being parallelized.

Moore's Law and Parallelism:


- Observation by Gordon Moore: transistor count doubles every year.
- More transistors mean more parallelism opportunities.
- Implications for power consumption, heat dissipation, and fabrication
challenges.

Multi-core Era:

- Transition from single-core processors due to physical limitations.


- Multi-core processors with multiple cores on a single chip.
- Need to rethink algorithms for parallelism.

Parallelization Strategies:

1. Problem Understanding : Identify hotspots, understand computation.

2. Partitioning/Decomposition : Divide work into chunks, domain or


functional decomposition.

3. Assignment : Compose fine-grained computations into tasks, consider load


balance and communication.

4. Orchestration & Mapping : Consider communication, synchronization, data


locality, and system aspects.

Communication and Synchronization:

- Communication involves data exchange between tasks.


- Trade-offs between latency and bandwidth.
- Synchronization coordinates tasks' execution.
- Types include barrier, lock/semaphore, synchronous communication.

Collective Communications:

- Involve data sharing among multiple tasks.

Synchronization Types:

1. Barrier : All tasks wait until a predefined point, then resume.

2. Lock/Semaphore : Serialize access to shared resources.

3. Synchronous Communication : Requires acknowledgment before data


transfer.

Remember that this is a concise overview, and you might need to refer back to
your detailed lecture slides for more comprehensive understanding and specific
examples.

THAT WAS SUMMARY NOW KEY CONCEPTS REVISION-


=

Granularity:
Granularity refers to the size of the tasks that are being parallelized. It is the
ratio of computation to communication in a parallel program. Fine-grained
granularity involves smaller tasks with relatively more communication, while
coarse-grained granularity involves larger tasks with relatively less
communication.

Communication and Computation:


- Communication refers to the exchange of data or messages between tasks or
processes in a parallel program.
- Computation is the actual processing or calculation performed by each task or
process.

Synchronization:
Synchronization involves coordinating the execution of tasks to ensure they
work correctly together. It prevents race conditions and ensures that tasks
proceed in an orderly manner.

Synchronization Overhead:
Synchronization overhead is the time and resources spent on coordinating tasks.
High synchronization overhead can hinder performance improvement.

Communication Events:
Communication events are points in a parallel program where data is exchanged
between tasks. They include sending and receiving data.

Fine-Grained Parallelism:
Fine-grained parallelism involves breaking down a problem into small tasks that
require frequent communication. It has a low computation to communication
ratio.

High Communication Overhead:


High communication overhead occurs when a significant portion of the
program's execution time is spent on communication rather than computation.

Effect of Communication Overhead on Performance:


High communication overhead can lead to reduced performance enhancement in
parallel programs, as time spent on communication outweighs the benefits of
parallel execution.

Types of Overheads:
Overheads in parallel programming include computation overhead,
communication overhead, synchronization overhead, and resource overhead.

Coarse-Grained Parallelism:
Coarse-grained parallelism involves larger tasks with less frequent
communication. It has a higher computation to communication ratio.

Load Balancing:
Load balancing ensures that tasks are distributed evenly among processors to
avoid underutilization or overloading of resources.

Load Imbalance Causes:


Load imbalance can be caused by unequal task sizes, variations in input data, or
differences in task execution times.

Domain vs Functional Decomposition:


- Domain decomposition involves dividing the problem into subdomains that
can be solved concurrently.
- Functional decomposition involves breaking down the problem based on
different functions or operations.

Data Decomposition:
Data decomposition divides the problem into chunks, where each chunk
operates on a different subset of the data.
Block vs Cyclic Decomposition:
- Block decomposition assigns contiguous blocks of data to tasks.
- Cyclic decomposition assigns data in a round-robin manner to tasks.

1D and 2D Data Decomposition:


1D decomposition divides data along a single dimension. 2D decomposition
divides data along two dimensions, often used in matrices.

Cyclic Data Decomposition:


Cyclic data decomposition assigns data to tasks in a circular pattern.

Data vs Domain Decomposition:


- Data decomposition focuses on distributing data.
- Domain decomposition focuses on distributing different subproblems.

Functional Decomposition:
Functional decomposition breaks down a problem into separate functions or
tasks.

Considerations for Task Composition and Assignment:


Consider load balance, communication patterns, synchronization, data locality,
and system-related factors like NUMA architecture.

NUMA (Non-Uniform Memory Access):


NUMA is a computer architecture where multiple processors have their own
local memory, and accessing remote memory has varying latencies.

Communication Cost:
Communication cost refers to the time and resources spent on data exchange
between tasks.
Inter-Task Communication Overhead:
Inter-task communication overhead arises from the time spent on sending and
receiving data between tasks.

Bandwidth and Latency:


Bandwidth is the amount of data that can be transmitted in a given time, while
latency is the time delay in data transmission.

Visibility of Communication:
- Parallel system: High visibility due to shared memory.
- Distributed system: Moderate visibility due to message passing.
- Parallel and Distributed: Moderate visibility.
- Shared memory model: High visibility due to shared memory access.

Shared Memory Model:


In a shared memory model, multiple processors access a common memory space,
enabling communication through shared variables.

Please note that this is a brief overview of the concepts. For a comprehensive
understanding, refer to your lecture materials and further reading.

NOW SOME QS ANSWER TESTING!


Certainly, here is a structured response to your questions with headings and
answers:

---

1. Granularity:
*Heading: Understanding Granularity in Parallel Programming*

- Question : What is granularity in parallel programming, and how is it related


to communication and computation ratios?

- Answer : Granularity refers to the size of tasks in a parallel program. It is the


ratio of computation to communication. Fine-grained granularity involves
smaller tasks with relatively more communication, while coarse-grained
granularity involves larger tasks with relatively less communication.

2. Communication and Computation:

*Heading: Differentiating Communication and Computation in Parallel


Programming*

- Question : What are communication and computation in parallel


programming?

- Answer : Communication refers to the exchange of data or messages between


tasks, while computation involves the actual processing or calculation performed
by each task.

3. Synchronization:

*Heading: Understanding Synchronization in Parallel Programming*

- Question : What is synchronization in parallel programming, and how does it


relate to coordinating tasks?
- Answer : Synchronization involves coordinating the execution of tasks to
ensure orderly processing and prevent race conditions. It ensures tasks work
together as intended.

*Subheading: Types of Synchronization:*

- Question : What are some types of synchronization mechanisms used in


parallel programming?

- Answer : Types of synchronization include barriers (all tasks wait until a


point), locks/semaphores (serialize access), and synchronous communication
(acknowledgment before data transfer).

4. Synchronization Overhead:

*Heading: Exploring Synchronization Overhead in Parallel Programs*

- Question : What is synchronization overhead, and how can high


synchronization overhead affect program performance?

- Answer : Synchronization overhead is the time and resources spent on


coordination. High synchronization overhead can hinder performance
improvement in parallel programs.

5. Communication Events:

*Heading: Understanding Communication Events in Parallel Programming*

- Question : What are communication events, and why are they important in
parallel programming?
- Answer : Communication events are points where data is exchanged between
tasks. They include sending and receiving data, and they play a crucial role in
maintaining data coherence.

6. Fine-Grained Parallelism:

*Heading: Exploring Fine-Grained Parallelism*

- Question : What is fine-grained parallelism, and how does it differ from


coarse-grained parallelism?

- Answer : Fine-grained parallelism involves small tasks with frequent


communication. It has a low computation to communication ratio, making it
suitable for tasks with intensive communication needs.

*Subheading: Communication Overhead in Fine-Grained Parallelism:*

- Question : Why does fine-grained parallelism often have low computation to


communication ratios?

- Answer : Fine-grained tasks require frequent communication, leading to a


lower computation to communication ratio.

7. High Communication Overhead:

*Heading: Understanding High Communication Overhead in Parallel


Programming*

- Question : What does high communication overhead imply in parallel


programming?
- Answer : High communication overhead indicates that a significant portion
of the program's execution time is spent on communication rather than actual
computation.

*Subheading: Impact on Performance Enhancement:*

- Question : How does high communication overhead affect the potential


performance enhancement of parallel programs?

- Answer : High communication overhead can limit the benefits of parallel


execution, as the time spent on communication offsets the gains from
parallelism.

8. Types of Overheads:

*Heading: Exploring Different Types of Overheads in Parallel Programming*

- Question : What are the different types of overheads encountered in parallel


programming?

- Answer : Overheads include computation overhead, communication


overhead, synchronization overhead, and resource overhead.

9. Coarse-Grained Parallelism:

*Heading: Understanding Coarse-Grained Parallelism*

- Question : What is coarse-grained parallelism, and how does it differ from


fine-grained parallelism?
- Answer : Coarse-grained parallelism involves larger tasks with less frequent
communication. It has a higher computation to communication ratio, making it
suitable for tasks with less communication.

10. Load Balancing:

*Heading: Exploring Load Balancing in Parallel Programming*

- Question : What is load balancing, and why is it important in parallel


programs?

- Answer : Load balancing ensures even distribution of tasks among processors,


avoiding resource underutilization or overloading.

*Subheading: Causes of Load Imbalance:*

- Question : What factors can lead to load imbalance in parallel programs?

- Answer : Load imbalance can be caused by unequal task sizes, variations in


input data, or differences in task execution times.

11. Domain vs Functional Decomposition:

*Heading: Comparing Domain and Functional Decomposition*

- Question : What are domain and functional decomposition, and how do they
differ?

- Answer : Domain decomposition divides the problem into subdomains, while


functional decomposition breaks down tasks based on functions or operations.
*Subheading: Data Decomposition in Domain Decomposition:*

- Question : What is data decomposition within the context of domain


decomposition?

- Answer : Data decomposition involves dividing the problem's data into


subdomains for concurrent processing.

12. Block vs Cyclic Decomposition:

*Heading: Analyzing Block and Cyclic Decomposition*

- Question : What are block and cyclic decomposition, and what are their
advantages and disadvantages?

- Answer : Block decomposition assigns contiguous data blocks, while cyclic


decomposition assigns data in a round-robin fashion. Block decomposition can
lead to better cache utilization, while cyclic decomposition can provide better
load balancing.

13. 1D and 2D Data Decomposition:

*Heading: Understanding 1D and 2D Data Decomposition*

- Question : What are 1D and 2D data decomposition, and how do they differ?

- Answer : 1D decomposition divides data along a single dimension, while 2D


decomposition divides data along two dimensions, often used in matrices.

*Subheading: Visualizing Data Decomposition:*


- Question : How does data decomposition work, and what happens in 1D and
2D decompositions? Can you provide a visual representation?

- Answer : In 1D decomposition, data is divided along a single axis. In 2D


decomposition, data is divided along both row and column axes, creating a grid-
like structure.

14. Cyclic Data Decomposition:

*Heading: Exploring Cyclic Data Decomposition*

- Question : What is cyclic data decomposition, and how does it distribute data
among tasks?

- Answer : Cyclic data decomposition assigns data to tasks in a circular pattern,


ensuring even distribution.

15. Data vs Domain Decomposition:

*Heading: Differentiating Data and Domain Decomposition*

- Question : What are the differences between data and domain decomposition
in parallel programming?

- Answer : Data decomposition focuses on distributing data among tasks, while


domain decomposition divides the problem into subdomains for concurrent
processing.

16. Functional Decomposition:

*Heading: Exploring Functional Decomposition*


- Question : What is functional decomposition, and where does it fall in the
spectrum of data and domain decomposition?

- Answer : Functional decomposition breaks down a problem into separate


functions or tasks. It falls under the category of domain decomposition.

*Subheading: Types of Decomposition:*

- Question : Are there different types

of decomposition, and can you provide subtypes and examples?

- Answer : Decomposition types include domain and functional. Subtypes may


include block and cyclic decomposition.

17. Considerations for Task Composition and Assignment:

*Heading: Factors to Consider when Composing and Assigning Tasks*

- Question : What factors should be considered when composing and assigning


tasks in parallel programming?

- Answer : Consider load balance, communication patterns, synchronization,


data locality, and system-related aspects like NUMA architecture.

18. NUMA (Non-Uniform Memory Access):

*Heading: Understanding NUMA Architecture*

- Question : What is NUMA, and how does it impact parallel programming?


- Answer : NUMA is a computer architecture where processors have varying
access times to memory. It affects memory access patterns in parallel programs.

19. Communication Cost:

*Heading: Exploring Communication Cost in Parallel Programs*

- Question : What is communication cost, and why is it important to consider


in parallel programming?

- Answer : Communication cost refers to the time and resources spent on data
exchange between tasks. It affects program performance.

*Subheading: Impact of Inter-Task Communication Overhead:*

- Question : How does inter-task communication overhead affect parallel


program performance?

- Answer : Inter-task communication overhead can lead to delays and


inefficiencies, reducing the potential speedup from parallelism.

*Subheading: Saturation of Bandwidth:*

- Question : How can communication saturate bandwidth in parallel systems?

- Answer : Excessive communication between tasks can saturate available


communication bandwidth, leading to congestion and performance degradation.

20. Bandwidth and Latency:


*Heading: Defining Bandwidth and Latency*

- Question : What are bandwidth and latency in parallel systems?

- Answer : Bandwidth is the data transmission rate, while latency is the time
delay in data transmission.

21. Visibility of Communication:

*Heading: Examining Visibility of Communication in Parallel Models*

- Question : How is communication visibility different in parallel, distributed,


and shared memory models?

- Answer : Communication visibility varies in different models: high in parallel


(shared memory), moderate in distributed, and a mix of both in parallel and
distributed.

22. Shared Memory Model:

*Heading: Understanding Shared Memory Model*

- Question : What is the shared memory model in parallel programming?

- Answer : In the shared memory model, multiple processors access a common


memory space, facilitating communication through shared variables.

---

Please note that these answers are concise summaries. For a more comprehensive
understanding, you may refer to your lecture materials and additional resources.
COMMUNICATION NOTES-

Certainly, here are expanded notes from your provided content about
communication patterns, synchronization, and related concepts:

Communication Patterns:

- Synchronous Communication :
- Requires "handshaking" between tasks.
- Structured explicitly in code by the programmer.
- Involves blocking communications as other tasks wait until communication is
completed.
- Suitable for scenarios where tasks need to coordinate closely.

- Asynchronous Communications :
- Enables tasks to transfer data independently.
- Non-blocking communications allow interleaving computation with
communication.
- Provides flexibility and potential performance improvements.
- Particularly useful when tasks can progress without waiting for
communication to complete.

Scope of Communication:

- Knowing task communication is crucial in design:


- Point-to-point Communication :
- Involves two tasks, one sender/producer and one receiver/consumer.
- Direct communication between specific tasks.
- Data is exchanged between individual pairs of tasks.
- Collective Communication :
- More than two tasks participate, often in a common group or collective.
- Data sharing involves multiple tasks in a coordinated manner.
- Efficient for broadcasting, gathering, and other collective operations.

Synchronization:

- Synchronization ensures coordination and sequencing of parallel tasks.


- Significant impact on program performance.
- Often involves serializing program segments to maintain order.

Types of Synchronization:

1. Barrier :
- All tasks involved.
- Each task works until reaching the barrier, then blocks.
- Resumes when the last task reaches the barrier.
- Ensures synchronization point before proceeding.

2. Lock / Semaphore :
- Any number of tasks can be involved.
- Typically used to serialize access to global data or code.
- Only one task can own the lock/semaphore at a time.
- Tasks attempt to acquire the lock, waiting if it's owned by another task.
- Can be blocking or non-blocking.
- Effective for managing shared resources.

3. Synchronous Communication Operations :


- Involves tasks executing communication.
- Requires acknowledgement before communication initiation.
- Ensures that tasks are ready for communication before data transfer.

These notes cover communication patterns, synchronization, and various


synchronization types. For more comprehensive understanding, refer to your
course materials and examples provided by your instructor.

PDC Parallel Architectures:


Absolutely, here's a comprehensive yet compact set of notes based on the slides
about Parallel Architectures, Flynn's Taxonomy, Cluster Computing, Grid
Computing, and Cloud Computing:

Parallel Architectures and Flynn's Taxonomy:

- Flynn's Taxonomy categorizes parallel architectures based on instruction and


data streams.
- Four classifications: SISD, SIMD, MISD, MIMD.
- SISD (Single Instruction, Single Data Stream): Single processor, single
instruction stream, deterministic execution, used in traditional computing.
- SIMD (Single Instruction, Multiple Data Stream): Parallel processor,
single instruction to all units, each processes different data, common in GPUs.
- MISD (Multiple Instruction, Single Data Stream): Sequence of data to
processors, each executing different instruction sequences, e.g., pipelined vector
processors.
- MIMD (Multiple Instruction, Multiple Data Stream): Simultaneous
execution of different instructions on different data, common in multi-core,
clusters, grids, and clouds.

Cluster Computing:
- Clusters are collections of interconnected independent uni-processor systems.
- High-performance alternative to SMP.
- Benefits include scalability, high availability, and redundancy.
- Cluster middleware provides a unified image, single point of entry, and single
file hierarchy.

Grid Computing:

- Grid Computing involves globally distributed heterogeneous computers


providing CPU power and data storage.
- Applications executed at various locations, geographically distributed services.
- Grid architecture comprises autonomous distributed computers/clusters, user
resource broker, grid resources, and grid information service.
- Utilizes standard protocols and interfaces for resource sharing.

Cloud Computing:

- Cloud Computing offers network-based computing over the Internet.


- Provides on-demand services, elasticity, and pay-as-you-go.
- Three service models: IaaS (Infrastructure as a Service), PaaS (Platform as a
Service), SaaS (Software as a Service).
- Cloud providers offer scalable, virtualized resources.

Supercomputers:

- Supercomputers lead in processing capacity and speed, measured in FLOPs.


- LINPACK Benchmark determines computer speed, Top 500 Supercomputers
list tracks high-end systems.
- Examples include Fugaku, Summit, Sierra, and Sunway TaihuLight.
Key Concepts and Bullet Points:

- Flynn's Taxonomy categorizes parallel architectures.


- SISD: Single processor, single instruction, deterministic execution.
- SIMD: Parallel processor, single instruction to all, different data, e.g., GPUs.
- MISD: Sequence of data to processors, different instructions, e.g., pipelined
vector processors.
- MIMD: Simultaneous execution of different instructions and data, common in
multi-cores, clusters, grids, clouds.
- Clusters: Collections of interconnected uni-processor systems, high-
performance alternative to SMP.
- Grid Computing: Globally distributed heterogeneous computers, standard
protocols and interfaces.
- Cloud Computing: Network-based computing, on-demand, IaaS, PaaS, SaaS.
- Supercomputers: Lead in processing capacity and speed, LINPACK
Benchmark, Top 500 list.

Please note that these notes provide a concise overview of each topic. More
details and specific examples can be found in your lecture slides.
—-------------------------------------------------------------------------------------------------------
----------------------------

Beowulf Cluster/Basic Concepts and MPI Intro

Beowulf Cluster:

A cluster of computers (compute nodes) interconnected with a network.


Characteristics: dedicated homogeneous nodes, off-the-shelf machines, standard
network, open-source software, Linux platform.
One special node called Head or Master node.

Networking Concepts:

DHCP (Dynamic Host Configuration Protocol): Obtain IP address dynamically.


NAT (Network Address Translation): Translates local addresses to a single
external IP.
Bridged Networking: VMs are full network citizens, like the host.
Internal Networking: Isolated network for VMs.
NFS (Network File System):

Share content between nodes.


Install NFS server on master and NFS client on slaves.
Create shared folders (e.g., /mirror) for data.
Configure /etc/exports on master to share folders with specific permissions.
Mount shared folders on slave nodes.
SSH (Secure Shell) Setup:

Set up password-less SSH between nodes.


Generate SSH key pairs (public and private keys) for user mpiuser.
Add master's public SSH key to authorized_keys on all nodes.
MPICH (Message Passing Interface):

Install MPICH2 for parallel programming.


Compile and run MPI programs across nodes.
Use mpiexec to distribute processes among nodes.
MPI Hello World Example:

Write an MPI program (mpi_hello.c) that prints processor information.


Compile the program using mpicc.
Run the program using mpiexec with a specified machinefile and number of
processes.

MPI INTRO:

Introduction to MPI:

- Message Passing Interface (MPI) is a standardized communication protocol


used for parallel programming.
- MPI enables applications to execute in parallel across multiple computers
connected by a network.
- Processes in an MPI program are assigned unique integers, known as ranks,
starting from 0 for the parent process.
- Each process's rank serves as its identifier within the MPI program.
- MPI provides functions that allow processes to determine their ranks and the
total number of processes in the program.

MPI Basic Functions:

- MPI_Init(&argc, &argv): Initializes the MPI environment.


- MPI_Comm_size(MPI_COMM_WORLD, &nprocs): Retrieves the total
number of processes in the program.
- MPI_Comm_rank(MPI_COMM_WORLD, &myrank): Retrieves the rank of
the calling process.
- MPI_Finalize(): Finalizes the MPI environment.

Relation between Beowulf Cluster and MPI:

- A Beowulf Cluster is a collection of interconnected computers (nodes) used for


parallel computing.
- MPI provides the communication framework for processes running on these
nodes to exchange data and coordinate their execution.
- Beowulf Clusters leverage MPI to enable processes on different nodes to
collaborate in parallel computing tasks.
- The cluster environment allows MPI programs to efficiently distribute
workloads, facilitate data exchange, and synchronize computations.
- MPI enables the cluster nodes to function as a unified computing resource,
enhancing overall performance and scalability.

In summary, MPI serves as the essential communication infrastructure that


allows processes within a Beowulf Cluster to collaborate and execute parallel
tasks efficiently. The combination of Beowulf Clusters and MPI provides a
powerful platform for high-performance parallel computing applications.

—--------------------------------------------------------------------------------------

ASSIGNMENT 1 AND 2 LEARNINGS OR PRACTICE QS–


—--------------------------------------------------------------------------------------

MPI BASICS-
Certainly, here's the text neatly organized in bullet points, and I've included the
codes separately in code boxes:

Message Passing Interface (MPI)

Basics

- Distributed Systems - Definition:


- Distributed system: components on networked computers communicate by
passing messages.
- Achieve a common goal through communication and coordination.

- How to Program Distributed Computers?


- Message Passing based Programming Model.
- MPI (Message Passing Interface) is a standardized message passing library.

MPI (Message Passing Interface)

- Standardized message passing library specification.


- Used for parallel computer clusters.
- Not a specific product but a specification.
- Multiple implementations: MPICH, LAM, OpenMPI, etc.
- Portable, with Fortran and C/C++ interfaces.
- Enables real parallel programming.

A Brief History - MPI

- Writing parallel applications was initially challenging.


- No single standard, various implementations.
- MPI standard defined to support same features and semantics.
- MPI-1 defined in 1994, providing a complete interface.
- Resulted in portable parallel programs.

The Message-Passing Model

- Processes communicate through messages.


- Message Passing Interface (MPI) facilitates communication.
- Communication system involves machines, processes, and a communication
network.

Types of Parallel Computing Models

- Data Parallel: Same instructions on multiple data items.


- Task Parallel: Different instructions on different data.
- SPMD (Single Program, Multiple Data): Not synchronized at individual
instruction level.
- MPI is suitable for MIMD/SPMD type of parallelism.

The Message-Passing Programming Paradigm

- Sequential Programming Paradigm: Memory, program, processor.


- Message-Passing Programming Paradigm: Communication network,
distributed memory, parallel processors.
- MPI facilitates communication between processes with separate address spaces.

Data and Work Distribution

- Communicating MPI-processes need identifiers (rank).


- Processes are identified using rank.

MPI Fundamentals

- Communicator defines a group of processes for communication.


- Each process in a group is assigned a unique rank.
- Communicators can be predefined (e.g., MPI_COMM_WORLD) or defined
explicitly.
MPI Program – A Generic Structure

```
#include <mpi.h>
#include <stdio.h>

int main(int argc, char argv[]) {


MPI_Init(&argc, &argv);
printf("Hello PDC Class!\n");
MPI_Finalize();
return 0;
}
```

Point-to-Point Communication

- Communication between two processes: source sends, destination receives.


- Communication within a communicator (e.g., MPI_COMM_WORLD).
- Processes identified by ranks in the communicator.

MPI_Send & MPI_Recv

- MPI_Send: Send data to a destination process.


- MPI_Recv: Receive data from a source process.
- Blocking operations, sender waits for data to be sent or receiver waits for data
to be received.

Elementary MPI datatypes

- Similar to C datatypes.
- E.g., int -> MPI_INT, double -> MPI_DOUBLE, char -> MPI_CHAR.
- Complex datatypes possible, e.g., structure datatype.

Non-Blocking Communication - Overview

- MPI_Isend and MPI_Irecv initiate communication and return a request data


structure.
- MPI_Wait blocks until communication is complete.
- MPI_Test checks for completion.

Non-Blocking Send and Receive

- MPI_Isend and MPI_Irecv start communication and return a request data


structure.
- MPI_Wait or MPI_Test is used for completion check.
- Advantages: No deadlocks, overlap communication with computation.

MPI_Probe

- Use MPI_Probe to query message size before receiving.


- MPI_Probe returns information about the incoming message.
- MPI_Get_count retrieves the size of the incoming message.

MPI_Probe - Example

```c
// Probe for an incoming message from process 0, tag 0
MPI_Probe(0, 0, MPI_COMM_WORLD, &status);

// Get message size


MPI_Get_count(&status, MPI_INT, &number_amount);

// Allocate buffer
int number_buf = (int )malloc(sizeof(int) number_amount);

// Receive the message


MPI_Recv(number_buf, number_amount, MPI_INT, 0, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
```

Summary (Blocking Send/Recv)

- MPI_SEND and MPI_RECV behavior explained.


- Blocking nature of MPI_Send and MPI_Recv.

Non-Blocking Communication - Example

- Demonstration of non-blocking communication using MPI_Isend and


MPI_Irecv.

Any Questions

AB AYA BHAI MPI ;-; LETS UNDERSTAND IT FINALLY-

MPI BASICS
• Processes may need to communicate with everyone else

• Three Main Classes:


1. Communications: Broadcast, Gather, Scatter
2. Synchronization: Barriers
3. Reductions: sum, max, etc.

• Properties:
– Must be executed by all processes (of the communicator)
– All processes in group call same operation at (roughly) the
same time
– All collective operations are blocking operations

MPI_ScatterV

sendbuf: address of send buffer (significant only at root)


sendcounts: integer array (of length group size) specifying the number of
elements to send to each processor
displs: integer array (of length group size). Entry i specifies the displacement
(relative to sendbuf from which to take the outgoing data to process i
sendtype: data-type of send buffer elements
recvcount: number of elements in receive buffer (integer)
recvtype: data-type of receive buffer elements
root: rank of sending process (integer)
comm: communicator (handle)

GATHERVV
sendbuf: address of send buffer
sendcounts: number of elements in send buffer (integer)
sendtype: data-type of send buffer elements
recvbuf: address of the receive buff (significant at root)
recvcounts: integer array (of length group size) containing the number of
elements
that are to be received from each process (on root)
displs: integer array (of length group size). Entry i specifies the displacement
relative
to recvbuf at which to place data from process i (significant only at root)
recvtype: data-type of receive buffer elements (handle)
root: rank of receiving process (root)
comm: communicator (handle)

• MPI_Allgather
• Similar to MPI_Gather, but the result is available to all
processes
• MPI_Allgatherv
• Similar to MPI_Gatherv, but the result is available to all
processes
• MPI_Alltoall
• Similar to MPI_Allgather, each process performs a
scater followed by gather process

• MPI_Alltoallv
• Similar to MPI_Alltoall, but messages to different
processes can have different length

ADVANCED-

• Processes may need to communicate with everyone else


• Three Main Classes:
1. Communications: Broadcast, Gather, Scatter
2. Synchronization: Barriers
3. Reductions: sum, max, etc.

• Properties:
– Must be executed by all processes (of the communicator)
– All processes in group call same operation at (roughly) the
same time
– All collective operations are blocking operations

MPI_ScatterV

sendbuf: address of send buffer (significant only at root)


sendcounts: integer array (of length group size) specifying the number of
elements to send to each processor
displs: integer array (of length group size). Entry i specifies the displacement
(relative to sendbuf from which to take the outgoing data to process i
sendtype: data-type of send buffer elements
recvcount: number of elements in receive buffer (integer)
recvtype: data-type of receive buffer elements
root: rank of sending process (integer)
comm: communicator (handle)

GATHERVV
sendbuf: address of send buffer
sendcounts: number of elements in send buffer (integer)
sendtype: data-type of send buffer elements
recvbuf: address of the receive buff (significant at root)
recvcounts: integer array (of length group size) containing the number of
elements
that are to be received from each process (on root)
displs: integer array (of length group size). Entry i specifies the displacement
relative
to recvbuf at which to place data from process i (significant only at root)
recvtype: data-type of receive buffer elements (handle)
root: rank of receiving process (root)
comm: communicator (handle)

• MPI_Allgather
• Similar to MPI_Gather, but the result is available to all
processes
• MPI_Allgatherv
• Similar to MPI_Gatherv, but the result is available to all
processes
• MPI_Alltoall
• Similar to MPI_Allgather, each process performs a
scater followed by gather process

• MPI_Alltoallv
• Similar to MPI_Alltoall, but messages to different
processes can have different length

—--
MPI_BARRIER

It synchronizes all processes (by blocking the Processes) in


communicator until all processes have called MPI_Barrier.
—--
REDUCTIONS
—--

The communicated data of the processes are combined


via a specified operation, e.g. ’+’

Two different variants:


- Result is only available at the root process
- Result is available at all processes

Input values (at each process):


- Scalar variable: operation combines all values of the
processes
– Array: The elements of the arrays are combined in an
element-wise fashion. The result is an array.

1- EXPLAIN WITH EXAMPLE SCALAR REDUCTION MPI


2- ARRAY REDUCTION
**1. Scalar Reduction in MPI:**

Scalar reduction in MPI refers to combining individual values from different


processes into a single result using a specified operation, such as sum, product,
maximum, or minimum. It's like everyone at a party sharing their snacks, and
then one person calculating the total for everyone.

**Example:**
Imagine you and your friends each have a bag of candies, and you want to know
the total number of candies you all have combined. Each friend counts their
candies, and then you add up all the counts to find the total number of candies.
**Code:**

**2. Array Reduction in MPI:**

Array reduction is similar to scalar reduction, but it involves combining arrays


of values element-wise across processes using the specified operation.

**Example:**
Imagine you and your friends each have a bag of balls, and you want to find the
total count of balls for each color. You count the number of red balls, your
friend counts the blue balls, and you com
bine the counts for all colors.

In these examples, MPI_Allreduce combines data across all processes using a


specified operation, producing a single result that represents the collective effort
of the entire group. Just like when you and your friends combine your resources
to achieve something bigger together! 🍬🏀🤝

—-----------------------------------------------------------------------
Performance Analysis

Performance?
- The need to measure improvement in computer architectures necessitates the
comparison of alternative designs.
- A better system is characterized by better performance, but understanding the
precise meaning of performance is essential.

Performance Metrics – Sequential Systems:


- In the context of computer systems and programs, a primary performance
metric is time, often referred to as wall-clock time.
- The execution time of a program (A) can be decomposed into user CPU time,
system CPU time, and waiting time due to I/O operations and time sharing.

Computer Performance:
- Evaluating computer performance involves various metrics:
- Clock Speed: Measured in GHz, it dictates the frequency of the clock cycle and
influences processor speed.
- MIPS (Millions of Instructions per Second): Facilitates comparison, but
potential for misinterpretation exists when comparing different instruction sets.
- FLOPS (Floating Point Operations per Second): Offers a reliable measure for
floating-point performance.
- Factors influencing computer performance encompass processor speed, data bus
width, cache size, main memory amount, and interface speed.

Measuring Performance:
- Each processor features a clock that ticks consistently at a regular rate.
- The clock serves to synchronize various components of the system.
- Clock cycle time is measured in GHz (gigahertz).
- For instance, a clock rate of 200 MHz implies the clock ticks 200,000,000 times
per second (Pentium 1, 1995).

Machine Clock Rate:


- Clock Rate (CR), expressed in MHz or GHz, is the reciprocal of Clock Cycle
(CC) time (clock period).
- The relationship between clock rate and cycle time is CC = 1 / CR.

Clock Speed, MIPS, and FLOPS:


- Faster clock speeds generally result in faster processors; for example, a 3.2 GHz
processor is faster than a 1.2 GHz processor.
- MIPS (Millions of Instructions per Second) is a measure used to assess how
quickly a computer executes instructions.
- FLOPS (Floating Point Operations per Second) serves as an excellent measure,
as it remains consistent across different processors.

Measuring Performance Units:


- High Performance Computing units include various levels:
- Kilo, Mega, Giga, Tera, Peta, Exa, and Zeta.
- The units cover speed (Flop/second) and capacity (Bytes).

Benchmarks:
- Benchmarks are critical tools for evaluating and comparing different systems as
well as assessing modifications to a single system.
- Microbenchmarks focus on specific performance dimensions, such as cache and
memory bandwidth, providing insights into underlying factors.
- Macrobenchmarks evaluate overall application execution time and require an
application suite for comprehensive assessment.
- Notable benchmark suites include SPEC CPU2000 for CPU-intensive
applications, EEMBC for embedded systems, and TPC benchmarks for servers.

Amdahl's Law and Speedup Factor:


- Amdahl's Law quantifies potential program speedup based on the fraction of
code that can be parallelized.
- The speedup formula is given by: Max.speedup = 1 / (1 - P), where P represents
the fraction of code that cannot be parallelized.
- Practical examples illustrate speedup's dependency on the parallelizable
fraction (P).
Gustafson’s Law:
- Gustafson’s Law challenges Amdahl’s Law by considering scalability regarding
problem size and retaining scalability as the number of processors increases.
- The focus is on large machines and increasing both problem size and the
parallel part for improved scalability.

Scalability:
- Scalability refers to a system's ability to accommodate increasing problem sizes
or resources.
- Strong scalability maintains efficiency while increasing processors, while weak
scalability maintains efficiency while increasing both processors and problem
size.

Strong vs Weak Scalability:


- Strong scaling relates to speedup for a fixed problem size concerning the
number of processors, governed by Amdahl's law.
- Weak scaling pertains to speedup for a scaled problem size with respect to the
number of processors, governed by Gustafson's law.

Feel free to ask if you have any questions or need further clarifications on any
specific concept!

Revision of Formulas -

Sure, I'd be happy to provide the basic formulas along with definitions and tips
or tricks to remember each one:

1. Speedup (S):
- Formula: Speedup (S) = Execution Time (Single Processor) / Execution Time
(Multiple Processors)
- Definition: Speedup measures the relative performance improvement gained
by executing a task on multiple processors compared to a single processor.
- Tip: Think of speedup as a ratio. A larger speedup value indicates better
performance. To remember the formula, think of dividing the execution time on
a single processor by the execution time on multiple processors to get the
speedup factor.

2. Amdahl's Law (Maximum Speedup):


- Formula: Max Speedup = 1 / ((1 - F) + F / P)
- Definition: Amdahl's Law calculates the maximum achievable speedup for a
program with a given fraction (F) of serial code when executed on P processors.
- Tip: To remember the formula, think of "1 minus F" in the denominator,
and "F divided by P" as part of the fraction. The formula highlights how the
serial portion limits the maximum speedup.

3. Efficiency (E):
- Formula: Efficiency (E) = Speedup / Number of Processors
- Definition: Efficiency quantifies how effectively multiple processors are used
to perform a task. It is the ratio of achieved speedup to the number of processors
used.
- Tip: Think of efficiency as a measure of how well resources are utilized.
Higher efficiency values indicate better utilization. Remember the formula by
dividing the speedup by the number of processors.

4. Gustafson's Law (Scaled Speedup):


- Formula: Scaled Speedup (S(p)) = p + (1 - p) * N
- Definition: Gustafson's Law considers scalability with respect to problem
size as the number of processors (p) increases. N represents the problem size.
- Tip: To recall the formula, remember that the scaled speedup involves a
linear term (p) and a term that accounts for the increase in problem size as (1 - p)
* N.

These tricks and tips should help you remember the formulas and their
meanings more easily. Feel free to associate them with visual or mnemonic aids
for even better recall!

SECTION 2-

1. GFLOPS (GigaFLOPS):
- Formula: GFLOPS = 10^9 Floating Point Operations Per Second
- Explanation: A measure of computing performance, indicating the number of
billions of floating-point operations a computer can perform in one second.

2. MFLOPS (MegaFLOPS):
- Formula: MFLOPS = 10^6 Floating Point Operations Per Second
- Explanation: Similar to GFLOPS but on a smaller scale, MFLOPS measures
the number of millions of floating-point operations a computer can perform in
one second.

3. Amdahl's Law:
- Formula: Max Speedup = 1 / ((1 - F) + F / P)
- Explanation: Amdahl's Law calculates the maximum potential speedup of a
program with a fraction (F) that can't be parallelized, when executed on P
processors. It helps understand the impact of parallelization on overall
performance.

4. Gustafson's Law (Scaled Speedup):


- Formula: Scaled Speedup (S(p)) = p + (1 - p) * N
- Explanation: Gustafson's Law considers scalability in terms of problem size
(N) and parallelizable fraction (p) to achieve better performance with more
processors. It challenges Amdahl's Law by focusing on larger problems and
maintaining scalability.

5. Efficiency:
- Formula: Efficiency = Speedup / Number of Processes
- Explanation: Efficiency measures how well a parallel program utilizes
available resources. It is the ratio of speedup achieved to the number of processes
used.

6. Speedup (with N CPUs or Machines):


- Formula: Speedup = 1 / (Serial Fraction + Parallel Fraction / Number of
Processors)
- Explanation: Speedup measures the performance improvement gained by
using multiple processors (CPUs) or machines. It takes into account both serial
and parallel fractions of a program.

7. Number of Computational Steps (Parallel):


- Formula: Number of Computational Steps (Parallel) = Number of
Computational Steps (Sequential) / Number of Processors
- Explanation: This formula relates the number of computational steps
required in a parallel execution to the number of processors used.

These formulas and concepts provide insights into various aspects of


performance analysis in computing systems, helping understand how different
factors impact overall performance.

Section 2.1
Here are the formulas for general speedup, Amdahl's speedup, Gustafson's
speedup, and efficiency, along with a tip to differentiate between them:

1. General Speedup (Parallel Execution):


- Formula: General Speedup = Execution Time (Sequential) / Execution Time
(Parallel)
- Tip: Think of general speedup as comparing the time it takes to complete a
task sequentially (using a single processor) to the time it takes in parallel (using
multiple processors). It's a straightforward measure of performance
improvement.

2. Amdahl's Speedup:
- Formula: Amdahl's Speedup = 1 / ((1 - Fraction Parallelizable) + Fraction
Parallelizable / Number of Processors)
- Tip: Remember Amdahl's Law by focusing on the idea that it quantifies the
potential speedup of a program considering the portion that can be parallelized
and the number of processors. It's about assessing the impact of parallelization
on overall speedup.

3. Gustafson's Speedup:
- Formula: Gustafson's Speedup = Fraction Parallelizable + (1 - Fraction
Parallelizable) * Number of Processors
- Tip: Recall Gustafson's Law by keeping in mind its emphasis on problem size
scaling with the number of processors. It looks at how the parallelizable fraction
of a program and the number of processors contribute to improved
performance.

4. Efficiency:
- Formula: Efficiency = Speedup / Number of Processors
- Tip: Think of efficiency as a measure of how effectively processors are being
utilized. It relates the speedup achieved to the number of processors used.
Higher efficiency indicates better utilization of resources.

Tip for Differentiation: When working with these concepts, it's helpful to
associate Amdahl's Law with the idea of limited parallelization (fraction that can
be parallelized) and Gustafson's Law with the concept of scalable problem size.
Efficiency, on the other hand, relates to how efficiently resources are used in
achieving speedup.

By keeping these associations in mind, you can differentiate between the


different formulas and concepts more easily.

Big difference b/w amdahls and gsutafsons ins admahsl for fixed size, whereas
gustafscons for scalabel scaled etc.

smp -→ parallel architechtue

To achieve the required speedup (S = 7.804) on an SMP machine with 32


processors, we need to find the percentage of parallelizable code (p) using
Amdahl's law.

Step 1: Calculate Efficiency (E)


E=S/n
E = 7.804 / 32
E ≈ 0.24425

Step 2: Find p (Percentage of Parallelizable Code)


Efficiency (E) = (1 - p) + (p / n)

Since E = 0.24425 and n = 32:


0.24425 = (1 - p) + (p / 32)
Now, solve for p:
p / 32 = 1 - 0.24425
p / 32 = 0.75575

p ≈ 0.75575 * 32
p ≈ 24.2016%

So, to achieve a speedup of 7.804 on the SMP machine with 32 processors,


approximately 24.2016% of the code needs to be effectively parallelized, and the
rest will be executed serially.

The correct percentage of parallelizable code (p) to attain the desired speedup is
approximately 24.2016%, which can be rounded to 24.20% or approximately
24%.

Let's break down the scenario and determine whether it's feasible for the
company, Techideas, to execute the parallel version of their Black-Scholes
application on the cluster with a utilization of 60% or above.

Given information:
- Percentage of parallelizable code (p) = 86.375%
- Number of cores available in the cluster (n) = 8

Step 1: Calculate the Speedup (S)


The formula for speedup (S) is S = 1 / [(1 - p) + (p / n)].

Substitute the given values:


S = 1 / [(1 - 0.86375) + (0.86375 / 8)]
S = 1 / [0.13625 + 0.107969]
S = 1 / 0.244219
S ≈ 4.095

Step 2: Calculate the Efficiency (E)


The formula for efficiency (E) is E = S / n.
Substitute the values:
E = 4.095 / 8
E ≈ 0.511875

The calculated efficiency is approximately 51.19%.

Step 3: Compare Efficiency with 60%


The company, Techideas, wants to utilize 60% or above of the parallelizable
code. However, the calculated efficiency is approximately 51.19%, which is below
the target of 60%.

Conclusion:
Based on the analysis, the parallel version of the Black-Scholes application on the
cluster with 8 cores will not be able to achieve the desired utilization of 60% or
above of the parallelizable code. The efficiency of around 51.19% falls short of
the target.

Recommendation:
Mr. Salman Ahmed should suggest to Techideas that running the parallel
application on the cluster with the current resources may not be feasible to
achieve the desired utilization goal. They may need to consider acquiring more
computing resources with a higher number of cores or optimizing the code
further to increase the parallelization and efficiency.

—-------------------------------------------------------------------------------------------------------
-----
Key points performance analysis in PDC-

Key Points from "Performance Analysis" by Dr. Qaisar Shafi:

1. Performance involves comparing different computer designs to measure


improvement in computer architectures.

2. Better system performance is characterized by improved speed and efficiency,


requiring a precise understanding of the concept.

3. Performance metrics in computer systems include wall-clock time, which


measures execution time of a program.

4. Execution time can be divided into user CPU time, system CPU time, waiting
time due to I/O operations, and time sharing.

5. Computer performance is evaluated using metrics like clock speed, MIPS


(Millions of Instructions per Second), and FLOPS (Floating Point Operations
per Second).

6. Factors influencing performance include processor speed, data bus width,


cache size, main memory, and interface speed.

7. Clock cycle synchronizes system components, and clock rate is the reciprocal
of cycle time.
8. Clock speed, MIPS, and FLOPS contribute to processor performance and
efficiency.

9. Performance units include Kilo, Mega, Giga, Tera, Peta, Exa, and Zeta,
covering speed and capacity.

10. Benchmarks are critical tools for evaluating and comparing systems, with
microbenchmarks and macrobenchmarks providing insights.

11. Amdahl's Law quantifies potential speedup based on parallelizable code


fraction, while Gustafson's Law focuses on scalability.

12. Scalability refers to a system's ability to handle increasing problem sizes or


resources.

13. Strong and weak scalability relate to speedup for fixed and scaled problem
sizes, respectively.

14. General speedup, Amdahl's speedup, Gustafson's speedup, and efficiency are
important performance formulas.

15. GFLOPS and MFLOPS measure computing performance in billions and


millions of operations per second.
16. Amdahl's Law calculates maximum speedup considering non-parallelizable
code fraction.

17. Gustafson's Law emphasizes scalability with larger problem size and parallel
fraction.

18. Efficiency measures how well parallel programs use resources.

19. The discussed formulas and concepts help analyze and improve overall
system performance.

20. Analysis of SMP machine and cluster utilization demonstrates the


importance of parallelization and efficiency considerations.

—-------------------------------------------------------------------------------------------------------
-----

You might also like