PDC Summers Finals Revision Notes
PDC Summers Finals Revision Notes
Sure, here's a summarized version of the key points and core concepts from your
lecture slides on Parallel and Distributed Computing:
Parallel Computing:
Distributed Computing:
Multi-core Era:
Parallelization Strategies:
Collective Communications:
Synchronization Types:
Remember that this is a concise overview, and you might need to refer back to
your detailed lecture slides for more comprehensive understanding and specific
examples.
Granularity:
Granularity refers to the size of the tasks that are being parallelized. It is the
ratio of computation to communication in a parallel program. Fine-grained
granularity involves smaller tasks with relatively more communication, while
coarse-grained granularity involves larger tasks with relatively less
communication.
Synchronization:
Synchronization involves coordinating the execution of tasks to ensure they
work correctly together. It prevents race conditions and ensures that tasks
proceed in an orderly manner.
Synchronization Overhead:
Synchronization overhead is the time and resources spent on coordinating tasks.
High synchronization overhead can hinder performance improvement.
Communication Events:
Communication events are points in a parallel program where data is exchanged
between tasks. They include sending and receiving data.
Fine-Grained Parallelism:
Fine-grained parallelism involves breaking down a problem into small tasks that
require frequent communication. It has a low computation to communication
ratio.
Types of Overheads:
Overheads in parallel programming include computation overhead,
communication overhead, synchronization overhead, and resource overhead.
Coarse-Grained Parallelism:
Coarse-grained parallelism involves larger tasks with less frequent
communication. It has a higher computation to communication ratio.
Load Balancing:
Load balancing ensures that tasks are distributed evenly among processors to
avoid underutilization or overloading of resources.
Data Decomposition:
Data decomposition divides the problem into chunks, where each chunk
operates on a different subset of the data.
Block vs Cyclic Decomposition:
- Block decomposition assigns contiguous blocks of data to tasks.
- Cyclic decomposition assigns data in a round-robin manner to tasks.
Functional Decomposition:
Functional decomposition breaks down a problem into separate functions or
tasks.
Communication Cost:
Communication cost refers to the time and resources spent on data exchange
between tasks.
Inter-Task Communication Overhead:
Inter-task communication overhead arises from the time spent on sending and
receiving data between tasks.
Visibility of Communication:
- Parallel system: High visibility due to shared memory.
- Distributed system: Moderate visibility due to message passing.
- Parallel and Distributed: Moderate visibility.
- Shared memory model: High visibility due to shared memory access.
Please note that this is a brief overview of the concepts. For a comprehensive
understanding, refer to your lecture materials and further reading.
---
1. Granularity:
*Heading: Understanding Granularity in Parallel Programming*
3. Synchronization:
4. Synchronization Overhead:
5. Communication Events:
- Question : What are communication events, and why are they important in
parallel programming?
- Answer : Communication events are points where data is exchanged between
tasks. They include sending and receiving data, and they play a crucial role in
maintaining data coherence.
6. Fine-Grained Parallelism:
8. Types of Overheads:
9. Coarse-Grained Parallelism:
- Question : What are domain and functional decomposition, and how do they
differ?
- Question : What are block and cyclic decomposition, and what are their
advantages and disadvantages?
- Question : What are 1D and 2D data decomposition, and how do they differ?
- Question : What is cyclic data decomposition, and how does it distribute data
among tasks?
- Question : What are the differences between data and domain decomposition
in parallel programming?
- Answer : Communication cost refers to the time and resources spent on data
exchange between tasks. It affects program performance.
- Answer : Bandwidth is the data transmission rate, while latency is the time
delay in data transmission.
---
Please note that these answers are concise summaries. For a more comprehensive
understanding, you may refer to your lecture materials and additional resources.
COMMUNICATION NOTES-
Certainly, here are expanded notes from your provided content about
communication patterns, synchronization, and related concepts:
Communication Patterns:
- Synchronous Communication :
- Requires "handshaking" between tasks.
- Structured explicitly in code by the programmer.
- Involves blocking communications as other tasks wait until communication is
completed.
- Suitable for scenarios where tasks need to coordinate closely.
- Asynchronous Communications :
- Enables tasks to transfer data independently.
- Non-blocking communications allow interleaving computation with
communication.
- Provides flexibility and potential performance improvements.
- Particularly useful when tasks can progress without waiting for
communication to complete.
Scope of Communication:
Synchronization:
Types of Synchronization:
1. Barrier :
- All tasks involved.
- Each task works until reaching the barrier, then blocks.
- Resumes when the last task reaches the barrier.
- Ensures synchronization point before proceeding.
2. Lock / Semaphore :
- Any number of tasks can be involved.
- Typically used to serialize access to global data or code.
- Only one task can own the lock/semaphore at a time.
- Tasks attempt to acquire the lock, waiting if it's owned by another task.
- Can be blocking or non-blocking.
- Effective for managing shared resources.
Cluster Computing:
- Clusters are collections of interconnected independent uni-processor systems.
- High-performance alternative to SMP.
- Benefits include scalability, high availability, and redundancy.
- Cluster middleware provides a unified image, single point of entry, and single
file hierarchy.
Grid Computing:
Cloud Computing:
Supercomputers:
Please note that these notes provide a concise overview of each topic. More
details and specific examples can be found in your lecture slides.
—-------------------------------------------------------------------------------------------------------
----------------------------
Beowulf Cluster:
Networking Concepts:
MPI INTRO:
Introduction to MPI:
—--------------------------------------------------------------------------------------
MPI BASICS-
Certainly, here's the text neatly organized in bullet points, and I've included the
codes separately in code boxes:
Basics
MPI Fundamentals
```
#include <mpi.h>
#include <stdio.h>
Point-to-Point Communication
- Similar to C datatypes.
- E.g., int -> MPI_INT, double -> MPI_DOUBLE, char -> MPI_CHAR.
- Complex datatypes possible, e.g., structure datatype.
MPI_Probe
MPI_Probe - Example
```c
// Probe for an incoming message from process 0, tag 0
MPI_Probe(0, 0, MPI_COMM_WORLD, &status);
// Allocate buffer
int number_buf = (int )malloc(sizeof(int) number_amount);
Any Questions
MPI BASICS
• Processes may need to communicate with everyone else
• Properties:
– Must be executed by all processes (of the communicator)
– All processes in group call same operation at (roughly) the
same time
– All collective operations are blocking operations
MPI_ScatterV
GATHERVV
sendbuf: address of send buffer
sendcounts: number of elements in send buffer (integer)
sendtype: data-type of send buffer elements
recvbuf: address of the receive buff (significant at root)
recvcounts: integer array (of length group size) containing the number of
elements
that are to be received from each process (on root)
displs: integer array (of length group size). Entry i specifies the displacement
relative
to recvbuf at which to place data from process i (significant only at root)
recvtype: data-type of receive buffer elements (handle)
root: rank of receiving process (root)
comm: communicator (handle)
• MPI_Allgather
• Similar to MPI_Gather, but the result is available to all
processes
• MPI_Allgatherv
• Similar to MPI_Gatherv, but the result is available to all
processes
• MPI_Alltoall
• Similar to MPI_Allgather, each process performs a
scater followed by gather process
• MPI_Alltoallv
• Similar to MPI_Alltoall, but messages to different
processes can have different length
ADVANCED-
• Properties:
– Must be executed by all processes (of the communicator)
– All processes in group call same operation at (roughly) the
same time
– All collective operations are blocking operations
MPI_ScatterV
GATHERVV
sendbuf: address of send buffer
sendcounts: number of elements in send buffer (integer)
sendtype: data-type of send buffer elements
recvbuf: address of the receive buff (significant at root)
recvcounts: integer array (of length group size) containing the number of
elements
that are to be received from each process (on root)
displs: integer array (of length group size). Entry i specifies the displacement
relative
to recvbuf at which to place data from process i (significant only at root)
recvtype: data-type of receive buffer elements (handle)
root: rank of receiving process (root)
comm: communicator (handle)
• MPI_Allgather
• Similar to MPI_Gather, but the result is available to all
processes
• MPI_Allgatherv
• Similar to MPI_Gatherv, but the result is available to all
processes
• MPI_Alltoall
• Similar to MPI_Allgather, each process performs a
scater followed by gather process
• MPI_Alltoallv
• Similar to MPI_Alltoall, but messages to different
processes can have different length
—--
MPI_BARRIER
**Example:**
Imagine you and your friends each have a bag of candies, and you want to know
the total number of candies you all have combined. Each friend counts their
candies, and then you add up all the counts to find the total number of candies.
**Code:**
**Example:**
Imagine you and your friends each have a bag of balls, and you want to find the
total count of balls for each color. You count the number of red balls, your
friend counts the blue balls, and you com
bine the counts for all colors.
—-----------------------------------------------------------------------
Performance Analysis
Performance?
- The need to measure improvement in computer architectures necessitates the
comparison of alternative designs.
- A better system is characterized by better performance, but understanding the
precise meaning of performance is essential.
Computer Performance:
- Evaluating computer performance involves various metrics:
- Clock Speed: Measured in GHz, it dictates the frequency of the clock cycle and
influences processor speed.
- MIPS (Millions of Instructions per Second): Facilitates comparison, but
potential for misinterpretation exists when comparing different instruction sets.
- FLOPS (Floating Point Operations per Second): Offers a reliable measure for
floating-point performance.
- Factors influencing computer performance encompass processor speed, data bus
width, cache size, main memory amount, and interface speed.
Measuring Performance:
- Each processor features a clock that ticks consistently at a regular rate.
- The clock serves to synchronize various components of the system.
- Clock cycle time is measured in GHz (gigahertz).
- For instance, a clock rate of 200 MHz implies the clock ticks 200,000,000 times
per second (Pentium 1, 1995).
Benchmarks:
- Benchmarks are critical tools for evaluating and comparing different systems as
well as assessing modifications to a single system.
- Microbenchmarks focus on specific performance dimensions, such as cache and
memory bandwidth, providing insights into underlying factors.
- Macrobenchmarks evaluate overall application execution time and require an
application suite for comprehensive assessment.
- Notable benchmark suites include SPEC CPU2000 for CPU-intensive
applications, EEMBC for embedded systems, and TPC benchmarks for servers.
Scalability:
- Scalability refers to a system's ability to accommodate increasing problem sizes
or resources.
- Strong scalability maintains efficiency while increasing processors, while weak
scalability maintains efficiency while increasing both processors and problem
size.
Feel free to ask if you have any questions or need further clarifications on any
specific concept!
Revision of Formulas -
Sure, I'd be happy to provide the basic formulas along with definitions and tips
or tricks to remember each one:
1. Speedup (S):
- Formula: Speedup (S) = Execution Time (Single Processor) / Execution Time
(Multiple Processors)
- Definition: Speedup measures the relative performance improvement gained
by executing a task on multiple processors compared to a single processor.
- Tip: Think of speedup as a ratio. A larger speedup value indicates better
performance. To remember the formula, think of dividing the execution time on
a single processor by the execution time on multiple processors to get the
speedup factor.
3. Efficiency (E):
- Formula: Efficiency (E) = Speedup / Number of Processors
- Definition: Efficiency quantifies how effectively multiple processors are used
to perform a task. It is the ratio of achieved speedup to the number of processors
used.
- Tip: Think of efficiency as a measure of how well resources are utilized.
Higher efficiency values indicate better utilization. Remember the formula by
dividing the speedup by the number of processors.
These tricks and tips should help you remember the formulas and their
meanings more easily. Feel free to associate them with visual or mnemonic aids
for even better recall!
SECTION 2-
1. GFLOPS (GigaFLOPS):
- Formula: GFLOPS = 10^9 Floating Point Operations Per Second
- Explanation: A measure of computing performance, indicating the number of
billions of floating-point operations a computer can perform in one second.
2. MFLOPS (MegaFLOPS):
- Formula: MFLOPS = 10^6 Floating Point Operations Per Second
- Explanation: Similar to GFLOPS but on a smaller scale, MFLOPS measures
the number of millions of floating-point operations a computer can perform in
one second.
3. Amdahl's Law:
- Formula: Max Speedup = 1 / ((1 - F) + F / P)
- Explanation: Amdahl's Law calculates the maximum potential speedup of a
program with a fraction (F) that can't be parallelized, when executed on P
processors. It helps understand the impact of parallelization on overall
performance.
5. Efficiency:
- Formula: Efficiency = Speedup / Number of Processes
- Explanation: Efficiency measures how well a parallel program utilizes
available resources. It is the ratio of speedup achieved to the number of processes
used.
Section 2.1
Here are the formulas for general speedup, Amdahl's speedup, Gustafson's
speedup, and efficiency, along with a tip to differentiate between them:
2. Amdahl's Speedup:
- Formula: Amdahl's Speedup = 1 / ((1 - Fraction Parallelizable) + Fraction
Parallelizable / Number of Processors)
- Tip: Remember Amdahl's Law by focusing on the idea that it quantifies the
potential speedup of a program considering the portion that can be parallelized
and the number of processors. It's about assessing the impact of parallelization
on overall speedup.
3. Gustafson's Speedup:
- Formula: Gustafson's Speedup = Fraction Parallelizable + (1 - Fraction
Parallelizable) * Number of Processors
- Tip: Recall Gustafson's Law by keeping in mind its emphasis on problem size
scaling with the number of processors. It looks at how the parallelizable fraction
of a program and the number of processors contribute to improved
performance.
4. Efficiency:
- Formula: Efficiency = Speedup / Number of Processors
- Tip: Think of efficiency as a measure of how effectively processors are being
utilized. It relates the speedup achieved to the number of processors used.
Higher efficiency indicates better utilization of resources.
Tip for Differentiation: When working with these concepts, it's helpful to
associate Amdahl's Law with the idea of limited parallelization (fraction that can
be parallelized) and Gustafson's Law with the concept of scalable problem size.
Efficiency, on the other hand, relates to how efficiently resources are used in
achieving speedup.
Big difference b/w amdahls and gsutafsons ins admahsl for fixed size, whereas
gustafscons for scalabel scaled etc.
p ≈ 0.75575 * 32
p ≈ 24.2016%
The correct percentage of parallelizable code (p) to attain the desired speedup is
approximately 24.2016%, which can be rounded to 24.20% or approximately
24%.
Let's break down the scenario and determine whether it's feasible for the
company, Techideas, to execute the parallel version of their Black-Scholes
application on the cluster with a utilization of 60% or above.
Given information:
- Percentage of parallelizable code (p) = 86.375%
- Number of cores available in the cluster (n) = 8
Conclusion:
Based on the analysis, the parallel version of the Black-Scholes application on the
cluster with 8 cores will not be able to achieve the desired utilization of 60% or
above of the parallelizable code. The efficiency of around 51.19% falls short of
the target.
Recommendation:
Mr. Salman Ahmed should suggest to Techideas that running the parallel
application on the cluster with the current resources may not be feasible to
achieve the desired utilization goal. They may need to consider acquiring more
computing resources with a higher number of cores or optimizing the code
further to increase the parallelization and efficiency.
—-------------------------------------------------------------------------------------------------------
-----
Key points performance analysis in PDC-
4. Execution time can be divided into user CPU time, system CPU time, waiting
time due to I/O operations, and time sharing.
7. Clock cycle synchronizes system components, and clock rate is the reciprocal
of cycle time.
8. Clock speed, MIPS, and FLOPS contribute to processor performance and
efficiency.
9. Performance units include Kilo, Mega, Giga, Tera, Peta, Exa, and Zeta,
covering speed and capacity.
10. Benchmarks are critical tools for evaluating and comparing systems, with
microbenchmarks and macrobenchmarks providing insights.
13. Strong and weak scalability relate to speedup for fixed and scaled problem
sizes, respectively.
14. General speedup, Amdahl's speedup, Gustafson's speedup, and efficiency are
important performance formulas.
17. Gustafson's Law emphasizes scalability with larger problem size and parallel
fraction.
19. The discussed formulas and concepts help analyze and improve overall
system performance.
—-------------------------------------------------------------------------------------------------------
-----