0% found this document useful (0 votes)
2 views21 pages

24csppc202 Multicore Architecture and Programming

The document discusses various aspects of multicore architecture and programming, including differences between symmetric and distributed shared memory architectures, Amdahl's Law, synchronization primitives, and methods to prevent deadlocks. It also covers OpenMP directives for parallel programming, MPI constructs for distributed memory programming, and methodologies for designing parallel programs. Additionally, it highlights challenges in parallelizing N-Body problems and real-world applications of such solvers.

Uploaded by

kowsi.indu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views21 pages

24csppc202 Multicore Architecture and Programming

The document discusses various aspects of multicore architecture and programming, including differences between symmetric and distributed shared memory architectures, Amdahl's Law, synchronization primitives, and methods to prevent deadlocks. It also covers OpenMP directives for parallel programming, MPI constructs for distributed memory programming, and methodologies for designing parallel programs. Additionally, it highlights challenges in parallelizing N-Body problems and real-world applications of such solvers.

Uploaded by

kowsi.indu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

24CSPPC202 MULTICORE ARCHITECTURE AND PROGRAMMING

PART-A

1. Differentiate between Symmetric and Distributed Shared Memory architectures.


Symmetric Memory Architecture:
• All processors share a single, centralized memory with equal access time.
• Communication occurs via shared memory (no explicit message passing needed).
Distributed Shared Memory (DSM) Architecture:
• Memory is distributed across nodes, but it is accessible as a single logical address
space.
• Communication is managed by software or hardware that simulates shared memory
across distributed systems.
2. State Amdahl’s Law and its significance.
Amdahl’s Law states that the maximum speedup of a program using multiple processors
is limited by the sequential portion of the program. It is given by:
Speedup=1/(1−P)+P/N
Where:
• P = Proportion of the program that can be parallelized
• N = Number of processors
Significance:
It highlights the diminishing returns of adding more processors, emphasizing the
importance of minimizing the sequential part of a program to achieve better parallel
performance.
3. List any two synchronization primitives used in parallel programming.
Two synchronization primitives used in parallel programming are:
1. Mutex (Mutual Exclusion) – Ensures that only one thread accesses a critical
section at a time.
2. Semaphore – A signaling mechanism used to control access to a common resource
by multiple threads.
4. Mention two methods to prevent deadlocks in parallel programs.
Two methods to prevent deadlocks in parallel programs are:
1. Resource Ordering – Assign a global order to resource acquisition and ensure all
threads acquire resources in that order.
2. Avoid Holding Multiple Locks – Minimize the use of nested locks or hold locks
for the shortest time possible to reduce the chance of circular wait.
5. What are work-sharing constructs in OpenMP?
Work-sharing constructs in OpenMP are directives that divide the execution of code
among multiple threads, ensuring that work is distributed without duplicating tasks.
Examples include:
1. **#pragma omp for** – Distributes loop iterations across threads.
2. **#pragma omp sections** – Assigns different sections of code to different threads.
These constructs help in parallelizing tasks efficiently while maintaining correct program
behavior.
6. What are the different types of memory in the OpenMP memory model?
In the OpenMP memory model, the two main types of memory are:
1. Shared Memory:
o Accessible by all threads in a team.
o Used for sharing data among threads.
2. Private Memory:
o Each thread has its own copy.
o Variables declared as private are not shared and are unique to each thread
7. Mention one real-world application of N-Body solvers.
Real-world application of N-Body solvers:
Astrophysics and Astronomy – N-Body solvers are used to simulate the gravitational
interactions between celestial bodies, such as modeling galaxy formation, star cluster
dynamics, or predicting planetary orbits.
8. List two challenges in parallelizing an N-Body problem.
Two challenges in parallelizing an N-Body problem are:
1. High Computational Complexity – Each body interacts with every other body,
leading to O(n2) computations, which is difficult to scale efficiently.
2. Load Balancing – Distributing the computational workload evenly among
processors is challenging due to the non-uniform distribution of bodies and
interactions.
9. Differentiate between Point-to-Point and Collective Communication.
Point-to-Point Communication:
• Involves data transfer between two specific processes (e.g., MPI_Send and
MPI_Recv).
• Used for direct and explicit message passing.
Collective Communication:
• Involves data exchange among a group of processes (e.g., MPI_Bcast, MPI_Reduce).
• Used for broadcasting, gathering, or reducing data across multiple processes
simultaneously.
10. What is the role of MPI_Send and MPI_Recv?
Role of MPI_Send and MPI_Recv:
• MPI_Send: Used to send a message from one process to another. It initiates point-to-
point communication by transmitting data.
• MPI_Recv: Used to receive a message sent by another process. It completes the
communication by accepting incoming data.
Together, they enable direct communication between two MPI processes in parallel
programs.
PART-B
11.B. Explain parallel program design methodologies for multi core processors.
Parallel program design for multicore processors involves dividing a program's
workload into smaller, independent chunks that can be executed concurrently by multiple
CPU cores. This approach, known as parallel processing, aims to reduce execution time
and improve overall performance.
Here's a breakdown of key methodologies:
1. Task Parallelism:
• Concept:
Dividing the program's logic into independent tasks and assigning each task to a different
core.
• Example:
A web server handling multiple client requests concurrently, with each request handled
by a separate thread or process.
• Benefits:
Good for applications with many distinct, independent operations that can be performed
at the same time.
• Tools:
Languages like Java and .NET offer built-in threading and task management features for
task parallelism.

2. Data Parallelism:
• Concept:
Splitting large datasets into smaller chunks and assigning each chunk to a different core
for processing.
• Example:
Processing a large image by dividing it into tiles and assigning each tile to a different
core.
• Benefits:
Simple to implement and can significantly speed up computations on large datasets.
• Tools:
Libraries like OpenMP (for C/C++) and MPI (for more complex distributed systems) can
facilitate data parallelism.

3. Process Parallelism (or Distributed Computing):


• Concept:
Distributing tasks across multiple machines or nodes within a network.
• Example:
Using a cluster of computers to process a massive dataset or a large-scale simulation.
• Benefits:
Extends parallel processing to handle extremely large workloads that exceed the capacity
of a single machine.
• Tools:
MPI and cloud-based platforms (e.g., Amazon Web Services), Google Cloud Platform,
Microsoft Azure) are commonly used for distributed computing.
4. Hybrid Parallelism:
• Concept:
Combining different parallelism strategies, often data and task parallelism, for optimal
performance.
• Example:
Using data parallelism to process large datasets on a single machine, and then task
parallelism to distribute different stages of the data processing pipeline across multiple
machines.
• Benefits:
Allows for a more flexible and efficient use of resources, especially in complex
applications.
5. Multithreading:
• Concept:
Creating multiple threads within a single process, allowing for parallel execution of
different parts of the process on different cores.
• Example:
A web browser that downloads images in the background while displaying the main page.
• Benefits:
Can significantly improve responsiveness and efficiency in applications that perform
multiple operations simultaneously.
• Tools:
Programming languages often provide built-in threading APIs (e.g., POSIX Threads).

Considerations for Parallel Program Design:


• Overhead:
Parallelism can introduce overhead (e.g., thread creation, synchronization) that can
negate the performance benefits.
• Data Sharing:
When multiple cores need to access and modify shared data, careful synchronization
mechanisms are needed to avoid race conditions and data corruption.
• Communication:
For distributed systems, efficient communication and data transfer between nodes is
crucial.
• Load Balancing:
Ensuring that tasks are distributed evenly across cores or nodes to avoid idle cores or
bottlenecks.
By carefully considering these methodologies and design factors, programmers can
effectively leverage the power of multicore processors to create fast and efficient parallel
applications.

12.B Discuss deadlocks and livelocks in parallel programming. Explain various techniques
to prevent them.

Deadlocks and livelocks are concurrency problems that can occur in parallel programming,
hindering progress. Deadlocks are situations where two or more processes are blocked
indefinitely, waiting for each other to release resources, while livelocks involve processes
repeatedly changing their state in response to each other, but without making any actual
progress. Several techniques can be employed to prevent these issues.

Deadlocks

Deadlocks occur when two or more processes are stuck in a circular wait, each holding a
resource that the other needs to proceed. This can happen when processes acquire locks in a non-
consistent order, hold and wait for other resources, and when resources cannot be preempted.
Techniques to Prevent Deadlocks:

• Order of Resource Acquisition:

Enforce a consistent order for acquiring locks or resources to prevent circular


waits.

• No Preemption:

Allow resources to be preempted if a process is unable to acquire a required


resource within a certain time.

• Avoid Hold and Wait:

Ensure that a process either acquires all resources it needs at once or releases all
held resources before waiting for another.

• Minimize Resource Sharing:

Reduce the number of shared resources to minimize the possibility of contention.

• Use Concurrent Collections:

Utilize concurrent collections like ConcurrentHashMap instead of HashMap to


avoid unnecessary locking.

• Timeout Mechanisms:

Implement tryLock with a timeout to avoid infinite waiting and potential


deadlocks.

Livelocks

Livelocks are similar to deadlocks but occur when processes are not blocked, but they are
constantly changing their state to avoid a conflict, but are unable to make progress. This often
happens when processes repeatedly retry operations that will always fail.

Techniques to Prevent Livelocks:

• Random Backoff:

Introduce random delays before retrying operations to avoid synchronized retry


attempts.

• Adjust Thread Priorities:


Dynamically adjust thread priorities to allow some threads to make progress.

• Semaphores:
Use semaphores to control concurrency and limit the number of processes
accessing shared resources.

• Proper Transaction Design:

Ensure that transactions are designed to handle potential failures and avoid
infinite loops.

• Avoid Unnecessary Retries:

Refactor code to reduce the need for frequent retries that can contribute to livelocks.

13.A) Describe different OpenMP directives and their usage with examples.

Different OpenMP Directives and Their Usage


OpenMP (Open Multi-Processing) is a widely used API for parallel programming in shared
memory systems. It uses compiler directives (pragmas) to specify parallel regions and work-
sharing constructs.

1. parallel Directive

• Purpose: Defines a parallel region where a team of threads executes the enclosed code
block concurrently.

• Syntax:

#pragma omp parallel

{
// Code executed by multiple threads
}

Example:

#pragma omp parallel

printf("Hello from thread %d\n", omp_get_thread_num());

Usage: To start parallel execution by multiple threads.

2. for Directive
• Purpose: Distributes loop iterations among threads in a parallel region.

• Syntax:

#pragma omp for

for (int i = 0; i < N; i++) {

// Loop body

Example:

#pragma omp parallel

#pragma omp for

for (int i = 0; i < 100; i++) {


array[i] = i * i;

Usage: For parallelizing loops, improving performance by dividing iterations.


3. sections Directive

• Purpose: Splits the enclosed code into sections, each executed by a different thread.

• Syntax:

#pragma omp sections


{

#pragma omp section

// Section 1 code

#pragma omp section

// Section 2 code
}

Example:

#pragma omp parallel sections

#pragma omp section

printf("Section 1 executed by thread %d\n", omp_get_thread_num());

#pragma omp section

{
printf("Section 2 executed by thread %d\n", omp_get_thread_num());

• Usage: When different tasks can be executed concurrently.


4. single Directive

• Purpose: Specifies a block of code that should be executed by only one thread in a
parallel region.

• Syntax:
#pragma omp single

// Code executed by one thread

Example:

#pragma omp parallel

#pragma omp single


{

printf("Executed by one thread only\n");

Usage: For initialization or I/O operations that must be done once.

5. critical Directive

• Purpose: Ensures mutual exclusion for a block of code, preventing race conditions.

• Syntax:

#pragma omp critical

// Critical section
}

Example:

int sum = 0;

#pragma omp parallel for


for (int i = 0; i < N; i++) {

#pragma omp critical

sum += array[i];
}

Usage: To protect updates to shared variables.

6. barrier Directive

• Purpose: Synchronizes all threads in a team; threads wait until all have reached the
barrier.

• Syntax:

#pragma omp barrier

Example:
#pragma omp parallel

// Some computation

#pragma omp barrier

// Code executed after all threads reach barrier

Usage: To enforce synchronization points.

7. master Directive

• Purpose: Specifies a block of code to be executed only by the master thread (thread 0).
• Syntax:
#pragma omp master

// Code executed by master thread only

Example:

#pragma omp parallel

#pragma omp master


{

printf("Master thread ID: %d\n", omp_get_thread_num());

Usage: For operations specific to the master thread without synchronization overhead.

8. atomic Directive

• Purpose: Ensures a specific memory update operation is atomic, preventing race


conditions on simple updates.

• Syntax:
#pragma omp atomic

shared_var++;

Example:

int count = 0;

#pragma omp parallel for

for (int i = 0; i < N; i++) {

#pragma omp atomic

count++;

}
Usage: For efficient synchronization on simple operations like increments.
9. task Directive

• Purpose: Defines a unit of work that can be executed asynchronously by a thread.

• Syntax:

#pragma omp task

// Task code

Example:
#pragma omp parallel

#pragma omp single

#pragma omp task

printf("Task 1 executed by thread %d\n", omp_get_thread_num());

}
#pragma omp task

printf("Task 2 executed by thread %d\n", omp_get_thread_num());

• Usage: For dynamic task parallelism, useful in irregular computations.

14.A.) Describe different MPI constructs and libraries used for distributed memory
programming.

MPI Constructs and Libraries in Distributed Memory Programming

Introduction to MPI

MPI (Message Passing Interface) is the de facto standard for distributed memory parallel
programming. It allows processes running on different nodes (with separate memory) to
communicate by sending and receiving messages explicitly.

MPI programs typically use multiple processes (not threads) and require explicit coordination.

1. Basic MPI Constructs

a) MPI_Init and MPI_Finalize

• Purpose: Initialize and terminate the MPI environment.

• Usage:

MPI_Init(&argc, &argv);
// ... MPI calls ...

MPI_Finalize();

b) MPI_Comm_size and MPI_Comm_rank

• Purpose: Get the number of processes and the rank (ID) of each process.

• Usage:

int size, rank;

MPI_Comm_size(MPI_COMM_WORLD, &size);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);
c) MPI_Send and MPI_Recv

• Purpose: Point-to-point communication between processes.

• Usage:

MPI_Send(&data, 1, MPI_INT, dest_rank, tag, MPI_COMM_WORLD);

MPI_Recv(&data, 1, MPI_INT, source_rank, tag, MPI_COMM_WORLD, &status);

2. Collective Communication Constructs

a) MPI_Bcast

• Broadcast: One process sends data to all others.

MPI_Bcast(&data, 1, MPI_INT, root_rank, MPI_COMM_WORLD);

b) MPI_Scatter

• Distributes different portions of an array to different processes.


MPI_Scatter(sendbuf, 1, MPI_INT, &recvbuf, 1, MPI_INT, root_rank, MPI_COMM_WORLD);

c) MPI_Gather

• Collects data from all processes into one process.

MPI_Gather(&sendbuf, 1, MPI_INT, recvbuf, 1, MPI_INT, root_rank, MPI_COMM_WORLD);


d) MPI_Reduce

• Applies a reduction operation (e.g., sum, max) across processes.

MPI_Reduce(&sendbuf, &result, 1, MPI_INT, MPI_SUM, root_rank, MPI_COMM_WORLD);

e) MPI_Allreduce
• Like MPI_Reduce but result is distributed to all processes.

MPI_Allreduce(&sendbuf, &result, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);

3. Synchronization Constructs

a) MPI_Barrier

• Synchronizes all processes in a communicator.

MPI_Barrier(MPI_COMM_WORLD);

4. Derived Data Types

MPI allows defining custom data types for sending non-contiguous data.
a) MPI_Type_create_struct

• Create a complex structure to send/receive.

• Useful for structured data like records or arrays of structs.

5. Communicators and Groups

a) MPI_COMM_WORLD

• The default communicator including all processes.

b) MPI_Comm_split

• Used to divide the global communicator into subgroups for logical task grouping.

MPI_Comm_split(MPI_COMM_WORLD, color, rank, &new_comm);

6. MPI Libraries
a) MPICH

• One of the original and widely used implementations of MPI.

• Focuses on performance and portability.

b) OpenMPI
• Open-source implementation used in many HPC environments.

• Provides robust support for different network fabrics.

c) Intel MPI

• Optimized for Intel architectures, often used in HPC clusters.


• 7. Example: Simple MPI Program
#include <mpi.h>

#include <stdio.h>

int main(int argc, char** argv) {

int rank, size, data = 10, recv_data;

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

MPI_Comm_size(MPI_COMM_WORLD, &size);

if (rank == 0) {

MPI_Send(&data, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);

} else if (rank == 1) {

MPI_Recv(&recv_data, 1, MPI_INT, 0, 0, MPI_COMM_WORLD,


MPI_STATUS_IGNORE);

printf("Process 1 received data %d\n", recv_data);

MPI_Finalize();

return 0;

15.B. Discuss OpenMP and MPI implementations with a case study. Compare their
advantages and limitations.

OpenMP and MPI Implementations with a Case Study: Comparison of Advantages and
Limitations
1. Introduction to OpenMP and MPI

OpenMP (Open Multi-Processing)

• A parallel programming API for shared-memory systems.

• Uses compiler directives (#pragma), simple to implement.

MPI (Message Passing Interface)

• A standard for parallel programming in distributed-memory environments.


• Requires explicit message passing between processes.

2. Case Study: Matrix Multiplication


Let’s consider dense matrix multiplication C=A×BC = A \times BC=A×B as a case study to
compare OpenMP and MPI implementations.

3. OpenMP Implementation (Shared Memory)

Approach:

• Parallelize the outer loop using #pragma omp parallel for.

• All threads share the same address space.

Code Snippet:

#pragma omp parallel for private(j,k)

for (int i = 0; i < N; i++) {

for (int j = 0; j < N; j++) {

for (int k = 0; k < N; k++) {

C[i][j] += A[i][k] * B[k][j];

Characteristics:

• No need to manage memory distribution.

• Threads work on different rows of matrix C.


4. MPI Implementation (Distributed Memory)

Approach:

• Divide matrix rows among MPI processes.

• Broadcast matrix B to all processes.

• Each process computes its part of matrix C and sends it back to root.

Code Snippet:

MPI_Scatter(A, ..., local_A, ..., 0, MPI_COMM_WORLD);


MPI_Bcast(B, ..., MPI_COMM_WORLD);

for (i = 0; i < local_N; i++) {

for (j = 0; j < N; j++) {

for (k = 0; k < N; k++) {

local_C[i][j] += local_A[i][k] * B[k][j];

MPI_Gather(local_C, ..., C, ..., 0, MPI_COMM_WORLD);

Characteristics:

• Data explicitly distributed across nodes.

• Uses collective communication for performance.

• 5. Performance Comparison
Metric OpenMP MPI
Execution Time Faster on single node Scales better across multiple nodes
Limited to one node (shared
Scalability Scales across cluster nodes
memory)
Metric OpenMP MPI
Communication Implicit, memory shared Explicit, requires message passing
Memory Usage Efficient due to shared memory Requires memory duplication
Development Complex with explicit
Easier to code and debug
Effort synchronization
Synchronization Handled by OpenMP runtime Manually managed using barriers

6. Advantages and Limitations

OpenMP Advantages

• Easy to implement and debug.

• Minimal code change to parallelize loops.

• Best suited for small-to-medium scale problems on shared memory machines.

OpenMP Limitations

• Not scalable beyond the physical cores of a machine.

• Shared memory may become a bottleneck due to bandwidth or contention.

MPI Advantages

• Highly scalable on supercomputers or clusters.

• Explicit control over data locality and memory.

• Suitable for large-scale scientific simulations.

MPI Limitations

• More complex code with manual memory and message management.

• High communication overhead if not optimized.

• Debugging parallel errors (like deadlocks) is more difficult.

7. Hybrid Model (MPI + OpenMP)

Approach:

• Use MPI across nodes and OpenMP within each node.


• Achieves balance between scalability and programming effort.

8. Conclusion

• OpenMP is ideal for simpler shared-memory systems with fewer cores.

• MPI is necessary for large-scale, distributed simulations.

• For maximum flexibility and scalability, hybrid models (MPI + OpenMP) are the most
effective.

You might also like