0% found this document useful (0 votes)
20 views21 pages

Quiz Prep

The document discusses various parallel computer memory architectures, including Shared Memory, Distributed Memory, and Hybrid Distributed-Shared Memory architectures, detailing their characteristics, advantages, and disadvantages. It also covers types of parallelism such as Task, Data, Pipeline, and Hybrid Parallelism, along with real-world applications and performance considerations. Additionally, it introduces tools for parallel programming and compares different programming models used in parallel computing.

Uploaded by

Priah Rajput
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views21 pages

Quiz Prep

The document discusses various parallel computer memory architectures, including Shared Memory, Distributed Memory, and Hybrid Distributed-Shared Memory architectures, detailing their characteristics, advantages, and disadvantages. It also covers types of parallelism such as Task, Data, Pipeline, and Hybrid Parallelism, along with real-world applications and performance considerations. Additionally, it introduces tools for parallel programming and compares different programming models used in parallel computing.

Uploaded by

Priah Rajput
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

LECTURE: 03

Parallel Computer Memory Architectures

Parallel computing systems use different memory architectures to allow multiple processors to work
together efficiently. These architectures determine how memory is accessed and shared among
processors. The main types are:

1. Shared Memory Architecture

2. Distributed Memory Architecture

3. Hybrid Distributed-Shared Memory Architecture

1. Shared Memory Architecture

In this type, all processors share the same memory space. Any changes made by one processor to
memory are visible to all other processors.

General Characteristics:

 All processors can access memory as a global address space (common memory).

 Multiple processors operate independently but share the memory.

 If one processor updates a value in memory, all others see the updated value.

 There are two types of shared memory architectures:

o Uniform Memory Access (UMA)

o Non-Uniform Memory Access (NUMA)

Example:

Imagine a group of people working on a shared Google Docs file. When one person makes changes,
everyone sees the update immediately.

1.1 Uniform Memory Access (UMA)

 All processors have equal access time to memory.

 Each processor has the same priority for accessing memory.

 Used in Symmetric Multiprocessors (SMPs).

 A special feature is cache coherency (if one processor updates memory, others are
immediately informed).

Example of UMA:

Think of a classroom where all students have equal access to the teacher’s notes on the board. No
one gets faster or slower access.
1.2 Non-Uniform Memory Access (NUMA)

 Used when multiple SMP machines are linked together.

 One SMP can directly access the memory of another SMP.

 Some processors have faster access to certain memory regions, while others may experience
slower access times.

 If cache coherency is maintained, it is called CC-NUMA (Cache Coherent NUMA).

Example of NUMA:

Think of a university with different libraries. A student near their department’s library can get books
faster than one accessing a distant library.

2. Distributed Memory Architecture

Here, each processor has its own local memory, and processors communicate through a network.
There is no global memory space.

General Characteristics:

 Each processor has its own memory and operates independently.

 Changes made by one processor do not affect other processors.

 Processors exchange data by sending messages over a network.

 No need for cache coherence because memory is not shared.

Example of Distributed Memory:

Think of a company with different departments. Each department has its own documents. If one
department needs information from another, they must request it explicitly instead of accessing it
directly.

3. Hybrid Distributed-Shared Memory Architecture

This is a combination of shared memory and distributed memory. The fastest and most powerful
computers today use this model.

General Characteristics:

 The shared memory component consists of SMP machines or GPUs (Graphics Processing
Units).

 The distributed memory component is a network of multiple shared memory machines.

 Processors within a machine share memory, but machines must communicate over a
network.

 This architecture is expected to dominate high-performance computing in the future.

Example of Hybrid Memory:


A multinational company has offices in different countries. Employees in the same office share
resources (shared memory), but if they need data from another office, they must communicate over
the internet (distributed memory).

ADVANTAGES DISADVANTAGES
SHARED ✔ Easy to program because all ✘ Scalability issues – Adding
processors access the same more processors can slow
memory. performance due to memory
✔ Fast data sharing as no access congestion.
extra communication is ✘ Synchronization problems –
required. Programmers must ensure that
processors do not overwrite
shared data.
DISTRIBUTED ✔ Scalability – Adding more ✘ Complex programming –
processors increases memory Programmers must manually
size proportionally. handle data communication
✔ No interference – Each between processors.
processor accesses its own ✘ Data mapping challenges –
memory without slowing down It is hard to convert some
others. existing programs (designed
✔ Cost-effective – Uses off- for shared memory) to
the-shelf components like distributed memory systems.
standard processors and ✘ non-uniform memory
network connections. access times – Accessing data
from a remote processor is
slower than accessing local
memory.
HYBRID ✔ Combines the best of both ✘ Programming complexity –
worlds – shared memory for Developers must manage both
fast local communication, shared and distributed
distributed memory for memory operations.
scalability.
✔ Highly scalable – Large
systems can handle huge
amounts of data.

UNIFORM MEMORY ADDRESS NON-UNIFORM MEMORY ADDRESS


 All processors have equal access time to  Used when multiple SMP machines are
memory. linked together.
 Each processor has the same priority for  One SMP can directly access the memory
accessing memory. of another SMP.
 Used in Symmetric Multiprocessors  Some processors have faster access to
(SMPs). certain memory regions, while others may
 A special feature is cache coherency (if one experience slower access times.
processor updates memory, others are  If cache coherency is maintained, it is called
immediately informed). CC-NUMA (Cache Coherent NUMA).

LECTURE: 04
Why Parallelism?

 Increasing demand for computational power: Modern applications (like deep learning and
simulations) require more processing power than a single processor can provide.

 Limitations of sequential processing: One processor doing tasks one after the other can be
too slow for large, complex problems.

 Performance improvements: Running tasks simultaneously can speed up the overall


computation.

 Energy efficiency: Often, splitting work across multiple processors can be more energy
efficient than one processor working hard all the time.

Types of Parallelism

There are four main approaches to parallelism:

1. Task Parallelism

2. Data Parallelism

3. Pipeline Parallelism

4. Hybrid Parallelism

1. Task Parallelism
Definition:

Different tasks (or different pieces of a program) are executed concurrently, meaning that separate
threads or processors perform different operations at the same time.

Examples:

 Web Server: One thread handles client requests while another processes database queries.

 Video Processing: One thread extracts frames from a video while another thread
compresses the frames concurrently.

Programming Models:

 OpenMP: Uses directives like #pragma omp parallel sections in C/C++.

 Pthreads: In C/C++, threads are created with functions like pthread_create().

 Intel TBB: A C++ library that supports task-based parallelism.

Purpose:

Task parallelism is useful when a program consists of distinct tasks that can run independently. It
helps utilize multiple cores by splitting the work into different, concurrent operations.

2. Data Parallelism
Definition:

The same operation is applied to different pieces of a large dataset at the same time. Instead of
dividing the program into different tasks, you divide the data among multiple processors.

Examples:

 Matrix Multiplication: Different cores compute parts of the resulting matrix simultaneously.

 Image Processing: Applying a filter to many pixels at the same time on a large image.

Programming Models:

 CUDA: For NVIDIA GPUs, which can run thousands of threads in parallel.

 OpenACC: A high-level model to express parallelism on GPUs.

 MPI (Message Passing Interface): Enables parallel processing across nodes in a cluster.

Purpose:

Data parallelism is ideal when you have large amounts of similar data (like images or numerical data)
that can be processed in parallel, which speeds up the computation significantly.

3. Pipeline Parallelism
Definition:

Tasks are divided into sequential stages (like a production line). Each stage processes a part of the
data and then passes it to the next stage.

Examples:

 CPU Instruction Pipeline: Typical stages include Fetch, Decode, Execute, and Writeback.

 AI Model Training: One stage might load data, the next stage processes it, and another stage
trains the model.

Programming Models:

 Streaming Architectures: Often used in both CPUs and GPUs.

 Deep Learning Frameworks (TensorFlow, PyTorch): These can implement pipelined training
where different layers or stages are processed in a sequence.

Purpose:

Pipeline parallelism helps in organizing tasks that naturally follow a sequential process, ensuring each
stage can work concurrently on different parts of the data, thereby improving throughput.

4. Hybrid Parallelism
Definition:

A combination of multiple parallelism techniques—often mixing data parallelism and task


parallelism, or incorporating pipeline parallelism as well.

Examples:

 MPI + OpenMP: MPI distributes tasks across different nodes (data parallelism), while
OpenMP handles multi-threading within each node (task parallelism).

 Deep Learning Training: It might use data parallelism to distribute training data across
multiple GPUs and pipeline parallelism to split the model layers across GPUs.

Programming Models:

 Hybrid MPI-OpenMP: Combines the strengths of both message passing (for inter-node
communication) and shared memory multi-threading (within a node).

 Deep Learning Frameworks: Such as TensorFlow, PyTorch, or Horovod, which are designed to
handle large-scale distributed training.

Purpose:

Hybrid parallelism is used in the most complex and high-performance systems (like supercomputers)
to take advantage of the strengths of each individual approach and overcome their limitations.

Task vs. Data Parallelism

 Task Parallelism:

o What it does: Runs different tasks (each doing a unique job) at the same time.

o Example: A web server processing various requests simultaneously.

 Data Parallelism:

o What it does: Runs the same operation on different parts of a dataset


simultaneously.

o Example: Processing different segments of a large image or dataset concurrently.

Key Differences:

 Communication:

o In task parallelism, tasks may communicate and share results, often requiring careful
synchronization.

o In data parallelism, each processor works on its own piece of the data, reducing the
need for constant communication.

 Application Suitability:
o Task parallelism is great for applications with different, independent operations.

o Data parallelism is ideal for applications that involve repetitive processing over large
datasets.

Real-World Applications

 Task Parallelism:

o Used in web servers and game engines where different tasks run concurrently.

 Data Parallelism:

o Common in image processing and deep learning where the same operations are
applied to large datasets.

 Pipeline Parallelism:

o Used in AI model inference and CPU instruction processing.

 Hybrid Parallelism:

o Found in supercomputers and high-end deep learning applications, where both task
and data parallelism are needed for optimal performance.

Performance Considerations

 Load Balancing: Ensuring that each processor gets an equal amount of work.

 Synchronization Overhead: The extra time spent coordinating between tasks or processors
(Minimize waiting time between tasks).

 Scalability Issues: Some approaches may not scale well when more processors or larger
datasets are involved (Choose the right model for large-scale execution).

 Choosing the Right Model: The application and available hardware will determine which
type of parallelism is most effective.

LECTURE: 05
Tools for Parallel Programming

Parallel programming tools distribute tasks across multiple cores, processors, or GPUs. These tools
can be classified into different categories:

1. Shared Memory Parallelism

 OpenMP: Uses compiler directives for parallelizing loops (#pragma omp parallel).

 Intel TBB: Task-based parallelism for multi-core CPUs.

2. Distributed Memory Parallelism


 MPI (Message Passing Interface): Used in clusters and supercomputers for explicit message
passing.

 Charm++: Object-oriented parallel framework that supports message-driven execution.

3. GPU-Based Parallelism

 CUDA (Compute Unified Device Architecture): NVIDIA’s framework for executing thousands
of GPU threads.

 OpenACC (Open Accelerators): Simplifies GPU programming using compiler directives.

Detailed Overview of Some Parallel Programming Tools

MPI (Message Passing Interface)

 Designed for: Distributed memory systems.

 Usage: Supercomputers and clusters.

 How it works: Explicit communication via message passing.

CUDA (Compute Unified Device Architecture)

 Designed for: NVIDIA GPUs.

 Usage: Executes thousands of threads in parallel.

 Example: Deep learning, image processing.

OpenACC (Open Accelerators)

 Simplifies GPU programming: Uses compiler directives.

 Similar to OpenMP but for GPUs.

Intel Threading Building Blocks (TBB)

 Supports task-based parallelism.

 Example: Multi-threaded applications.

HPX (High-Performance ParalleX)

 Asynchronous parallel programming framework.

 Based on C++ and designed for scalability.

Charm++

 Object-oriented parallel programming framework.

 Uses a message-driven execution model.

Apache Spark

 Used for big data processing and parallel computations.

 Implements Resilient Distributed Datasets (RDDs).


Performance Considerations in Parallel Computing

 Load Balancing: Ensuring all processors do an equal amount of work.

 Scalability: How well the system performs as the number of processors increases.

 Synchronization Overhead: Minimizing the time spent waiting for resources.

 Communication Overhead: Reducing delays caused by message passing.

Comparison of Parallel Programming Tools

Tool Type Best For

OpenMP Shared Memory Multi-threaded applications

MPI Distributed Memory Supercomputers, clusters

CUDA GPU-Based High-performance computing (HPC)

OpenACC GPU-Based Simplifying GPU programming

TBB Shared Memory Task-based parallelism

HPX Distributed Memory Asynchronous parallel computing

Charm++ Distributed Memory Message-driven execution

Spark Distributed Computing Big data processing

LECTURE: 06
1. SISD (Single Instruction, Single Data)

 What it means: A single processor executes one instruction at a time on one piece of data.

 Example: A standard desktop computer with a single-core processor.

 Where it is used: Simple applications like text editing, browsing, and basic calculations.

🔹 Real-World Example: A calculator – it performs one operation (instruction) on one number (data)
at a time.

2. SIMD (Single Instruction, Multiple Data)

 What it means: The same instruction is applied to multiple pieces of data at once.

 Example: A Graphics Processing Unit (GPU) processes multiple pixels at the same time.

 Where it is used:

o Image processing – applying a filter to an image.


o Machine learning – matrix multiplications in AI models.

🔹 Real-World Example: Traffic lights in a city – one instruction (change light color) is applied to
multiple locations (intersections) at the same time.

3. MISD (Multiple Instruction, Single Data)

 What it means: Different instructions operate on the same data.

 Example: Used in systems that require high reliability, like fault-tolerant aerospace systems.

 Where it is used:

o Missile guidance systems – multiple calculations check the same data to ensure
accuracy.

o Medical monitoring – different sensors analyze a patient’s vital signs from one input.

🔹 Real-World Example: Security screening at an airport – different scanning machines (instructions)


analyze the same bag (data) in different ways (X-ray, infrared, etc.).

4. MIMD (Multiple Instruction, Multiple Data)

 What it means: Different processors execute different instructions on different pieces of data
at the same time.

 Example: A modern multi-core CPU – different cores run different programs at the same
time.

 Where it is used:

o Cloud computing – multiple users run different applications.

o AI training – different tasks (data preprocessing, training, testing) run


simultaneously.

🔹 Real-World Example: A restaurant kitchen – multiple chefs (processors) cook different meals
(instructions) using different ingredients (data) at the same time.

Parallel and Distributed Computing

Parallel and distributed computing are two approaches to solving complex computational problems
by using multiple processors or computers at the same time.

 Parallel Computing: Uses multiple processors within a single system to perform tasks
simultaneously.

 Distributed Computing: Uses multiple interconnected computers that work together to solve
a problem.
Massively Parallel Processing (MPP) Architecture

What is MPP?

Massively Parallel Processing (MPP) is a computing architecture where many processors work
independently on different parts of a problem.

Where is MPP Used?

 Supercomputers: IBM Summit, Cray

 Big Data Processing: Hadoop, AI training

 Scientific Simulations

Key Characteristics of MPP

1. Large Number of Processors – Typically, thousands of processors.

2. Distributed Memory – Each processor has its own memory and communicates via a
network.

3. Parallel Execution – Tasks are divided into smaller parts and executed simultaneously.

4. High Scalability – Suitable for large-scale computations.

Example of MPP in Action

Google Search: Google uses thousands of computers in parallel to process search queries efficiently.

Components of MPP

1. Processing Nodes – Each node has its own CPU, memory, and OS.

2. Interconnection Network – A high-speed network for communication.

3. Memory Model – Uses distributed memory (each node has its own memory).

4. Parallel Software – Special software is needed (e.g., MPI, Hadoop).

Types of MPP Architectures

 Tightly Coupled MPP – Nodes communicate through high-speed interconnects (e.g., Cray
Supercomputers).

 Loosely Coupled MPP – Uses standard network protocols like TCP/IP (e.g., Hadoop clusters).

Advantages of MPP

✅ High Performance – Solves complex computations faster.


✅ Scalability – More processors can be added easily.
✅ Fault Tolerance – If one node fails, others continue processing.

Disadvantages of MPP

❌ Complex Programming – Requires special parallel algorithms.


❌ High Cost – Expensive infrastructure and maintenance.
❌ Communication Overhead – Data exchange between nodes can slow performance.
Cluster-Based Parallel Computing

What is Cluster Computing?

Cluster computing refers to a group of interconnected computers (nodes) that work together to solve
complex computational problems.

How It Works

Each computer (node) has its own CPU, memory, and storage but communicates through a high-
speed network.

Examples of Cluster Computing

 HPC (High-Performance Computing): Used in scientific research.

 Big Data & AI: Google Cloud, Amazon Web Services.

 Financial Modeling: Used in stock market predictions.

Communication in Cluster-Based Parallel Computing

1. Shared Memory Model – Nodes access a common memory space.

2. Message Passing Model – Nodes send messages to communicate (e.g., MPI, PVM).

Advantages of Clusters in Parallel Computing

✅ Cost-Effective – Uses normal hardware instead of expensive supercomputers.


✅ Scalable – More nodes can be added easily.
✅ Fault Tolerance – System continues working even if one node fails.

Categorizing Parallel Approaches (Flynn’s Taxonomy, 1966)

Flynn’s Taxonomy classifies parallel computer architectures based on how instructions and data are
processed.

Category Instructions Data Example

SISD (Single Instruction, Single One instruction at a One data at a Traditional single-core
Data) time time CPUs

SIMD (Single Instruction, One instruction at a Multiple data at a


GPUs, Image Processing
Multiple Data) time time

MISD (Multiple Instruction, Aerospace systems (fault


Multiple instructions One data
Single Data) tolerance)

MIMD (Multiple Instruction, Cloud Computing, AI


Multiple instructions Multiple data
Multiple Data) training

SIMT (Single Instruction, Multi- Multiple (per GPU computing (CUDA,


Single (per warp)
Thread) thread) OpenCL)
Clusters in Parallel Computing

Clusters are collections of interconnected computers (nodes) that work together on computations.

Types of Clusters in Parallel Computing

1. High-Performance Computing (HPC) Clusters – Used for scientific simulations (e.g., IBM
Summit).

2. Load Balancing Clusters – Distributes tasks efficiently (e.g., Google Cloud).

3. High-Availability (HA) Clusters – Ensures continuous operation (e.g., Banking Systems).

4. Storage Clusters – Used for big data storage (e.g., Hadoop Distributed File System).

Key Features of Cluster Computing

 Multiple Nodes – Each node has its own CPU, memory, and storage.

 Parallel Processing – Tasks are divided among nodes.

 Message Passing – Communication using MPI or PVM.

 Fault Tolerance – If one node fails, others continue.

Parallel Computing Strategies

Parallel computing involves different techniques to divide work across multiple processors.

1. Process-Based Parallelization

o Uses separate memory spaces for each process.

o Communication via message passing (e.g., MPI).

o Example: Running multiple simulations in scientific computing.

2. Thread-Based Parallelization

o Runs multiple threads within a single process.

o Faster than process-based parallelism.

o Example: Web servers handling multiple requests.

3. Vectorization

o Operates on multiple data units using SIMD (Single Instruction, Multiple Data).

o Ideal for AI/ML workloads.

o Example: Deep learning computations.

4. Stream Processing

o Uses thousands of lightweight threads in parallel.


o Example: Video streaming processing using GPUs.

SISD → Traditional single-core CPU (one task at a time).

SIMD → GPUs (same task on many data points at once).

MISD → Aerospace and security systems (multiple ways to process the same data).

MIMD → Cloud computing, AI training (different tasks on different data at the same time).

LECTURE: 07(a)
1. Understanding Parallel Languages

 Parallel programming languages help run multiple tasks at the same time using multiple
processors or cores.

 This is different from serial programming, where tasks are done one after another.

 The main benefit is speed and efficiency, making it useful in areas like scientific research,
real-time data processing, and large-scale simulations.

2. Examples of Parallel Programming Models and Frameworks

a) OpenMP (Open Multi-Processing)

 What it is: A tool for writing parallel programs in C, C++, and Fortran.

 How it works: Uses shared memory, meaning all threads share the same memory space.

 Where it is used: Good for multi-core processors on a single machine (like laptops or
desktops).

 Example: Splitting a large loop into smaller tasks that different cores run at the same time.

b) MPI (Message Passing Interface)

 What it is: A tool for running parallel programs across multiple machines (distributed
computing).

 How it works: Uses distributed memory, meaning each process has its own memory and
communicates by sending messages.

 Where it is used: Good for clusters and supercomputers that work on huge calculations like
weather forecasting.

 Example: A weather model running on a cluster where different machines calculate different
parts of the simulation.
c) CUDA (Compute Unified Device Architecture)

 What it is: A programming model for NVIDIA GPUs.

 How it works: Uses GPU memory, separate from CPU memory. Data must be transferred
between them.

 Where it is used: Image processing, deep learning, and AI.

 Example: Training a deep learning model on a GPU, where thousands of small calculations
happen in parallel.

3. Difference Between OpenMP, MPI, and CUDA

Feature OpenMP MPI CUDA

Memory Shared Memory (Same Distributed Memory (Each GPU Memory (Separate
Model memory for all threads) process has its own memory) from CPU memory)

Parallelism Multi-threading on a single Distributed computing across


GPU-based parallelism
Type machine multiple machines

Multi-core processors in a Large-scale simulations on AI, Deep Learning,


Best For
single system clusters Graphics Processing

Splitting a for-loop into Running a weather simulation Running AI models on


Example
multiple threads on different computers NVIDIA GPUs

4. Parallelizing Compiler

A parallelizing compiler is a tool that automatically converts normal code into parallel code for
better performance.

Why do we need Parallelizing Compilers?

1. More multi-core processors → We need to use them efficiently.

2. Manual parallelization is difficult → Writing parallel code by hand is complex.

3. Faster performance → Needed for AI, big data, and scientific computing.

How a Parallelizing Compiler Works

 Dependency Analysis → Checks which parts of the code can be run in parallel.

 Loop Transformation → Converts slow loops into fast parallel loops.

 Task Scheduling → Assigns different parts of the program to different processors.

 Code Optimization → Removes unnecessary steps to improve speed.

5. Techniques Used in Parallelizing Compilers


1. Loop Transformations (Unrolling, Fusion, Tiling) → Makes loops run faster.

2. Automatic Vectorization → Converts code to SIMD (process multiple data at once).

3. Task Parallelism → Splits tasks into independent threads.

4. Message Passing Optimization → Reduces overhead in MPI-based systems.

5. Data Dependency Analysis → Ensures parallel execution is correct.

6. Applications of Parallelizing Compilers

 Scientific Computing → Simulations, weather forecasting.

 Big Data Processing → Hadoop, Spark optimization.

 Machine Learning & AI → Faster training of deep learning models.

 Cloud Computing → Better workload distribution.

7. Examples of Parallelizing Compilers

 LLVM Polly → Optimizes loops for parallelism.

 Intel C/C++ Compiler (ICC) → Improves threading.

 OpenMP & OpenACC Compilers → Simplify parallel programming.

 GCC with Auto-Parallelization → Converts code to use multiple processors.

 IBM XL Compiler → Used in High-Performance Computing (HPC).

Summary of Key Differences

Feature OpenMP MPI CUDA Parallelizing Compiler

Automatically converts
Multi-threading in Message passing in
Purpose GPU programming sequential code to parallel
shared memory distributed systems
code

Memory Works with different


Shared memory Distributed memory GPU memory
Model models

Best Used
Multi-core CPUs Cluster computing AI & Graphics Any parallel system
For

Parallelizing a for- Running a weather Training deep


Example Optimizing AI and HPC code
loop model on a cluster learning models

LECTURE: 07(b)
Dependency Types Analysis in Parallel and Distributed Computing

Introduction

When we write programs for parallel or distributed computing, we must ensure that different tasks
can run together correctly. If some tasks depend on others, they may cause errors or delays.
Dependency analysis helps identify such relationships and ensures:

 Correctness of computations.

 Efficient use of resources.

 Prevention of race conditions and deadlocks.

 Better scheduling of tasks across multiple processors or computers.

Types of Dependencies

Dependencies can be classified into four main types:

1. Data Dependencies

2. Control Dependencies

3. Resource Dependencies

4. Communication Dependencies

1. Data Dependencies

Data dependencies occur when one instruction or task depends on the result of another instruction.
This affects how instructions are executed and whether they can be parallelized.

Types of Data Dependencies

1. True Dependency (Read-After-Write, RAW)

2. Anti-Dependency (Write-After-Read, WAR)

3. Output Dependency (Write-After-Write, WAW)

1.1 True Dependency (Read-After-Write, RAW)

 Happens when an instruction needs to read a value that a previous instruction writes.

 Also called flow dependency because data flows from one instruction to another.

Example:

A = B + C; // Instruction 1 writes A

D = A + E; // Instruction 2 reads A
Explanation:

 The second instruction depends on the first one because it needs the value of A before
execution.

 This dependency must be preserved for correct execution.

1.2 Anti-Dependency (Write-After-Read, WAR)

 Happens when an instruction needs to write to a variable that a previous instruction has
read.

 The execution order must be maintained.

Example:

X = Y + Z; // Instruction 1 reads X

X = A + B; // Instruction 2 writes to X

Explanation:

 The first instruction reads X, and the second instruction writes to X.

 If instruction 2 executes first, it will change X before instruction 1 reads it, leading to
incorrect results.

1.3 Output Dependency (Write-After-Write, WAW)

 Happens when two instructions write to the same variable.

 The order of execution must be maintained to avoid overwriting.

Example:

X = Y + Z; // Instruction 1 writes to X

X = A + B; // Instruction 2 also writes to X

Explanation:

 If instruction 2 executes before instruction 1, the result of instruction 1 will be lost.

 The correct execution order must be followed.

2. Control Dependencies

 Occurs when the execution of an instruction depends on the outcome of a previous


conditional statement.

 Affects the flow of instructions.

Example:
if (X > 0) { // Condition check

Y = A + B; // Dependent instruction

Explanation:

 Instruction Y = A + B depends on whether X > 0.

 The processor cannot execute instruction 2 until it knows the result of instruction 1.

 This can cause delays in pipelined processors.

3. Resource Dependencies

 Happens when multiple instructions try to use the same hardware resource (such as
memory, registers, or execution units).

 Can cause pipeline stalls (delays in execution).

Example:

A = B + C; // Instruction 1 uses ALU

D = E - F; // Instruction 2 also needs ALU

Explanation:

 If the processor has only one ALU (Arithmetic Logic Unit), instruction 2 must wait until
instruction 1 finishes.

 This creates a structural hazard.

4. Communication Dependencies

 Occur when tasks running on different processors need to share data.

 Data transfer between processors can cause synchronization delays.

Example (Inter-Processor Dependency):

CPU1: X = 5; // Updates X

CPU2: Y = X + 3; // Reads X

Explanation:

 CPU2 must wait until CPU1 updates the value of X before reading it.

 If not handled properly, it can lead to race conditions or stale data issues.

 Solution: Use synchronization techniques like locks, semaphores, or memory barriers.

Resolving Dependencies in Parallel Computing


Techniques to Remove Dependencies

1. Loop Transformations:

o Loop unrolling, loop fusion to optimize execution.

2. Speculative Execution:

o Predicts results and executes ahead of time.

3. Vectorization:

o Uses SIMD (Single Instruction Multiple Data) techniques to perform operations on


multiple data points at once.

Resolving Dependencies in Distributed Computing

Techniques to Optimize Distributed Computing

1. Task Partitioning:

o Assign independent tasks to different nodes.

2. Asynchronous Execution:

o Use message queues to avoid blocking execution.

3. Fault Tolerance:

o Handle failures using checkpointing to save progress.

Case Studies & Examples

A. GPU-Based Matrix Multiplication Using CUDA (True Dependency - RAW)

 CUDA enables parallel execution of matrix multiplication on GPUs.

 True dependencies occur when computing an element in a matrix depends on previous


computations.

Example:

C[i][j] = A[i][k] * B[k][j] + C[i][j]; // Each row depends on the previous row

 A row in the result matrix depends on previous rows.

 Requires techniques like tiling to minimize dependencies.

B. OpenMP Handling Loop-Carried Dependencies (Anti-Dependency - WAR)

 OpenMP is used for parallelizing loops in multi-core CPUs.

 Loop-carried dependencies occur when later iterations depend on earlier iterations.

Example:

#pragma omp parallel for


for (int i = 1; i < N; i++) {

A[i] = A[i - 1] + B[i]; // Dependency on A[i-1]

 Since A[i] depends on A[i-1], OpenMP cannot parallelize this loop without modification.

 Solution: Use prefix sum algorithms or independent computations.

You might also like