Quiz Prep
Quiz Prep
Parallel computing systems use different memory architectures to allow multiple processors to work
together efficiently. These architectures determine how memory is accessed and shared among
processors. The main types are:
In this type, all processors share the same memory space. Any changes made by one processor to
memory are visible to all other processors.
General Characteristics:
All processors can access memory as a global address space (common memory).
If one processor updates a value in memory, all others see the updated value.
Example:
Imagine a group of people working on a shared Google Docs file. When one person makes changes,
everyone sees the update immediately.
A special feature is cache coherency (if one processor updates memory, others are
immediately informed).
Example of UMA:
Think of a classroom where all students have equal access to the teacher’s notes on the board. No
one gets faster or slower access.
1.2 Non-Uniform Memory Access (NUMA)
Some processors have faster access to certain memory regions, while others may experience
slower access times.
Example of NUMA:
Think of a university with different libraries. A student near their department’s library can get books
faster than one accessing a distant library.
Here, each processor has its own local memory, and processors communicate through a network.
There is no global memory space.
General Characteristics:
Think of a company with different departments. Each department has its own documents. If one
department needs information from another, they must request it explicitly instead of accessing it
directly.
This is a combination of shared memory and distributed memory. The fastest and most powerful
computers today use this model.
General Characteristics:
The shared memory component consists of SMP machines or GPUs (Graphics Processing
Units).
Processors within a machine share memory, but machines must communicate over a
network.
ADVANTAGES DISADVANTAGES
SHARED ✔ Easy to program because all ✘ Scalability issues – Adding
processors access the same more processors can slow
memory. performance due to memory
✔ Fast data sharing as no access congestion.
extra communication is ✘ Synchronization problems –
required. Programmers must ensure that
processors do not overwrite
shared data.
DISTRIBUTED ✔ Scalability – Adding more ✘ Complex programming –
processors increases memory Programmers must manually
size proportionally. handle data communication
✔ No interference – Each between processors.
processor accesses its own ✘ Data mapping challenges –
memory without slowing down It is hard to convert some
others. existing programs (designed
✔ Cost-effective – Uses off- for shared memory) to
the-shelf components like distributed memory systems.
standard processors and ✘ non-uniform memory
network connections. access times – Accessing data
from a remote processor is
slower than accessing local
memory.
HYBRID ✔ Combines the best of both ✘ Programming complexity –
worlds – shared memory for Developers must manage both
fast local communication, shared and distributed
distributed memory for memory operations.
scalability.
✔ Highly scalable – Large
systems can handle huge
amounts of data.
LECTURE: 04
Why Parallelism?
Increasing demand for computational power: Modern applications (like deep learning and
simulations) require more processing power than a single processor can provide.
Limitations of sequential processing: One processor doing tasks one after the other can be
too slow for large, complex problems.
Energy efficiency: Often, splitting work across multiple processors can be more energy
efficient than one processor working hard all the time.
Types of Parallelism
1. Task Parallelism
2. Data Parallelism
3. Pipeline Parallelism
4. Hybrid Parallelism
1. Task Parallelism
Definition:
Different tasks (or different pieces of a program) are executed concurrently, meaning that separate
threads or processors perform different operations at the same time.
Examples:
Web Server: One thread handles client requests while another processes database queries.
Video Processing: One thread extracts frames from a video while another thread
compresses the frames concurrently.
Programming Models:
Purpose:
Task parallelism is useful when a program consists of distinct tasks that can run independently. It
helps utilize multiple cores by splitting the work into different, concurrent operations.
2. Data Parallelism
Definition:
The same operation is applied to different pieces of a large dataset at the same time. Instead of
dividing the program into different tasks, you divide the data among multiple processors.
Examples:
Matrix Multiplication: Different cores compute parts of the resulting matrix simultaneously.
Image Processing: Applying a filter to many pixels at the same time on a large image.
Programming Models:
CUDA: For NVIDIA GPUs, which can run thousands of threads in parallel.
MPI (Message Passing Interface): Enables parallel processing across nodes in a cluster.
Purpose:
Data parallelism is ideal when you have large amounts of similar data (like images or numerical data)
that can be processed in parallel, which speeds up the computation significantly.
3. Pipeline Parallelism
Definition:
Tasks are divided into sequential stages (like a production line). Each stage processes a part of the
data and then passes it to the next stage.
Examples:
CPU Instruction Pipeline: Typical stages include Fetch, Decode, Execute, and Writeback.
AI Model Training: One stage might load data, the next stage processes it, and another stage
trains the model.
Programming Models:
Deep Learning Frameworks (TensorFlow, PyTorch): These can implement pipelined training
where different layers or stages are processed in a sequence.
Purpose:
Pipeline parallelism helps in organizing tasks that naturally follow a sequential process, ensuring each
stage can work concurrently on different parts of the data, thereby improving throughput.
4. Hybrid Parallelism
Definition:
Examples:
MPI + OpenMP: MPI distributes tasks across different nodes (data parallelism), while
OpenMP handles multi-threading within each node (task parallelism).
Deep Learning Training: It might use data parallelism to distribute training data across
multiple GPUs and pipeline parallelism to split the model layers across GPUs.
Programming Models:
Hybrid MPI-OpenMP: Combines the strengths of both message passing (for inter-node
communication) and shared memory multi-threading (within a node).
Deep Learning Frameworks: Such as TensorFlow, PyTorch, or Horovod, which are designed to
handle large-scale distributed training.
Purpose:
Hybrid parallelism is used in the most complex and high-performance systems (like supercomputers)
to take advantage of the strengths of each individual approach and overcome their limitations.
Task Parallelism:
o What it does: Runs different tasks (each doing a unique job) at the same time.
Data Parallelism:
Key Differences:
Communication:
o In task parallelism, tasks may communicate and share results, often requiring careful
synchronization.
o In data parallelism, each processor works on its own piece of the data, reducing the
need for constant communication.
Application Suitability:
o Task parallelism is great for applications with different, independent operations.
o Data parallelism is ideal for applications that involve repetitive processing over large
datasets.
Real-World Applications
Task Parallelism:
o Used in web servers and game engines where different tasks run concurrently.
Data Parallelism:
o Common in image processing and deep learning where the same operations are
applied to large datasets.
Pipeline Parallelism:
Hybrid Parallelism:
o Found in supercomputers and high-end deep learning applications, where both task
and data parallelism are needed for optimal performance.
Performance Considerations
Load Balancing: Ensuring that each processor gets an equal amount of work.
Synchronization Overhead: The extra time spent coordinating between tasks or processors
(Minimize waiting time between tasks).
Scalability Issues: Some approaches may not scale well when more processors or larger
datasets are involved (Choose the right model for large-scale execution).
Choosing the Right Model: The application and available hardware will determine which
type of parallelism is most effective.
LECTURE: 05
Tools for Parallel Programming
Parallel programming tools distribute tasks across multiple cores, processors, or GPUs. These tools
can be classified into different categories:
OpenMP: Uses compiler directives for parallelizing loops (#pragma omp parallel).
3. GPU-Based Parallelism
CUDA (Compute Unified Device Architecture): NVIDIA’s framework for executing thousands
of GPU threads.
Charm++
Apache Spark
Scalability: How well the system performs as the number of processors increases.
LECTURE: 06
1. SISD (Single Instruction, Single Data)
What it means: A single processor executes one instruction at a time on one piece of data.
Where it is used: Simple applications like text editing, browsing, and basic calculations.
🔹 Real-World Example: A calculator – it performs one operation (instruction) on one number (data)
at a time.
What it means: The same instruction is applied to multiple pieces of data at once.
Example: A Graphics Processing Unit (GPU) processes multiple pixels at the same time.
Where it is used:
🔹 Real-World Example: Traffic lights in a city – one instruction (change light color) is applied to
multiple locations (intersections) at the same time.
Example: Used in systems that require high reliability, like fault-tolerant aerospace systems.
Where it is used:
o Missile guidance systems – multiple calculations check the same data to ensure
accuracy.
o Medical monitoring – different sensors analyze a patient’s vital signs from one input.
What it means: Different processors execute different instructions on different pieces of data
at the same time.
Example: A modern multi-core CPU – different cores run different programs at the same
time.
Where it is used:
🔹 Real-World Example: A restaurant kitchen – multiple chefs (processors) cook different meals
(instructions) using different ingredients (data) at the same time.
Parallel and distributed computing are two approaches to solving complex computational problems
by using multiple processors or computers at the same time.
Parallel Computing: Uses multiple processors within a single system to perform tasks
simultaneously.
Distributed Computing: Uses multiple interconnected computers that work together to solve
a problem.
Massively Parallel Processing (MPP) Architecture
What is MPP?
Massively Parallel Processing (MPP) is a computing architecture where many processors work
independently on different parts of a problem.
Scientific Simulations
2. Distributed Memory – Each processor has its own memory and communicates via a
network.
3. Parallel Execution – Tasks are divided into smaller parts and executed simultaneously.
Google Search: Google uses thousands of computers in parallel to process search queries efficiently.
Components of MPP
1. Processing Nodes – Each node has its own CPU, memory, and OS.
3. Memory Model – Uses distributed memory (each node has its own memory).
Tightly Coupled MPP – Nodes communicate through high-speed interconnects (e.g., Cray
Supercomputers).
Loosely Coupled MPP – Uses standard network protocols like TCP/IP (e.g., Hadoop clusters).
Advantages of MPP
Disadvantages of MPP
Cluster computing refers to a group of interconnected computers (nodes) that work together to solve
complex computational problems.
How It Works
Each computer (node) has its own CPU, memory, and storage but communicates through a high-
speed network.
2. Message Passing Model – Nodes send messages to communicate (e.g., MPI, PVM).
Flynn’s Taxonomy classifies parallel computer architectures based on how instructions and data are
processed.
SISD (Single Instruction, Single One instruction at a One data at a Traditional single-core
Data) time time CPUs
Clusters are collections of interconnected computers (nodes) that work together on computations.
1. High-Performance Computing (HPC) Clusters – Used for scientific simulations (e.g., IBM
Summit).
4. Storage Clusters – Used for big data storage (e.g., Hadoop Distributed File System).
Multiple Nodes – Each node has its own CPU, memory, and storage.
Parallel computing involves different techniques to divide work across multiple processors.
1. Process-Based Parallelization
2. Thread-Based Parallelization
3. Vectorization
o Operates on multiple data units using SIMD (Single Instruction, Multiple Data).
4. Stream Processing
MISD → Aerospace and security systems (multiple ways to process the same data).
MIMD → Cloud computing, AI training (different tasks on different data at the same time).
LECTURE: 07(a)
1. Understanding Parallel Languages
Parallel programming languages help run multiple tasks at the same time using multiple
processors or cores.
This is different from serial programming, where tasks are done one after another.
The main benefit is speed and efficiency, making it useful in areas like scientific research,
real-time data processing, and large-scale simulations.
What it is: A tool for writing parallel programs in C, C++, and Fortran.
How it works: Uses shared memory, meaning all threads share the same memory space.
Where it is used: Good for multi-core processors on a single machine (like laptops or
desktops).
Example: Splitting a large loop into smaller tasks that different cores run at the same time.
What it is: A tool for running parallel programs across multiple machines (distributed
computing).
How it works: Uses distributed memory, meaning each process has its own memory and
communicates by sending messages.
Where it is used: Good for clusters and supercomputers that work on huge calculations like
weather forecasting.
Example: A weather model running on a cluster where different machines calculate different
parts of the simulation.
c) CUDA (Compute Unified Device Architecture)
How it works: Uses GPU memory, separate from CPU memory. Data must be transferred
between them.
Example: Training a deep learning model on a GPU, where thousands of small calculations
happen in parallel.
Memory Shared Memory (Same Distributed Memory (Each GPU Memory (Separate
Model memory for all threads) process has its own memory) from CPU memory)
4. Parallelizing Compiler
A parallelizing compiler is a tool that automatically converts normal code into parallel code for
better performance.
3. Faster performance → Needed for AI, big data, and scientific computing.
Dependency Analysis → Checks which parts of the code can be run in parallel.
Automatically converts
Multi-threading in Message passing in
Purpose GPU programming sequential code to parallel
shared memory distributed systems
code
Best Used
Multi-core CPUs Cluster computing AI & Graphics Any parallel system
For
LECTURE: 07(b)
Dependency Types Analysis in Parallel and Distributed Computing
Introduction
When we write programs for parallel or distributed computing, we must ensure that different tasks
can run together correctly. If some tasks depend on others, they may cause errors or delays.
Dependency analysis helps identify such relationships and ensures:
Correctness of computations.
Types of Dependencies
1. Data Dependencies
2. Control Dependencies
3. Resource Dependencies
4. Communication Dependencies
1. Data Dependencies
Data dependencies occur when one instruction or task depends on the result of another instruction.
This affects how instructions are executed and whether they can be parallelized.
Happens when an instruction needs to read a value that a previous instruction writes.
Also called flow dependency because data flows from one instruction to another.
Example:
A = B + C; // Instruction 1 writes A
D = A + E; // Instruction 2 reads A
Explanation:
The second instruction depends on the first one because it needs the value of A before
execution.
Happens when an instruction needs to write to a variable that a previous instruction has
read.
Example:
X = Y + Z; // Instruction 1 reads X
X = A + B; // Instruction 2 writes to X
Explanation:
If instruction 2 executes first, it will change X before instruction 1 reads it, leading to
incorrect results.
Example:
X = Y + Z; // Instruction 1 writes to X
Explanation:
2. Control Dependencies
Example:
if (X > 0) { // Condition check
Y = A + B; // Dependent instruction
Explanation:
The processor cannot execute instruction 2 until it knows the result of instruction 1.
3. Resource Dependencies
Happens when multiple instructions try to use the same hardware resource (such as
memory, registers, or execution units).
Example:
Explanation:
If the processor has only one ALU (Arithmetic Logic Unit), instruction 2 must wait until
instruction 1 finishes.
4. Communication Dependencies
CPU1: X = 5; // Updates X
CPU2: Y = X + 3; // Reads X
Explanation:
CPU2 must wait until CPU1 updates the value of X before reading it.
If not handled properly, it can lead to race conditions or stale data issues.
1. Loop Transformations:
2. Speculative Execution:
3. Vectorization:
1. Task Partitioning:
2. Asynchronous Execution:
3. Fault Tolerance:
Example:
C[i][j] = A[i][k] * B[k][j] + C[i][j]; // Each row depends on the previous row
Example:
Since A[i] depends on A[i-1], OpenMP cannot parallelize this loop without modification.