0% found this document useful (0 votes)
39 views43 pages

Parallel_Programming_FDP

The document outlines a course on Parallel Programming (BCS702), detailing prerequisites such as knowledge of C/C++ and familiarity with operating systems. It covers topics like the need for parallel programming, classifications of parallel systems, and various hardware and software tools used in parallel computing, including CUDA and MPI. Real-world applications are illustrated through examples in big data processing and real-time fraud detection.

Uploaded by

Raghavendra gs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views43 pages

Parallel_Programming_FDP

The document outlines a course on Parallel Programming (BCS702), detailing prerequisites such as knowledge of C/C++ and familiarity with operating systems. It covers topics like the need for parallel programming, classifications of parallel systems, and various hardware and software tools used in parallel computing, including CUDA and MPI. Real-world applications are illustrated through examples in big data processing and real-time fraud detection.

Uploaded by

Raghavendra gs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

PARALLEL

COMPUTING
BASIC INFORMATION
• Course/Code: Parallel Programming/BCS702
• Course Requisites:
 Knowledge of C or C++ Programming
 Familiarity with Operating Systems Concepts
 Basic Understanding of Computer Architecture
 Foundations in Algorithms and Data Structures
 Mathematical Maturity and Problem-Solving Skills
 Basic Exposure to Unix/Linux Environment
• Text_Book: “An Introduction to Parallel Programming”
(2nd Edition) by Peter S. Pacheco and Matthew
Malensek, the prerequisites for effectively studying
the subject of Parallel Programming
CONTENTS

• Need for Parallel Programming


• Parallel Hardware and Software
• Classifications: SIMD vs. MIMD
• Interconnection Networks & Cache
Coherence
• Shared vs. Distributed Memory
Models
• Coordinating Processes and
Threads
WHY PARALLEL
PROGRAMMING?
• Traditional CPUs can’t get significantly faster.
(physical limits of clock speed)
• Multi-core processors are standard in modern
computers.
• Huge datasets and real-time systems require
faster computation.
• Cloud, AI, ML, and Big Data applications
demand scalable performance.
Hardware Description Real-Time Example
Multi-core CPUs A single chip with multiple Smartphones and laptops run
cores (e.g., quad-core, multiple apps (camera, calls,
octa-core) background sync) in parallel

GPUs (Graphics Thousands of small cores Gaming, Image rendering, AI


Processing Units) for data-parallel tasks model training

Clusters Group of computers Weather forecasting


connected via a network simulations in supercomputing
centers

FPGAs/ASICs Custom chips designed for Self-driving cars use them for
specific parallel tasks fast sensor data processing

Supercomputers Thousands of CPUs/GPUs Scientific simulations, space


working together research (NASA, ISRO)
Software Type Real-Time Example

OpenMP API for shared-memory Used in scientific computing


multiprocessing (C/C++) apps on multi-core servers

MPI (Message Library for distributed- Used in large-scale simulations


Passing memory systems on supercomputers (e.g.,
Interface) earthquake modeling)

CUDA NVIDIA’s API for GPU Deep learning frameworks like


programming TensorFlow/PyTorch use CUDA
for model training

MapReduce / Distributed data processing Used by Google, Amazon,


Apache Spark frameworks Flipkart for data analytics

Hadoop Big Data processing over Used in banks and telecom


distributed storage (HDFS) companies for processing
transaction logs

Java Threads / Built-in parallel features in Used in chat apps, background


Python general-purpose languages processing in web apps
multiprocessing
EVERYDAY EXAMPLE:
IMAGINE ONE PERSON TYPING A 10-PAGE DOCUMENT — TAKES 10 MINUTES.

TWO PEOPLE TYPING THE SAME — DONE IN 5 MINUTES.

FOUR PEOPLE TYPING THE SAME AGAIN — DONE IN 2.5 MINUTES.

JUST LIKE THAT, INSTEAD OF ONE FAST PROCESSOR, WE NOW RELY ON MANY PROCESSORS WORKING TOGETHER — AND THAT’S WHAT IS CALLED!!

PARALLEL PROGRAMMING
BIG DATA PROCESSING (E-
COMMERCE)
SCENARIO:
amazon processes millions of transactions,
product recommendations, and user behavior
logs per second.

CHALLENGE:
Sequential processing would take hours or days
to analyze terabytes of data.

SOLUTION:
Use apache spark or hadoop mapreduce to:
Divide data into chunks
Process in parallel across hundreds of machines
Get results in minutes or seconds
Real-Time Fraud Detection (Banking/Finance)
Scenario:
Banks need to detect fraudulent transactions in
real-time (e.g., sudden login from another
country, large withdrawal).

Challenge:
Comparing current transaction with thousands of
historical ones in real-time is computationally
heavy.

Solution:
Use parallel processing on GPUs or clusters
to:
• Analyze transaction patterns instantly
SIMD: Single Instruction,
Multiple Data

MIMD: Multiple Instruction,


Multiple Data

PARALLEL Shared-Memory
SYSTEMS Architectures

Distributed-Memory
Architectures

Synchronization and
Coordination Methods
CLASSIFICATION OF PARALLEL
COMPUTERS
Classification Type Key Feature Example
Flynn’s SIMD Single instruction, GPU operations
multiple data
Flynn’s MIMD Multiple instruction, Multicore CPUs, Clusters
multiple data
Shared Memory Global memory access OpenMP on multicore
CPUs
Distributed Memory Local memory per MPI-based cluster
processor
Hybrid Architecture Shared and distributed MPI + OpenMP hybrid
memory
Multicore Systems Multiple CPU cores per Intel/AMD CPUs
processor
Manycore Systems Many lightweight cores NVIDIA GPU, Intel Xeon
Phi
Accelerators/ CPU + GPU/TPU/FPGA CUDA with CPU offloading
Heterogeneous
Cluster Computing Networked group of Beowulf clusters
Flynn’s Taxonomy – Classifications of Parallel Computers

Flynn categorized computer architectures based on number of


instruction streams and data streams into four types:

Type Full Form Instruction Data


Stream Stream

SISD Single Instruction, Single 1 1


Data

SIMD Single Instruction, Multiple 1 Many


Data

MISD Multiple Instruction, Single Many 1


Data

MIMD Multiple Instruction, Multiple Many Many


Data
Interconnection Networks
Connectivi Scalabilit
Topology Latency Cost
ty y
Linear
Low High Poor Low
Array
Ring Medium Medium Moderate Low
Mediu
2D Mesh Good Low Good
m
Torus Better Lower Good Higher
Hypercu
Excellent Very Low Excellent High
be
Very
Crossbar Optimal Lowest Poor
High
GPU PROGRAMMING

• Utilizes
the parallel processing power
of GPUs.
• CUDA: NVIDIA’s platform for general-
purpose GPU computing.
• Hierarchicalexecution model: Threads,
Blocks, Grids.
• Efficient for data-parallel tasks.
• Requires careful memory and thread
management.
PROGRAMMING HYBRID
SYSTEMS
• Combines CPU and GPU for optimized
computation.
• CPUhandles control-heavy tasks; GPU
handles parallel computation.
• Requires memory transfer and
synchronization between host and
device.
• Programming models like CUDA and
OpenCL facilitate hybrid execution.
AMDAHL’S LAW

Amdahl’s Law defines the theoretical


maximum speedup of a program based on
the proportion of code that can be
parallelized.
• It emphasizes that the serial portion
limits overall performance improvement.
FORMULA
Speedup (S) = 1 / (f + (1 - f) / P)
Where:
• f = Serial portion of the program
• (1 - f) = Parallel portion
• P = Number of processors
• S = Overall speedup
IMPLICATIONS

• Iff = 0 (fully parallel), Speedup = P


(ideal case)
• If f = 0.1 (10% serial), max speedup =
10 (even with infinite processors)
• Highlights diminishing returns as P
increases when f > 0
SOME REAL-TIME ANALOGIES
FOR AMDAHL’S LAW
Traffic Jam on a Highway
Analogy: Imagine a superhighway with 100 lanes,
but a toll booth at the end with only one lane.
No matter how fast or wide the highway is, every
car still queues at the bottleneck.
That toll booth is like the serial portion of your
code—limits throughput, no matter how many
processors you have.
CONTINUATION….

Kitchen with One Chef


Analogy: You’re cooking a big meal. You recruit ten
helpers to chop, fry, and plate—but only you can add
the final garnish.
While helpers speed up prep, everything still waits
for the “final touch.”
That task—like serial code—delays total completion.
SCALABILITY IN MIMD
SYSTEMS
• Measures how performance improves
with more processors.
• Affected by:
- Communication overhead
- Load balancing
- Memory contention
- Synchronization delays
TAKING TIMINGS OF MIMD
PROGRAMS

• Timing is crucial to evaluate performance.


• MPI_Wtime() provides accurate timing in
MPI.
• Measure both computation and
communication time.
• Identify performance bottlenecks.
GPU PERFORMANCE
• Factors affecting performance:
- Thread/block configuration
- Memory access patterns
- Warp divergence
- Shared memory utilization
• Use tools like NVIDIA Nsight for
profiling.
WHAT IS CUDA?
(COMPUTE UNIFIED DEVICE
ARCHITECTURE)
• CUDA is a parallel computing platform and
API model developed by NVIDIA.
• It allows general-purpose processing on
GPUs (GPGPU).
• Key feature: Enables massive parallelism
using thousands of threads on NVIDIA
GPUs.
ARCHITECTURE OVERVIEW

• Threads are organized in:


- Threads → Blocks → Grids

• This hierarchical model enables scalable parallel


execution across GPU cores.
CUDA PROGRAMMING
MODEL
• CUDA extends C/C++ with keywords:
- __global__: GPU kernel
-__device__: GPU function
- __host__: CPU function

• Example:
__global__ void add(int *a, int *b, int *c)
CUDA MEMORY MODEL

• Types of memory:
- Global (slow, accessible by all threads)
- Shared (fast, within block)
- Local (per-thread)
- Constant & Texture (read-only)
• Efficient memory access is essential for
performance.
THREADS AND WARPS

• Threads are grouped in warps (32 threads):


- SIMT model: Single Instruction, Multiple
Threads
- Warp divergence occurs when threads
take different paths, reducing efficiency.
PERFORMANCE
CONSIDERATIONS
• CUDA performance depends on:
- Grid and block configuration
- Minimizing memory latency
- Avoiding warp divergence
- Coalesced memory access
CUDA USE CASES
Widely used in:
- Machine learning (deep learning)
- Scientific computing
- Financial modeling
- Real-time video/image processing
- High-performance computing
INTRODUCTION TO MPI

• MPI (Message Passing Interface) is a


standardized and portable message-passing
system.
• Designed for use on distributed memory
systems.
• Provides point-to-point and collective
communication mechanisms.
• Suitable for parallel applications requiring
explicit process coordination.
MPI FUNCTIONS
• MPI_Init, MPI_Finalize – Initialize and
terminate MPI environment.
• MPI_Comm_rank – Get process ID.
• MPI_Comm_size – Get number of processes.
• MPI_Send, MPI_Recv – Basic point-to-point
communication.
• MPI_Bcast, MPI_Scatter, MPI_Gather –
Collective communication.
THE TRAPEZOIDAL RULE IN
MPI
• Numerical integration technique adapted for
parallelization using MPI.
• Each process computes a local integral over a
subinterval.
• Results are gathered using MPI_Reduce or
MPI_Gather.
• Demonstrates load partitioning and
communication.
DEALING WITH I/O IN MPI

• MPI I/O functions help perform parallel


I/O operations.
• File operations: MPI_File_open,
MPI_File_read, MPI_File_write.
• Coordinated I/O avoids bottlenecks and
ensures consistency.
• Collective I/O improves performance
over individual reads/writes.
COLLECTIVE
COMMUNICATION
MPI provides high-level operations involving
all processes:
- MPI_Bcast: Broadcast from one to all.
- MPI_Scatter: Distribute different parts of data.
- MPI_Gather: Collect data to one process.
- MPI_Reduce: Apply reduction operation (sum, max,
etc.).

Ensures synchronization and efficiency.


MPI-DERIVED DATATYPES
• Custom datatypes for structured communication.
• Useful for sending non-contiguous or complex
data.
• Created using MPI_Type_vector, MPI_Type_struct,
etc.
• Increases code clarity and transfer efficiency.
PERFORMANCE EVALUATION
OF MPI PROGRAMS

• Use MPI_Wtime for timing execution segments.


• Measure communication overhead and compute
efficiency.
• Evaluate scalability using speedup and efficiency
metrics.
• Optimize through message size tuning and load
balancing.
A PARALLEL SORTING
ALGORITHM
• Sorting algorithms like parallel merge sort
can be implemented using MPI.
• Each process sorts a portion of the array
locally.
• Processes exchange and merge data in
parallel phases.
• Requires synchronization and data
redistribution.
Thank You

You might also like