0% found this document useful (0 votes)

39 views43 pages

Parallel_Programming_FDP

The document outlines a course on Parallel Programming (BCS702), detailing prerequisites such as knowledge of C/C++ and familiarity with operating systems. It covers topics like the need for parallel programming, classifications of parallel systems, and various hardware and software tools used in parallel computing, including CUDA and MPI. Real-world applications are illustrated through examples in big data processing and real-time fraud detection.

Uploaded by

Raghavendra gs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views43 pages

Parallel_Programming_FDP

Uploaded by

Raghavendra gs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 43

PARALLEL

COMPUTING
BASIC INFORMATION
• Course/Code: Parallel Programming/BCS702
• Course Requisites:
 Knowledge of C or C++ Programming
 Familiarity with Operating Systems Concepts
 Basic Understanding of Computer Architecture
 Foundations in Algorithms and Data Structures
 Mathematical Maturity and Problem-Solving Skills
 Basic Exposure to Unix/Linux Environment
• Text_Book: “An Introduction to Parallel Programming”
(2nd Edition) by Peter S. Pacheco and Matthew
Malensek, the prerequisites for effectively studying
the subject of Parallel Programming
CONTENTS

• Need for Parallel Programming

• Parallel Hardware and Software
• Classifications: SIMD vs. MIMD
• Interconnection Networks & Cache
Coherence
• Shared vs. Distributed Memory
Models
• Coordinating Processes and
Threads
WHY PARALLEL
PROGRAMMING?
• Traditional CPUs can’t get significantly faster.
(physical limits of clock speed)
• Multi-core processors are standard in modern
computers.
• Huge datasets and real-time systems require
faster computation.
• Cloud, AI, ML, and Big Data applications
demand scalable performance.
Hardware Description Real-Time Example
Multi-core CPUs A single chip with multiple Smartphones and laptops run
cores (e.g., quad-core, multiple apps (camera, calls,
octa-core) background sync) in parallel

GPUs (Graphics Thousands of small cores Gaming, Image rendering, AI

Processing Units) for data-parallel tasks model training

Clusters Group of computers Weather forecasting

connected via a network simulations in supercomputing
centers

FPGAs/ASICs Custom chips designed for Self-driving cars use them for
specific parallel tasks fast sensor data processing

Supercomputers Thousands of CPUs/GPUs Scientific simulations, space

working together research (NASA, ISRO)
Software Type Real-Time Example

OpenMP API for shared-memory Used in scientific computing

multiprocessing (C/C++) apps on multi-core servers

MPI (Message Library for distributed- Used in large-scale simulations

Passing memory systems on supercomputers (e.g.,
Interface) earthquake modeling)

CUDA NVIDIA’s API for GPU Deep learning frameworks like

programming TensorFlow/PyTorch use CUDA
for model training

MapReduce / Distributed data processing Used by Google, Amazon,

Apache Spark frameworks Flipkart for data analytics

Hadoop Big Data processing over Used in banks and telecom

distributed storage (HDFS) companies for processing
transaction logs

Java Threads / Built-in parallel features in Used in chat apps, background

Python general-purpose languages processing in web apps
multiprocessing
EVERYDAY EXAMPLE:
IMAGINE ONE PERSON TYPING A 10-PAGE DOCUMENT — TAKES 10 MINUTES.

TWO PEOPLE TYPING THE SAME — DONE IN 5 MINUTES.

FOUR PEOPLE TYPING THE SAME AGAIN — DONE IN 2.5 MINUTES.

JUST LIKE THAT, INSTEAD OF ONE FAST PROCESSOR, WE NOW RELY ON MANY PROCESSORS WORKING TOGETHER — AND THAT’S WHAT IS CALLED!!

PARALLEL PROGRAMMING
BIG DATA PROCESSING (E-
COMMERCE)
SCENARIO:
amazon processes millions of transactions,
product recommendations, and user behavior
logs per second.

CHALLENGE:
Sequential processing would take hours or days
to analyze terabytes of data.

SOLUTION:
Use apache spark or hadoop mapreduce to:
Divide data into chunks
Process in parallel across hundreds of machines
Get results in minutes or seconds
Real-Time Fraud Detection (Banking/Finance)
Scenario:
Banks need to detect fraudulent transactions in
real-time (e.g., sudden login from another
country, large withdrawal).

Challenge:
Comparing current transaction with thousands of
historical ones in real-time is computationally
heavy.

Solution:
Use parallel processing on GPUs or clusters
to:
• Analyze transaction patterns instantly
SIMD: Single Instruction,
Multiple Data

MIMD: Multiple Instruction,

Multiple Data

PARALLEL Shared-Memory
SYSTEMS Architectures

Distributed-Memory
Architectures

Synchronization and
Coordination Methods
CLASSIFICATION OF PARALLEL
COMPUTERS
Classification Type Key Feature Example
Flynn’s SIMD Single instruction, GPU operations
multiple data
Flynn’s MIMD Multiple instruction, Multicore CPUs, Clusters
multiple data
Shared Memory Global memory access OpenMP on multicore
CPUs
Distributed Memory Local memory per MPI-based cluster
processor
Hybrid Architecture Shared and distributed MPI + OpenMP hybrid
memory
Multicore Systems Multiple CPU cores per Intel/AMD CPUs
processor
Manycore Systems Many lightweight cores NVIDIA GPU, Intel Xeon
Phi
Accelerators/ CPU + GPU/TPU/FPGA CUDA with CPU offloading
Heterogeneous
Cluster Computing Networked group of Beowulf clusters
Flynn’s Taxonomy – Classifications of Parallel Computers

Flynn categorized computer architectures based on number of

instruction streams and data streams into four types:

Type Full Form Instruction Data

Stream Stream

SISD Single Instruction, Single 1 1

Data

SIMD Single Instruction, Multiple 1 Many

Data

MISD Multiple Instruction, Single Many 1

Data

MIMD Multiple Instruction, Multiple Many Many

Data
Interconnection Networks
Connectivi Scalabilit
Topology Latency Cost
ty y
Linear
Low High Poor Low
Array
Ring Medium Medium Moderate Low
Mediu
2D Mesh Good Low Good
m
Torus Better Lower Good Higher
Hypercu
Excellent Very Low Excellent High
be
Very
Crossbar Optimal Lowest Poor
High
GPU PROGRAMMING

• Utilizes
the parallel processing power
of GPUs.
• CUDA: NVIDIA’s platform for general-
purpose GPU computing.
• Hierarchicalexecution model: Threads,
Blocks, Grids.
• Efficient for data-parallel tasks.
• Requires careful memory and thread
management.
PROGRAMMING HYBRID
SYSTEMS
• Combines CPU and GPU for optimized
computation.
• CPUhandles control-heavy tasks; GPU
handles parallel computation.
• Requires memory transfer and
synchronization between host and
device.
• Programming models like CUDA and
OpenCL facilitate hybrid execution.
AMDAHL’S LAW

Amdahl’s Law defines the theoretical

maximum speedup of a program based on
the proportion of code that can be
parallelized.
• It emphasizes that the serial portion
limits overall performance improvement.
FORMULA
Speedup (S) = 1 / (f + (1 - f) / P)
Where:
• f = Serial portion of the program
• (1 - f) = Parallel portion
• P = Number of processors
• S = Overall speedup
IMPLICATIONS

• Iff = 0 (fully parallel), Speedup = P

(ideal case)
• If f = 0.1 (10% serial), max speedup =
10 (even with infinite processors)
• Highlights diminishing returns as P
increases when f > 0
SOME REAL-TIME ANALOGIES
FOR AMDAHL’S LAW
Traffic Jam on a Highway
Analogy: Imagine a superhighway with 100 lanes,
but a toll booth at the end with only one lane.
No matter how fast or wide the highway is, every
car still queues at the bottleneck.
That toll booth is like the serial portion of your
code—limits throughput, no matter how many
processors you have.
CONTINUATION….

Kitchen with One Chef

Analogy: You’re cooking a big meal. You recruit ten
helpers to chop, fry, and plate—but only you can add
the final garnish.
While helpers speed up prep, everything still waits
for the “final touch.”
That task—like serial code—delays total completion.
SCALABILITY IN MIMD
SYSTEMS
• Measures how performance improves
with more processors.
• Affected by:
- Communication overhead
- Load balancing
- Memory contention
- Synchronization delays
TAKING TIMINGS OF MIMD
PROGRAMS

• Timing is crucial to evaluate performance.

• MPI_Wtime() provides accurate timing in
MPI.
• Measure both computation and
communication time.
• Identify performance bottlenecks.
GPU PERFORMANCE
• Factors affecting performance:
- Thread/block configuration
- Memory access patterns
- Warp divergence
- Shared memory utilization
• Use tools like NVIDIA Nsight for
profiling.
WHAT IS CUDA?
(COMPUTE UNIFIED DEVICE
ARCHITECTURE)
• CUDA is a parallel computing platform and
API model developed by NVIDIA.
• It allows general-purpose processing on
GPUs (GPGPU).
• Key feature: Enables massive parallelism
using thousands of threads on NVIDIA
GPUs.
ARCHITECTURE OVERVIEW

• Threads are organized in:

- Threads → Blocks → Grids

• This hierarchical model enables scalable parallel

execution across GPU cores.
CUDA PROGRAMMING
MODEL
• CUDA extends C/C++ with keywords:
- __global__: GPU kernel
-__device__: GPU function
- __host__: CPU function

• Example:
__global__ void add(int *a, int *b, int *c)
CUDA MEMORY MODEL

• Types of memory:
- Global (slow, accessible by all threads)
- Shared (fast, within block)
- Local (per-thread)
- Constant & Texture (read-only)
• Efficient memory access is essential for
performance.
THREADS AND WARPS

• Threads are grouped in warps (32 threads):

- SIMT model: Single Instruction, Multiple
Threads
- Warp divergence occurs when threads
take different paths, reducing efficiency.
PERFORMANCE
CONSIDERATIONS
• CUDA performance depends on:
- Grid and block configuration
- Minimizing memory latency
- Avoiding warp divergence
- Coalesced memory access
CUDA USE CASES
Widely used in:
- Machine learning (deep learning)
- Scientific computing
- Financial modeling
- Real-time video/image processing
- High-performance computing
INTRODUCTION TO MPI

• MPI (Message Passing Interface) is a

standardized and portable message-passing
system.
• Designed for use on distributed memory
systems.
• Provides point-to-point and collective
communication mechanisms.
• Suitable for parallel applications requiring
explicit process coordination.
MPI FUNCTIONS
• MPI_Init, MPI_Finalize – Initialize and
terminate MPI environment.
• MPI_Comm_rank – Get process ID.
• MPI_Comm_size – Get number of processes.
• MPI_Send, MPI_Recv – Basic point-to-point
communication.
• MPI_Bcast, MPI_Scatter, MPI_Gather –
Collective communication.
THE TRAPEZOIDAL RULE IN
MPI
• Numerical integration technique adapted for
parallelization using MPI.
• Each process computes a local integral over a
subinterval.
• Results are gathered using MPI_Reduce or
MPI_Gather.
• Demonstrates load partitioning and
communication.
DEALING WITH I/O IN MPI

• MPI I/O functions help perform parallel

I/O operations.
• File operations: MPI_File_open,
MPI_File_read, MPI_File_write.
• Coordinated I/O avoids bottlenecks and
ensures consistency.
• Collective I/O improves performance
over individual reads/writes.
COLLECTIVE
COMMUNICATION
MPI provides high-level operations involving
all processes:
- MPI_Bcast: Broadcast from one to all.
- MPI_Scatter: Distribute different parts of data.
- MPI_Gather: Collect data to one process.
- MPI_Reduce: Apply reduction operation (sum, max,
etc.).

Ensures synchronization and efficiency.

MPI-DERIVED DATATYPES
• Custom datatypes for structured communication.
• Useful for sending non-contiguous or complex
data.
• Created using MPI_Type_vector, MPI_Type_struct,
etc.
• Increases code clarity and transfer efficiency.
PERFORMANCE EVALUATION
OF MPI PROGRAMS

• Use MPI_Wtime for timing execution segments.

• Measure communication overhead and compute
efficiency.
• Evaluate scalability using speedup and efficiency
metrics.
• Optimize through message size tuning and load
balancing.
A PARALLEL SORTING
ALGORITHM
• Sorting algorithms like parallel merge sort
can be implemented using MPI.
• Each process sorts a portion of the array
locally.
• Processes exchange and merge data in
parallel phases.
• Requires synchronization and data
redistribution.
Thank You

Bcs702 Parallel Computing Module 1
No ratings yet
Bcs702 Parallel Computing Module 1
35 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
PP Cuda Unit1 1
No ratings yet
PP Cuda Unit1 1
77 pages
Overview of Parallel Computing: Shawn T. Brown
No ratings yet
Overview of Parallel Computing: Shawn T. Brown
46 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
Ayushagrawal HPC
No ratings yet
Ayushagrawal HPC
17 pages
Parallel Computing
No ratings yet
Parallel Computing
57 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Basics CUDA
No ratings yet
Basics CUDA
55 pages
unit 1
No ratings yet
unit 1
25 pages
Parallel and Distributed Computing
No ratings yet
Parallel and Distributed Computing
90 pages
Co 1
No ratings yet
Co 1
66 pages
Lecture 2
No ratings yet
Lecture 2
21 pages
PDC Complete Course File
No ratings yet
PDC Complete Course File
422 pages
Theory of Distributed Computing and Parallel Processing With Its Applications, Advantages and Disadvantages
No ratings yet
Theory of Distributed Computing and Parallel Processing With Its Applications, Advantages and Disadvantages
11 pages
CC Unit 1.2
No ratings yet
CC Unit 1.2
39 pages
CSE5006 Multicore-Architectures ETH 1 AC41
No ratings yet
CSE5006 Multicore-Architectures ETH 1 AC41
9 pages
BDS Session 2
No ratings yet
BDS Session 2
56 pages
001 - DDS IIIT Jan 10th
No ratings yet
001 - DDS IIIT Jan 10th
34 pages
Week1 Parallel and Distributed Computing
No ratings yet
Week1 Parallel and Distributed Computing
55 pages
Intro To Parallel Computing
No ratings yet
Intro To Parallel Computing
127 pages
Whitepaper Imsl Increase Performance Parallel Programming Numerical Libraries
No ratings yet
Whitepaper Imsl Increase Performance Parallel Programming Numerical Libraries
8 pages
Lecture Week - 1 Introduction 1 - SP-24
No ratings yet
Lecture Week - 1 Introduction 1 - SP-24
51 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 03-Aug-2021 Lecture1-Course Introduction
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 03-Aug-2021 Lecture1-Course Introduction
39 pages
HPC - Unit-1 Insem Notes
No ratings yet
HPC - Unit-1 Insem Notes
76 pages
Lecture 1 Introduction 1
No ratings yet
Lecture 1 Introduction 1
49 pages
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
No ratings yet
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
22 pages
GPU Programming Slides 1
No ratings yet
GPU Programming Slides 1
33 pages
KCS 713 Unit 1 Lecture 5
No ratings yet
KCS 713 Unit 1 Lecture 5
32 pages
Lecture 1 Introduction
No ratings yet
Lecture 1 Introduction
34 pages
Advancedcomputer Architecture
No ratings yet
Advancedcomputer Architecture
91 pages
Preview-9781482211191 A37870511
No ratings yet
Preview-9781482211191 A37870511
50 pages
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
No ratings yet
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
63 pages
An Approach To Parallel Processing: Yashraj Rai Puja Padiya
No ratings yet
An Approach To Parallel Processing: Yashraj Rai Puja Padiya
3 pages
Multi Threading
No ratings yet
Multi Threading
168 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
Unit 1 - Part 1
No ratings yet
Unit 1 - Part 1
51 pages
Flynns
No ratings yet
Flynns
41 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Lecture 4
No ratings yet
Lecture 4
27 pages
Parallel Computing 1 Unit
No ratings yet
Parallel Computing 1 Unit
59 pages
HPC BOOk
No ratings yet
HPC BOOk
68 pages
Lec1 and 2
No ratings yet
Lec1 and 2
52 pages
Week1 - Parallel and Distributed Computing
100% (1)
Week1 - Parallel and Distributed Computing
46 pages
2 ParallelArchExec
No ratings yet
2 ParallelArchExec
46 pages
Cuda
No ratings yet
Cuda
69 pages
Parallel Comp Point Main
No ratings yet
Parallel Comp Point Main
18 pages
1 Introduction
No ratings yet
1 Introduction
48 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
Lecture 2 General Parallelism Terms
No ratings yet
Lecture 2 General Parallelism Terms
22 pages
Syllabus
No ratings yet
Syllabus
2 pages
Architecture
No ratings yet
Architecture
67 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
Unit 3
No ratings yet
Unit 3
31 pages
Parallel Programming
No ratings yet
Parallel Programming
42 pages
4.big Data Platforms
No ratings yet
4.big Data Platforms
40 pages
CUDA Programming Fundamentals: Definitive Reference for Developers and Engineers
From Everand
CUDA Programming Fundamentals: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
CUDA Programming with C++: From Basics to Expert Proficiency
From Everand
CUDA Programming with C++: From Basics to Expert Proficiency
William Smith
No ratings yet
CUDA Programming with Python: From Basics to Expert Proficiency
From Everand
CUDA Programming with Python: From Basics to Expert Proficiency
William Smith
1/5 (1)
GPUs and GPGPU
No ratings yet
GPUs and GPGPU
15 pages
Nosql Module 4
No ratings yet
Nosql Module 4
61 pages
06 Planning
No ratings yet
06 Planning
26 pages
AI and Machine Learning 4
No ratings yet
AI and Machine Learning 4
34 pages
7 Csesyll
No ratings yet
7 Csesyll
4 pages
Er Model
No ratings yet
Er Model
98 pages
Week 1 PyDA
No ratings yet
Week 1 PyDA
7 pages
Circle Trigno
No ratings yet
Circle Trigno
55 pages
Understanding Layout Managers: A Comprehensive Guide
No ratings yet
Understanding Layout Managers: A Comprehensive Guide
9 pages
The Kernel in The Mind (v1.1.2025)
No ratings yet
The Kernel in The Mind (v1.1.2025)
102 pages
Practical File Python 2024-25
No ratings yet
Practical File Python 2024-25
16 pages
SolidWorks-Mathcad Integration Install & User Guide
No ratings yet
SolidWorks-Mathcad Integration Install & User Guide
5 pages
OpenGL Class Notes
No ratings yet
OpenGL Class Notes
10 pages
VAC Courses Info
No ratings yet
VAC Courses Info
10 pages
Module-4 Trees
No ratings yet
Module-4 Trees
14 pages
Mod Menu Log - Com - Swiftgames.survival
No ratings yet
Mod Menu Log - Com - Swiftgames.survival
29 pages
Python - Tutorial: #!/usr/bin/python Print "Hello, Python!"
No ratings yet
Python - Tutorial: #!/usr/bin/python Print "Hello, Python!"
174 pages
App Development Reviewer
No ratings yet
App Development Reviewer
4 pages
Function With No Arguments and No Return Value
No ratings yet
Function With No Arguments and No Return Value
3 pages
I.T. 5th Sem
100% (1)
I.T. 5th Sem
11 pages
SQL Tutorial
No ratings yet
SQL Tutorial
3 pages
OOP Through Java Unit-5
No ratings yet
OOP Through Java Unit-5
22 pages
Front End Development Assignment 1-20 Q
No ratings yet
Front End Development Assignment 1-20 Q
15 pages
Hadoop Wordcount Program
No ratings yet
Hadoop Wordcount Program
20 pages
Python: Plain Fixed-Width Text
No ratings yet
Python: Plain Fixed-Width Text
24 pages
Document From Gaurav
No ratings yet
Document From Gaurav
1 page
Logcat
No ratings yet
Logcat
157 pages
Chapter 4
No ratings yet
Chapter 4
27 pages
Com - Game.loader Logcat
No ratings yet
Com - Game.loader Logcat
30 pages
MNFST
No ratings yet
MNFST
4 pages
On The Notion of Object-Oriented Programming
100% (1)
On The Notion of Object-Oriented Programming
3 pages
SIMD Presentation
No ratings yet
SIMD Presentation
28 pages
SQL Syntax
No ratings yet
SQL Syntax
2 pages
DS Module 3 Trees VTU BCS304 Notes/PPT
No ratings yet
DS Module 3 Trees VTU BCS304 Notes/PPT
49 pages
Programming With Java - OE EC 604A
No ratings yet
Programming With Java - OE EC 604A
3 pages
CPF L Ab Mannual
No ratings yet
CPF L Ab Mannual
19 pages
Evolutionary Algorithms in Engineering and Computer Science
67% (3)
Evolutionary Algorithms in Engineering and Computer Science
500 pages

Parallel_Programming_FDP

Uploaded by

Parallel_Programming_FDP

Uploaded by

PARALLEL

• Need for Parallel Programming

GPUs (Graphics Thousands of small cores Gaming, Image rendering, AI

Clusters Group of computers Weather forecasting

Supercomputers Thousands of CPUs/GPUs Scientific simulations, space

OpenMP API for shared-memory Used in scientific computing

MPI (Message Library for distributed- Used in large-scale simulations

CUDA NVIDIA’s API for GPU Deep learning frameworks like

MapReduce / Distributed data processing Used by Google, Amazon,

Hadoop Big Data processing over Used in banks and telecom

Java Threads / Built-in parallel features in Used in chat apps, background

TWO PEOPLE TYPING THE SAME — DONE IN 5 MINUTES.

FOUR PEOPLE TYPING THE SAME AGAIN — DONE IN 2.5 MINUTES.

MIMD: Multiple Instruction,

Flynn categorized computer architectures based on number of

Type Full Form Instruction Data

SISD Single Instruction, Single 1 1

SIMD Single Instruction, Multiple 1 Many

MISD Multiple Instruction, Single Many 1

MIMD Multiple Instruction, Multiple Many Many

Amdahl’s Law defines the theoretical

• Iff = 0 (fully parallel), Speedup = P

Kitchen with One Chef

• Timing is crucial to evaluate performance.

• Threads are organized in:

• This hierarchical model enables scalable parallel

• Threads are grouped in warps (32 threads):

• MPI (Message Passing Interface) is a

• MPI I/O functions help perform parallel

Ensures synchronization and efficiency.

• Use MPI_Wtime for timing execution segments.

You might also like