3.Introduction to Parallelism
3.Introduction to Parallelism
Introduction to Parallelization
August 7, 2023
Logistics
• Office hours: 5 – 6 PM
– Chetna (RM) chetna@cse
– Muzafar (KD-213) muzafarwan@cse
– Vishal Singh (RM-505) vshlsng@cse
• Group formation
– Email by August 9
– Include names, roll numbers, email-ids
– vdeka@cse, chetna@cse
2
Parallelism Everywhere
Source: https://gilmour.com/
Why Parallel?
Task: Find the average age of Indians
Time (1 CPU): 10 s
Time (2 CPUs): 5 s
Time (4 CPUs): 3 s
Parallelism
A parallel computer is a collection of processing
elements that communicate and cooperate to solve
large problems fast.
5
Why Fast?
6
Basic Computing Unit
CPU/Core
Memory
Intel Core i7
Processing unit
(Courtesy: www.intel.com)
System – Simplified View
Memory
DISK
8
Multicore Era
CPU
Intel 4004
(1971)
Single core
single chip
9
Moore’s Law
Number of CPU cores per
node increased
Gordon Moore
[Source: Wikipedia]
Parallel Computing
DISK DISK
DISK DISK
Supercomputer/Cluster/Data Center
Network is the backbone for data communication
Parallel Computer
Compute nodes
Domain Decomposition
Compute nodes
Discretization
15
Data Bottleneck
Congestion
Read/write
Storage
17
Parallel Computer
18
Parallel Average
Core
Process
Memory
for i = 0 to N/P for i = N/P to 2N/P for i = 2N/P to 3N/P for i = 3N/P to N
sum += a[i] sum += a[i] sum += a[i] sum += a[i]
0 1 2 3
19
Parallel Code Example
20
Performance Measure
• Speedup
Time ( 1 processor)
SP =
Time ( P processors)
• Efficiency
SP
EP =
P
21
Parallel Performance (Parallel Sum)
Parallel efficiency of summing 10^7 doubles
22
Ideal Speedup
Speedup Linear
Superlinear
Sublinear
Processors
23
Scalability Bottleneck
25
Parallel Programming Models
Libraries MPI, TBB, Pthread, OpenMP, …
New languages Haskell, X10, Chapel, …
Extensions Coarray Fortran, UPC, Cilk, OpenCL, …
• Shared memory
– OpenMP, Pthreads, CUDA, …
• Distributed memory
– MPI, UPC, …
• Hybrid
– MPI + OpenMP, MPI + CUDA
26
Sharing Data
27
Parallel Programming Models
Shared memory programming – OpenMP, Pthreads
Distributed memory programming – MPI
Cache
Core
Process/
thread
28
Shared Memory Programming
• Shared address space
• Time taken to access certain memory words is longer (NUMA)
• Programming paradigms – Pthreads, OpenMP
• Need to worry about concurrent access
Memory
CPU cores
29
Threads
31
OpenMP Example
• Thread-based
• Fork-join model
} //join
32
OpenMP
34
OpenMP
35
Output
36
OpenMP – Parallel Sum
37
OpenMP Timing
38
Multiple Systems
Cache
Core
Process
39
Distributed Memory Systems
64 – 192 GB RAM/node
• Networked systems
Node • Distributed memory
• Local memory
• Remote memory
• Parallel
Codefile system
Cluster
40
MPI (Message Passing Interface)
• Standard for message passing in a distributed
memory environment (most widely used
programming model in supercomputers)
• Efforts began in 1991 by Jack Dongarra, Tony
Hey, and David W. Walker
• MPI Forum formed in 1993
– Version 1.0: 1994
– Version 4.0: 2021
41
Process - Distinct Address Space
Core
Process
Memory
42
Multiple Processes on a Single Node
Node 1
Node 2
44
Communication using Messages
Core
Process
Memory
for i = 0 to N/P for i = N/P to 2N/P for i = 2N/P to 3N/P for i = 3N/P to N
sum += a[i] sum += a[i] sum += a[i] sum += a[i]
45
Communication using Messages
Core
Process
Memory
46
Message Passing
Time
Process 0 Process 1 47
MPI Programming
48
MPI Programming
Core
Process
Memory
Local data Local data Local data Local data
for i = 0 to N/P for i = N/P to 2N/P for i = 2N/P to 3N/P for i = 3N/P to N
sum += a[i] sum += a[i] sum += a[i] sum += a[i]
Core
Process
Memory
51
Simplest Communication Primitives
• MPI_Send
• MPI_Recv
52
MPI Programming
SENDER RECEIVER
int MPI_Recv (void *buf, int count, MPI_Datatype datatype, int
source, int tag, MPI_Comm comm, MPI_Status *status)
53
MPI Programming
MPI_Comm_rank (MPI_COMM_WORLD, &myrank);
// Sender process
if (myrank == 0) /* code for process 0 */
{
strcpy (message,"Hello, there");
MPI_Send (message, strlen(message)+1, MPI_CHAR, 1, 99,
MPI_COMM_WORLD);
}
// Receiver process
else if (myrank == 1) /* code for process 1 */
{
MPI_Recv (message, 20, MPI_CHAR, 0, 99, MPI_COMM_WORLD,
&status);
printf ("received :%s\n", message);
} 54
MPI – Parallel Sum
Assume the data array resides in the memory of process 0 initially
MPI_Comm_rank (MPI_COMM_WORLD, &myrank);
// Sender process
if (myrank == 0) /* code for process 0 */
{
for (int rank=1; rank<SIZE ; rank++) {
start = rank*N/size*sizeof(int);
MPI_Send (data+start, N/size, MPI_INT, rank, 99, MPI_COMM_WORLD);
}
}
else /* code for processes 1 … SIZE */
{
MPI_Recv (data, N/size, MPI_CHAR, 0, 99, MPI_COMM_WORLD, &status);
}
55
MPI Timing
double stime = MPI_Wtime();
…
…
…
double etime = MPI_Wtime();
56
Parallelization
57
Interpolation
58
Interpolation
59
Range/Value Query
Input: File
Output: File
60
Query on a Million Processes
Compute nodes
Unstructured Mesh
Source: COMSOL
62
Unstructured Mesh
64