0% found this document useful (0 votes)
6 views

3.Introduction to Parallelism

The lecture introduces the concept of parallelization, emphasizing its importance in computing efficiency and speed, particularly in processing large datasets. It covers various programming models for parallel computing, including shared and distributed memory systems, and highlights tools such as OpenMP and MPI. The session also discusses performance measures like speedup and efficiency, along with practical examples of parallel code implementation.

Uploaded by

1none2none3
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

3.Introduction to Parallelism

The lecture introduces the concept of parallelization, emphasizing its importance in computing efficiency and speed, particularly in processing large datasets. It covers various programming models for parallel computing, including shared and distributed memory systems, and highlights tools such as OpenMP and MPI. The session also discusses performance measures like speedup and efficiency, along with practical examples of parallel code implementation.

Uploaded by

1none2none3
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Lecture 3

Introduction to Parallelization

August 7, 2023
Logistics
• Office hours: 5 – 6 PM
– Chetna (RM) chetna@cse
– Muzafar (KD-213) muzafarwan@cse
– Vishal Singh (RM-505) vshlsng@cse
• Group formation
– Email by August 9
– Include names, roll numbers, email-ids
– vdeka@cse, chetna@cse

2
Parallelism Everywhere

Source: https://gilmour.com/
Why Parallel?
Task: Find the average age of Indians

India 2020 population is estimated at 1,380,004,385


people at mid year according to UN data.

Time (1 human): > 40 years

Time (1 CPU): 10 s

Time (2 CPUs): 5 s

Time (4 CPUs): 3 s
Parallelism
A parallel computer is a collection of processing
elements that communicate and cooperate to solve
large problems fast.

– Almasi and Gottlieb (1989)

5
Why Fast?

6
Basic Computing Unit

CPU/Core
Memory

Intel Core i7
Processing unit
(Courtesy: www.intel.com)
System – Simplified View

Memory

Fast Memory CPU cores

DISK

8
Multicore Era
CPU
Intel 4004
(1971)
Single core
single chip

Single core Hydra Multiple cores


(1996)
multiple chips single chip
Cray X-MP IBM POWER4
(1982) (2001)
Multiple cores
multiple chips

9
Moore’s Law
Number of CPU cores per
node increased

Gordon Moore

[Source: Wikipedia]
Parallel Computing

DISK DISK

DISK DISK
Supercomputer/Cluster/Data Center
Network is the backbone for data communication
Parallel Computer

Compute nodes
Domain Decomposition

Compute nodes
Discretization

Gridded mesh for a global model [Credit: Tompkins, ICTP]

15
Data Bottleneck
Congestion

Read/write

Storage

Example: ‘Age’ is data


Compute nodes
Average – Serial vs. Parallel
Serial Parallel

for i = 1 to N for i = 1 to N/P


sum += a[i] sum += a[i]
avg = sum/N collect sums and compute

17
Parallel Computer

A parallel computer is a collection of processing


elements that communicate and cooperate to solve
large problems fast.

– Almasi and Gottlieb (1989)

18
Parallel Average
Core
Process
Memory

for i = 0 to N/P for i = N/P to 2N/P for i = 2N/P to 3N/P for i = 3N/P to N
sum += a[i] sum += a[i] sum += a[i] sum += a[i]

0 1 2 3
19
Parallel Code Example

// local computation at every process/thread


for i = N/P * id ; i < N/P * (id+1) ; i++
localsum += a[i]

// collect localsum, add up in one of the ranks


and compute average

20
Performance Measure
• Speedup
Time ( 1 processor)
SP =
Time ( P processors)

• Efficiency
SP
EP =
P

21
Parallel Performance (Parallel Sum)
Parallel efficiency of summing 10^7 doubles

#Processes Time (sec) Speedup Efficiency


1 0.025 1 1.00
2 0.013 1.9 0.95
4 0.010 2.5 0.63
8 0.009 2.8 0.35
12 0.007 3.6 0.30

22
Ideal Speedup
Speedup Linear
Superlinear

Sublinear

Processors
23
Scalability Bottleneck

Performance of weather simulation application


24
Programming

25
Parallel Programming Models
Libraries MPI, TBB, Pthread, OpenMP, …
New languages Haskell, X10, Chapel, …
Extensions Coarray Fortran, UPC, Cilk, OpenCL, …

• Shared memory
– OpenMP, Pthreads, CUDA, …
• Distributed memory
– MPI, UPC, …
• Hybrid
– MPI + OpenMP, MPI + CUDA
26
Sharing Data

27
Parallel Programming Models
Shared memory programming – OpenMP, Pthreads
Distributed memory programming – MPI

Cache
Core
Process/
thread
28
Shared Memory Programming
• Shared address space
• Time taken to access certain memory words is longer (NUMA)
• Programming paradigms – Pthreads, OpenMP
• Need to worry about concurrent access

Memory
CPU cores

29
Threads

From Tim Mattson’s slides 30


OpenMP (Open Multiprocessing)
• Standard for shared memory programming
– Compiler directives
– Runtime routines
– Environment variables
• OpenMP Architecture Review Board
• First released in Nov’97
• Current version 5.1 (Nov’20)

31
OpenMP Example
• Thread-based
• Fork-join model

#pragma omp parallel //fork


Spawn a
{ default number
of threads

} //join

32
OpenMP

$ gcc –fopenmp –o foo foo.c


33
OpenMP

34
OpenMP

35
Output

36
OpenMP – Parallel Sum

Work on distinct data concurrently

37
OpenMP Timing

double stime = omp_get_wtime();


#pragma omp parallel
{

}
double etime = omp_get_wtime();

38
Multiple Systems

Cache
Core
Process

39
Distributed Memory Systems
64 – 192 GB RAM/node

• Networked systems
Node • Distributed memory
• Local memory
• Remote memory
• Parallel
Codefile system

Cluster
40
MPI (Message Passing Interface)
• Standard for message passing in a distributed
memory environment (most widely used
programming model in supercomputers)
• Efforts began in 1991 by Jack Dongarra, Tony
Hey, and David W. Walker
• MPI Forum formed in 1993
– Version 1.0: 1994
– Version 4.0: 2021

41
Process - Distinct Address Space
Core
Process
Memory

Local data Local data Local data Local data

42
Multiple Processes on a Single Node

From N. Karanjkar’s slides


43
Multiple Processes on Multiple Nodes

Node 1

Node 2
44
Communication using Messages
Core
Process
Memory

Local data Local data Local data Local data

for i = 0 to N/P for i = N/P to 2N/P for i = 2N/P to 3N/P for i = 3N/P to N
sum += a[i] sum += a[i] sum += a[i] sum += a[i]

45
Communication using Messages
Core
Process
Memory

Local data Local data Local data Local data


Instruction 1 Instruction 1 Instruction 1 Instruction 1
Instruction 2 Instruction 2 Instruction 2 Instruction 2 SIMD
… … … …

46
Message Passing
Time

Process 0 Process 1 47
MPI Programming

48
MPI Programming

mpicc -o program.x program.c


49
Communication using Messages

Core
Process
Memory
Local data Local data Local data Local data

for i = 0 to N/P for i = N/P to 2N/P for i = 2N/P to 3N/P for i = 3N/P to N
sum += a[i] sum += a[i] sum += a[i] sum += a[i]

for i = N/P * rank ; i < N/P * (rank+1) ; i++


localsum += a[i]
Collect localsum, add up at one of the ranks
50
Communication using Messages

Core
Process
Memory

51
Simplest Communication Primitives

• MPI_Send
• MPI_Recv

52
MPI Programming

int MPI_Send (const void *buf, int count, MPI_Datatype datatype,


int dest, int tag, MPI_Comm comm)

SENDER RECEIVER
int MPI_Recv (void *buf, int count, MPI_Datatype datatype, int
source, int tag, MPI_Comm comm, MPI_Status *status)
53
MPI Programming
MPI_Comm_rank (MPI_COMM_WORLD, &myrank);

// Sender process
if (myrank == 0) /* code for process 0 */
{
strcpy (message,"Hello, there");
MPI_Send (message, strlen(message)+1, MPI_CHAR, 1, 99,
MPI_COMM_WORLD);
}

// Receiver process
else if (myrank == 1) /* code for process 1 */
{
MPI_Recv (message, 20, MPI_CHAR, 0, 99, MPI_COMM_WORLD,
&status);
printf ("received :%s\n", message);
} 54
MPI – Parallel Sum
Assume the data array resides in the memory of process 0 initially
MPI_Comm_rank (MPI_COMM_WORLD, &myrank);

// Sender process
if (myrank == 0) /* code for process 0 */
{
for (int rank=1; rank<SIZE ; rank++) {
start = rank*N/size*sizeof(int);
MPI_Send (data+start, N/size, MPI_INT, rank, 99, MPI_COMM_WORLD);
}
}
else /* code for processes 1 … SIZE */
{
MPI_Recv (data, N/size, MPI_CHAR, 0, 99, MPI_COMM_WORLD, &status);
}
55
MPI Timing
double stime = MPI_Wtime();



double etime = MPI_Wtime();

56
Parallelization

57
Interpolation

58
Interpolation

59
Range/Value Query

Input: File
Output: File

60
Query on a Million Processes

Compute nodes
Unstructured Mesh

Source: COMSOL
62
Unstructured Mesh

Obayashi et al., Multi-objective Design Exploration Using Efficient Global Optimization


63
Thank You

64

You might also like