0% found this document useful (0 votes)
28 views

01 cmsc416 Intro

Uploaded by

qiqi85078802
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

01 cmsc416 Intro

Uploaded by

qiqi85078802
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Introduction to Parallel Computing (CMSC416 / CMSC616)

Course Overview
Abhinav Bhatele, Alan Sussman
About the instructor — Dr. Bhatele
• Ph.D. from the University of Illinois at Urbana-Champaign

• Spent eight years at Lawrence Livermore National Laboratory

• Started at the University of Maryland in


2019

• Research areas:

• High performance computing

• Distributed AI

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 2


About the instructors — Dr. Sussman

• Professor at UMD for >20 years, research scientist before that


• Research is in high performance parallel and distributed computing

• Currently Associate Chair for Undergraduate Education

• Recently returned from a rotation at the National Science Foundation as a program


director, in the Of ce of Advanced Cyberinfrastructure

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 3


fi
Introductions

• Name

• Junior/Senior/Graduate student

• Something interesting/unique about yourself

• (optional) Why this course?

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 4


This course is
• An introduction to parallel computing

• 416: Upper Level CS Coursework / General Track / Area 1: Systems

• 616: Qualifying course for MS/PhD: Computer Systems

• Work expected:
• Five to six programming assignments

• Three to four quizzes (no advance notice)

• Midterm exam: in class on October 24 (tentative)

• Final exam: TBA

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 5


Course topics
• Introduction to parallel computing (1 week)

• Parallel algorithm design (2 weeks)

• Distributed memory parallel programming (3 weeks)

• Performance analysis and tools (2 weeks)

• Shared-memory parallel programming (1 week)

• GPU programming (1 week)

• Parallel architectures and networks (1 week)

• Parallel simulation codes (2 weeks)


Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 6
Tools we will use for the class
• Syllabus, lecture slides, programming assignment descriptions on course website:
• https://www.cs.umd.edu/class/fall2024/cmsc416

• Programming assignment submissions on gradescope

• Quizzes on ELMS

• Discussions: Piazza
• https://piazza.com/umd/fall2024/cmsc416cmsc616

• If you want to contact the course staff outside of piazza, send an email to:
[email protected]

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 7


Zaratan accounts

• Zaratan is the UMD DIT cluster we’ll use for the programming assignments

• You should receive an email when your account is ready for use

• Helpful resources:
• https://hpcc.umd.edu/hpcc/help/usage.html

• https://missing.csail.mit.edu

• https://www.cs.umd.edu/~mmarsh/books/cmdline/cmdline.html

• https://www.cs.umd.edu/~mmarsh/books/tools/tools.html

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 8


Programming assignments

• You can write and debug most of your assignment locally:


• Use a virtual linux box

• Docker containers

• MacOS: use macports or homebrew

• On zaratan:
• You can use VSCode

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 9


Excused absence
Any student who needs to be excused for an absence from a single lecture, due to a medically
necessitated absence shall make a reasonable attempt to inform the instructor of his/her illness prior to
the class. Upon returning to the class, present the instructor with a self-signed note attesting to the date
of their illness. Each note must contain an acknowledgment by the student that the information provided
is true and correct. Providing false information to University of cials is prohibited under Part 9(i) of the
Code of Student Conduct (V-1.00(B) University of Maryland Code of Student Conduct) and may result in
disciplinary action.

Self-documentation may not be used for Major Scheduled Grading Events (midterm and nal exams) and
it may only be used for one class meeting during the semester. Any student who needs to be excused for
a prolonged absence (two or more consecutive class meetings), or for a Major Scheduled Grading Event,
must provide written documentation of the illness from the Health Center or from an outside health
care provider. This documentation must verify dates of treatment and indicate the timeframe that the
student was unable to meet academic responsibilities. In addition, it must contain the name and phone
number of the medical service provider to be used if veri cation is needed. No diagnostic information
will ever be requested.

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 10


fi
fi
fi
Use of LLMs
You can use LLMs such as ChatGPT as you would use Google for research. However, you cannot
generate your solutions only using ChatGPT.You must demonstrate independent thought and effort.

If you use ChatGPT for anything class related, you must mention that in your answer/report. Please note
that LLMs provide unreliable information, regardless of how convincingly they do so. If you are going to
use an LLM as a research tool in your submission, you must ensure that the information is correct and
addresses the actual question asked.

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 11


What is parallel computing?

• Serial or sequential computing: doing a task in sequence on a single processor

• Parallel computing: breaking up a task into sub-tasks and doing them in parallel
(concurrently) on a set of processors (often connected by a network)

• Some tasks do not need any communication: embarrassingly parallel

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 12


What is parallel computing?
• Does it include:
• Grid computing: processors are dispersed geographically

• Distributed computing: processors connected by a network

• Cloud computing: on-demand availability, typically pay-as-you-go model

• Does it include:
• Superscalar processors

• Vector processors

• Accelerators (GPUs, FPGAs)

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 13


The need for parallel computing or HPC
HPC stands for High Performance Computing

Drug discovery

https://www.nature.com/articles/nature21414

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 14


The need for parallel computing or HPC
HPC stands for High Performance Computing

Drug discovery Weather forecasting

https://www.nature.com/articles/nature21414

https://www.ncl.ucar.edu/Applications/wrf.shtml

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 14


The need for parallel computing or HPC
HPC stands for High Performance Computing Study of the universe

Drug discovery Weather forecasting

https://www.nature.com/articles/nature21414

https://www.nas.nasa.gov/SC14/demos/demo27.html

https://www.ncl.ucar.edu/Applications/wrf.shtml

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 14


Why do we need parallelism?

• Make some science simulations feasible in the lifetime of humans

• Typical constraints are speed or memory requirements


• Either it would take too long to do the simulations

• Or the simulation data would not t in the memory of a single processor

• Made possible by using more than one core/processor

• Provide answers in realtime or near realtime

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 15


fi
Large supercomputers
• Top500 list: https://top500.org/lists/top500/2024/06/

https://www.olcf.ornl.gov/frontier

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 16


Parallel architecture
• A set of nodes or processing elements connected by a network.

https://computing.llnl.gov/tutorials/parallel_comp

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 17


Interconnection networks
A group with 96 Aries routers
• Different topologies for connecting nodes together

• Used in the past: torus, hypercube

• More popular currently: fat-tree, dragon y


Column all-to-all (black) links Row all-to-all (green) links

Fig. 3: Example of a Cray Cascade (XC30) installation with fou


A group with 96 Aries routers Two-level dragonfly with multiple groups
message is routed in at most two hops (on the black and/or gre
inter-group blue links are used leading to a shortest path of at m

thousand nodes. In either case, the top-level switches only have C


downward connections from their ports to other switches (thus
if there are n leaf-level Inter-group (blue) links 2 top-level switches are
switches, only n

Column all-to-all (black) links needed).


Row all-to-all (green) links (not all links are shown) e
w
Fig. 3: Example of a Cray Cascade (XC30) installation with Traffic
fouringroups
currentandfat-tree
96 Aries networks
routersisperusually
group.forwarded
Within a gro
Torus Fat-tree Dragon y c
message is routed in at most two hops (on the black using and/or agreenstaticlinks)
routing algorithm, does
if congestion meaning that allbetween
not exist; messages
groups
m
between
inter-group blue links are used leading to a shortest path of at most five hops. a given pair of nodes take the same (shortest)
t
Abhinav Bhatele, Alan Sussman (CMSC416 path through the fat-tree every time. Each path consists
/ CMSC616) 18 of
t
a sequence of links going up from the source node to a
fl
fl
I/O sub-system / Parallel file system

• Home directories and scratch space on


clusters are typically on a parallel le
system

• Compute nodes do not have local disks

• Parallel lesystem is mounted on all login


and compute nodes

http://wiki.lustre.org/Introduction_to_Lustre

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 19


fi
fi
System software: models and runtimes
• Parallel programming model
• Parallelism is achieved through language constructs or by making calls to a
User code
library and the execution model depends on the model used.

• Parallel runtime [system]:


Parallel runtime
• Implements the parallel execution model

• Shared memory/address-space
Communication library
• Pthreads, OpenMP, Chapel

• Distributed memory Operating system


• MPI, Charm

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 20


Introduction to Parallel Computing (CMSC416 / CMSC818X)

Terminology and Definitions


Abhinav Bhatele, Department of Computer Science
Announcements

• Quiz 0 has been posted on ELMS

• Zaratan accounts have been created for everyone

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 22


Getting started with zaratan

• Over 360 nodes with AMD Milan processors (128 cores/node, 512 GB memory/
node)

• 20 nodes with four NVIDIA A100 GPUs (40 GB per GPU)

ssh [email protected]

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 23


Cores, sockets, nodes
• Core: a single execution unit that has
a private L1 cache and can execute
instructions independently

• Processor: several cores on a single


Integrated Circuit (IC) or chip are
called a multi-core processor

• Socket: physical connector into which


an IC/chip or processor is inserted.

• Node: a packaging of sockets —


motherboard or printed circuit board
(PCB) that has multiple sockets
https://hpc-wiki.info/hpc/HPC-Dictionary

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 24


Rackmount servers

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 25


Rackmount servers

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 25


Rackmount server motherboard

https://www.anandtech.com/show/15924/chenbro-announces-rb13804-dual-socket-1u-xeon-4-bay-hpc-barebones-server https://www.anandtech.com/show/7003/the-haswell-review-intel-core-i74770k-i54560k-tested

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 26


Rackmount server motherboard

https://www.anandtech.com/show/15924/chenbro-announces-rb13804-dual-socket-1u-xeon-4-bay-hpc-barebones-server https://www.anandtech.com/show/7003/the-haswell-review-intel-core-i74770k-i54560k-tested

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 26


Job scheduling

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 27


Job scheduling
• HPC systems use job or batch scheduling

• Each user submits their parallel programs for execution to a “job” scheduler

Job Queue
#Nodes Time
Requested Requested
1 128 30 mins
2 64 24 hours
3 56 6 hours
4 192 12 hours
5 … …
6 … …

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 27


Job scheduling
• HPC systems use job or batch scheduling

• Each user submits their parallel programs for execution to a “job” scheduler

• The scheduler decides:


• what job to schedule next (based on an algorithm: FCFS, priority-based, ….) Job Queue
• what resources (compute nodes) to allocate to the ready job
#Nodes Time
Requested Requested
1 128 30 mins
2 64 24 hours
3 56 6 hours
4 192 12 hours
5 … …
6 … …

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 27


Job scheduling
• HPC systems use job or batch scheduling

• Each user submits their parallel programs for execution to a “job” scheduler

• The scheduler decides:


• what job to schedule next (based on an algorithm: FCFS, priority-based, ….) Job Queue
• what resources (compute nodes) to allocate to the ready job
#Nodes Time
Requested Requested

• Compute nodes: dedicated to each job


1 128 30 mins
2 64 24 hours
3 56 6 hours

• Network, lesystem: shared by all jobs 4


5
192

12 hours

6 … …

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 27


fi
Compute nodes vs. login nodes

• Compute nodes: dedicated nodes for running jobs


• Can only be accessed when they have been allocated to a user by the job scheduler

• Login nodes: nodes shared by all users to compile their programs, submit jobs etc.

• Service/managements nodes: I/O nodes, etc.

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 28


Supercomputers vs. commodity clusters

• Supercomputer refers to a large expensive installation, typically using custom


hardware
• High-speed interconnect

• IBM Blue Gene, Cray XT, Cray XC

• Cluster refers to a cluster of nodes, typically put together using commodity (off-the-
shelf) hardware

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 29


Serial vs. parallel code

• Thread: a thread or path of execution managed by the operating system (OS)


• Threads share the same memory address space

• Process: heavy-weight, processes do not share resources such as memory, le


descriptors etc.

• Serial or sequential code: can only run on a single thread or process

• Parallel code: can be run on one or more threads or processes

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 30

fi
Scaling and scalable

• Scaling: the action of running a parallel


program on 1 to n processes
• 1, 2, 3, … , n

• 1, 2, 4, 8, …, n

• Scalable: A program is scalable if its


performance improves when using more
resources

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 31


Scaling and scalable

• Scaling: the action of running a parallel 10000


Actual
program on 1 to n processes Extrapolation

Execution time (minutes)


1000

• 1, 2, 3, … , n 100

• 1, 2, 4, 8, …, n 10

1
• Scalable: A program is scalable if its
0.1
performance improves when using more 1 4 16 64 256 1K 4K 16K

resources Number of cores

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 31


Weak versus strong scaling

• Strong scaling: Fixed total problem size as we run on more resources (processes or
threads)
• Sorting n numbers on 1 process, 2 processes, 4 processes, …

• Weak scaling: Fixed problem size per process but increasing total problem size as we
run on more resources
• Sorting n numbers on 1 process

• 2n numbers on 2 processes

• 4n numbers on 4 processes

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 32


Speedup and efficiency

• (Parallel) Speedup: Ratio of execution time on one process to that on p processes

t1
Speedup =
tp

• (Parallel) ef ciency: Speedup per process

t1
E ciency =
tp × p

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 33


ffi
fi
Amdahl’s law

• Speedup is limited by the serial portion of the code


• Often referred to as the serial “bottleneck” — the portion that cannot be parallelized

• Lets say only a fraction f of the program (in terms of execution time) can be
parallelized on p processes

1
Speedup =
(1 − f ) + f/p

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 34


Amdahl’s law

• Speedup is limited by the serial portion of the code


• Often referred to as the serial “bottleneck” — the portion that cannot be parallelized

• Lets say only a fraction f of the program (in terms of execution time) can be
parallelized on p processes

1
Speedup =
(1 − f ) + f/p

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 34


Amdahl’s law

• Speedup is limited by the serial portion of the code


• Often referred to as the serial “bottleneck” — the portion that cannot be parallelized

• Lets say only a fraction f of the program (in terms of execution time) can be
parallelized on p processes

1
Speedup =
(1 − f ) + f/p

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 34


1
Amdahl’s law Speedup =
(1 − f ) + f/p

fprintf(stdout,"Process %d of %d is on %s\n",
myid, numprocs, processor_name); Total time on 1 process = 100s
fflush(stdout); Serial portion = 40s
n = 10000; /* default # of rectangles */ Portion that can be parallelized = 60s
if (myid == 0)
startwtime = MPI_Wtime();
60
MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); f= = 0.6
h = 1.0 / (double) n;
100
sum = 0.0;
/* A slightly better approach starts from large i and works back */
1
for (i = myid + 1; i <= n; i += numprocs)
Speedup =
{
x = h * ((double)i - 0.5);
(1 − 0.6) + 0.6/p
sum += f(x);
}
mypi = h * sum;

MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 35


Communication and synchronization

• Each process may execute serial code independently for a while

• When data is needed from other (remote) processes, messaging is required


• Referred to as communication or synchronization (or MPI messages)

• Intra-node communication: among cores within a node

• Inter-node communication: among cores on different nodes connected by a network

• Bulk synchronous programs: All processes compute simultaneously, then synchronize


(communicate) together

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 36


Different models of parallel computation
SISD: Single Instruction Single Data

Flynn’s Taxonomy
https://en.wikipedia.org/wiki/Flynn's_taxonomy

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 37


Different models of parallel computation
SISD: Single Instruction Single Data SIMD: Single Instruction Multiple Data

Flynn’s Taxonomy
Example: Vector / array processors
https://en.wikipedia.org/wiki/Flynn's_taxonomy

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 37


Different models of parallel computation
MIMD: Multiple Instruction Multiple Data

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 38


Different models of parallel computation
MIMD: Multiple Instruction Multiple Data
• Two other variations

• SIMT: Single Instruction Multiple Threads


• Threads execute in lock-step

• Example: GPUs

• SPMD: Single program Multiple Data


• All processes execute the same program but act on
different data

• Enables MIMD parallelization

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 38

You might also like