01 cmsc416 Intro
01 cmsc416 Intro
Course Overview
Abhinav Bhatele, Alan Sussman
About the instructor — Dr. Bhatele
• Ph.D. from the University of Illinois at Urbana-Champaign
• Research areas:
• Distributed AI
• Name
• Junior/Senior/Graduate student
• Work expected:
• Five to six programming assignments
• Quizzes on ELMS
• Discussions: Piazza
• https://piazza.com/umd/fall2024/cmsc416cmsc616
• If you want to contact the course staff outside of piazza, send an email to:
[email protected]
• Zaratan is the UMD DIT cluster we’ll use for the programming assignments
• You should receive an email when your account is ready for use
• Helpful resources:
• https://hpcc.umd.edu/hpcc/help/usage.html
• https://missing.csail.mit.edu
• https://www.cs.umd.edu/~mmarsh/books/cmdline/cmdline.html
• https://www.cs.umd.edu/~mmarsh/books/tools/tools.html
• Docker containers
• On zaratan:
• You can use VSCode
Self-documentation may not be used for Major Scheduled Grading Events (midterm and nal exams) and
it may only be used for one class meeting during the semester. Any student who needs to be excused for
a prolonged absence (two or more consecutive class meetings), or for a Major Scheduled Grading Event,
must provide written documentation of the illness from the Health Center or from an outside health
care provider. This documentation must verify dates of treatment and indicate the timeframe that the
student was unable to meet academic responsibilities. In addition, it must contain the name and phone
number of the medical service provider to be used if veri cation is needed. No diagnostic information
will ever be requested.
If you use ChatGPT for anything class related, you must mention that in your answer/report. Please note
that LLMs provide unreliable information, regardless of how convincingly they do so. If you are going to
use an LLM as a research tool in your submission, you must ensure that the information is correct and
addresses the actual question asked.
• Parallel computing: breaking up a task into sub-tasks and doing them in parallel
(concurrently) on a set of processors (often connected by a network)
• Does it include:
• Superscalar processors
• Vector processors
Drug discovery
https://www.nature.com/articles/nature21414
https://www.nature.com/articles/nature21414
https://www.ncl.ucar.edu/Applications/wrf.shtml
https://www.nature.com/articles/nature21414
https://www.nas.nasa.gov/SC14/demos/demo27.html
https://www.ncl.ucar.edu/Applications/wrf.shtml
https://www.olcf.ornl.gov/frontier
https://computing.llnl.gov/tutorials/parallel_comp
http://wiki.lustre.org/Introduction_to_Lustre
• Shared memory/address-space
Communication library
• Pthreads, OpenMP, Chapel
• Over 360 nodes with AMD Milan processors (128 cores/node, 512 GB memory/
node)
https://www.anandtech.com/show/15924/chenbro-announces-rb13804-dual-socket-1u-xeon-4-bay-hpc-barebones-server https://www.anandtech.com/show/7003/the-haswell-review-intel-core-i74770k-i54560k-tested
https://www.anandtech.com/show/15924/chenbro-announces-rb13804-dual-socket-1u-xeon-4-bay-hpc-barebones-server https://www.anandtech.com/show/7003/the-haswell-review-intel-core-i74770k-i54560k-tested
• Each user submits their parallel programs for execution to a “job” scheduler
Job Queue
#Nodes Time
Requested Requested
1 128 30 mins
2 64 24 hours
3 56 6 hours
4 192 12 hours
5 … …
6 … …
• Each user submits their parallel programs for execution to a “job” scheduler
• Each user submits their parallel programs for execution to a “job” scheduler
• Login nodes: nodes shared by all users to compile their programs, submit jobs etc.
• Cluster refers to a cluster of nodes, typically put together using commodity (off-the-
shelf) hardware
fi
Scaling and scalable
• 1, 2, 4, 8, …, n
• 1, 2, 3, … , n 100
• 1, 2, 4, 8, …, n 10
1
• Scalable: A program is scalable if its
0.1
performance improves when using more 1 4 16 64 256 1K 4K 16K
• Strong scaling: Fixed total problem size as we run on more resources (processes or
threads)
• Sorting n numbers on 1 process, 2 processes, 4 processes, …
• Weak scaling: Fixed problem size per process but increasing total problem size as we
run on more resources
• Sorting n numbers on 1 process
• 2n numbers on 2 processes
• 4n numbers on 4 processes
t1
Speedup =
tp
t1
E ciency =
tp × p
• Lets say only a fraction f of the program (in terms of execution time) can be
parallelized on p processes
1
Speedup =
(1 − f ) + f/p
• Lets say only a fraction f of the program (in terms of execution time) can be
parallelized on p processes
1
Speedup =
(1 − f ) + f/p
• Lets say only a fraction f of the program (in terms of execution time) can be
parallelized on p processes
1
Speedup =
(1 − f ) + f/p
fprintf(stdout,"Process %d of %d is on %s\n",
myid, numprocs, processor_name); Total time on 1 process = 100s
fflush(stdout); Serial portion = 40s
n = 10000; /* default # of rectangles */ Portion that can be parallelized = 60s
if (myid == 0)
startwtime = MPI_Wtime();
60
MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); f= = 0.6
h = 1.0 / (double) n;
100
sum = 0.0;
/* A slightly better approach starts from large i and works back */
1
for (i = myid + 1; i <= n; i += numprocs)
Speedup =
{
x = h * ((double)i - 0.5);
(1 − 0.6) + 0.6/p
sum += f(x);
}
mypi = h * sum;
Flynn’s Taxonomy
https://en.wikipedia.org/wiki/Flynn's_taxonomy
Flynn’s Taxonomy
Example: Vector / array processors
https://en.wikipedia.org/wiki/Flynn's_taxonomy
• Example: GPUs