HPC Overview
HPC Overview
High Performance
Computing
Timothy H. Kaiser, PH.D.
[email protected]
http://geco.mines.edu/workshop
1
This tutorial will cover all three time slots.
In the first session we will discuss the importance of parallel computing to
high performance computing. We will by example, show the basic concepts
of parallel computing. The advantages and disadvantages of parallel
computing will be discussed. We will present an overview of current and
future trends in HPC hardware.
The second session will provide an introduction to MPI, the most common
package used to write parallel programs for HPC platforms. As tradition
dictates, we will show how to write "Hello World" in MPI. Attendees will be
shown how to and allowed to build and run relatively simple examples on a
consortium resource.
The third session will briefly discuss other important HPC topics. This will
include a discussion of OpenMP, hybrid programming, combining MPI and
OpenMP. Some computational libraries available for HPC will be highlighted.
We will briefly mention parallel computing using graphic processing units
(GPUs).
2
Today’s Overview
• Scripting
• OpenMP
3
Introduction
• Why go parallel?
• Some terminology
4
What is Parallelism?
6
Why do parallel computing?
• Available memory
• Performance
7
Why do parallel computing?
• We can run…
• Larger problems
• Faster
• More cases
8
Weather Forecasting
9
Modeling Motion of Astronomical
bodies
(brute force)
• Each body is attracted to each other body by gravitational forces.
• Movement of each body can be predicted by calculating the total force experienced by the
body.
10
Types of parallelism two extremes
• Data parallel
• Task parallel
Starting partial
differential equation:
Finite Difference
Approximation:
PE #0 PE #1 PE #2 PE #3
y
PE #4 PE #5 PE #6 PE #7
x 12
Typical Task Parallel Application
• Signal processing
• This is a pipeline
13
Parallel Program Structure
Communicate &
Repeat
14
Parallel Problems
Communicate &
Repeat
15
A “Real” example
#!/usr/bin/env python
from sys import argv else:
from os.path import isfile myval=sin(10.0)
from time import sleep notready=True
from math import sin,cos while notready :
# if isfile(fname) :
fname="message" notready=False
my_id=int(argv[1]) sleep(3)
print my_id, "starting program" mf=open(fname,"r")
# message=float(mf.readline())
mf.close()
total=myval**2+message**2
else:
sleep(5)
if (my_id == 1):
sleep(2)
myval=cos(10.0)
mf=open(fname,"w")
mf.write(str(myval)) print my_id, "done with program"
mf.close()
16
Theoretical upper limits
• All parallel programs contain:
• Parallel sections
• Serial sections
t p = (f p /N + fs) t s
• Effect of multiple processors on speed up
1
S = ts =
tp f p /N + fs
• Where
• N = number of processors
18
Illustration of Amdahl's Law
It takes only a small
fraction of serial content
in a code to
degrade the parallel
performance.
19
Amdahl’s Law Vs. Reality
•Amdahl’s Law provides a theoretical upper limit on
parallel speedup assuming that there are no costs for
communications.
•In reality, communications will result in a further
degradation of performance
80
fp = 0.99
70
60
50
Amdahl's Law
40 Reality
30
20
10
0
0 50 100 150 200 250
Number of processors
20
Sometimes you don’t get what you
expect!
21
Some other considerations
• Writing effective parallel application is difficult
22
Parallelism Carries a Price Tag
• Parallel programming
• Is effort-intensive
• May not yield the results you want, even if you invest a lot of
time
• Superlinear Speedup
• Efficiency
• Cost
• Scalability
• Problem Size
• Gustafson’s Law
24
Superlinear Speedup
25
Efficiency
Efficiency = Execution time using one processor over the
Execution time using a number of processors
26
Cost
The processor-time product or cost (or work) of a computation defined as
Cost = (execution time) x (total number of processors used)
The cost of a sequential computation is simply its execution time, t s . The cost of a
parallel computation is t p x n. The parallel execution time, t p , is given by ts/S(n)
27
Scalability
28
Problem size
Problem size: the number of basic steps in the best sequential
algorithm for a given problem and data set size
• Weak Scaling
30
Some Classes of machines
Network
Distributed Memory
Processors only Have access to
their local memory
“talk” to other processors over a network
31
Some Classes of machines
Uniform
Shared Processor Processor
Memory
(UMA)
Processor Processor
All processors Memory
have equal access
to Processor Processor
Memory
Processor Processor
Can “talk”
via memory
32
Some Classes of machines
Hybrid
Shared memory nodes
connected by a network
...
33
Some Classes of machines
More common today
Each node has a collection
of multicore chips
...
Hybrid Machines
35
Network Topology
36
Network Terminology
• Latency
• Bandwidth
37
Ring
38
Grid
Wrapping
produces torus
39
Tree
Fat tree
the lines get
wider as you
go up
40
Hypercube
110 111
100 101
010 011
000 001
3 dimensional hypercube
41
4D Hypercube
0001 0011
0000 0010
4X DDR=> 16Gbit/sec
44
Measured Bandwidth
45