0% found this document useful (0 votes)
43 views45 pages

HPC Overview

This tutorial covers high performance computing (HPC) over three sessions. The first discusses the importance and basic concepts of parallel computing, advantages and disadvantages, and trends in HPC hardware. The second provides an introduction to MPI and shows how to write simple parallel programs. The third briefly discusses additional topics like OpenMP, hybrid programming, computational libraries, and GPU computing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views45 pages

HPC Overview

This tutorial covers high performance computing (HPC) over three sessions. The first discusses the importance and basic concepts of parallel computing, advantages and disadvantages, and trends in HPC hardware. The second provides an introduction to MPI and shows how to write simple parallel programs. The third briefly discusses additional topics like OpenMP, hybrid programming, computational libraries, and GPU computing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Overview of

High Performance
Computing
Timothy H. Kaiser, PH.D.
[email protected]

http://geco.mines.edu/workshop

1
This tutorial will cover all three time slots.
 
In the first session we will discuss the importance of parallel computing to
high performance computing. We will by example, show the basic concepts
of parallel computing. The advantages and disadvantages of parallel
computing will be discussed. We will present an overview of current and
future trends in HPC hardware.
 
The second session will provide an introduction to MPI, the most common
package used to write parallel programs for HPC platforms. As tradition
dictates, we will show how to write "Hello World" in MPI. Attendees will be
shown how to and allowed to build and run relatively simple examples on a
consortium resource.
 
The third session will briefly discuss other important HPC topics. This will
include a discussion of OpenMP, hybrid programming, combining MPI and
OpenMP. Some computational libraries available for HPC will be highlighted.
We will briefly mention parallel computing using graphic processing units
(GPUs).
 

2
Today’s Overview

• HPC computing in a nutshell?

• Basic MPI - run an example

• A few additional MPI features

• A “Real” MPI example

• Scripting

• OpenMP

• Libraries and other stuff

3
Introduction

• What is parallel computing?

• Why go parallel?

• When do you go parallel?

• What are some limits of parallel computing?

• Types of parallel computers

• Some terminology

4
What is Parallelism?

• Consider your favorite computational application

• One processor can give me results in N hours

• Why not use N processors


-- and get the results in just one hour?

The concept is simple:


Parallelism = applying multiple processors
to a single problem
5
Parallel computing is
computing by committee
Grid of a Problem
to be Solved
• Parallel computing: the use of multiple
computers or processors working
Process 0 Process 1
together on a common task. does work does work

• Each processor works on its


for this region for this region

section of the problem

• Processors are allowed to Process 2


does work
Process 3
does work
exchange information with other for this region for this region
processors

6
Why do parallel computing?

• Limits of single CPU computing

• Available memory

• Performance

• Parallel computing allows:

• Solve problems that don’t fit on a single CPU

• Solve problems that can’t be solved in a


reasonable time

7
Why do parallel computing?

• We can run…

• Larger problems

• Faster

• More cases

• Run simulations at finer resolutions

• Model physical phenomena more realistically

8
Weather Forecasting

•Atmosphere is modeled by dividing it into three-dimensional regions or


cells
•1 mile x 1 mile x 1 mile (10 cells high)
•about 500 x 10 6 cells.
•The calculations of each cell are repeated many times to model the
passage of time.
•About 200 floating point operations per cell per time step or 10 11
floating point operations necessary per time step
•10 day forecast with 10 minute resolution => 1.5x1014 flop
•100 Mflops would take about 17 days
•1.7 Tflops would take 2 minutes
•17 Tflops would take 12 seconds

9
Modeling Motion of Astronomical
bodies
(brute force)
• Each body is attracted to each other body by gravitational forces.

• Movement of each body can be predicted by calculating the total force experienced by the
body.

• For N bodies, N - 1 forces / body yields N 2 calculations each time step

• A galaxy has, 10 11 stars => 10 9 years for one iteration

• Using a N log N efficient approximate algorithm => about a year

• NOTE: This is closely related to another hot topic: Protein Folding

10
Types of parallelism two extremes
• Data parallel

• Each processor performs the same task on different data

• Example - grid problems

• Bag of Tasks or Embarrassingly Parallel is a special case

• Task parallel

• Each processor performs a different task

• Example - signal processing such as encoding multitrack


data

• Pipeline is a special case


11
Simple data parallel program
• Example: integrate 2-D propagation problem

Starting partial
differential equation:

Finite Difference
Approximation:

PE #0 PE #1 PE #2 PE #3
y

PE #4 PE #5 PE #6 PE #7

x 12
Typical Task Parallel Application

DATA FFT Multiply Inverse


Normalize
Task Task FFT
Task
Task

• Signal processing

• Use one processor for each task

• Can use more processors if one is overloaded

• This is a pipeline

13
Parallel Program Structure

Communicate &
Repeat

work 1a work 2a work (N)a

work 1b work 2b work (N)b


Begin
start
parallel End End
work 1c work 2c work (N)c
Parallel

work 1d work 2d work (N)d

14
Parallel Problems
Communicate &
Repeat

work 1a work 2a work (N)a

work 1b work 2b work (N)b


Begin
start
parallel End Start Serial Section
work 1c work 2c work (N)c
Parallel

work 1d work 2d work (N)d

Subtasks don’t Serial Section


finish together (No Parallel Work)

work 1x work 2x work (N)x

work 1y work 2y work (N)y


End Serial Section
start
parallel End End
work 1z work 2z work (N)z
Parallel

Not using all processors

15
A “Real” example
#!/usr/bin/env python
from sys import argv else:
from os.path import isfile myval=sin(10.0)
from time import sleep notready=True
from math import sin,cos while notready :
# if isfile(fname) :
fname="message" notready=False
my_id=int(argv[1]) sleep(3)
print my_id, "starting program" mf=open(fname,"r")
# message=float(mf.readline())
mf.close()
total=myval**2+message**2
else:
sleep(5)
if (my_id == 1):
sleep(2)
myval=cos(10.0)
mf=open(fname,"w")
mf.write(str(myval)) print my_id, "done with program"
mf.close()

16
Theoretical upper limits
• All parallel programs contain:

• Parallel sections

• Serial sections

• Serial sections are when work is being duplicated


or no useful work is being done, (waiting for
others)

• Serial sections limit the parallel effectiveness

• If you have a lot of serial computation then you


will not get good speedup

• No serial work “allows” perfect speedup

• Amdahl’s Law states this formally


17
Amdahl’s Law
• Amdahl’s Law places a strict limit on the speedup that can be realized by using multiple
processors.

• Effect of multiple processors on run time

t p = (f p /N + fs) t s
• Effect of multiple processors on speed up
1
S = ts =
tp f p /N + fs
• Where

• Fs = serial fraction of code

• Fp = parallel fraction of code

• N = number of processors

• Perfect speedup t=t1/n or S(n)=n

18
Illustration of Amdahl's Law
It takes only a small
fraction of serial content
in a code to
degrade the parallel
performance.

19
Amdahl’s Law Vs. Reality
•Amdahl’s Law provides a theoretical upper limit on
parallel speedup assuming that there are no costs for
communications.
•In reality, communications will result in a further
degradation of performance
80
fp = 0.99
70

60

50
Amdahl's Law
40 Reality
30

20

10

0
0 50 100 150 200 250
Number of processors

20
Sometimes you don’t get what you
expect!

21
Some other considerations
• Writing effective parallel application is difficult

• Communication can limit parallel efficiency

• Serial time can dominate

• Load balance is important

• Is it worth your time to rewrite your application

• Do the CPU requirements justify


parallelization?

• Will the code be used just once?

22
Parallelism Carries a Price Tag
• Parallel programming

• Involves a steep learning curve

• Is effort-intensive

• Parallel computing environments are unstable and unpredictable

• Don’t respond to many serial debugging and tuning


techniques

• May not yield the results you want, even if you invest a lot of
time

Will the investment of your time be worth it?


23
Terms related to algorithms

• Amdahl’s Law (talked about this already)

• Superlinear Speedup

• Efficiency

• Cost

• Scalability

• Problem Size

• Gustafson’s Law

24
Superlinear Speedup

S(n) > n, may be seen on occasion, but usually


this is due to using a suboptimal sequential
algorithm or some unique feature of the
architecture that favors the parallel formation.

One common reason for superlinear speedup is


the extra cache in the multiprocessor system
which can hold more of the problem data at any
instant, it leads to less, relatively slow memory
traffic.

25
Efficiency
Efficiency = Execution time using one processor over the
Execution time using a number of processors

Its just the speedup divided by the number of


processors

26
Cost
The processor-time product or cost (or work) of a computation defined as
Cost = (execution time) x (total number of processors used)

The cost of a sequential computation is simply its execution time, t s . The cost of a
parallel computation is t p x n. The parallel execution time, t p , is given by ts/S(n)

Hence, the cost of a parallel computation is given by

Cost-Optimal Parallel Algorithm


One in which the cost to solve a problem on a multiprocessor is proportional to the
cost

27
Scalability

Used to indicate a hardware design that allows the


system to be increased in size and in doing so to obtain
increased performance - could be described as
architecture or hardware scalability.

Scalability is also used to indicate that a parallel


algorithm can accommodate increased data items with
a low and bounded increase in computational steps -
could be described as algorithmic scalability.

28
Problem size
Problem size: the number of basic steps in the best sequential
algorithm for a given problem and data set size

•Intuitively, we would think of the number of data elements


being processed in the algorithm as a measure of size.

•However, doubling the date set size would not necessarily


double the number of computational steps. It will depend
upon the problem.

•For example, adding two matrices has this effect, but


multiplying matrices quadruples operations.

Note: Bad sequential algorithms


29
tend to scale well
Other names for Scaling

• Strong Scaling (Engineering)

• For a fixed problem size how does the time to


solution vary with the number of processors

• Weak Scaling

• How the time to solution varies with processor


count with a fixed problem size per processor

30
Some Classes of machines
Network

Processor Processor Processor Processor

Memory Memory Memory Memory

Distributed Memory
Processors only Have access to
their local memory
“talk” to other processors over a network
31
Some Classes of machines
Uniform
Shared Processor Processor
Memory
(UMA)
Processor Processor
All processors Memory
have equal access
to Processor Processor

Memory
Processor Processor
Can “talk”
via memory
32
Some Classes of machines
Hybrid
Shared memory nodes
connected by a network

...

33
Some Classes of machines
More common today
Each node has a collection
of multicore chips

...

Ra has 268 nodes


256 quad core dual socket
12 dual core quad socket
34
Some Classes of machines

Hybrid Machines

•Add special purpose processors "Normal" CPU


to normal processors
•Not a new concept but, Special Purpose
regaining traction Processor
•Example: our Tesla Nvidia node, FPGA, GPU, Vector,
cuda1 Cell...

35
Network Topology

• For ultimate performance you may be concerned


how you nodes are connected.

• Avoid communications between distant node

• For some machines it might be difficult to control


or know the placement of applications

36
Network Terminology
• Latency

• How long to get between nodes in the


network.

• Bandwidth

• How much data can be moved per unit


time.

• Bandwidth is limited by the number of wires


and the rate at which each wire can accept
data and choke points

37
Ring

38
Grid

Wrapping
produces torus

39
Tree
Fat tree
the lines get
wider as you
go up

40
Hypercube
110 111

100 101

010 011

000 001

3 dimensional hypercube
41
4D Hypercube

1100 1110 1101 1111

1000 1010 1001 1011

0100 0110 0101 0111

0001 0011
0000 0010

Some communications algorithms are hypercube based


How big would a 7d hypercube be?
42
Star

Quality depends on what is in the center


43
Example: An Infiniband Switch

Infiniband, DDR, Cisco 7024 IB


Server Switch - 48 Port

Adaptors. Each compute node has


one DDR 1-Port HCA

4X DDR=> 16Gbit/sec

140 nanosecond hardware latency

1.26 microsecond at software level

44
Measured Bandwidth

45

You might also like