High Performance
Computing
LECTURE #2
1
Agenda
o What is parallel computing?
o Why Parallel Computers?
o Motivation
o Inevitability of parallel computing
o Application demands
o Technology and architecture trends
oTerminologies
o How is parallelism expressed in a program
o Challenges
2
What is parallel computing?
Multiple processors cooperating concurrently to solve one problem.
3
What is parallel computing?
“A parallel computer is a collection of processing elements that can
communicate and cooperate to solve large problems fast”
Almasi/Gottlieb
“communicate and cooperate”
• Nodes and interconnect architecture
• Problem partitioning (Co-ordination of events in a process)
“large problems fast”
• Programming model
• Match of model and architecture
4
What is parallel computing?
Some broad issues:
• Resource Allocation:
– How large a collection?
– How powerful are the elements?
• Data access, Communication and Synchronization
– How are data transmitted between processors?
– How do the elements cooperate and communicate?
– What are the abstractions and primitives for cooperation?
• Performance and Scalability
– How does it all translate into performance?
– How does it scale? A service is said to be scalable when ??
5
Why Parallel Computer?
• Tremendous advances in microprocessor technology,
ex: clock rates of processors increased from 40MHz (e.g MIPS R3000, 1988)
to 2.0 GHz (e.g pentium 4, 2002)
to nowadays, 8.429GHz (AMD's Bulldozer based FX chips, 2012)
• Processor are now capable of executing multiple instruction in the same cycle
◦ The fundamental sequence of steps that a CPU performs. Also known as the "fetch-execute
cycle," it is the time in which a single instruction is fetched from memory, decoded and
executed.
◦ The first half of the cycle transfers the instruction from memory to the instruction register
and decodes it. The second half executes the instruction
6
Why parallel computing?
• The ability of memory system to feed data to processor at required rate
increased
• In addition, significant innovations in architecture and software have
addressed the mitigation of bottlenecks posed by data-Path and memory
◦ Hence, multiplicity of data-paths to increase access to storage elements (memory & disk)
7
Motivation
• Sequential architectures reaching physical limitation.
• Uniprocessor architectures will not be able to sustain the rate of performance
increments in the future.
• Computation requirements are ever increasing
-- visualization, distributed databases,
-- simulations, scientific prediction (earthquake), etc.
• Accelerating applications.
8
Inevitability of parallel computing
Application demand for performance
• Scientific: weather forecasting, pharmaceutical design, genomics
• Commercial: OLTP, search engine, decision support, data mining
• Scalable web servers
Technology and architecture trends
• limits to sequential CPU, memory, storage performance
• parallelism is an effective way of utilizing growing number of transistors.
• low incremental cost of supporting parallelism
9
Application Demand: Inevitability of parallel Computing
Engineering Computers
• Earthquake and structural modeling • Embedded systems increasingly rely on
distributed control algorithms.
• Design and simulation of micro- and nano-scale
systems. •Network intrusion detection, cryptography, etc.
•Optimizing performance of modern automobile.
Computational Sciences •Networks, mail-servers, search engines…
• Bioinformatics: Functional and structural •Visualization architectures & entertainment
characterization of genes and proteins.
•Simulation Traditional scientific and engineering
• Astrophysics: exploring the evolution of galaxies. paradigm:
1) Do theory or paper design.
• Weather modeling, flood/tornado prediction.. 2) Perform experiments or build system.
Limitations:
Commercial
– Too difficult -- build large wind tunnels.
• Data mining and analysis for optimizing business – Too expensive -- build a throw-away passenger jet.
and marketing decisions.
– Too slow -- wait for climate or galactic evolution.
• Database and Web servers for online transaction
– Too dangerous -- weapons, drug design, climate
processing experimentation. 10
Technology and architecture
1- Processor Capacity
11
2- Transistor Count
40% more functions can be performed by a CPU per year
Fundamentally, the use of more transistors improves performance in two ways:
◦ Parallelism: multiple operations done at once (less processing time)
◦ Locality: data references performed close to the processor (less memory latency)
12
3- Clock Rate
30% per year ---> today’s PC is yesterday’s Supercomputer
13
4- Similar Story for Memory and Disk
❖ Divergence between memory capacity and speed
o Capacity increased by 1000X from 1980-95, speed only 2X
o Larger memories are slower, while processors get faster “memory wall”
- Need to transfer more data in parallel
- Need deeper cache hierarchies
- Parallelism helps hide memory latency
❖Parallelism within memory systems too
o New designs fetch many bits within memory chip, follow with fast pipelined
transfer across narrower interface
14
5- Role of Architecture
Greatest trend in VLSI is an increase in the exploited parallelism
• Up to 1985: bit level parallelism:
– 4-bit -> 8 bit -> 16-bit – slows after 32 bit
• Mid 80s to mid 90s: Instruction Level Parallelism (ILP)
– pipelining and simple instruction sets (RISC)
– on-chip caches and functional units => superscalar execution
– Greater sophistication: out of order execution, speculation
• Nowadays:
– Hyper-threading
– Multi-core
15
❖ Definition
High-performance computing (HPC) is the use of parallel processing for
running advanced application programs efficiently, reliably and quickly.
17
Conclusions
• The hardware evolution, driven by Moore’s law, was geared toward two
things:
– exploiting parallelism
– Dealing with memory (latency, capacity)
18
Terminologies
❑Core a single computing unit with its own independent control
❑ Multicore is a processor having several cores that can access the same memory
concurrently
❑ A computation is decomposed into several parts called Tasks that can be computed
in parallel
❑Finding enough parallelism is (one of the) critical steps for high performance
(Amdahl’s law).
19
Performance Metrics
❑ Execution time:
The time elapsed between the beginning and the end of its execution.
❑Speedup:
The ration between serial and parallel time.
Speedup= Ts/Tp
❑ Efficiency:
Ratio of speedup to the number of processors.
Efficiency= Speedup/P
20
❑ Amdahl’s Law
Used to predict maximum speedup using multiple processors.
• Let f = fraction of work performed sequentially.
• (1 - f) = fraction of work that is parallelizable.
• P = number of processors
On 1 cpu: T1 = f + (1 – f ) = 1.
(1−𝑓)
On P processors: Tp = f +
𝑝
• Speedup
𝑇1 1 1
= <
𝑇𝑝 𝑓+(1−𝑓)/𝑝 𝑓
Speedup limited by sequential part
21
How is parallelism expressed in a
program
IMPLICITLY EXPLICITLY
❑ Define tasks only, rest implied; or ❑ Define tasks, work decomposition,
define tasks and work decomposition data decomposition, communication,
rest implied; synchronization.
❑ OpenMP is a high-level parallel ❑MPI is a library for fully explicit
programming model, which is mostly an parallelization.
implicit model.
23
1- IMPLICITLY
❑ It is a characteristic of a programming language that allows a compiler or interpreter to
automatically exploit the parallelism inherent to the computations expressed by some of the
language's constructs.
❑ A pure implicitly parallel language does not need special directives, operators or functions
to enable parallel execution.
❑ Programming languages with implicit parallelism include Axum, HPF, Id, LabVIEW, MATLAB
M-code,
❑ Example: taking the sine or logarithm of a group of numbers, a language that provides
implicit parallelism might allow the programmer to write the instruction thus:
24
Advantages Disadvantages
❑A programmer does not need to worry ❑It reduce the control that the
about task division or process programmer has over the parallel
communication, execution of the program,
❑ focusing instead in the problem that ❑resulting sometimes in less-than-optimal
his or her program is intended to solve. parallel efficiency.
❑It generally facilitates the design of ❑ Sometimes debugging is difficult.
parallel programs .
25
2- EXPLICITLY
How is parallelism expressed in a program
❑it is the representation of concurrent computations by means of primitives in the form
of special-purpose directives or function calls.
❑Most parallel primitives are related to process synchronization, communication or task
partitioning.
Advantages Disadvantages
❑ The absolute programmer control ❑ programming with explicit parallelism
over the parallel execution. is often difficult, especially for non
computing specialists,
❑ A skilled parallel programmer takes
advantage of explicit parallelism to ❑because of the extra work involved in
produce very efficient code. planning the task division and
synchronization of concurrent
26
What is Think - Different
How many people doing the work → (Degree of Parallelism)
What is needed to begin the work → (Initialization)
Who does what → (Work distribution)
Access to work part → (Data/IO access)
Whether they need info from each other to finish their own job → (Communication)
When are they all done → (Synchronization)
What needs to be done to collate the result
27
Challenges
All parallel programs contain:
❑ Parallel sections
❑ Serial sections
❑Serial sections are with work is being duplicated or no useful work is being done, (waiting
for others)
Building efficient algorithms avoiding:
❑ Communication delay
❑ Idling
❑ Synchronization
28
Sources of overhead in parallel programs
❑ Inter process interaction:
The time spent communicating data between processing elements is
usually the most significant source of parallel processing overhead.
❑ Idling:
Processes may become idle due to many reasons such as load
imbalance, synchronization, and presence of serial components in a
program.
❑ Excess Computation:
The fastest known sequential algorithm for a problem may be difficult
or impossible to parallelize, forcing us to use a parallel algorithm
based on a poorer but easily parallelizable sequential algorithm.
29