AA Part1
AA Part1
MOUNA BAKLOUTI
[email protected]
Course Outline
1. Fundamentals
2. HPC systems and applications
3. HPC programming
1
1. Fundamentals
Fundamental concepts
Serial Computing: Traditionally, software has been written for serial computation:
A problem is broken into instructions (arithmetic, memory read and write, control, …)
Instructions are executed sequentially one after another, only one at any moment in time
Executed on a single processor (CPU)
2
Fundamental concepts
Serial Computing:
The execution time of a program with N instructions on a processor that is able to execute F instructions
per second is: T = N ÷ F
One could execute the program faster (i.e. reduce T) by augmenting the value of F. And this has been
the trend during more than 30 years of technology and computer architecture evolution.
The sequential execution became faster ("Instruction Level Parallelism", "Pipelining", Higher Frequencies)
More and smaller transistors = more performance
Programmers simply waited for the next processor generation
Today:
The frequency of processors does not increase significantly and more (heat dissipation problems)
The instruction level parallelism does not increase significantly any more
The execution speed is dominated by memory access times (but caches still become larger and faster)
Fundamental concepts
Parallel Computing: Parallel computing is the simultaneous use of multiple compute resources to
solve a computational problem:
A problem is broken into parts (called tasks) that can be solved concurrently
Each part is further broken down into instructions
Instructions from each part are executed in parallel on different processors (CPUs)
Multicore:
Use transistors for more compute cores
Parallelism in the software
Programmers have to write parallel programs to
benefit from new hardware
And data ?
3
Fundamental concepts
Parallel Computing
Shared memory architecture and memory address space
Fundamental concepts
Parallel Computing
Distributed-memory architecture and memory address space
4
Fundamental concepts
Parallel Computing
Ideally, each processor could receive 1/P of the program, reducing its execution time by P (number of
processors):
T = (N ÷ P) ÷ F
Need to manage and coordinate the execution of tasks, ensuring correct access to shared resources
Expressing Tasks
Motivation
Suppose you have a database where you have information about the available cars or the purchase:
Assume that we want to count how many Green cars are available to sell.
10
5
Expressing Tasks
Motivation
A first approach: one could traverse all the records X[0] . . . X[n-1] in the database X and check if the
Color field matches the required value Green, storing the number of matches in variable count
Whose computation time on a single processor (T1) would be proportional to the number of records n in the database T1 n
It is linear as a function of the number of records
11
Expressing Tasks
Tasks
A second approach: one could divide the traversal in P groups (tasks), for example for P=4:
Checking the Color field for a subset of n/P consecutive records, and counting on a per-task « private »
copy of variable count
12
6
Expressing Tasks
Tasks
However, we still need to « globally » count the number of records found that match the condition by
combining the individual « private » counts into the original count variable
13
Expressing Tasks
Tasks
Up to this point you could anticipate that the computation time would be divided by the number of
tasks P if P workers are used to do the computation
T1 is the serial execution time
14
7
Tasks and Dependencies
Motivation
Consider now that we want to execute the following query:
15
16
8
Tasks and Dependencies
Dependences impose task execution orderig constraints that need to be fulfilled in order to guarantee
correct results
17
18
9
Task Dependency Graph
Task dependence graph abstraction
Each of the edges has a
Directed Acyclic Graph direction and there is no cycle
in this graph
19
20
10
Task Dependency Graph
Critical path T∞
Critical path: path in the task graph with the highest accumulated work
21
22
11
Task Dependency Graph
Wrap Up
23
24
12
Task Dependency Graph
Solution
25
26
13
Granularity and Parallelism
Example: a task could be in charge of checking a number of consecutive elements m of the database:
27
However, there is a tradeoff between potential parallelism and overheads related with its exploitation
(e.g. creation of tasks, synchronization, exchange of data, …)
28
14
Task definition
Can the computation be divided in parts ?
Task decomposition: based on the processing to do (e.g. functions, loop iterations…)
Data decomposition: based on the data to be processed (e.g. elements of a vector,
rows of a matrix) (implies task decomposition)
There may be (data or control) dependencies between tasks
Metrics to understand how our task/data decomposition can potentially
behave
Factors: granularity and overheads
29
Parallelism =
Pmin is the minimum number of processors necessary to achieve Parallelism
30
15
Example: Vector Sum
Compute the sum of elements X[0] . . . X[n-1] of a vector X
How can we design an algorithm which leads to a TDG with more parallelism?
31
32
16
Example: Vector Sum
Task definition: each invocation to recursive_sum
TDG (with input data):
Metrics:
Same problem can be expressed with different algorithms/implementations
leading to different metrics
33
Advanced granularity
Given a sequential program, the number of tasks that one can generate and
the size of the tasks (what is called granularity) are related one to the other.
Fine-grained tasks vs. coarse-grained tasks
The parallelism increases as the decomposition becomes finer in granularity (small
tasks) and vice versa
34
17
Advanced granularity
Fine-grained Decomposition
Example: matrix-vector product (n by n matrix):
A task could be each individual x and + in the dot product that contributes to the computation of an
element of y
35
Advanced granularity
Coarse-grained Decomposition
A task could be in charge of computing a number of consecutive elements of y (e.g. three elements)
36
18
Advanced granularity
So…
In this example, the fine
It would appear that the parallel time can be made arbitrarily granularity is that each task
small by making the decomposition finer in granularity but … does the computation
y[i]+A[i,j]*b[j], this is the finest
Tradeoff between the granularity of a decomposition and associated overheads granularity that gives max
(sources of overhead: creation of tasks, task synchronization, exchange of data parallelism, so here we will
between tasks, …) have n2 tasks where n is the
number of rows and columns
The granularity may determine performance bounds of matrix A. So we will have
many tasks and remember
that we have a tradeoff
between the granularity of the
decomposition and the
associated overheads.
Because this overhead; not
always the best choice is to
choose the finest grain.
37
38
19
Example: stencil computation using Jacobi solver
40
20
Speedup and Efficiency
Execution Time in P processors
Tp = execution time on P processors
Task scheduling: how are tasks assigned to processors? For example, consider 2 proc:
Here is the split of tasks depending on the dependency:
41
for P=2
42
21
Speedup and Efficiency
Scalability and Efficiency
Scalability: how the speed-up evolves when the number of processors is increased
Efficiency: The greater the efficiency, the better the parallelism.
43
44
22
Fundamental concepts
Amdahl’s Law
How much faster will the program run?
where
T(1) : Time to run the program on one processor
T(n) : Time to run the program on n processors
45
Fundamental concepts
Amdahl’s Law
Performance improvement is limited by the fraction of time the program does not run in fully
parallel mode
46
23
Fundamental concepts
Amdahl’s Law
Over simplified model but it tells that there are limits to the scalability of parallelism
47
Fundamental concepts
Amdahl’s Law
Assume the following simplified case, where the parallel fraction φ is the fraction, of total
execution time, the program can be parallelized:
48
24
Fundamental concepts
Amdahl’s Law
From where we can compute the speed-up Sp that can be achieved as
Fully sequential Time
Sequential fraction
Two particular cases: When we cannot parallelize any part of the prog
49
Fundamental concepts
Overhead sources
Parallel computing is not for free, we should account overheads (i.e. any cost that gets added to a
sequential computation so as to enable it to run in parallel)
On the left overhead of task creation, in the middle overhead to ensure that shared objects and variables are accessed with mutual exclusion avoiding
data corruption and extra memory latency , also task synchro overhead in order to ensure that dependencies between tasks are satisfied. Finally on the
right the last overhead, which is the synchro cost that has to be paid to wait all the tasks to finish their execution in parallel.
50
25
Fundamental concepts
Common overheads
Data sharing: can be explicit via messages, or implicit via a memory hierarchy (caches)
Idleness: thread cannot find any useful work to execute (e.g. dependences, load imbalance, poor
communication and computation overlap or hiding of memory latencies, …)
Computation: extra work added to obtain a parallel algorithm (e.g. replication)
Memory: extra memory used to obtain a parallel algorithm (e.g. impact on memory hierarchy, …)
51
Problem 1
1. Assume we want to execute two different applications in our parallel machine with 4 processors:
application App1 is sequential; App2 is parallelised defining 4 tasks, each task executing one fourth of
the total application. The sequential time for the applications is 8 and 40 time units, respectively.
Assuming: that App1 can start its execution at time 4 and App2 starts at time 0, draw a time line
showing how they will be executed if the operating system:
a) Does not allow multiprogramming, i.e. only one application can be executed at the same time in the system.
b) Allows multiprogramming so that the system tries to have both applications running concurrently, each
application exactly making use of the number of processors that is able to use.
c) The same as in the second case, but now App2 is parallelised defining 3 tasks, each task executing one third of
the total application.
52
26
Solution
a)
b)
53
Solution
c)
54
27
Problem 2
1. We give the following TDG:
a) Calculate T1
b) Critical Path
c) Calculate T∞
d) Parallelism and minimum number of processors (Pmin)
55
Solution
56
28
Problem 3
We give the following code written in C:
57
Problem 3
Assuming that:
1. In the initialization loop, the execution of each iteration of the internal loop lasts 10 cycles;
2. In the computation loop, the execution of each iteration of the internal loop lasts 100 cycles; and
3. The execution of the foo function does not cause any kind of dependence (i.e. no overhead). We ask:
a) Draw the Task Dependence Graph (TDG), indicating for each node its cost in terms of execution time,
b) Calculate the values for T1 and T∞ as well as the potential Parallelism.
c) Calculate which is the best value for the « speed-up » on 4 processors (S4), indicating which would be the proper
task mapping (assignment) to processors to achieve it.
58
29
Solution
a)
b)
59
Solution
c)
60
30
Problem 4
Given the following code:
61
Problem 4
Assuming that:
1. The execution of the modify_d routine takes 10 time units and the execution of the modify_nd routine
takes 5 time units;
2. Each internal iteration of the computation loop (i.e. each internal iteration (for k) of the for_compute
task) takes 5 time units; and
3. The execution of the output task takes 100 time units, we ask:
a) Draw the Task Dependence Graph (TDG), indicating for each node its cost in terms of execution time (in time
units),
b) Compute the values for T1 and T∞ and the potential Parallelism, as well as the parallel fraction (phi).
c) Indicate which would be the most appropriate task assignment on 2 processors in order to obtain the best
possible « speed-up ». For that assignment, calculate T2 and S2.
62
31
Solution
a)
63
Solution
b)
64
32
Solution
65
Problem 5
The following figure shows an incomplete time diagram for the execution of a parallel application on 4
processors:
66
33
Problem 5
The figure has a set of rectangles, each rectangle represents the execution of a task with its associated
cost in time units. In the timeline there are two regions (1 and 2) with 4 parallel tasks each. The
execution cost for tasks in region1 is unknown (x time units each); the cost for each task in region2 is 4
time units. The computation starts with a sequential task (with cost 5), then all tasks in region1, running
in parallel, followed by another sequential task (with cost 2), then all tasks in region2 running in parallel
followed by a final sequential task (with cost 1).
Knowing that an ideal speed-up of 9 could be achieved if the application could make use of infinite
processors (𝑆 → = 9) and assuming that the two parallel regions can be decomposed ideally, with as
many tasks as processors with the appropriate fraction of the original cost, we ask:
a. What is the parallel fraction (φ) for the application represented in the time diagram above ?
b. Which is the « speed-up » that is achieved in the execution with 4 processors (S4)?
c. Which is the value x in region1 ?
67
Solution
a)
b)
c)
68
34
Fundamental concepts
Flynn’s Classical Taxonomy
There are different ways to classify parallel computers
One of the more widely used classifications, in use since 1966, is called Flynn’s Taxonomy
Flynn’s taxonomy distinguishes multi-processor computer architectures according to how they
can be classified along the two independent dimensions of Instruction Stream and Data Stream.
Each can have only one of two possible states: Single or Multiple
Four possible classifications according to Flynn:
69
70
35
Single Instruction Multiple Data (SIMD)
A type of parallel computer
Single Instruction: All processing units execute the same
instruction at any given clock cycle
Multiple Data: Each processing unit can operate on a different
data element
Best suited for specialized problems characterized by a high
degree of regularity, such as graphics/image processing
71
72
36
Multiple Instruction, Multiple Data
(MIMD)
A type of parallel computer
Multiple Instruction: Every processor may be executing a different instruction stream
Multiple Data: Every processor may be working with a different data stream
73
The growth in complexity has always roughly translated to an equivalent growth in compute
performance.
Increasing chip transistor counts and clock speeds ⇒ More advanced techniques to improve
performance.
74
37
Moore’s law
75
76
38
Moore’s Law & Performance Improvement
Pipelined functional units Out-of-order execution
Subdivide complex operations into simple Execute instructions in an unordered fashion
components executed on different functional Hundreds of instructions in flight
units on the CPU.
Instruction Level Parallelism (ILP). Larger caches
Hold copies of data to be used soon
Superscalar architecture Solution to DRAM gap
More than one instruction / cycle.
Multiple / identical functional units which can
operate concurrently. Simplified instruction set
General move from CISC to RISC paradigm
Data parallelism through SIMD instructions X86-based processors execute CISC machine
Identical op. on an array of int or FP operands code
Intel (SSE), AMD (3dNow!), Power/PowerPC On-the-fly translation into RISC µ-ops
(AltiVec)
77
Memory Architecture
Shared memory: General characteristics
All processors can access all memory as global address space
Multiple processors can operate independently but share the same memory resources
Changes in a memory location performed by one processor are visible to all other processors
78
39
Memory Architecture
Shared memory: Uniform Memory Access (UMA)
Most commonly represented today by Symmetric Multiprocessor (SMP) machines
Identical processors
Equal access and access times to memory
79
Memory Architecture
Shared memory: Non-Uniform Memory Access (NUMA)
Often made by physically linking two or more SMPs
One SMP can directly access memory of another SMP
Not all processors have equal access time to all memories
Memory access across link is slower
80
40
Memory Architecture
Shared memory
Advantages
Global address space provides a user-friendly programming with regard to memory
Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs
Disadvantages
The lack of scalability between memory and CPUs: Adding more CPUs can geometrically increases traffic
on the shared memory-CPU path
Programmer responsibility for synchronization to ensure ”correct” access to global memory
81
Memory Architecture
Distributed memory: General characteristics
Distributed memory systems require a communication network to connect inter-processor
memory
Processors have their own local memory. Memory addresses in one processor do not map to
another processor, so there is no concept of global address space across all processors
The programmer has to explicitly define how and when data should be communicated as well
as when synchronization between tasks should be performed
82
41
Memory Architecture
Distributed memory: General characteristics
The network ”fabric” used for data transfer varies widely.
83
Memory Architecture
Distributed memory
Advantages
Memory is scalable with the number of processors
Each processor can rapidly access its own memory without interference and without the overhead
incurred with trying to maintain global cache coherency
Disadvantages
The programmer is responsible for many of the details associated with data communication between
processors
It may be difficult to map existing data structures, based on global memory, to this memory organization
84
42
Memory Architecture
Hybrid Distributed-Shared memory: General characteristics
The largest and fastest computers in the world today employ both shared and distributed
memory architectures
The shared memory component can be a shared memory machine and/or graphics processing
units (GPU)
Processors on a compute node share same memory space
Requires communication to exchange data between compute nodes
85
86
43