0% found this document useful (0 votes)
5 views43 pages

AA Part1

The document outlines the fundamentals of advanced architectures, focusing on serial and parallel computing concepts, including task decomposition and dependencies. It discusses the importance of granularity in task definitions and the trade-offs between fine and coarse-grained tasks, as well as the implications of Amdahl's Law on speedup and efficiency in parallel processing. The course aims to provide a comprehensive understanding of high-performance computing (HPC) systems and programming techniques.

Uploaded by

Fg Esra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views43 pages

AA Part1

The document outlines the fundamentals of advanced architectures, focusing on serial and parallel computing concepts, including task decomposition and dependencies. It discusses the importance of granularity in task definitions and the trade-offs between fine and coarse-grained tasks, as well as the implications of Amdahl's Law on speedup and efficiency in parallel processing. The course aims to provide a comprehensive understanding of high-performance computing (HPC) systems and programming techniques.

Uploaded by

Fg Esra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Advanced Architectures

MOUNA BAKLOUTI
[email protected]

Course Outline
1. Fundamentals
2. HPC systems and applications
3. HPC programming

1
1. Fundamentals

Fundamental concepts
 Serial Computing: Traditionally, software has been written for serial computation:
 A problem is broken into instructions (arithmetic, memory read and write, control, …)
 Instructions are executed sequentially one after another, only one at any moment in time
 Executed on a single processor (CPU)

2
Fundamental concepts
 Serial Computing:
 The execution time of a program with N instructions on a processor that is able to execute F instructions
per second is: T = N ÷ F

 One could execute the program faster (i.e. reduce T) by augmenting the value of F. And this has been
the trend during more than 30 years of technology and computer architecture evolution.

 The sequential execution became faster ("Instruction Level Parallelism", "Pipelining", Higher Frequencies)
 More and smaller transistors = more performance
 Programmers simply waited for the next processor generation
 Today:
 The frequency of processors does not increase significantly and more (heat dissipation problems)
 The instruction level parallelism does not increase significantly any more
 The execution speed is dominated by memory access times (but caches still become larger and faster)

Fundamental concepts
 Parallel Computing: Parallel computing is the simultaneous use of multiple compute resources to
solve a computational problem:
 A problem is broken into parts (called tasks) that can be solved concurrently
 Each part is further broken down into instructions
 Instructions from each part are executed in parallel on different processors (CPUs)

Multicore:
 Use transistors for more compute cores
 Parallelism in the software
 Programmers have to write parallel programs to
benefit from new hardware

 And data ?

3
Fundamental concepts
 Parallel Computing
 Shared memory architecture and memory address space

 Hardware support for coherent data sharing and tight synchronization

Fundamental concepts
 Parallel Computing
 Distributed-memory architecture and memory address space

 Hardware support for remote data accesses and communication

4
Fundamental concepts
 Parallel Computing
 Ideally, each processor could receive 1/P of the program, reducing its execution time by P (number of
processors):
T = (N ÷ P) ÷ F

 Need to manage and coordinate the execution of tasks, ensuring correct access to shared resources

Expressing Tasks
 Motivation
 Suppose you have a database where you have information about the available cars or the purchase:

 Assume that we want to count how many Green cars are available to sell.

10

5
Expressing Tasks
 Motivation
 A first approach: one could traverse all the records X[0] . . . X[n-1] in the database X and check if the
Color field matches the required value Green, storing the number of matches in variable count

This is a sequential approach since we need to traverse all the records

 A possible sequential program could be:

 Whose computation time on a single processor (T1) would be proportional to the number of records n in the database T1 n
 It is linear as a function of the number of records

11

Expressing Tasks
 Tasks
 A second approach: one could divide the traversal in P groups (tasks), for example for P=4:

 Checking the Color field for a subset of n/P consecutive records, and counting on a per-task « private »
copy of variable count

12

6
Expressing Tasks
 Tasks
 However, we still need to « globally » count the number of records found that match the condition by
combining the individual « private » counts into the original count variable

13

Expressing Tasks
 Tasks
 Up to this point you could anticipate that the computation time would be divided by the number of
tasks P if P workers are used to do the computation
T1 is the serial execution time

 With an additional overhead to perform this global reduction

probably proportional to the number of workers (processors) P

14

7
Tasks and Dependencies
 Motivation
 Consider now that we want to execute the following query:

on our car dealer database:

15

Tasks and Dependencies


 Tasks
 A possible query plan could be:

 Each of these operations in the query plan could be a task,


each computing an intermediate table of entries that
satisfy particular conditions.

 Are they independent? This task depends on


 Some of them are, for example tasks « Civic », « 2001 », « Green » and the 2 previous one
« White »
 Others are not independent. For example task « Civic_AND_2001 » can
not start its execution until both tasks « Civic » and « 2001 » complete

16

8
Tasks and Dependencies
 Dependences impose task execution orderig constraints that need to be fulfilled in order to guarantee
correct results

 Task dependence graph: graphical representation of the task decomposition

17

Tasks and Dependencies


 Other query plans are possible, for example:

… with different task dependence graphs and


potential to execute tasks in parallel

18

9
Task Dependency Graph
 Task dependence graph abstraction
Each of the edges has a
 Directed Acyclic Graph direction and there is no cycle
in this graph

 Node=task, its weight represents the amount of work


to be done

 Edge=dependence, i.e. successor node can only


execute after predecessor node has completed

19

Task Dependency Graph


 Total work T1
 Parallel machine abstraction
 P identical processors
 Each processor executes a node at a time

20

10
Task Dependency Graph
 Critical path T∞

 Critical path: path in the task graph with the highest accumulated work

 Assuming sufficient processors,

21

Task Dependency Graph


 Parallelism and Pmin

 if sufficient processors were available

 Pmin is the minimum number of processors necessary to achieve


Parallelism

22

11
Task Dependency Graph
 Wrap Up

23

Task Dependency Graph


 Question
 Consider the task dependency graphs for the two database queries, assuming
work_node is proportional to the number of inputs to be processed. Which are T1 , T∞
and Parallelism in each case ?

24

12
Task Dependency Graph
 Solution

25

Granularity and Parallelism


 The granularity of the task decomposition is determined by the computational size of the nodes (tasks)
in the task graph
 Example: counting matches in our car dealer database

26

13
Granularity and Parallelism
 Example: a task could be in charge of checking a number of consecutive elements m of the database:

 With a ptotential parallelism = n/m


n is the number of records in the database
or basically the number of times that we
have to execute the loop

27

Granularity and Parallelism


 It would appear that the parallelism is higher when going to fine-grain task decompositions

 However, there is a tradeoff between potential parallelism and overheads related with its exploitation
(e.g. creation of tasks, synchronization, exchange of data, …)

28

14
Task definition
 Can the computation be divided in parts ?
 Task decomposition: based on the processing to do (e.g. functions, loop iterations…)
 Data decomposition: based on the data to be processed (e.g. elements of a vector,
rows of a matrix) (implies task decomposition)
 There may be (data or control) dependencies between tasks
 Metrics to understand how our task/data decomposition can potentially
behave
 Factors: granularity and overheads

29

Task definition: Recap


 TDG: directed acyclic graph to represent tasks and dependencies between
them
 Metrics

 Parallelism =
 Pmin is the minimum number of processors necessary to achieve Parallelism

 Task granularity vs. number of tasks

30

15
Example: Vector Sum
 Compute the sum of elements X[0] . . . X[n-1] of a vector X

 Task definition: each iteration of the i loop is a task


 TDG (with input data):  Metrics

 How can we design an algorithm which leads to a TDG with more parallelism?
31

Example: Vector Sum


 Writing a recursive version of the sequential program to compute the sum of
elements X[0]…X[n-1] of a vector X, following a divide-and-conquer strategy:
Assume, we have 8 elements
we will apply a divide and
conquer strategy, a
recursive version of this
sequential program.
Remember the divide
and conquer strategy
divides a problem into
subproblems of same
type until they become
simple enough to be
solved directly.

At the end we have 8 subvectors, which is the total number


of elements in this vector.

32

16
Example: Vector Sum
 Task definition: each invocation to recursive_sum
 TDG (with input data):

 Metrics:
 Same problem can be expressed with different algorithms/implementations
leading to different metrics

33

Advanced granularity
 Given a sequential program, the number of tasks that one can generate and
the size of the tasks (what is called granularity) are related one to the other.
 Fine-grained tasks vs. coarse-grained tasks
 The parallelism increases as the decomposition becomes finer in granularity (small
tasks) and vice versa

34

17
Advanced granularity
Fine-grained Decomposition
 Example: matrix-vector product (n by n matrix):
 A task could be each individual x and + in the dot product that contributes to the computation of an
element of y

 A task could also be each complete dot product to compute an element of y


For fine grain parallelism, one task could be that
computing each y[i] so the number of tasks is the
number of rows in this vector y

35

Advanced granularity
Coarse-grained Decomposition

 A task could be in charge of computing a number of consecutive elements of y (e.g. three elements)

Here, one task will be in charge


of computing a number of
consecutive elements, so a
block of this matrix (here a
task consists in computing 3
elements of y).

Or we can just define a task to do the whole job. This is a very


 A task could be in charge of computing the whole vector y coarse grain and no parallelism at all. In this case the granularity
is the max because only one task will do all the work and
parallelism=1, so no parallelism at all.

36

18
Advanced granularity
So…
In this example, the fine
 It would appear that the parallel time can be made arbitrarily granularity is that each task
small by making the decomposition finer in granularity but … does the computation
y[i]+A[i,j]*b[j], this is the finest
 Tradeoff between the granularity of a decomposition and associated overheads granularity that gives max
(sources of overhead: creation of tasks, task synchronization, exchange of data parallelism, so here we will
between tasks, …) have n2 tasks where n is the
number of rows and columns
 The granularity may determine performance bounds of matrix A. So we will have
many tasks and remember
that we have a tradeoff
between the granularity of the
decomposition and the
associated overheads.
Because this overhead; not
always the best choice is to
choose the finest grain.

37

Example: stencil computation using Jacobi solver


 Stencil algorithm that computes each element of matrix utmp using 4
neighbor elements of matrix u, both matrices with nxn elements (updates elements
in a multi-dimensional array based on neighboring values using a fixed pattern)

38

19
Example: stencil computation using Jacobi solver

Different tasks decompositions of the same problem:

 Finer grain task decomposition  higher parallelism, but …


39

Example: stencil computation using Jacobi solver


overhead

The finer the granularity is


more tasks we will need to
have because each task will
do less work, and also more
overhead of the task
creation we will have. We
can also have overhead of
data exchange or overhead
of synchronization.

 Trade-off between task granularity and task creation overhead

40

20
Speedup and Efficiency
Execution Time in P processors
 Tp = execution time on P processors
 Task scheduling: how are tasks assigned to processors? For example, consider 2 proc:
Here is the split of tasks depending on the dependency:

41

Speedup and Efficiency


Speedup
 Speedup Sp (Amdahl’s Law): relative reduction of the sequential execution time when
using P processors

As the speedup is greater, the better is


the parallelism

for P=2

42

21
Speedup and Efficiency
Scalability and Efficiency
 Scalability: how the speed-up evolves when the number of processors is increased
 Efficiency: The greater the efficiency, the better the parallelism.

The most parallelism is when


Here, there are some
speedup=p, which is the ideal
special cases where it can
speedup and efficiency = 100% but
happen
in real cases we cannot reach the
ideal because of pbs like load
balancing, dependencies between
tasks, overheads of tasks creation,
synchronization...

43

Speedup and Efficiency


Strong vs Weak Scalability
 Two usual scenarios to evaluate the scalability of one application:
 Increase the number of processors P with constant problem size (here the pb size is the same as we are
not making the pb bigger or smaller) (strong scaling  reduce the execution time)
 Increase the number of processors P with problem size proportional to P (the goal is to solve a larger
pb) (weak scaling  solve larger problem) As we increase P we
increase the pb size.

For weak scaling as we


increase P we increase also
As we increase P, since the size the size so the granularity
of the pb is the same, the will be constant.
granularity becomes smaller.

44

22
Fundamental concepts
Amdahl’s Law
 How much faster will the program run?

 where
 T(1) : Time to run the program on one processor
 T(n) : Time to run the program on n processors

 It tells how efficiently you parallelize your code


 If we consider that the program has a serial fraction S and a parallel fraction P, then

45

Fundamental concepts
Amdahl’s Law
 Performance improvement is limited by the fraction of time the program does not run in fully
parallel mode

46

23
Fundamental concepts
Amdahl’s Law
 Over simplified model but it tells that there are limits to the scalability of parallelism

47

Fundamental concepts
Amdahl’s Law
 Assume the following simplified case, where the parallel fraction φ is the fraction, of total
execution time, the program can be parallelized:

Fraction of (1-φ) is so the fraction


time that the of the sequential time
prog can run in
parallel

The exec time of the appli


using P proc

48

24
Fundamental concepts
Amdahl’s Law
 From where we can compute the speed-up Sp that can be achieved as
Fully sequential Time

Sequential fraction

 Two particular cases: When we cannot parallelize any part of the prog

49

Fundamental concepts
Overhead sources
 Parallel computing is not for free, we should account overheads (i.e. any cost that gets added to a
sequential computation so as to enable it to run in parallel)

On the left overhead of task creation, in the middle overhead to ensure that shared objects and variables are accessed with mutual exclusion avoiding
data corruption and extra memory latency , also task synchro overhead in order to ensure that dependencies between tasks are satisfied. Finally on the
right the last overhead, which is the synchro cost that has to be paid to wait all the tasks to finish their execution in parallel.
50

25
Fundamental concepts
Common overheads
 Data sharing: can be explicit via messages, or implicit via a memory hierarchy (caches)
 Idleness: thread cannot find any useful work to execute (e.g. dependences, load imbalance, poor
communication and computation overlap or hiding of memory latencies, …)
 Computation: extra work added to obtain a parallel algorithm (e.g. replication)
 Memory: extra memory used to obtain a parallel algorithm (e.g. impact on memory hierarchy, …)

51

Problem 1
1. Assume we want to execute two different applications in our parallel machine with 4 processors:
application App1 is sequential; App2 is parallelised defining 4 tasks, each task executing one fourth of
the total application. The sequential time for the applications is 8 and 40 time units, respectively.
Assuming: that App1 can start its execution at time 4 and App2 starts at time 0, draw a time line
showing how they will be executed if the operating system:
a) Does not allow multiprogramming, i.e. only one application can be executed at the same time in the system.
b) Allows multiprogramming so that the system tries to have both applications running concurrently, each
application exactly making use of the number of processors that is able to use.
c) The same as in the second case, but now App2 is parallelised defining 3 tasks, each task executing one third of
the total application.

52

26
Solution
a)

b)

53

Solution
c)

54

27
Problem 2
1. We give the following TDG:
a) Calculate T1
b) Critical Path
c) Calculate T∞
d) Parallelism and minimum number of processors (Pmin)

55

Solution

56

28
Problem 3
We give the following code written in C:

57

Problem 3
Assuming that:
1. In the initialization loop, the execution of each iteration of the internal loop lasts 10 cycles;
2. In the computation loop, the execution of each iteration of the internal loop lasts 100 cycles; and
3. The execution of the foo function does not cause any kind of dependence (i.e. no overhead). We ask:
a) Draw the Task Dependence Graph (TDG), indicating for each node its cost in terms of execution time,
b) Calculate the values for T1 and T∞ as well as the potential Parallelism.
c) Calculate which is the best value for the « speed-up » on 4 processors (S4), indicating which would be the proper
task mapping (assignment) to processors to achieve it.

58

29
Solution
a)

b)

59

Solution
c)

60

30
Problem 4
Given the following code:

61

Problem 4
Assuming that:
1. The execution of the modify_d routine takes 10 time units and the execution of the modify_nd routine
takes 5 time units;
2. Each internal iteration of the computation loop (i.e. each internal iteration (for k) of the for_compute
task) takes 5 time units; and
3. The execution of the output task takes 100 time units, we ask:
a) Draw the Task Dependence Graph (TDG), indicating for each node its cost in terms of execution time (in time
units),
b) Compute the values for T1 and T∞ and the potential Parallelism, as well as the parallel fraction (phi).
c) Indicate which would be the most appropriate task assignment on 2 processors in order to obtain the best
possible « speed-up ». For that assignment, calculate T2 and S2.

62

31
Solution
a)

63

Solution
b)

64

32
Solution

65

Problem 5
The following figure shows an incomplete time diagram for the execution of a parallel application on 4
processors:

66

33
Problem 5
The figure has a set of rectangles, each rectangle represents the execution of a task with its associated
cost in time units. In the timeline there are two regions (1 and 2) with 4 parallel tasks each. The
execution cost for tasks in region1 is unknown (x time units each); the cost for each task in region2 is 4
time units. The computation starts with a sequential task (with cost 5), then all tasks in region1, running
in parallel, followed by another sequential task (with cost 2), then all tasks in region2 running in parallel
followed by a final sequential task (with cost 1).
Knowing that an ideal speed-up of 9 could be achieved if the application could make use of infinite
processors (𝑆 → = 9) and assuming that the two parallel regions can be decomposed ideally, with as
many tasks as processors with the appropriate fraction of the original cost, we ask:
a. What is the parallel fraction (φ) for the application represented in the time diagram above ?
b. Which is the « speed-up » that is achieved in the execution with 4 processors (S4)?
c. Which is the value x in region1 ?

67

Solution
a)

b)

c)

68

34
Fundamental concepts
Flynn’s Classical Taxonomy
 There are different ways to classify parallel computers
 One of the more widely used classifications, in use since 1966, is called Flynn’s Taxonomy
 Flynn’s taxonomy distinguishes multi-processor computer architectures according to how they
can be classified along the two independent dimensions of Instruction Stream and Data Stream.
Each can have only one of two possible states: Single or Multiple
 Four possible classifications according to Flynn:

69

Single Instruction Single Data (SISD)


 A serial (non-parallel) computer
 Single Instruction: Only one instruction stream is performed
on the CPU during any one clock cycle
 Single Data: Only one data stream is being used as input
during any one clock cycle
 Deterministic execution
 This is the oldest type of computer

70

35
Single Instruction Multiple Data (SIMD)
 A type of parallel computer
 Single Instruction: All processing units execute the same
instruction at any given clock cycle
 Multiple Data: Each processing unit can operate on a different
data element
 Best suited for specialized problems characterized by a high
degree of regularity, such as graphics/image processing

71

Multiple Instruction Single Data (MISD)


 A type of parallel computer
 Multiple Instruction: Each processing unit operates on the data independently via
separate instruction streams
 Single Data: A single data stream is fed into multiple processing
units

 Some conceivable uses might be:


 multiple frequency filters operating on a single signal stream
 multiple cryptography algorithms attempting to crack a single coded message

72

36
Multiple Instruction, Multiple Data
(MIMD)
 A type of parallel computer
 Multiple Instruction: Every processor may be executing a different instruction stream
 Multiple Data: Every processor may be working with a different data stream

 Execution can be synchronous or asynchronous, deterministic or non-deterministic


 Most modern supercomputers fall into this category
 Many MIMD architectures also include SIMD execution sub-components

73

Modern CPUs performance


Transistor Galore: Moore’s Law
 Gordon Moore, co-founder of Intel Corp, 1965
“ The number of transistors on a chip that are required to hit the « sweet spot » of minimal
manufacturing cost per component would double about every 24 months.”

 The growth in complexity has always roughly translated to an equivalent growth in compute
performance.
 Increasing chip transistor counts and clock speeds ⇒ More advanced techniques to improve
performance.

74

37
Moore’s law

The number of transistors on a


microchip will double every two
years, leading to exponential
growth in computing power.
This law has been driving
technological advancements for
over 50 years and has had a
profound impact on artificial
intelligence (AI).

Data source: https://www.unite.ai/moores-law/, 2024

75

Growth in processor performance

Data source: https://semiengineering.com/ai-accelerator-architectures-poised-for-big-changes/, 2024

76

38
Moore’s Law & Performance Improvement
Pipelined functional units Out-of-order execution
 Subdivide complex operations into simple  Execute instructions in an unordered fashion
components executed on different functional  Hundreds of instructions in flight
units on the CPU.
Instruction Level Parallelism (ILP). Larger caches
 Hold copies of data to be used soon
Superscalar architecture  Solution to DRAM gap
 More than one instruction / cycle.
 Multiple / identical functional units which can
operate concurrently. Simplified instruction set
 General move from CISC to RISC paradigm
Data parallelism through SIMD instructions  X86-based processors execute CISC machine
 Identical op. on an array of int or FP operands code
 Intel (SSE), AMD (3dNow!), Power/PowerPC  On-the-fly translation into RISC µ-ops
(AltiVec)

77

Memory Architecture
Shared memory: General characteristics
 All processors can access all memory as global address space
 Multiple processors can operate independently but share the same memory resources
 Changes in a memory location performed by one processor are visible to all other processors

78

39
Memory Architecture
Shared memory: Uniform Memory Access (UMA)
 Most commonly represented today by Symmetric Multiprocessor (SMP) machines
 Identical processors
 Equal access and access times to memory

79

Memory Architecture
Shared memory: Non-Uniform Memory Access (NUMA)
 Often made by physically linking two or more SMPs
 One SMP can directly access memory of another SMP
 Not all processors have equal access time to all memories
 Memory access across link is slower

80

40
Memory Architecture
Shared memory
 Advantages
 Global address space provides a user-friendly programming with regard to memory
 Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs

 Disadvantages
 The lack of scalability between memory and CPUs: Adding more CPUs can geometrically increases traffic
on the shared memory-CPU path
 Programmer responsibility for synchronization to ensure ”correct” access to global memory

81

Memory Architecture
Distributed memory: General characteristics
 Distributed memory systems require a communication network to connect inter-processor
memory
 Processors have their own local memory. Memory addresses in one processor do not map to
another processor, so there is no concept of global address space across all processors
 The programmer has to explicitly define how and when data should be communicated as well
as when synchronization between tasks should be performed

82

41
Memory Architecture
Distributed memory: General characteristics
 The network ”fabric” used for data transfer varies widely.

83

Memory Architecture
Distributed memory
 Advantages
 Memory is scalable with the number of processors
 Each processor can rapidly access its own memory without interference and without the overhead
incurred with trying to maintain global cache coherency

 Disadvantages
 The programmer is responsible for many of the details associated with data communication between
processors
 It may be difficult to map existing data structures, based on global memory, to this memory organization

84

42
Memory Architecture
Hybrid Distributed-Shared memory: General characteristics
 The largest and fastest computers in the world today employ both shared and distributed
memory architectures
 The shared memory component can be a shared memory machine and/or graphics processing
units (GPU)
 Processors on a compute node share same memory space
 Requires communication to exchange data between compute nodes

85

Useful Metrics and Relative Benchmarks


FLOPS [Gflop/s] – High Performance LINPACK (HPL)
64 Bits Floating point operations per second. It measures a computer performance and it’s useful
for scientific workloads. It’s more relevant than the number of instructions per second in this
case.
AI-Flops – HPL - Artificial Intelligence (HPL-AI) (NEW)
Multi-precisions Floating point operations per second (FP64 to FP16). It measures a computer
performance and it’s useful for AI workloads.
Bandwidth (or throughput) [Gbyte/s] - STREAM
It’s the maximum rate of data transfer across a given path. Bandwidth may be characterized
as network bandwidth or data bandwidth.
Latency [ms] - PingPong
It’s the amount of time it takes a data packet to travel from point A to point B. Together,
bandwidth and latency define the speed and capacity of a network.

86

43

You might also like