0% found this document useful (0 votes)

7 views43 pages

AA Part1

The document outlines the fundamentals of advanced architectures, focusing on serial and parallel computing concepts, including task decomposition and dependencies. It discusses the importance of granularity in task definitions and the trade-offs between fine and coarse-grained tasks, as well as the implications of Amdahl's Law on speedup and efficiency in parallel processing. The course aims to provide a comprehensive understanding of high-performance computing (HPC) systems and programming techniques.

Uploaded by

Fg Esra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views43 pages

AA Part1

Uploaded by

Fg Esra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Advanced Architectures

MOUNA BAKLOUTI
[email protected]

Course Outline
1. Fundamentals
2. HPC systems and applications
3. HPC programming

1
1. Fundamentals

Fundamental concepts
 Serial Computing: Traditionally, software has been written for serial computation:
 A problem is broken into instructions (arithmetic, memory read and write, control, …)
 Instructions are executed sequentially one after another, only one at any moment in time
 Executed on a single processor (CPU)

2
Fundamental concepts
 Serial Computing:
 The execution time of a program with N instructions on a processor that is able to execute F instructions
per second is: T = N ÷ F

 One could execute the program faster (i.e. reduce T) by augmenting the value of F. And this has been
the trend during more than 30 years of technology and computer architecture evolution.

 The sequential execution became faster ("Instruction Level Parallelism", "Pipelining", Higher Frequencies)
 More and smaller transistors = more performance
 Programmers simply waited for the next processor generation
 Today:
 The frequency of processors does not increase significantly and more (heat dissipation problems)
 The instruction level parallelism does not increase significantly any more
 The execution speed is dominated by memory access times (but caches still become larger and faster)

Fundamental concepts
 Parallel Computing: Parallel computing is the simultaneous use of multiple compute resources to
solve a computational problem:
 A problem is broken into parts (called tasks) that can be solved concurrently
 Each part is further broken down into instructions
 Instructions from each part are executed in parallel on different processors (CPUs)

Multicore:
 Use transistors for more compute cores
 Parallelism in the software
 Programmers have to write parallel programs to
benefit from new hardware

 And data ?

3
Fundamental concepts
 Parallel Computing
 Shared memory architecture and memory address space

 Hardware support for coherent data sharing and tight synchronization

Fundamental concepts
 Parallel Computing
 Distributed-memory architecture and memory address space

 Hardware support for remote data accesses and communication

4
Fundamental concepts
 Parallel Computing
 Ideally, each processor could receive 1/P of the program, reducing its execution time by P (number of
processors):
T = (N ÷ P) ÷ F

 Need to manage and coordinate the execution of tasks, ensuring correct access to shared resources

Expressing Tasks
 Motivation
 Suppose you have a database where you have information about the available cars or the purchase:

 Assume that we want to count how many Green cars are available to sell.

5
Expressing Tasks
 Motivation
 A first approach: one could traverse all the records X[0] . . . X[n-1] in the database X and check if the
Color field matches the required value Green, storing the number of matches in variable count

This is a sequential approach since we need to traverse all the records

 A possible sequential program could be:

 Whose computation time on a single processor (T1) would be proportional to the number of records n in the database T1 n
 It is linear as a function of the number of records

Expressing Tasks
 Tasks
 A second approach: one could divide the traversal in P groups (tasks), for example for P=4:

 Checking the Color field for a subset of n/P consecutive records, and counting on a per-task « private »
copy of variable count

6
Expressing Tasks
 Tasks
 However, we still need to « globally » count the number of records found that match the condition by
combining the individual « private » counts into the original count variable

Expressing Tasks
 Tasks
 Up to this point you could anticipate that the computation time would be divided by the number of
tasks P if P workers are used to do the computation
T1 is the serial execution time

 With an additional overhead to perform this global reduction

probably proportional to the number of workers (processors) P

7
Tasks and Dependencies
 Motivation
 Consider now that we want to execute the following query:

on our car dealer database:

Tasks and Dependencies

 Tasks
 A possible query plan could be:

 Each of these operations in the query plan could be a task,

each computing an intermediate table of entries that
satisfy particular conditions.

 Are they independent? This task depends on

 Some of them are, for example tasks « Civic », « 2001 », « Green » and the 2 previous one
« White »
 Others are not independent. For example task « Civic_AND_2001 » can
not start its execution until both tasks « Civic » and « 2001 » complete

8
Tasks and Dependencies
 Dependences impose task execution orderig constraints that need to be fulfilled in order to guarantee
correct results

 Task dependence graph: graphical representation of the task decomposition

Tasks and Dependencies

 Other query plans are possible, for example:

… with different task dependence graphs and

potential to execute tasks in parallel

9
Task Dependency Graph
 Task dependence graph abstraction
Each of the edges has a
 Directed Acyclic Graph direction and there is no cycle
in this graph

 Node=task, its weight represents the amount of work

to be done

 Edge=dependence, i.e. successor node can only

execute after predecessor node has completed

Task Dependency Graph

 Total work T1
 Parallel machine abstraction
 P identical processors
 Each processor executes a node at a time

10
Task Dependency Graph
 Critical path T∞

 Critical path: path in the task graph with the highest accumulated work

 Assuming sufficient processors,

Task Dependency Graph

 Parallelism and Pmin

 if sufficient processors were available

 Pmin is the minimum number of processors necessary to achieve

Parallelism

11
Task Dependency Graph
 Wrap Up

Task Dependency Graph

 Question
 Consider the task dependency graphs for the two database queries, assuming
work_node is proportional to the number of inputs to be processed. Which are T1 , T∞
and Parallelism in each case ?

12
Task Dependency Graph
 Solution

Granularity and Parallelism

 The granularity of the task decomposition is determined by the computational size of the nodes (tasks)
in the task graph
 Example: counting matches in our car dealer database

13
Granularity and Parallelism
 Example: a task could be in charge of checking a number of consecutive elements m of the database:

 With a ptotential parallelism = n/m

n is the number of records in the database
or basically the number of times that we
have to execute the loop

Granularity and Parallelism

 It would appear that the parallelism is higher when going to fine-grain task decompositions

 However, there is a tradeoff between potential parallelism and overheads related with its exploitation
(e.g. creation of tasks, synchronization, exchange of data, …)

14
Task definition
 Can the computation be divided in parts ?
 Task decomposition: based on the processing to do (e.g. functions, loop iterations…)
 Data decomposition: based on the data to be processed (e.g. elements of a vector,
rows of a matrix) (implies task decomposition)
 There may be (data or control) dependencies between tasks
 Metrics to understand how our task/data decomposition can potentially
behave
 Factors: granularity and overheads

Task definition: Recap

 TDG: directed acyclic graph to represent tasks and dependencies between
them
 Metrics



 Parallelism =
 Pmin is the minimum number of processors necessary to achieve Parallelism

 Task granularity vs. number of tasks

15
Example: Vector Sum
 Compute the sum of elements X[0] . . . X[n-1] of a vector X

 Task definition: each iteration of the i loop is a task

 TDG (with input data):  Metrics

 How can we design an algorithm which leads to a TDG with more parallelism?
31

Example: Vector Sum

 Writing a recursive version of the sequential program to compute the sum of
elements X[0]…X[n-1] of a vector X, following a divide-and-conquer strategy:
Assume, we have 8 elements
we will apply a divide and
conquer strategy, a
recursive version of this
sequential program.
Remember the divide
and conquer strategy
divides a problem into
subproblems of same
type until they become
simple enough to be
solved directly.

At the end we have 8 subvectors, which is the total number

of elements in this vector.

16
Example: Vector Sum
 Task definition: each invocation to recursive_sum
 TDG (with input data):

 Metrics:
 Same problem can be expressed with different algorithms/implementations
leading to different metrics

Advanced granularity
 Given a sequential program, the number of tasks that one can generate and
the size of the tasks (what is called granularity) are related one to the other.
 Fine-grained tasks vs. coarse-grained tasks
 The parallelism increases as the decomposition becomes finer in granularity (small
tasks) and vice versa

17
Advanced granularity
Fine-grained Decomposition
 Example: matrix-vector product (n by n matrix):
 A task could be each individual x and + in the dot product that contributes to the computation of an
element of y

 A task could also be each complete dot product to compute an element of y

For fine grain parallelism, one task could be that
computing each y[i] so the number of tasks is the
number of rows in this vector y

Advanced granularity
Coarse-grained Decomposition

 A task could be in charge of computing a number of consecutive elements of y (e.g. three elements)

Here, one task will be in charge

of computing a number of
consecutive elements, so a
block of this matrix (here a
task consists in computing 3
elements of y).

Or we can just define a task to do the whole job. This is a very

 A task could be in charge of computing the whole vector y coarse grain and no parallelism at all. In this case the granularity
is the max because only one task will do all the work and
parallelism=1, so no parallelism at all.

18
Advanced granularity
So…
In this example, the fine
 It would appear that the parallel time can be made arbitrarily granularity is that each task
small by making the decomposition finer in granularity but … does the computation
y[i]+A[i,j]*b[j], this is the finest
 Tradeoff between the granularity of a decomposition and associated overheads granularity that gives max
(sources of overhead: creation of tasks, task synchronization, exchange of data parallelism, so here we will
between tasks, …) have n2 tasks where n is the
number of rows and columns
 The granularity may determine performance bounds of matrix A. So we will have
many tasks and remember
that we have a tradeoff
between the granularity of the
decomposition and the
associated overheads.
Because this overhead; not
always the best choice is to
choose the finest grain.

Example: stencil computation using Jacobi solver

 Stencil algorithm that computes each element of matrix utmp using 4
neighbor elements of matrix u, both matrices with nxn elements (updates elements
in a multi-dimensional array based on neighboring values using a fixed pattern)

19
Example: stencil computation using Jacobi solver

Different tasks decompositions of the same problem:

 Finer grain task decomposition  higher parallelism, but …

Example: stencil computation using Jacobi solver

overhead

The finer the granularity is

more tasks we will need to
have because each task will
do less work, and also more
overhead of the task
creation we will have. We
can also have overhead of
data exchange or overhead
of synchronization.

 Trade-off between task granularity and task creation overhead

20
Speedup and Efficiency
Execution Time in P processors
 Tp = execution time on P processors
 Task scheduling: how are tasks assigned to processors? For example, consider 2 proc:
Here is the split of tasks depending on the dependency:

Speedup and Efficiency

Speedup
 Speedup Sp (Amdahl’s Law): relative reduction of the sequential execution time when
using P processors

As the speedup is greater, the better is

the parallelism

for P=2

21
Speedup and Efficiency
Scalability and Efficiency
 Scalability: how the speed-up evolves when the number of processors is increased
 Efficiency: The greater the efficiency, the better the parallelism.

The most parallelism is when

Here, there are some
speedup=p, which is the ideal
special cases where it can
speedup and efficiency = 100% but
happen
in real cases we cannot reach the
ideal because of pbs like load
balancing, dependencies between
tasks, overheads of tasks creation,
synchronization...

Speedup and Efficiency

Strong vs Weak Scalability
 Two usual scenarios to evaluate the scalability of one application:
 Increase the number of processors P with constant problem size (here the pb size is the same as we are
not making the pb bigger or smaller) (strong scaling  reduce the execution time)
 Increase the number of processors P with problem size proportional to P (the goal is to solve a larger
pb) (weak scaling  solve larger problem) As we increase P we
increase the pb size.

For weak scaling as we

increase P we increase also
As we increase P, since the size the size so the granularity
of the pb is the same, the will be constant.
granularity becomes smaller.

22
Fundamental concepts
Amdahl’s Law
 How much faster will the program run?

 where
 T(1) : Time to run the program on one processor
 T(n) : Time to run the program on n processors

 It tells how efficiently you parallelize your code

 If we consider that the program has a serial fraction S and a parallel fraction P, then

Fundamental concepts
Amdahl’s Law
 Performance improvement is limited by the fraction of time the program does not run in fully
parallel mode

23
Fundamental concepts
Amdahl’s Law
 Over simplified model but it tells that there are limits to the scalability of parallelism

Fundamental concepts
Amdahl’s Law
 Assume the following simplified case, where the parallel fraction φ is the fraction, of total
execution time, the program can be parallelized:

Fraction of (1-φ) is so the fraction

time that the of the sequential time
prog can run in
parallel

The exec time of the appli

using P proc

24
Fundamental concepts
Amdahl’s Law
 From where we can compute the speed-up Sp that can be achieved as
Fully sequential Time

Sequential fraction

 Two particular cases: When we cannot parallelize any part of the prog

Fundamental concepts
Overhead sources
 Parallel computing is not for free, we should account overheads (i.e. any cost that gets added to a
sequential computation so as to enable it to run in parallel)

On the left overhead of task creation, in the middle overhead to ensure that shared objects and variables are accessed with mutual exclusion avoiding
data corruption and extra memory latency , also task synchro overhead in order to ensure that dependencies between tasks are satisfied. Finally on the
right the last overhead, which is the synchro cost that has to be paid to wait all the tasks to finish their execution in parallel.
50

25
Fundamental concepts
Common overheads
 Data sharing: can be explicit via messages, or implicit via a memory hierarchy (caches)
 Idleness: thread cannot find any useful work to execute (e.g. dependences, load imbalance, poor
communication and computation overlap or hiding of memory latencies, …)
 Computation: extra work added to obtain a parallel algorithm (e.g. replication)
 Memory: extra memory used to obtain a parallel algorithm (e.g. impact on memory hierarchy, …)

Problem 1
1. Assume we want to execute two different applications in our parallel machine with 4 processors:
application App1 is sequential; App2 is parallelised defining 4 tasks, each task executing one fourth of
the total application. The sequential time for the applications is 8 and 40 time units, respectively.
Assuming: that App1 can start its execution at time 4 and App2 starts at time 0, draw a time line
showing how they will be executed if the operating system:
a) Does not allow multiprogramming, i.e. only one application can be executed at the same time in the system.
b) Allows multiprogramming so that the system tries to have both applications running concurrently, each
application exactly making use of the number of processors that is able to use.
c) The same as in the second case, but now App2 is parallelised defining 3 tasks, each task executing one third of
the total application.

26
Solution
a)

Solution
c)

27
Problem 2
1. We give the following TDG:
a) Calculate T1
b) Critical Path
c) Calculate T∞
d) Parallelism and minimum number of processors (Pmin)

Solution

28
Problem 3
We give the following code written in C:

Problem 3
Assuming that:
1. In the initialization loop, the execution of each iteration of the internal loop lasts 10 cycles;
2. In the computation loop, the execution of each iteration of the internal loop lasts 100 cycles; and
3. The execution of the foo function does not cause any kind of dependence (i.e. no overhead). We ask:
a) Draw the Task Dependence Graph (TDG), indicating for each node its cost in terms of execution time,
b) Calculate the values for T1 and T∞ as well as the potential Parallelism.
c) Calculate which is the best value for the « speed-up » on 4 processors (S4), indicating which would be the proper
task mapping (assignment) to processors to achieve it.

29
Solution
a)

Solution
c)

30
Problem 4
Given the following code:

Problem 4
Assuming that:
1. The execution of the modify_d routine takes 10 time units and the execution of the modify_nd routine
takes 5 time units;
2. Each internal iteration of the computation loop (i.e. each internal iteration (for k) of the for_compute
task) takes 5 time units; and
3. The execution of the output task takes 100 time units, we ask:
a) Draw the Task Dependence Graph (TDG), indicating for each node its cost in terms of execution time (in time
units),
b) Compute the values for T1 and T∞ and the potential Parallelism, as well as the parallel fraction (phi).
c) Indicate which would be the most appropriate task assignment on 2 processors in order to obtain the best
possible « speed-up ». For that assignment, calculate T2 and S2.

31
Solution
a)

Solution
b)

32
Solution

Problem 5
The following figure shows an incomplete time diagram for the execution of a parallel application on 4
processors:

33
Problem 5
The figure has a set of rectangles, each rectangle represents the execution of a task with its associated
cost in time units. In the timeline there are two regions (1 and 2) with 4 parallel tasks each. The
execution cost for tasks in region1 is unknown (x time units each); the cost for each task in region2 is 4
time units. The computation starts with a sequential task (with cost 5), then all tasks in region1, running
in parallel, followed by another sequential task (with cost 2), then all tasks in region2 running in parallel
followed by a final sequential task (with cost 1).
Knowing that an ideal speed-up of 9 could be achieved if the application could make use of infinite
processors (𝑆 → = 9) and assuming that the two parallel regions can be decomposed ideally, with as
many tasks as processors with the appropriate fraction of the original cost, we ask:
a. What is the parallel fraction (φ) for the application represented in the time diagram above ?
b. Which is the « speed-up » that is achieved in the execution with 4 processors (S4)?
c. Which is the value x in region1 ?

Solution
a)

34
Fundamental concepts
Flynn’s Classical Taxonomy
 There are different ways to classify parallel computers
 One of the more widely used classifications, in use since 1966, is called Flynn’s Taxonomy
 Flynn’s taxonomy distinguishes multi-processor computer architectures according to how they
can be classified along the two independent dimensions of Instruction Stream and Data Stream.
Each can have only one of two possible states: Single or Multiple
 Four possible classifications according to Flynn:

Single Instruction Single Data (SISD)

 A serial (non-parallel) computer
 Single Instruction: Only one instruction stream is performed
on the CPU during any one clock cycle
 Single Data: Only one data stream is being used as input
during any one clock cycle
 Deterministic execution
 This is the oldest type of computer

35
Single Instruction Multiple Data (SIMD)
 A type of parallel computer
 Single Instruction: All processing units execute the same
instruction at any given clock cycle
 Multiple Data: Each processing unit can operate on a different
data element
 Best suited for specialized problems characterized by a high
degree of regularity, such as graphics/image processing

Multiple Instruction Single Data (MISD)

 A type of parallel computer
 Multiple Instruction: Each processing unit operates on the data independently via
separate instruction streams
 Single Data: A single data stream is fed into multiple processing
units

 Some conceivable uses might be:

 multiple frequency filters operating on a single signal stream
 multiple cryptography algorithms attempting to crack a single coded message

36
Multiple Instruction, Multiple Data
(MIMD)
 A type of parallel computer
 Multiple Instruction: Every processor may be executing a different instruction stream
 Multiple Data: Every processor may be working with a different data stream

 Execution can be synchronous or asynchronous, deterministic or non-deterministic

 Most modern supercomputers fall into this category
 Many MIMD architectures also include SIMD execution sub-components

Modern CPUs performance

Transistor Galore: Moore’s Law
 Gordon Moore, co-founder of Intel Corp, 1965
“ The number of transistors on a chip that are required to hit the « sweet spot » of minimal
manufacturing cost per component would double about every 24 months.”

 The growth in complexity has always roughly translated to an equivalent growth in compute
performance.
 Increasing chip transistor counts and clock speeds ⇒ More advanced techniques to improve
performance.

37
Moore’s law

The number of transistors on a

microchip will double every two
years, leading to exponential
growth in computing power.
This law has been driving
technological advancements for
over 50 years and has had a
profound impact on artificial
intelligence (AI).

Data source: https://www.unite.ai/moores-law/, 2024

Growth in processor performance

Data source: https://semiengineering.com/ai-accelerator-architectures-poised-for-big-changes/, 2024

38
Moore’s Law & Performance Improvement
Pipelined functional units Out-of-order execution
 Subdivide complex operations into simple  Execute instructions in an unordered fashion
components executed on different functional  Hundreds of instructions in flight
units on the CPU.
Instruction Level Parallelism (ILP). Larger caches
 Hold copies of data to be used soon
Superscalar architecture  Solution to DRAM gap
 More than one instruction / cycle.
 Multiple / identical functional units which can
operate concurrently. Simplified instruction set
 General move from CISC to RISC paradigm
Data parallelism through SIMD instructions  X86-based processors execute CISC machine
 Identical op. on an array of int or FP operands code
 Intel (SSE), AMD (3dNow!), Power/PowerPC  On-the-fly translation into RISC µ-ops
(AltiVec)

Memory Architecture
Shared memory: General characteristics
 All processors can access all memory as global address space
 Multiple processors can operate independently but share the same memory resources
 Changes in a memory location performed by one processor are visible to all other processors

39
Memory Architecture
Shared memory: Uniform Memory Access (UMA)
 Most commonly represented today by Symmetric Multiprocessor (SMP) machines
 Identical processors
 Equal access and access times to memory

Memory Architecture
Shared memory: Non-Uniform Memory Access (NUMA)
 Often made by physically linking two or more SMPs
 One SMP can directly access memory of another SMP
 Not all processors have equal access time to all memories
 Memory access across link is slower

40
Memory Architecture
Shared memory
 Advantages
 Global address space provides a user-friendly programming with regard to memory
 Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs

 Disadvantages
 The lack of scalability between memory and CPUs: Adding more CPUs can geometrically increases traffic
on the shared memory-CPU path
 Programmer responsibility for synchronization to ensure ”correct” access to global memory

Memory Architecture
Distributed memory: General characteristics
 Distributed memory systems require a communication network to connect inter-processor
memory
 Processors have their own local memory. Memory addresses in one processor do not map to
another processor, so there is no concept of global address space across all processors
 The programmer has to explicitly define how and when data should be communicated as well
as when synchronization between tasks should be performed

41
Memory Architecture
Distributed memory: General characteristics
 The network ”fabric” used for data transfer varies widely.

Memory Architecture
Distributed memory
 Advantages
 Memory is scalable with the number of processors
 Each processor can rapidly access its own memory without interference and without the overhead
incurred with trying to maintain global cache coherency

 Disadvantages
 The programmer is responsible for many of the details associated with data communication between
processors
 It may be difficult to map existing data structures, based on global memory, to this memory organization

42
Memory Architecture
Hybrid Distributed-Shared memory: General characteristics
 The largest and fastest computers in the world today employ both shared and distributed
memory architectures
 The shared memory component can be a shared memory machine and/or graphics processing
units (GPU)
 Processors on a compute node share same memory space
 Requires communication to exchange data between compute nodes

Useful Metrics and Relative Benchmarks

FLOPS [Gflop/s] – High Performance LINPACK (HPL)
64 Bits Floating point operations per second. It measures a computer performance and it’s useful
for scientific workloads. It’s more relevant than the number of instructions per second in this
case.
AI-Flops – HPL - Artificial Intelligence (HPL-AI) (NEW)
Multi-precisions Floating point operations per second (FP64 to FP16). It measures a computer
performance and it’s useful for AI workloads.
Bandwidth (or throughput) [Gbyte/s] - STREAM
It’s the maximum rate of data transfer across a given path. Bandwidth may be characterized
as network bandwidth or data bandwidth.
Latency [ms] - PingPong
It’s the amount of time it takes a data packet to travel from point A to point B. Together,
bandwidth and latency define the speed and capacity of a network.

Installation Technician Computing and Peripherals
No ratings yet
Installation Technician Computing and Peripherals
298 pages
VNX5200 Parts Location Guide
No ratings yet
VNX5200 Parts Location Guide
58 pages
Introduction To Motherboard
No ratings yet
Introduction To Motherboard
5 pages
Parallel Computing
No ratings yet
Parallel Computing
28 pages
Travel Agency of Malta
50% (2)
Travel Agency of Malta
8 pages
5 - Designing Parallel Programs
No ratings yet
5 - Designing Parallel Programs
52 pages
Computer Class 2
No ratings yet
Computer Class 2
2 pages
Chapter 3 - Principles of Parallel Algorithm Design
No ratings yet
Chapter 3 - Principles of Parallel Algorithm Design
52 pages
Quarter 1 CSS 12 Examination
No ratings yet
Quarter 1 CSS 12 Examination
11 pages
Atmel Library Components List
100% (1)
Atmel Library Components List
20 pages
Form 1 Operating System Topical Questions
No ratings yet
Form 1 Operating System Topical Questions
2 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Padp Unit 4up
No ratings yet
Padp Unit 4up
147 pages
LDCA Unit5
No ratings yet
LDCA Unit5
124 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
8-Parallel Algorithm Design - Preliminaries-09-Jan-2020Material - I - 09-Jan-2020 - Module - 3 - Preliminaries PDF
No ratings yet
8-Parallel Algorithm Design - Preliminaries-09-Jan-2020Material - I - 09-Jan-2020 - Module - 3 - Preliminaries PDF
18 pages
Programming For Performance
No ratings yet
Programming For Performance
79 pages
BDS Session 6
No ratings yet
BDS Session 6
53 pages
HPC Parallel
No ratings yet
HPC Parallel
122 pages
Character Conversion: Uppercase To Lowercase: Introduction To 8086 Assembly Language Programming (Alp4)
No ratings yet
Character Conversion: Uppercase To Lowercase: Introduction To 8086 Assembly Language Programming (Alp4)
38 pages
Chapter Overview: Algorithms and Concurrency: - Introduction To Parallel Algorithms
No ratings yet
Chapter Overview: Algorithms and Concurrency: - Introduction To Parallel Algorithms
84 pages
FoP HPC Unit II
No ratings yet
FoP HPC Unit II
107 pages
WINSEM2022-23 CSE4001 ETH VL2022230503176 Reference Material I 02-02-2023 Module3-ParallelDecomposition
No ratings yet
WINSEM2022-23 CSE4001 ETH VL2022230503176 Reference Material I 02-02-2023 Module3-ParallelDecomposition
89 pages
Chap3 Slides Week4
No ratings yet
Chap3 Slides Week4
42 pages
PDC (Steps in Parallel Algorithm Design)
No ratings yet
PDC (Steps in Parallel Algorithm Design)
82 pages
Parallel Computing
No ratings yet
Parallel Computing
91 pages
02 - Lecture #2
No ratings yet
02 - Lecture #2
29 pages
Parallel Computing Unit 3 - Principles of Parallel Computing Design
No ratings yet
Parallel Computing Unit 3 - Principles of Parallel Computing Design
78 pages
Unit 2
No ratings yet
Unit 2
81 pages
Principles of Parallel Algorithm Design
No ratings yet
Principles of Parallel Algorithm Design
78 pages
Unit 2
No ratings yet
Unit 2
151 pages
Program and Network Properties 2.1 Conditions of Parallelism 2.2 Program Partitioning and Scheduling
No ratings yet
Program and Network Properties 2.1 Conditions of Parallelism 2.2 Program Partitioning and Scheduling
47 pages
BCSE412L - Parallel Computing 01
No ratings yet
BCSE412L - Parallel Computing 01
27 pages
Module 3 - Principles of Parallel Algorithm Design
No ratings yet
Module 3 - Principles of Parallel Algorithm Design
39 pages
Unit - 2 HPC
No ratings yet
Unit - 2 HPC
96 pages
03-Task Decomposition and Mapping
No ratings yet
03-Task Decomposition and Mapping
62 pages
Chapter 7 - Parallel Programming Issues
No ratings yet
Chapter 7 - Parallel Programming Issues
68 pages
Principles of Parallel Algorithm Design
No ratings yet
Principles of Parallel Algorithm Design
63 pages
CSC 580 - Chapter 3
No ratings yet
CSC 580 - Chapter 3
35 pages
Common PDC Module3
No ratings yet
Common PDC Module3
43 pages
Operating System
No ratings yet
Operating System
68 pages
HPC - Unit-2 Insem Notes
No ratings yet
HPC - Unit-2 Insem Notes
99 pages
ConcurrencyDecomposition Parallel Algorithm
No ratings yet
ConcurrencyDecomposition Parallel Algorithm
40 pages
Unit 2
No ratings yet
Unit 2
64 pages
ParallelIzation Principles
No ratings yet
ParallelIzation Principles
40 pages
Chap 4-7 - Parallel - Abstractions - and - MPI
No ratings yet
Chap 4-7 - Parallel - Abstractions - and - MPI
34 pages
Introduction To Parallel Computing Design and Anal
No ratings yet
Introduction To Parallel Computing Design and Anal
53 pages
LP V Theory and Practical Explanation: o o o o
No ratings yet
LP V Theory and Practical Explanation: o o o o
96 pages
Lecture4 PDF
No ratings yet
Lecture4 PDF
23 pages
Partitioning
No ratings yet
Partitioning
37 pages
Con Currency Mapping
No ratings yet
Con Currency Mapping
40 pages
Lecture 5 Principles of Parallel Algorithm Design
No ratings yet
Lecture 5 Principles of Parallel Algorithm Design
30 pages
Unit 2 HPC
No ratings yet
Unit 2 HPC
92 pages
L19-20 PA Design Intro
No ratings yet
L19-20 PA Design Intro
31 pages
Apollo Windows NVM Image Burning Procedure Including Config For v8.4 and Up A05
No ratings yet
Apollo Windows NVM Image Burning Procedure Including Config For v8.4 and Up A05
22 pages
Parallel Programming: Lecture #9
No ratings yet
Parallel Programming: Lecture #9
24 pages
Microprocessor Systems: Microcontrollers
No ratings yet
Microprocessor Systems: Microcontrollers
26 pages
Lecture 6
No ratings yet
Lecture 6
37 pages
Example 41 - RAM ROM
No ratings yet
Example 41 - RAM ROM
12 pages
Parallel Programming
No ratings yet
Parallel Programming
42 pages
Angel 8E 1 4
No ratings yet
Angel 8E 1 4
21 pages
Unit 2 - Part - 1
No ratings yet
Unit 2 - Part - 1
32 pages
Lecture 6 Principles of Parallel Algorithm Design
No ratings yet
Lecture 6 Principles of Parallel Algorithm Design
35 pages
Operating System Architecture
No ratings yet
Operating System Architecture
19 pages
Module - 3 Parallel Algorithm Design - Preliminaries
No ratings yet
Module - 3 Parallel Algorithm Design - Preliminaries
12 pages
LECTURE 4 - Parallel Computing Design (PART 1)
No ratings yet
LECTURE 4 - Parallel Computing Design (PART 1)
47 pages
AICT - Lecture 10.
No ratings yet
AICT - Lecture 10.
29 pages
PDC Unit-2
No ratings yet
PDC Unit-2
48 pages
Parallel Algorithms Presentation
No ratings yet
Parallel Algorithms Presentation
32 pages
CMP 312 1
No ratings yet
CMP 312 1
23 pages
Device Controllers and Device Drivers Fra
No ratings yet
Device Controllers and Device Drivers Fra
9 pages
WINSEM2022-23 CSE4001 ETH VL2022230503160 2023-01-19 Reference-Material-I
No ratings yet
WINSEM2022-23 CSE4001 ETH VL2022230503160 2023-01-19 Reference-Material-I
72 pages
E - Notes - HPC-Unit 3-1
No ratings yet
E - Notes - HPC-Unit 3-1
26 pages
It - Unit 14 - Assignment 2 - Template
No ratings yet
It - Unit 14 - Assignment 2 - Template
9 pages
Gateway - Specifications Sept 18 2020
No ratings yet
Gateway - Specifications Sept 18 2020
6 pages
ICT Exam 22 23 1Q
No ratings yet
ICT Exam 22 23 1Q
6 pages
WINSEM2022-23 CSE4001 ETH VL2022230503160 2023-01-12 Reference-Material-I
No ratings yet
WINSEM2022-23 CSE4001 ETH VL2022230503160 2023-01-12 Reference-Material-I
28 pages
Project - ParallelComputing BSR v2
No ratings yet
Project - ParallelComputing BSR v2
40 pages
Main Memory
No ratings yet
Main Memory
5 pages
Bid Document For Tender Supply For Office Stationery 2025
No ratings yet
Bid Document For Tender Supply For Office Stationery 2025
5 pages
GP OS Threads
No ratings yet
GP OS Threads
4 pages
HPC Ut 2
No ratings yet
HPC Ut 2
4 pages
INSTALLING OPERATING SYSTEM Part I
No ratings yet
INSTALLING OPERATING SYSTEM Part I
4 pages
EN - Igame GeForce GTX 1660 Ultra 6G
No ratings yet
EN - Igame GeForce GTX 1660 Ultra 6G
2 pages
Spesifikasi Laptop Untuk Kebutuhan Lab CBT
No ratings yet
Spesifikasi Laptop Untuk Kebutuhan Lab CBT
2 pages
Computer Operation Cat 1
No ratings yet
Computer Operation Cat 1
2 pages
Cloud Computing Quiz 2
No ratings yet
Cloud Computing Quiz 2
2 pages
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
Dataflow and Reactive Programming Systems
From Everand
Dataflow and Reactive Programming Systems
Matt Carkci
No ratings yet

AA Part1

Uploaded by

AA Part1

Uploaded by

Advanced Architectures

 Hardware support for coherent data sharing and tight synchronization

 Hardware support for remote data accesses and communication

This is a sequential approach since we need to traverse all the records

 A possible sequential program could be:

 With an additional overhead to perform this global reduction

probably proportional to the number of workers (processors) P

on our car dealer database:

Tasks and Dependencies

 Each of these operations in the query plan could be a task,

 Are they independent? This task depends on

 Task dependence graph: graphical representation of the task decomposition

Tasks and Dependencies

… with different task dependence graphs and

 Node=task, its weight represents the amount of work

 Edge=dependence, i.e. successor node can only

Task Dependency Graph

 Assuming sufficient processors,

Task Dependency Graph

 if sufficient processors were available

 Pmin is the minimum number of processors necessary to achieve

Task Dependency Graph

Granularity and Parallelism

 With a ptotential parallelism = n/m

Granularity and Parallelism

Task definition: Recap

 Task granularity vs. number of tasks

 Task definition: each iteration of the i loop is a task

Example: Vector Sum

At the end we have 8 subvectors, which is the total number

 A task could also be each complete dot product to compute an element of y

Here, one task will be in charge

Or we can just define a task to do the whole job. This is a very

Example: stencil computation using Jacobi solver

Different tasks decompositions of the same problem:

 Finer grain task decomposition  higher parallelism, but …

Example: stencil computation using Jacobi solver

The finer the granularity is

 Trade-off between task granularity and task creation overhead

Speedup and Efficiency

As the speedup is greater, the better is

The most parallelism is when

Speedup and Efficiency

For weak scaling as we

 It tells how efficiently you parallelize your code

Fraction of (1-φ) is so the fraction

The exec time of the appli

Single Instruction Single Data (SISD)

Multiple Instruction Single Data (MISD)

 Some conceivable uses might be:

 Execution can be synchronous or asynchronous, deterministic or non-deterministic

Modern CPUs performance

The number of transistors on a

Data source: https://www.unite.ai/moores-law/, 2024

Growth in processor performance

Data source: https://semiengineering.com/ai-accelerator-architectures-poised-for-big-changes/, 2024

Useful Metrics and Relative Benchmarks

You might also like