Introduction To Parallel Computing
Introduction To Parallel Computing
INTRODUCTION
TO PARALLEL
COMPUTING
Page 1
Page 2
1
08-02-2024
Page 3
PARALLEL COMPUTING:
RESOURCES
• The compute resources can include:
• A single computer with multiple processors;
• A single computer with (multiple) processor(s) and some specialized
computer resources (GPU, …)
• An arbitrary number of computers connected by a network;
Page 4
2
08-02-2024
Page 5
Page 6
3
08-02-2024
Applications requiring high availability rely on parallel and distributed platforms for
redundancy.
Many of todays’ applications such as weather prediction, aerodynamics and artificial
intelligence are very computationally intensive and require vast amounts of processing
power.
So, it appears that the only way forward is to use PARALLELISM. The idea here is that if
several operations can be performed simultaneously then the total computation time is
reduced.
Page 7
Page 8
4
08-02-2024
Parallel programming:
the human process of developing programs that express what computations should be
executed in parallel.
Page 9
Page 10
10
5
08-02-2024
ADVANTAGES:-
Time Reduction:-
With the help of parallel processing, a number of computations can be performed at
once, bringing down the time required to complete a project.
Complexity :-
Parallel processing is particularly useful in projects that require complex computations,
such as weather modeling and digital special effects.
Greater reliability :-
A parallel computer work even if a processor fails….fault tolerance
Page 12
12
CONCEPTS AND
TERMINOLOGY
Page 17
17
6
08-02-2024
Page 18
18
BASIC DESIGN
• Basic design
• Memory is used to store both program and data
instructions
• Program instructions are coded data which tell the
computer to do something
• Data is simply information to be used by the
program
• A central processing unit (CPU) gets
instructions and/or data from memory,
decodes the instructions and then
sequentially performs them.
Page 19
19
7
08-02-2024
Page 20
20
FLYNN MATRIX
Page 21
21
8
08-02-2024
Page 22
22
Page 23
23
9
08-02-2024
Page 24
24
Page 25
25
10
08-02-2024
Page 26
26
• CPU
Contemporary CPUs consist of one or more cores - a distinct execution unit with its own
instruction stream. Cores with a CPU may be organized into one or more sockets - each socket
with its own distinct memory . When a CPU consists of two or more sockets, usually hardware
infrastructure supports memory sharing across sockets.
• Node
A standalone "computer in a box." Usually comprised of multiple CPUs/processors/cores, memory,
network interfaces, etc. Nodes are networked together to comprise a supercomputer.
• Task
A logically discrete section of computational work. A task is typically a program or program-like set
of instructions that is executed by a processor.
• Parallel Task
A task that can be executed by multiple processors safely (yields correct results).
• Serial Execution
Execution of a program sequentially, one statement at a time. In the simplest sense, this is what
happens on a one processor machine. However, virtually all parallel tasks will have sections of a
parallel program that must be executed serially.
Page 27
27
11
08-02-2024
• Parallel Execution
• Execution of a program by more than one task, with each task being able to
execute the same or different statement at the same moment in time.
• Shared Memory
• From a strictly hardware point of view, describes a computer architecture
where all processors have direct (usually bus based) access to common
physical memory. In a programming sense, it describes a model where parallel
tasks all have the same "picture" of memory and can directly address and
access the same logical memory locations regardless of where the physical
memory actually exists.
• Distributed Memory
• In hardware, refers to network based memory access for physical memory
that is not common. As a programming model, tasks can only logically "see"
local machine memory and must use communications to access memory on
other machines where other tasks are executing.
Page 28
28
• Communications
• Parallel tasks typically need to exchange data. There are several ways this can
be accomplished, such as through a shared memory bus or over a network,
however the actual event of data exchange is commonly referred to as
communications regardless of the method employed.
• Synchronization
• The coordination of parallel tasks in real time, very often associated with
communications. Often implemented by establishing a synchronization point
within an application where a task may not proceed further until another
task(s) reaches the same or logically equivalent point.
• Synchronization usually involves waiting by at least one task, and can
therefore cause a parallel application's wall clock execution time to increase.
Page 29
29
12
08-02-2024
• Granularity
• In parallel computing, granularity (or grain size) of a task is a measure of the
amount of work (or computation) which is performed by that task.
• granularity is a quantitative or qualitative measure of the ratio of computation
to communication time.
Page 30
30
• Observed Speedup
• Observed speedup of a code which has been parallelized,
defined as:
wall-clock time of serial execution
wall-clock time of parallel execution
• One of the simplest and most widely used indicators for a
parallel program's performance.
Page 31
31
13
08-02-2024
• Parallel Overhead
• The amount of time required to coordinate parallel tasks, as opposed to doing
useful work. Parallel overhead can include factors such as:
• Task start-up time
• Synchronizations
• Data communications
• Software overhead imposed by parallel compilers, libraries, tools, operating system, etc.
• Task termination time
Page 32
32
• Scalability
• Refers to a parallel system's (hardware and/or software) ability to demonstrate a
proportionate increase in parallel speedup with the addition of more processors. Factors that
contribute to scalability include:
• Hardware - particularly memory-cpu bandwidths and network communications
• Application algorithm
• Parallel overhead
• Characteristics of your specific application and coding
Page 33
33
14
08-02-2024
PARALLEL
COMPUTER
MEMORY
ARCHITECTURES
Page 34
34
MEMORY ARCHITECTURES
• Shared Memory
• Distributed Memory
• Hybrid Distributed-Shared Memory
Page 35
35
15
08-02-2024
SHARED MEMORY
• Shared memory parallel computers vary widely, but generally have in common the
ability for all processors to access all memory as global address space.
• Multiple processors can operate independently but share the same memory
resources.
• Changes in a memory location effected by one processor are visible to all other
processors.
• Shared memory machines can be divided into two main classes based upon memory
access times: UMA and NUMA.
Page 36
36
Page 37
37
16
08-02-2024
• Advantages
• Global address space provides a user-friendly programming perspective to memory
• Data sharing between tasks is both fast and uniform due to the proximity of memory
to CPUs
• Disadvantages:
• Primary disadvantage is the lack of scalability between memory and CPUs. Adding
more CPUs can geometrically increases traffic on the shared memory-CPU path, and
for cache coherent systems, geometrically increase traffic associated with
cache/memory management.
• Programmer responsibility for synchronization constructs that insure "correct" access
of global memory.
• Expense: it becomes increasingly difficult and expensive to design and produce shared
memory machines with ever increasing numbers of processors.
Page 38
38
DISTRIBUTED MEMORY
• Like shared memory systems, distributed memory systems vary widely but share a
common characteristics. Distributed memory systems require a communication network
to connect inter-processor memory.
• Processors have their own local memory. Memory addresses in one processor do not
map to another processor, so there is no concept of global address space across all
processors.
• Because each processor has its own local memory, it operates independently. Changes it
makes to its local memory have no effect on the memory of other processors. Hence,
the concept of cache coherency does not apply.
• When a processor needs access to data in another processor, it is usually the task of the
programmer to explicitly define how and when data is communicated. Synchronization
between tasks is likewise the programmer's responsibility.
Page 39
39
17
08-02-2024
• Advantages
• Memory is scalable with number of processors. Increase the number of processors
and the size of memory increases proportionately.
• Each processor can rapidly access its own memory without interference and
without the overhead incurred with trying to maintain cache coherency.
• Cost effectiveness: can use commodity, off-the-shelf processors and networking.
• Disadvantages
• The programmer is responsible for many of the details associated with data
communication between processors.
• It may be difficult to map existing data structures, based on global memory, to this
memory organization.
• Non-uniform memory access (NUMA) times
Page 40
40
Page 42
42
18
08-02-2024
PARALLEL
PROGRAMMING
MODELS
Page 43
43
OVERVIEW
Page 44
44
19
08-02-2024
OVERVIEW
Page 46
46
Page 47
47
20
08-02-2024
Page 48
48
THREADS MODEL
Page 49
49
21
08-02-2024
• OpenMP
• Jointly defined and endorsed by a group of major computer hardware and
software vendors.
• Portable / multi-platform, including Unix and Windows NT platforms
• Available in C/C++ implementations
• Can be very easy and simple to use - provides for "incremental parallelism"
Page 51
51
Page 52
52
22
08-02-2024
Page 54
54
Page 55
55
23
08-02-2024
OTHER MODELS
Page 59
59
HYBRID
Page 60
60
24
08-02-2024
Page 61
61
Page 62
62
25
08-02-2024
• A multi-core processor is a single computing component with two or more independent actual CPUs
(called "cores").
• These independent CPUs are capable of executing instructions at the same time hence overall increasing
the speed at which programs can be executed.
• Manufacturers typically integrate the cores onto a single integrated circuit die.
• Processors were originally developed with only one core.
• A dual-core processor has two cores (e.g. Intel Core Duo),
• A quad-core processor contains four cores (e.g. Intel's quad-core processors-i3,i5, and i7)
• A hexa-core processor contains six cores (e.g . Intel Core i7 Extreme Edition 980X)
• An octo-core processor or octa-core processor contains eight cores (e.g. Intel Xeon E7-2820)
• A deca-core processor contains ten cores (e.g Intel Xeon E7-2850
Page 64
64
PERFORMANCE
Two key goals to be achieved with the design of parallel applications are:
• Performance – the capacity to reduce the time needed to solve a problem as
the computing resources increase
• Scalability – the capacity to increase performance as the size of the problem
increases
The main factors limiting the performance and the scalability of an application can
be divided into:
• Architectural limitations
• Algorithmic limitations
Page 65
65
26
08-02-2024
Page 68
68
SPEEDUP
Page 69
69
27
08-02-2024
EFFICIENCY
Page 70
70
REDUNDANCY
Redundancy measures the increase in the required computation when using more
processing units. It measures the ratio between the number of operations performed
by the parallel execution and by the sequential execution.
Page 71
71
28
08-02-2024
UTILIZATION
Page 72
72
PIPELINING
Page 73
73
29
08-02-2024
PIPELINING
Page 74
74
PIPELINING IS NATURAL!
• Laundry Example
• Ann, Brian, Cathy, Dave A B C D
each have one load of clothes
to wash, dry, and fold
• Washer takes 30 minutes
Page 75
75
30
08-02-2024
SEQUENTIAL LAUNDRY
6 PM 7 8 9 10 11 Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
T
a A
s
k
B
O
r
d C
e
r D
• Sequential laundry takes 6 hours for 4 loads
Page 76
76
6 PM 7 8 9 10 11 Midnight
Time
30 40 40 40 40 20
T
a A
s
k
B
O
r
d C
e
r
D
• Pipelined laundry takes 3.5 hours for 4 loads
Page 77
77
31
08-02-2024
PIPELINING LESSONS
• Pipelining doesn’t help latency of
6 PM 7 8 9 single task, it helps throughput
of entire workload
Time
• Pipeline rate limited by slowest
30 40 40 40 40 20 pipeline stage
T
• Multiple tasks operating
a A
s simultaneously using different
k resources
B • Potential speedup = Number
O
r pipe stages
d C • Unbalanced lengths of pipe
e stages reduces speedup
r
D • Time to “fill” pipeline and time
to “drain” it reduces speedup
Page 78
78
Page 79
79
32
08-02-2024
Page 80
80
Page 81
81
33
08-02-2024
Page 82
82
Page 83
83
34
08-02-2024
PIPELINE PERFORMANCE
Page 84
84
PIPELINE PERFORMANCE
Page 85
85
35
08-02-2024
Page 86
86
Page 87
87
36
08-02-2024
Page 88
88
Page 89
89
37
08-02-2024
• Pipelining and parallel processing are both techniques used in computer architecture to improve the performance
of processors.
• Pipelining involves breaking down the execution of instructions into a series of stages. Each stage performs a
different part of the instruction, and multiple instructions can be processed simultaneously, each at a different
stage of execution. This allows for more efficient use of the processor's resources and can lead to faster overall
execution of instructions.
• Parallel processing, on the other hand, involves the simultaneous execution of multiple instructions or tasks. This
can be done using multiple processors or processor cores working together to handle different parts of a task at
the same time. Parallel processing can significantly speed up the execution of tasks that can be divided into
independent sub-tasks.
• In summary, pipelining improves the efficiency of processing individual instructions, while parallel processing
improves the overall throughput by executing multiple instructions or tasks simultaneously. Both techniques are
used to enhance the performance of modern computer systems.
Page 90
90
38