Lecture 2 Computer Architecture Course 2024 1
Lecture 2 Computer Architecture Course 2024 1
Introduction to Computer
http://cacs.usc.edu/educatio
architecture and Parallel Computing
n/cs653-lecture.html
1
9/8/2024
Serial Computing
2
9/8/2024
Serial Computing
Parallel Computing
3
9/8/2024
Parallel Computing
Parallel Computing
4
9/8/2024
Parallel Computing
The computational problem should be able to:
Be broken apart into discrete pieces of work that can be
solved simultaneously;
Execute multiple program instructions at any moment in
time;
Be solved in less time with multiple compute resources
than with a single compute resource.
Parallel Computing
10
5
9/8/2024
Parallel Computers:
11
12
6
9/8/2024
Parallel Computing
13
CPU
14
7
9/8/2024
GPU
15
TPU
16
8
9/8/2024
17
DPU
18
9
9/8/2024
19
QPU vs GPU
20
10
9/8/2024
21
22
11
9/8/2024
23
Main Reasons
24
12
9/8/2024
Main Reasons
25
26
13
9/8/2024
Main Reasons
27
Main Reasons
PROVIDE CONCURRENCY:
A single compute resource can only do one thing at a time.
Multiple compute resources can do many things
simultaneously.
Example: Collaborative Networks provide a global venue
where people from around the world can meet and conduct
work "virtually".
28
14
9/8/2024
Main Reasons
29
Main reasons
30
15
9/8/2024
Main reasons
MAKE BETTER USE OF UNDERLYING PARALLEL HARDWARE:
Modern computers, even laptops, are parallel in architecture
with multiple processors/cores.
Parallel software is specifically intended for parallel
hardware with multiple cores, threads, etc.
In most cases, serial programs run on modern computers
"waste" potential computing power.
31
Main reasons
16
9/8/2024
The Future:
During the past 20+ years, the trends indicated by ever faster
networks, distributed systems, and multi-processor computer
architectures (even at the desktop level) clearly show that
parallelism is the future of computing.
In this same time period, there has been a greater than
500,000x increase in supercomputer performance, with no
end currently in sight.
The race is already on for Exascale Computing!
Exaflop = 1018 calculations per second
33
The Future:
34
17
9/8/2024
35
36
18
9/8/2024
37
38
19
9/8/2024
39
40
20
9/8/2024
41
42
21
9/8/2024
43
44
22
9/8/2024
45
46
23
9/8/2024
47
48
24
9/8/2024
49
50
25
9/8/2024
51
52
26
9/8/2024
53
54
27
9/8/2024
CUDA
(Compute Unified Device
Architecture)
Supercomputing for the Masses
55
What is CUDA?
56
28
9/8/2024
Why CUDA?
57
58
29
9/8/2024
59
60
30
9/8/2024
61
62
31
9/8/2024
63
64
32
9/8/2024
65
Pipelining is the process of accumulating instruction from the processor through a pipeline. It
allows storing and executing instructions in an orderly process. It is also known as pipeline
processing
66
33
9/8/2024
67
68
34
9/8/2024
69
70
35
9/8/2024
71
72
36
9/8/2024
73
74
37
9/8/2024
75
76
38
9/8/2024
77
https://computing.llnl.gov/tutorials/parallel_comp/#Concepts
78
39
9/8/2024
Complexity:
In general, parallel applications are much more complex than
corresponding serial applications, perhaps an order of magnitude. Not
only do you have multiple instruction streams executing at the same
time, but you also have data flowing between them.
The costs of complexity are measured in programmer time in virtually
every aspect of the software development cycle:
Design
Coding
Debugging
Tuning
Maintenance
Application programming interface=API
79
Portability:
Thanks to standardization in several APIs, such as MPI, POSIX threads, and
OpenMP, portability issues with parallel programs are not as serious as in years
past.
However...
All of the usual portability issues associated with serial programs apply to
parallel programs. For example, if you use vendor "enhancements" to Fortran,
C or C++, portability will be a problem.
Even though standards exist for several APIs, implementations will differ in a
number of details, sometimes to the point of requiring code modifications in
order to effect portability.
Operating systems can play a key role in code portability issues.
Hardware architectures are characteristically highly variable and can affect
portability.
POSIX stands for Portable Operating System Interface
80
40
9/8/2024
Resource Requirements:
The primary intent of parallel programming is to decrease execution
wall clock time, however in order to accomplish this, more CPU time
is required. For example, a parallel code that runs in 1 hour on 8
processors actually uses 8 hours of CPU time.
The amount of memory required can be greater for parallel codes than
serial codes, due to the need to replicate data and for overheads
associated with parallel support libraries and subsystems.
For short running parallel programs, there can actually be a decrease in
performance compared to a similar serial implementation. The overhead
costs associated with setting up the parallel environment, task creation,
communications and task termination can comprise a significant portion
of the total execution time for short runs.
81
Scalability:
82
41
9/8/2024
Scalability:
2) Weak scaling:
The problem size per processor
stays fixed as more processors
are added. The total problem
size is proportional to the
number of processors used.
Goal is to run larger problem in
same amount of time
Perfect scaling means problem
Px runs in same time as single
processor run
83
Scalability:
The ability of a parallel program's performance to scale is a result of a
number of interrelated factors. Simply adding more processors is rarely the
answer.
The algorithm may have inherent limits to scalability. At some point, adding
more resources causes performance to decrease. This is a common situation
with many parallel applications.
Hardware factors play a significant role in scalability. Examples:
Memory-cpu bus bandwidth on an SMP machine
Communications network bandwidth
Amount of memory available on any given machine or set of machines
Processor clock speed
Parallel support libraries and subsystems software can limit scalability
independent of your application.
84
42
9/8/2024
Shared Memory
General Characteristics:
Shared memory parallel computers vary widely,
but generally have in common the ability for all
processors to access all memory as global
address space.
Multiple processors can operate independently
but share the same memory resources.
Changes in a memory location effected by one
processor are visible to all other processors.
Historically, shared memory machines have
been classified as UMA and NUMA, based upon
memory access times.
Uniform Memory Access (UMA)
85
Shared Memory
86
43
9/8/2024
Shared Memory
87
Shared Memory
Advantages:
Global address space provides a user-friendly programming perspective to
memory
Data sharing between tasks is both fast and uniform due to the proximity of
memory to CPUs
Disadvantages:
Primary disadvantage is the lack of scalability between memory and CPUs.
Adding more CPUs can geometrically increases traffic on the shared
memory-CPU path, and for cache coherent systems, geometrically increase
traffic associated with cache/memory management.
Programmer responsibility for synchronization constructs that ensure
"correct" access of global memory.
88
44
9/8/2024
Distributed Memory
89
Processors have their own local memory. Memory addresses in one processor do
not map to another processor, so there is no concept of global address space across
all processors.
Because each processor has its own local memory, it operates independently.
Changes it makes to its local memory have no effect on the memory of other
processors. Hence, the concept of cache coherency does not apply.
When a processor needs access to data in another processor, it is usually the task
of the programmer to explicitly define how and when data is communicated.
Synchronization between tasks is likewise the programmer's responsibility.
The network "fabric" used for data transfer varies widely, though it can be as
simple as Ethernet.
Advantages:
Memory is scalable with the number of processors. Increase the number of
processors and the size of memory increases proportionately.
Each processor can rapidly access its own memory without interference and without
the overhead incurred with trying to maintain global cache coherency.
Cost effectiveness: can use commodity, off-the-shelf processors and networking.
90
45
9/8/2024
Disadvantages:
The programmer is responsible for many of the details
associated with data communication between processors.
It may be difficult to map existing data structures, based on
global memory, to this memory organization.
Non-uniform memory access times - data residing on a
remote node takes longer to access than node local data.
91
General Characteristics:
The largest and fastest computers in the world today employ both shared and distributed
memory architectures.
92
46
9/8/2024
93
94
47
9/8/2024
95
96
48
9/8/2024
97
Threads Model
This programming model is a type of shared memory programming.
In the threads model of parallel programming, a single "heavy weight" process can
have multiple "light weight", concurrent execution paths.
For example:
a.out loads and acquires all of the necessary system and user resources to run.
This is the "heavy weight" process.
a.out performs some serial work, and then creates a number of tasks (threads)
that can be scheduled and run by the operating system concurrently.
Each thread has local data, but also, shares the entire resources of a.out. This
saves the overhead associated with replicating a program's resources for each
thread ("light weight"). Each thread also benefits from a global memory view
because it shares the memory space of a.out.
A thread's work may best be described as a subroutine within the main program.
Any thread can execute any subroutine at the same time as other threads.
Threads communicate with each other through global memory (updating
address locations). This requires synchronization constructs to ensure that more
than one thread is not updating the same global address at any time.
Threads can come and go, but a.out remains present to provide the necessary
shared resources until the application has completed.
98
49
9/8/2024
Implementations:
From a programming perspective, threads implementations commonly
comprise: A library of subroutines that are called from within parallel source
code
A set of compiler directives imbedded in either serial or parallel source code
In both cases, the programmer is responsible for determining the parallelism
(although compilers can sometimes help).
Threaded implementations are not new in computing. Historically, hardware
vendors have implemented their own proprietary versions of threads. These
implementations differed substantially from each other making it difficult for
programmers to develop portable threaded applications.
Unrelated standardization efforts have resulted in two very different
implementations of threads: POSIX Threads and OpenMP.
o POSIX Threads tutorial: computing.llnl.gov/tutorials/pthreads
o OpenMP tutorial: computing.llnl.gov/tutorials/openMP
99
100
50
9/8/2024
101
On shared memory architectures, all tasks may have access to the data structure through global memory.
On distributed memory architectures, the global data structure can be split up logically and/or physically across tasks.
102
51
9/8/2024
Hybrid Model
A hybrid model combines more than one of the previously described programming
models.
Currently, a common example of a hybrid model is the combination of the message
passing model (MPI) with the threads model (OpenMP). Threads perform
computationally intensive kernels using local, on-node data
Communications between processes on different nodes occurs over the network using
MPI
This hybrid model lends itself well to the most popular hardware environment of
clustered multi/many-core machines.
Another similar and increasingly popular example of a hybrid model is using MPI
with CPU-GPU (Graphics Processing Unit) programming.
MPI tasks run on CPUs using local memory and communicating with each other
over a network.
Computationally intensive kernels are off-loaded to GPUs on-node.
Data exchange between node-local memory and GPUs uses CUDA (or
something equivalent).
Other hybrid models are common:
MPI with Pthreads
MPI with non-GPU accelerators
103
SPMD is actually a "high level" programming model that can be built upon any
combination of the previously mentioned parallel programming models.
SINGLE PROGRAM: All tasks execute their copy of the same program
simultaneously. This program can be threads, message passing, data parallel or
hybrid.
MULTIPLE DATA: All tasks may use different data
SPMD programs usually have the necessary logic programmed into them to
allow different tasks to branch or conditionally execute only those parts of the
program they are designed to execute. That is, tasks do not necessarily have to
execute the entire program - perhaps only a portion of it.
The SPMD model, using message passing or hybrid programming, is probably
the most commonly used parallel programming model for multi-node clusters.
104
52
9/8/2024
105
106
53
9/8/2024
107
Programmer Directed
Using "compiler directives" or possibly compiler flags, the
programmer explicitly tells the compiler how to parallelize
the code.
May be able to be used in conjunction with some degree of
automatic parallelization also.
108
54
9/8/2024
109
110
55
9/8/2024
111
Functional Decomposition:
Functional Decomposition:
In this approach, the focus is
on the computation that is to
be performed rather than on
the data manipulated by the
computation. The problem is
decomposed according to the
work that must be done. Each
task then performs a portion
of the overall work.
112
56
9/8/2024
Functional Decomposition:
Communications
Synchronization
Data Dependencies
Load Balancing
Granularity
113
Functional Decomposition:
114
57