Chapter Three parallel computing
Chapter Three parallel computing
-Introduction
-Parallel Architectures
-Shared memory systems and cache coherence
-Distributed-memory systems
-Hybrid memory systems
-Interconnection networks and routing protocols
-In parallel computing, interconnection networks and routing
protocols facilitate communication and data transfer between
processors and memory, enabling efficient computation.
The code inside the #pragma omp parallel block is executed by multiple
threads. omp_get_thread_num() retrieves the unique ID of each thread.
prints the message
prints the message"Hello, world:" to the console.
int main() {
int i;
#pragma omp parallel for
for (i = 0; i < 10; i++) {
printf("Thread %d processes iteration %d\n", omp_get_thread_num(), i);
}
return 0;
}
The iterations of the loop are distributed among available threads.
prints the message
prints the message"Hello, world:"
.
to the console.
Registers
General-purpose registers: Registers directly accessible by the CPU cores for storing
temporary data and intermediate computation results.
Special-purpose registers: Registers dedicated to specific functions, such as program
counter, stack pointer, and status flags, essential for CPU operations and control flow.
Cache Memory
L1 Cache: Level 1 cache located closest to the CPU cores, offering fast access to
frequently accessed data and instructions.
L2 Cache: Level 2 cache situated between L1 cache and main memory, providing larger
storage capacity and slightly slower access speeds.
L3 Cache: Level 3 cache shared among multiple CPU cores, offering a larger cache size
and serving as a shared resource for improving data locality and reducing memory access
latency.
Components of Parallel Computing Architecture
Buses
System Bus: Connects the CPU, memory, and other internal
components within a computer system. It facilitates
communication and data transfer between these components.
Memory Bus: Dedicated bus for transferring data between the CPU
and main memory (RAM). It ensures fast and efficient access to
memory resources.
Switches
Crossbar Switches: High-performance switches that provide multiple paths for data transmission
between input and output ports. They enable simultaneous communication between multiple pairs of
devices, improving bandwidth and reducing latency.
Packet Switches: Switches that forward data in discrete packets based on destination addresses.
They efficiently manage network traffic by dynamically allocating bandwidth and prioritizing
packets based on quality of service (QoS) parameters.
Networks
Ethernet: A widely used networking technology for local area networks (LANs) and wide area
networks (WANs). It employs Ethernet cables and switches to transmit data packets between
devices within a network.
InfiniBand: A high-speed interconnect technology commonly used in high-performance computing
(HPC) environments. It offers low-latency, high-bandwidth communication between compute nodes
in clustered systems.
Fiber Channel: A storage area network (SAN) technology that enables high-speed data transfer
between servers and storage devices over fiber optic cables. It provides reliable and scalable
connectivity for enterprise storage solutions.
Components of Parallel Computing Architecture
One definition of parallel architecture
A parallel computer is a collection of processing elements that cooperate to
solve large problems fast.
Key issues:
Resource Allocation
how large a collection?
how powerful are the elements?
how much memory?
the historical tight coupling between hardware and software which leads to
the above most common parallel architectures.
◼ When we say parallel architectures that are tightly coupled to specific programming
models, meaning the hardware design is optimized for a particular way of expressing
parallelism
◼ The introduction of high-level programming models has been a critical step in decoupling
parallel hardware from specific software paradigms.
◼ Programming models like OpenMP and CUDA offer higher-level abstractions, allowing
developers to focus on parallelism without worrying about the specific hardware being
used.
◼ These models provide a consistent interface to parallelize code, whether you're running on
a multi-core CPU, a GPU, or even a distributed system.
What is the solution for decoupling?
◼ OpenMP: A directive-based parallel programming model that enables multi-
threading within shared-memory systems. By using compiler directives,
OpenMP allows for parallel execution without needing to worry about the
underlying architecture.
◼ CUDA: A parallel computing platform and application programming interface
(API) that allows developers to write software that can run on GPUs,
abstracting away the hardware-specific intricacies.
◼ Message-Passing Interfaces (MPI): For distributed-memory systems, MPI (Message
Passing Interface) standardized how processes communicate in a parallel
environment.
◼ By abstracting the communication between processors, MPI allowed for a more
portable way of writing parallel programs that could run on different distributed
architectures, without being tightly coupled to the underlying hardware.
Parallel Architectures
◼ Parallel architecture extends traditional computer architecture by introducing
mechanisms for communication and cooperation among processing elements.
◦ OLD: Instruction Set Architecture--classical computer architecture focuses
on instruction execution
◦ NEW: Communication Architecture--emphasizes how multiple processing
units interact to solve large-scale problems efficiently. Communication
architecture is the foundation of parallel and distributed computing
systems.
◼ Communication architecture defines
◦ Critical abstractions, boundaries, and primitives (interfaces)
◦ Organizational structures that implement interfaces (hw or sw)
◼ Abstractions (logical models for interaction)
◼ Boundaries (separation of responsibilities)
◼ Primitives (basic operations for communication)
◼ Organizational Structures (how interfaces are implemented in HW/SW)
Shared Memory Architecture
Shared memory parallel computers vary widely, but generally have in common the
ability for all processors to access all memory as global address space.
Multiple processors can operate independently but share the same memory resources.
Changes in a memory location effected by one processor are visible to all other
processors.
Shared Memory Architecture
Shared memory machines can be divided into two main classes based upon memory
access times: UMA and NUMA.
Multiple processors share access to a common memory space.
This architecture simplifies communication and data sharing among processors but
requires mechanisms for synchronization and mutual exclusion to prevent data
hazards.
Uniform Memory Access (UMA):
All processors in a UMA system have equal access times to any memory location.
Sometimes called CC-UMA - Cache Coherent UMA.
Challenges in SMS: Multiple processors may have cached copies of the same memory
location, leading to inconsistency if one modifies it. Soln: MESI & Catch Coherence
MESI protocol
Modified (M): Only one cache has the modified data; must write back before others read.
Exclusive (E): Only one cache has the data, but it’s unmodified (clean).
Shared (S): Multiple caches can read but not write.
Invalid (I): The cache line is invalid and must be fetched.
Shared Memory : UMA vs. NUMA
• Cache coherent means if one processor updates a location in shared memory,
all the other processors know about the update.
• Cache coherence ensures that all processors or cores in a multi-core system
see a consistent view of shared data, preventing data inconsistencies and
ensuring reliable program execution when multiple caches store copies of
the same data.
• The Problem: In a multi-core system, each core has its own cache memory
to store frequently accessed data for faster retrieval. If multiple cores share
the same data, and one core modifies its cached copy, other cores might still
have outdated versions of that data, leading to inconsistencies and errors.
• The Solution: Cache coherence protocols are mechanisms that ensure all
caches maintain a consistent view of shared data. These protocols detect
when a shared data block is modified by one core and propagate the changes
to other caches, ensuring that all cores see the updated data.
Shared Memory : UMA vs. NUMA
Non-Uniform Memory Access (NUMA):
• Each processor has its own local memory, and access to that local memory is faster than
accessing memory on another processor's board (remote memory).
• NUMA is often used in systems with multiple SMPs linked together.
• Suitable for real-time and time-critical applications where faster access to local data is
crucial.
• Not all processors have equal access time to all memories
• If cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent
NUMA
Shared Memory Architecture
Any processor can directly reference any memory location
Communication occurs implicitly as result of loads and stores
Convenient:
Location transparency (don’t need to worry about physical placement of
data)
Similar programming model to time-sharing on uniprocessors
Except processes run on different processors
Good throughput on multiprogrammed workloads
Naturally provided on wide range of platforms
History dates at least to precursors of mainframes in early 60s
Wide range of scale: few to hundreds of processors
Popularly known as shared-memory machines / model
Ambiguous: memory may be physically distributed among processors
Shared Memory Architecture
Shared Memory: Pro and Con
Advantages
• Global address space provides a user-friendly programming perspective to
memory
• Data sharing between tasks is both fast and uniform due to the proximity of
memory to CPUs
Disadvantages:
• Primary disadvantage is the lack of scalability between memory and CPUs.
Adding more CPUs can geometrically increases traffic on the shared memory-
CPU path, and for cache coherent systems, geometrically increase traffic
associated with cache/memory management.
• Programmer responsibility for synchronization constructs that insure "correct"
access of global memory.
• Expense: it becomes increasingly difficult and expensive to design and
produce shared memory machines with ever increasing numbers of processors.
Distributed Memory
Like shared memory systems, distributed memory systems vary widely but share a
common characteristic.
Distributed memory systems require a communication network to connect inter-
processor memory.
Distributed-memory architecture comprises multiple independent processing units,
each with its own memory space.
Communication between processors is achieved through message passing over a
network.
This architecture offers scalability and fault tolerance but requires explicit data
distribution and communication protocols.
Processors have their own local memory. Memory addresses in one processor do not
map to another processor, so there is no concept of global address space across all
processors.
Because each processor has its own local memory, it operates independently.
Changes it makes to its local memory have no effect on the memory of other
processors. Hence, the concept of cache coherency does not apply.
Distributed Memory
When a processor needs access to data in another processor, it is
usually the task of the programmer to explicitly define how and when
data is communicated. Synchronisation between tasks is likewise the
programmer's responsibility.
The network "fabric" used for data transfer varies widely, though it
can can be as simple as Ethernet.
Distributed Memory: Pro and Con
Advantages
• Memory is scalable with number of processors. Increase the number of
processors and the size of memory increases proportionately.
• Each processor can rapidly access its own memory without interference and
without the overhead incurred with trying to maintain cache coherency.
• Cost effectiveness: can use commodity, off-the-shelf processors and networking.
Disadvantages
• The programmer is responsible for many of the details associated with data
communication between processors.
• It may be difficult to map existing data structures, based on global memory, to
this memory organization.
• Non-uniform memory access (NUMA) times
Hybrid Distributed-Shared Memory
Comparison of Shared and Distributed Memory Architectures
Software Availability many 1000s ISVs many 1000s ISVs 100s ISVs
Hybrid Distributed-Shared Memory
The largest and fastest computers in the world today employ both
shared and distributed memory architectures.
Routing Protocols
Routing protocols determine the path a message takes through the network to
reach its destination.
Dimension-Ordered Routing: A common technique for meshes and hypercubes,
where messages follow a specific dimension order.
XY Routing: A specific dimension-ordered routing technique for two-dimensional
meshes.
E-cube Routing: A specific dimension-ordered routing technique for hypercubes.
Interconnection networks and routing protocols
Routing Decisions:
Switches/Routers: Nodes in the network that make routing decisions, selecting the
appropriate link to forward a message.
Hops: The number of steps a message takes to reach its destination, through intermediate
nodes.
Performance Metrics
Latency: The time it takes for a message to travel from one processor
to another. Network latency can stall parallel computations.
Bandwidth: The amount of data that can be transferred per unit of
time. Congestion at shared links (e.g., bus or tree root).
Scalability: The ability of the network to maintain performance as the
number of processors increases.
Cost: The hardware and software costs associated with the
interconnection network and routing protocol.
Thank you…!!!