0% found this document useful (0 votes)
5 views

Chapter Three parallel computing

Chapter Three discusses parallel computer architectures, including shared memory, distributed memory, and hybrid systems, emphasizing their role in enhancing computational efficiency. It highlights the importance of interconnection networks and routing protocols for communication between processors and memory. The chapter also covers components of parallel computing architecture, such as processors, memory types, and programming models like OpenMP and CUDA, which facilitate the development of parallel applications.

Uploaded by

umukewser temam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Chapter Three parallel computing

Chapter Three discusses parallel computer architectures, including shared memory, distributed memory, and hybrid systems, emphasizing their role in enhancing computational efficiency. It highlights the importance of interconnection networks and routing protocols for communication between processors and memory. The chapter also covers components of parallel computing architecture, such as processors, memory types, and programming models like OpenMP and CUDA, which facilitate the development of parallel applications.

Uploaded by

umukewser temam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

PARALLEL COMPUTING

Chapter Three: Parallel Computer Architectures

-Introduction
-Parallel Architectures
-Shared memory systems and cache coherence
-Distributed-memory systems
-Hybrid memory systems
-Interconnection networks and routing protocols
-In parallel computing, interconnection networks and routing
protocols facilitate communication and data transfer between
processors and memory, enabling efficient computation.

-Interconnection networks, which can be static or dynamic, define


the physical links, while routing protocols determine how data
packets travel through the network.
Introduction
 Parallel computer architecture refers to the design and organization of
computer systems that use multiple processors or cores to perform
computations simultaneously, thereby improving performance and
efficiency.
 Parallel architectures refer to the hardware design where multiple processors
or cores execute instructions simultaneously to increase computational
speed.
 Parallel computer architecture is designed to exploit the power of multiple
processors working together to solve problems more efficiently than a single
processor could.
 It's a foundational concept behind high-performance computing (HPC),
supercomputing, and the modern multi-core processors found in most
computers today.
Introduction
 Parallel computing architecture involves designing systems where multiple
processors or cores work concurrently to execute tasks and solve problems,
significantly enhancing speed and efficiency, especially for complex and
data-intensive applications.

 In parallel computing, the architecture comprises essential components such


as processors, memory hierarchy, interconnects, and software stack.
 These components work together to facilitate efficient communication, data
processing, and task coordination across multiple processing units.

 Understanding the roles and interactions of these components is crucial for


designing and optimizing parallel computing systems.
prints the message
prints the message"Hello, world:" to the console.

prints the message"Hello, world:" to the console.

"Hello, world:" to the console.

Parallel C program with OpenMP


 #include <stdio.h>
 #include <omp.h>
 int main() {
 printf("Hello, world:"); // prints the message "Hello, world:" to the console.
 #pragma omp parallel // This directive tells the compiler to execute the
following block of code in parallel.
 printf(" %d", omp_get_thread_num()); // Inside the parallel block, the
omp_get_thread_num() function returns the unique thread ID of the thread
currently executing this line of code.
 printf("\n"); // prints a newline after all threads have printed
 return 0; // ends the program
 }
prints the message
prints the message"Hello, world:" to the console.

prints the message"Hello, world:" to the console.

"Hello, world:" to the console.

Parallel C program with OpenMP


 #include <stdio.h>
 #include <omp.h>
 int main() {
 printf("Hello, world:"); // prints the message "Hello, world:" to the console.
 #pragma omp parallel // This directive tells the compiler to execute the
following block of code in parallel.
 printf(" %d", omp_get_thread_num()); // Inside the parallel block, the
omp_get_thread_num() function returns the unique thread ID of the thread
currently executing this line of code.
 printf("\n"); // prints a newline after all threads have printed
 return 0; // ends the program
 }
prints the message
prints the message"Hello, world:" to the console.

prints the message"Hello, world:" to the console.

"Hello, world:" to the console.

Parallel C program with OpenMP


prints the message
prints the message"Hello, world:" to the console.

prints the message"Hello, world:" to the console.

"Hello, world:" to the console.

Parallel C program with OpenMP


 OpenMP directives used in parallel programming, followed by examples
 #pragma omp parallel:
 This is the primary directive to specify a parallel region. It indicates that the following
block of code should be executed by multiple threads in parallel.
 #include <stdio.h>
 #include <omp.h>
 int main() {
 #pragma omp parallel
 {
 printf("Hello from thread %d\n", omp_get_thread_num());
 }
 return 0;
 }

 The code inside the #pragma omp parallel block is executed by multiple
threads. omp_get_thread_num() retrieves the unique ID of each thread.
prints the message
prints the message"Hello, world:" to the console.

prints the message"Hello, world:" to the console.

"Hello, world:" to the console.

Parallel C program with OpenMP


 #pragma omp for:
 This directive is used to parallelize a loop. It splits iterations of the loop across multiple
threads.
 #include <stdio.h>
 #include <omp.h>

 int main() {
 int i;
 #pragma omp parallel for
 for (i = 0; i < 10; i++) {
 printf("Thread %d processes iteration %d\n", omp_get_thread_num(), i);
 }
 return 0;
 }
 The iterations of the loop are distributed among available threads.
prints the message
prints the message"Hello, world:"
.
to the console.

"Hello, world:" to the console.

Parallel C program with OpenMP


 #pragma omp critical:
 This directive specifies that a block of code should be executed by only one thread at a
time, ensuring that no two threads enter the critical section simultaneously. This is useful
for protecting shared resources.

 #pragma omp barrier:


 This directive creates a barrier in the code where all threads must synchronize before any
thread can continue. It ensures that all threads reach this point before proceeding.

 #pragma omp atomic:


 This directive ensures that a specific update to a variable is done atomically, preventing
race conditions without the overhead of a critical section.
Components of Parallel Computing Architecture

 Processors are the central processing units responsible for executing


instructions and performing computations in parallel computing systems.
 Different types of processors, such as CPUs, GPUs, and APUs, offer varying
degrees of parallelism and computational capabilities.
 Central Processing Units (CPU)
Multi-core CPUs: These CPUs feature multiple processing cores
integrated onto a single chip, allowing parallel execution of tasks.
Each core can independently execute instructions, enabling higher
performance and efficiency in multi-threaded applications.
Multi-threaded CPUs: Multi-threaded CPUs support the simultaneous
execution of multiple threads within each core. This feature enhances
throughput and responsiveness by overlapping the execution of
multiple tasks, particularly in applications with parallelizable
workloads.
Components of Parallel Computing Architecture

 Graphical Processing Units (GPU)


 Stream processors: GPUs consist of numerous stream processors,
also known as shader cores, responsible for executing computational
tasks in parallel. These processors are optimized for data-parallel
operations and are particularly well-suited for graphics rendering,
scientific computing, and machine learning tasks.

CUDA cores: CUDA (Compute Unified Device Architecture) cores


are specialized processing units found in NVIDIA GPUs. These cores
are designed to execute parallel computing tasks programmed using
the CUDA parallel computing platform and application programming
interface (API). CUDA cores offer high throughput and efficiency for
parallel processing workloads.
Components of Parallel Computing Architecture

 Accelerated Processing Units (APU)


CPU cores: Accelerated Processing Units (APUs) integrate
both CPU and GPU cores on a single chip. The CPU cores
within APUs are responsible for general-purpose computing
tasks, such as executing application code, handling system
operations, and managing memory.

GPU cores: Alongside CPU cores, APUs also include GPU


cores optimized for parallel computation and graphics
processing. These GPU cores provide accelerated performance
for tasks such as image rendering, video decoding, and parallel
computing workloads.
Components of Parallel Computing Architecture
Components of Parallel Computing Architecture

 Registers
 General-purpose registers: Registers directly accessible by the CPU cores for storing
temporary data and intermediate computation results.
 Special-purpose registers: Registers dedicated to specific functions, such as program
counter, stack pointer, and status flags, essential for CPU operations and control flow.
 Cache Memory
 L1 Cache: Level 1 cache located closest to the CPU cores, offering fast access to
frequently accessed data and instructions.
 L2 Cache: Level 2 cache situated between L1 cache and main memory, providing larger
storage capacity and slightly slower access speeds.
 L3 Cache: Level 3 cache shared among multiple CPU cores, offering a larger cache size
and serving as a shared resource for improving data locality and reducing memory access
latency.
Components of Parallel Computing Architecture

 Main Memory (RAM)


 Dynamic RAM (DRAM): Main memory modules composed of dynamic
random-access memory cells, used for storing program instructions and
data during program execution.
 Static RAM (SRAM): Caches and buffer memory within the memory
hierarchy, offering faster access speeds and lower latency compared to
DRAM because it uses flipflops to store data rather DRAM uses
capacitors and transistors to store data, which needs charging and
discharging for storing and increase latency. Closer to the CPU
compared to DRAM.
 Video RAM (VRAM): Dedicated memory on GPUs used for storing
textures, framebuffers, and other graphical data required for rendering
images and videos. VRAM enables high-speed access to graphics data
and enhances the performance of GPU-accelerated applications.
Components of Parallel Computing Architecture
Secondary Storage (Disk)
 Hard Disk Drives (HDDs): Magnetic storage devices used for long-term data storage and
retrieval in parallel computing systems. HDDs provide high-capacity storage but slower access
speeds compared to main memory.
 HDDs have slower read and write speeds and higher latency than SSDs, making them less
suitable for parallel computations that demand speed.
 HDDs have slower random-access times compared to SSDs, which can hinder performance in
parallel computing scenarios.
 SSDs tend to be more expensive per gigabyte than HDDs.
 Solid State Drives (SSDs): Flash-based storage devices offer faster access speeds and lower
latency than HDDs. SSDs are commonly used as secondary storage in parallel computing
systems to improve I/O performance and reduce data access latency.
 SSDs offer much faster read and write speeds compared to HDDs, leading to quicker data
access and processing times.
 SSDs have lower latency (the time it takes to access data), which is crucial for parallel
computations where rapid data retrieval is essential.
 SSDs excel at random access, meaning they can quickly access data from any location on the
drive, a key advantage for parallel processing that often involves accessing data from various
locations.
Components of Parallel Computing Architecture
Components of Parallel Computing Architecture

 Buses
 System Bus: Connects the CPU, memory, and other internal
components within a computer system. It facilitates
communication and data transfer between these components.

 Memory Bus: Dedicated bus for transferring data between the CPU
and main memory (RAM). It ensures fast and efficient access to
memory resources.

 I/O Bus: Input/Output bus connects peripheral devices, such as


storage devices, network interfaces, and accelerators, to the CPU
and memory in a parallel computing system.
Components of Parallel Computing Architecture

 Switches
 Crossbar Switches: High-performance switches that provide multiple paths for data transmission
between input and output ports. They enable simultaneous communication between multiple pairs of
devices, improving bandwidth and reducing latency.
 Packet Switches: Switches that forward data in discrete packets based on destination addresses.
They efficiently manage network traffic by dynamically allocating bandwidth and prioritizing
packets based on quality of service (QoS) parameters.

 Networks
 Ethernet: A widely used networking technology for local area networks (LANs) and wide area
networks (WANs). It employs Ethernet cables and switches to transmit data packets between
devices within a network.
 InfiniBand: A high-speed interconnect technology commonly used in high-performance computing
(HPC) environments. It offers low-latency, high-bandwidth communication between compute nodes
in clustered systems.
 Fiber Channel: A storage area network (SAN) technology that enables high-speed data transfer
between servers and storage devices over fiber optic cables. It provides reliable and scalable
connectivity for enterprise storage solutions.
Components of Parallel Computing Architecture
One definition of parallel architecture
 A parallel computer is a collection of processing elements that cooperate to
solve large problems fast.
 Key issues:
 Resource Allocation
 how large a collection?
 how powerful are the elements?
 how much memory?

 Data access, Communication and Synchronization


 how do the elements cooperate and communicate?
 how are data transmitted between processors?
 what are the abstractions and primitives for cooperation?

 Performance and Scalability


 how does it all translate into performance?
 how does it scale?
Why study parallel arch & programming models?
 The Answer before 15 or more years:
 Because it allows you to achieve performance beyond what we get with CPU clock frequency
scaling

 The Answer Today:


 Because it seems to be the best available way to achieve higher performance in the foreseeable
future
 CPU clock rates are no longer increasing! ---The higher the clock speed, the more heat is
generated, and we've now hit a stage where it is no longer efficient to increase processor
speed due to the amount of energy that goes into cooling it.
 Instruction-level-parallelism is not increasing either!
 Improving performance further on sequential code becomes very complicated +
diminishing returns
 Without explicit parallelism or architectural specialization, performance becomes a zero-sum game.
 Specialization is more disruptive than parallel programming (and is mostly about parallelism
anyway)
Parallel Architectures
 Shared Memory
 Distributed Memory Common Parallel Architectures
 Hybrid Memory
 Historically, parallel architectures were tightly coupled to specific programming models,
leading to divergent architectures with no clear, and predictable growth pattern.
 The development of parallel computer architectures and the programming models used to
write software for them were closely intertwined.
 As a result of this tight coupling, different parallel architectures emerged, each with
unique characteristics and programming models, leading to a landscape of diverse and
sometimes incompatible systems.
 The lack of a unified approach to parallel computing and the proliferation of different
architectures and programming models resulted in a pattern of growth that was not easily
predictable or scalable.
 Each new architecture often required a new set of programming tools and techniques,
hindering widespread adoption and standardization.
What is the solution for decoupling?
◼ The solution to the problem of tightly coupled parallel architectures and
programming models, which led to divergent and sometimes incompatible
systems?
◼ Focus on standardization, abstraction, and portability to break away from

the historical tight coupling between hardware and software which leads to
the above most common parallel architectures.
◼ When we say parallel architectures that are tightly coupled to specific programming
models, meaning the hardware design is optimized for a particular way of expressing
parallelism
◼ The introduction of high-level programming models has been a critical step in decoupling
parallel hardware from specific software paradigms.
◼ Programming models like OpenMP and CUDA offer higher-level abstractions, allowing
developers to focus on parallelism without worrying about the specific hardware being
used.
◼ These models provide a consistent interface to parallelize code, whether you're running on
a multi-core CPU, a GPU, or even a distributed system.
What is the solution for decoupling?
◼ OpenMP: A directive-based parallel programming model that enables multi-
threading within shared-memory systems. By using compiler directives,
OpenMP allows for parallel execution without needing to worry about the
underlying architecture.
◼ CUDA: A parallel computing platform and application programming interface
(API) that allows developers to write software that can run on GPUs,
abstracting away the hardware-specific intricacies.
◼ Message-Passing Interfaces (MPI): For distributed-memory systems, MPI (Message
Passing Interface) standardized how processes communicate in a parallel
environment.
◼ By abstracting the communication between processors, MPI allowed for a more
portable way of writing parallel programs that could run on different distributed
architectures, without being tightly coupled to the underlying hardware.
Parallel Architectures
◼ Parallel architecture extends traditional computer architecture by introducing
mechanisms for communication and cooperation among processing elements.
◦ OLD: Instruction Set Architecture--classical computer architecture focuses
on instruction execution
◦ NEW: Communication Architecture--emphasizes how multiple processing
units interact to solve large-scale problems efficiently. Communication
architecture is the foundation of parallel and distributed computing
systems.
◼ Communication architecture defines
◦ Critical abstractions, boundaries, and primitives (interfaces)
◦ Organizational structures that implement interfaces (hw or sw)
◼ Abstractions (logical models for interaction)
◼ Boundaries (separation of responsibilities)
◼ Primitives (basic operations for communication)
◼ Organizational Structures (how interfaces are implemented in HW/SW)
Shared Memory Architecture
 Shared memory parallel computers vary widely, but generally have in common the
ability for all processors to access all memory as global address space.

 Multiple processors can operate independently but share the same memory resources.
 Changes in a memory location effected by one processor are visible to all other
processors.
Shared Memory Architecture
 Shared memory machines can be divided into two main classes based upon memory
access times: UMA and NUMA.
 Multiple processors share access to a common memory space.
 This architecture simplifies communication and data sharing among processors but
requires mechanisms for synchronization and mutual exclusion to prevent data
hazards.
Uniform Memory Access (UMA):
 All processors in a UMA system have equal access times to any memory location.
 Sometimes called CC-UMA - Cache Coherent UMA.
 Challenges in SMS: Multiple processors may have cached copies of the same memory
location, leading to inconsistency if one modifies it. Soln: MESI & Catch Coherence
 MESI protocol
 Modified (M): Only one cache has the modified data; must write back before others read.
 Exclusive (E): Only one cache has the data, but it’s unmodified (clean).
 Shared (S): Multiple caches can read but not write.
 Invalid (I): The cache line is invalid and must be fetched.
Shared Memory : UMA vs. NUMA
• Cache coherent means if one processor updates a location in shared memory,
all the other processors know about the update.
• Cache coherence ensures that all processors or cores in a multi-core system
see a consistent view of shared data, preventing data inconsistencies and
ensuring reliable program execution when multiple caches store copies of
the same data.

• The Problem: In a multi-core system, each core has its own cache memory
to store frequently accessed data for faster retrieval. If multiple cores share
the same data, and one core modifies its cached copy, other cores might still
have outdated versions of that data, leading to inconsistencies and errors.

• The Solution: Cache coherence protocols are mechanisms that ensure all
caches maintain a consistent view of shared data. These protocols detect
when a shared data block is modified by one core and propagate the changes
to other caches, ensuring that all cores see the updated data.
Shared Memory : UMA vs. NUMA
 Non-Uniform Memory Access (NUMA):
• Each processor has its own local memory, and access to that local memory is faster than
accessing memory on another processor's board (remote memory).
• NUMA is often used in systems with multiple SMPs linked together.
• Suitable for real-time and time-critical applications where faster access to local data is
crucial.
• Not all processors have equal access time to all memories
• If cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent
NUMA
Shared Memory Architecture
 Any processor can directly reference any memory location
 Communication occurs implicitly as result of loads and stores
 Convenient:
 Location transparency (don’t need to worry about physical placement of
data)
 Similar programming model to time-sharing on uniprocessors
 Except processes run on different processors
 Good throughput on multiprogrammed workloads
 Naturally provided on wide range of platforms
 History dates at least to precursors of mainframes in early 60s
 Wide range of scale: few to hundreds of processors
 Popularly known as shared-memory machines / model
 Ambiguous: memory may be physically distributed among processors
Shared Memory Architecture
Shared Memory: Pro and Con
 Advantages
• Global address space provides a user-friendly programming perspective to
memory
• Data sharing between tasks is both fast and uniform due to the proximity of
memory to CPUs

 Disadvantages:
• Primary disadvantage is the lack of scalability between memory and CPUs.
Adding more CPUs can geometrically increases traffic on the shared memory-
CPU path, and for cache coherent systems, geometrically increase traffic
associated with cache/memory management.
• Programmer responsibility for synchronization constructs that insure "correct"
access of global memory.
• Expense: it becomes increasingly difficult and expensive to design and
produce shared memory machines with ever increasing numbers of processors.
Distributed Memory
 Like shared memory systems, distributed memory systems vary widely but share a
common characteristic.
 Distributed memory systems require a communication network to connect inter-
processor memory.
 Distributed-memory architecture comprises multiple independent processing units,
each with its own memory space.
 Communication between processors is achieved through message passing over a
network.
 This architecture offers scalability and fault tolerance but requires explicit data
distribution and communication protocols.
 Processors have their own local memory. Memory addresses in one processor do not
map to another processor, so there is no concept of global address space across all
processors.
 Because each processor has its own local memory, it operates independently.
 Changes it makes to its local memory have no effect on the memory of other
processors. Hence, the concept of cache coherency does not apply.
Distributed Memory
 When a processor needs access to data in another processor, it is
usually the task of the programmer to explicitly define how and when
data is communicated. Synchronisation between tasks is likewise the
programmer's responsibility.

 The network "fabric" used for data transfer varies widely, though it
can can be as simple as Ethernet.
Distributed Memory: Pro and Con
 Advantages
• Memory is scalable with number of processors. Increase the number of
processors and the size of memory increases proportionately.
• Each processor can rapidly access its own memory without interference and
without the overhead incurred with trying to maintain cache coherency.
• Cost effectiveness: can use commodity, off-the-shelf processors and networking.

 Disadvantages
• The programmer is responsible for many of the details associated with data
communication between processors.
• It may be difficult to map existing data structures, based on global memory, to
this memory organization.
• Non-uniform memory access (NUMA) times
Hybrid Distributed-Shared Memory
Comparison of Shared and Distributed Memory Architectures

Architecture CC-UMA CC-NUMA Distributed

Examples SMPs Bull NovaScale Cray T3E


Sun Vexx SGI Origin Maspar
DEC/Compaq Sequent IBM SP2
SGI Challenge HP Exemplar IBM BlueGene
IBM POWER3 DEC/Compaq
IBM POWER4 (MCM)

Communications MPI MPI MPI


Threads Threads
OpenMP OpenMP
shmem shmem

Scalability to 10s of processors to 100s of processors to 1000s of processors

Draw Backs Memory-CPU bandwidth Memory-CPU bandwidth System administration


Non-uniform access times Programming is hard to
develop and maintain

Software Availability many 1000s ISVs many 1000s ISVs 100s ISVs
Hybrid Distributed-Shared Memory
 The largest and fastest computers in the world today employ both
shared and distributed memory architectures.

 The shared memory component is usually a cache coherent SMP machine.


Processors on a given SMP can address that machine's memory as global.
 Hybrid architectures combine elements of both shared-memory and
distributed-memory systems.
 These architectures leverage the benefits of shared-memory parallelism
within individual nodes and distributed-memory scalability across multiple
nodes, making them suitable for a wide range of applications.
Hybrid Distributed-Shared Memory
 The distributed memory component is the networking of multiple
SMPs. SMPs know only about their own memory - not the memory
on another SMP.
 Therefore, network communications are required to move data from
one SMP to another.
 Current trends seem to indicate that this type of memory architecture
will continue to prevail and increase at the high end of computing for
the foreseeable future.
 Advantages and Disadvantages: whatever is common to both shared
and distributed memory architectures.
Interconnection networks and routing protocols
 Interconnection networks and routing protocols are the backbone of parallel and
distributed systems, determining how efficiently processors, memory, and
accelerators communicate.
 Interconnection networks connect processors and memory in a parallel computer,
facilitating communication and data transfer.
 Static (Direct) Networks: Use point-to-point communication links, where each node is
directly connected to others.
 Dynamic (Indirect) Networks: Employ switches to connect nodes dynamically, allowing for
more flexible communication paths.
Interconnection Network Topologies
 Buses: A simple topology where all processors share a common bus for data exchange.
 Crossbar: Provides a direct connection between any pair of nodes, but can be expensive and
power-intensive.
 Multistage: Uses multiple stages of switches to connect nodes, offering scalability and
flexibility.
Interconnection networks and routing protocols
 Meshes: Nodes are arranged in a grid-like structure, allowing for efficient
communication between neighboring nodes.
 Hypercubes: Nodes are connected in a way that each node has a neighbor for each
bit position in its address.

Routing Protocols
 Routing protocols determine the path a message takes through the network to
reach its destination.
 Dimension-Ordered Routing: A common technique for meshes and hypercubes,
where messages follow a specific dimension order.
 XY Routing: A specific dimension-ordered routing technique for two-dimensional
meshes.
 E-cube Routing: A specific dimension-ordered routing technique for hypercubes.
Interconnection networks and routing protocols

 Routing Decisions:
 Switches/Routers: Nodes in the network that make routing decisions, selecting the
appropriate link to forward a message.
 Hops: The number of steps a message takes to reach its destination, through intermediate
nodes.

Network and parallel algorithms design directly impacts:


• Latency (time for data to travel)
• Bandwidth (data transfer rate)
• Scalability (how well the system grows with added nodes)
• Fault Tolerance (handling link/node failures)
Implications of Interconnection networks and routing
protocols for parallel computing

Performance Metrics
 Latency: The time it takes for a message to travel from one processor
to another. Network latency can stall parallel computations.
 Bandwidth: The amount of data that can be transferred per unit of
time. Congestion at shared links (e.g., bus or tree root).
 Scalability: The ability of the network to maintain performance as the
number of processors increases.
 Cost: The hardware and software costs associated with the
interconnection network and routing protocol.
Thank you…!!!

You might also like