0% found this document useful (0 votes)

5 views

Chapter Three parallel computing

Chapter Three discusses parallel computer architectures, including shared memory, distributed memory, and hybrid systems, emphasizing their role in enhancing computational efficiency. It highlights the importance of interconnection networks and routing protocols for communication between processors and memory. The chapter also covers components of parallel computing architecture, such as processors, memory types, and programming models like OpenMP and CUDA, which facilitate the development of parallel applications.

Uploaded by

umukewser temam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Chapter Three parallel computing

Uploaded by

umukewser temam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

PARALLEL COMPUTING

Chapter Three: Parallel Computer Architectures

-Introduction
-Parallel Architectures
-Shared memory systems and cache coherence
-Distributed-memory systems
-Hybrid memory systems
-Interconnection networks and routing protocols
-In parallel computing, interconnection networks and routing
protocols facilitate communication and data transfer between
processors and memory, enabling efficient computation.

-Interconnection networks, which can be static or dynamic, define

the physical links, while routing protocols determine how data
packets travel through the network.
Introduction
 Parallel computer architecture refers to the design and organization of
computer systems that use multiple processors or cores to perform
computations simultaneously, thereby improving performance and
efficiency.
 Parallel architectures refer to the hardware design where multiple processors
or cores execute instructions simultaneously to increase computational
speed.
 Parallel computer architecture is designed to exploit the power of multiple
processors working together to solve problems more efficiently than a single
processor could.
 It's a foundational concept behind high-performance computing (HPC),
supercomputing, and the modern multi-core processors found in most
computers today.
Introduction
 Parallel computing architecture involves designing systems where multiple
processors or cores work concurrently to execute tasks and solve problems,
significantly enhancing speed and efficiency, especially for complex and
data-intensive applications.

 In parallel computing, the architecture comprises essential components such

as processors, memory hierarchy, interconnects, and software stack.
 These components work together to facilitate efficient communication, data
processing, and task coordination across multiple processing units.

 Understanding the roles and interactions of these components is crucial for

designing and optimizing parallel computing systems.
prints the message
prints the message"Hello, world:" to the console.

prints the message"Hello, world:" to the console.

"Hello, world:" to the console.

Parallel C program with OpenMP

 #include <stdio.h>
 #include <omp.h>
 int main() {
 printf("Hello, world:"); // prints the message "Hello, world:" to the console.
 #pragma omp parallel // This directive tells the compiler to execute the
following block of code in parallel.
 printf(" %d", omp_get_thread_num()); // Inside the parallel block, the
omp_get_thread_num() function returns the unique thread ID of the thread
currently executing this line of code.
 printf("\n"); // prints a newline after all threads have printed
 return 0; // ends the program
 }
prints the message
prints the message"Hello, world:" to the console.

prints the message"Hello, world:" to the console.

"Hello, world:" to the console.

Parallel C program with OpenMP

prints the message"Hello, world:" to the console.

"Hello, world:" to the console.

Parallel C program with OpenMP

prints the message
prints the message"Hello, world:" to the console.

prints the message"Hello, world:" to the console.

"Hello, world:" to the console.

Parallel C program with OpenMP

 OpenMP directives used in parallel programming, followed by examples
 #pragma omp parallel:
 This is the primary directive to specify a parallel region. It indicates that the following
block of code should be executed by multiple threads in parallel.
 #include <stdio.h>
 #include <omp.h>
 int main() {
 #pragma omp parallel
 {
 printf("Hello from thread %d\n", omp_get_thread_num());
 }
 return 0;
 }

 The code inside the #pragma omp parallel block is executed by multiple
threads. omp_get_thread_num() retrieves the unique ID of each thread.
prints the message
prints the message"Hello, world:" to the console.

prints the message"Hello, world:" to the console.

"Hello, world:" to the console.

Parallel C program with OpenMP

 #pragma omp for:
 This directive is used to parallelize a loop. It splits iterations of the loop across multiple
threads.
 #include <stdio.h>
 #include <omp.h>

 int main() {
 int i;
 #pragma omp parallel for
 for (i = 0; i < 10; i++) {
 printf("Thread %d processes iteration %d\n", omp_get_thread_num(), i);
 }
 return 0;
 }
 The iterations of the loop are distributed among available threads.
prints the message
prints the message"Hello, world:"
.
to the console.

"Hello, world:" to the console.

Parallel C program with OpenMP

 #pragma omp critical:
 This directive specifies that a block of code should be executed by only one thread at a
time, ensuring that no two threads enter the critical section simultaneously. This is useful
for protecting shared resources.

 #pragma omp barrier:

 This directive creates a barrier in the code where all threads must synchronize before any
thread can continue. It ensures that all threads reach this point before proceeding.

 #pragma omp atomic:

 This directive ensures that a specific update to a variable is done atomically, preventing
race conditions without the overhead of a critical section.
Components of Parallel Computing Architecture

 Processors are the central processing units responsible for executing

instructions and performing computations in parallel computing systems.
 Different types of processors, such as CPUs, GPUs, and APUs, offer varying
degrees of parallelism and computational capabilities.
 Central Processing Units (CPU)
Multi-core CPUs: These CPUs feature multiple processing cores
integrated onto a single chip, allowing parallel execution of tasks.
Each core can independently execute instructions, enabling higher
performance and efficiency in multi-threaded applications.
Multi-threaded CPUs: Multi-threaded CPUs support the simultaneous
execution of multiple threads within each core. This feature enhances
throughput and responsiveness by overlapping the execution of
multiple tasks, particularly in applications with parallelizable
workloads.
Components of Parallel Computing Architecture

 Graphical Processing Units (GPU)

 Stream processors: GPUs consist of numerous stream processors,
also known as shader cores, responsible for executing computational
tasks in parallel. These processors are optimized for data-parallel
operations and are particularly well-suited for graphics rendering,
scientific computing, and machine learning tasks.

CUDA cores: CUDA (Compute Unified Device Architecture) cores

are specialized processing units found in NVIDIA GPUs. These cores
are designed to execute parallel computing tasks programmed using
the CUDA parallel computing platform and application programming
interface (API). CUDA cores offer high throughput and efficiency for
parallel processing workloads.
Components of Parallel Computing Architecture

 Accelerated Processing Units (APU)

CPU cores: Accelerated Processing Units (APUs) integrate
both CPU and GPU cores on a single chip. The CPU cores
within APUs are responsible for general-purpose computing
tasks, such as executing application code, handling system
operations, and managing memory.

GPU cores: Alongside CPU cores, APUs also include GPU

cores optimized for parallel computation and graphics
processing. These GPU cores provide accelerated performance
for tasks such as image rendering, video decoding, and parallel
computing workloads.
Components of Parallel Computing Architecture
Components of Parallel Computing Architecture

 Registers
 General-purpose registers: Registers directly accessible by the CPU cores for storing
temporary data and intermediate computation results.
 Special-purpose registers: Registers dedicated to specific functions, such as program
counter, stack pointer, and status flags, essential for CPU operations and control flow.
 Cache Memory
 L1 Cache: Level 1 cache located closest to the CPU cores, offering fast access to
frequently accessed data and instructions.
 L2 Cache: Level 2 cache situated between L1 cache and main memory, providing larger
storage capacity and slightly slower access speeds.
 L3 Cache: Level 3 cache shared among multiple CPU cores, offering a larger cache size
and serving as a shared resource for improving data locality and reducing memory access
latency.
Components of Parallel Computing Architecture

 Main Memory (RAM)

 Dynamic RAM (DRAM): Main memory modules composed of dynamic
random-access memory cells, used for storing program instructions and
data during program execution.
 Static RAM (SRAM): Caches and buffer memory within the memory
hierarchy, offering faster access speeds and lower latency compared to
DRAM because it uses flipflops to store data rather DRAM uses
capacitors and transistors to store data, which needs charging and
discharging for storing and increase latency. Closer to the CPU
compared to DRAM.
 Video RAM (VRAM): Dedicated memory on GPUs used for storing
textures, framebuffers, and other graphical data required for rendering
images and videos. VRAM enables high-speed access to graphics data
and enhances the performance of GPU-accelerated applications.
Components of Parallel Computing Architecture
Secondary Storage (Disk)
 Hard Disk Drives (HDDs): Magnetic storage devices used for long-term data storage and
retrieval in parallel computing systems. HDDs provide high-capacity storage but slower access
speeds compared to main memory.
 HDDs have slower read and write speeds and higher latency than SSDs, making them less
suitable for parallel computations that demand speed.
 HDDs have slower random-access times compared to SSDs, which can hinder performance in
parallel computing scenarios.
 SSDs tend to be more expensive per gigabyte than HDDs.
 Solid State Drives (SSDs): Flash-based storage devices offer faster access speeds and lower
latency than HDDs. SSDs are commonly used as secondary storage in parallel computing
systems to improve I/O performance and reduce data access latency.
 SSDs offer much faster read and write speeds compared to HDDs, leading to quicker data
access and processing times.
 SSDs have lower latency (the time it takes to access data), which is crucial for parallel
computations where rapid data retrieval is essential.
 SSDs excel at random access, meaning they can quickly access data from any location on the
drive, a key advantage for parallel processing that often involves accessing data from various
locations.
Components of Parallel Computing Architecture
Components of Parallel Computing Architecture

 Buses
 System Bus: Connects the CPU, memory, and other internal
components within a computer system. It facilitates
communication and data transfer between these components.

 Memory Bus: Dedicated bus for transferring data between the CPU
and main memory (RAM). It ensures fast and efficient access to
memory resources.

 I/O Bus: Input/Output bus connects peripheral devices, such as

storage devices, network interfaces, and accelerators, to the CPU
and memory in a parallel computing system.
Components of Parallel Computing Architecture

 Switches
 Crossbar Switches: High-performance switches that provide multiple paths for data transmission
between input and output ports. They enable simultaneous communication between multiple pairs of
devices, improving bandwidth and reducing latency.
 Packet Switches: Switches that forward data in discrete packets based on destination addresses.
They efficiently manage network traffic by dynamically allocating bandwidth and prioritizing
packets based on quality of service (QoS) parameters.

 Networks
 Ethernet: A widely used networking technology for local area networks (LANs) and wide area
networks (WANs). It employs Ethernet cables and switches to transmit data packets between
devices within a network.
 InfiniBand: A high-speed interconnect technology commonly used in high-performance computing
(HPC) environments. It offers low-latency, high-bandwidth communication between compute nodes
in clustered systems.
 Fiber Channel: A storage area network (SAN) technology that enables high-speed data transfer
between servers and storage devices over fiber optic cables. It provides reliable and scalable
connectivity for enterprise storage solutions.
Components of Parallel Computing Architecture
One definition of parallel architecture
 A parallel computer is a collection of processing elements that cooperate to
solve large problems fast.
 Key issues:
 Resource Allocation
 how large a collection?
 how powerful are the elements?
 how much memory?

 Data access, Communication and Synchronization

 how do the elements cooperate and communicate?
 how are data transmitted between processors?
 what are the abstractions and primitives for cooperation?

 Performance and Scalability

 how does it all translate into performance?
 how does it scale?
Why study parallel arch & programming models?
 The Answer before 15 or more years:
 Because it allows you to achieve performance beyond what we get with CPU clock frequency
scaling

 The Answer Today:

 Because it seems to be the best available way to achieve higher performance in the foreseeable
future
 CPU clock rates are no longer increasing! ---The higher the clock speed, the more heat is
generated, and we've now hit a stage where it is no longer efficient to increase processor
speed due to the amount of energy that goes into cooling it.
 Instruction-level-parallelism is not increasing either!
 Improving performance further on sequential code becomes very complicated +
diminishing returns
 Without explicit parallelism or architectural specialization, performance becomes a zero-sum game.
 Specialization is more disruptive than parallel programming (and is mostly about parallelism
anyway)
Parallel Architectures
 Shared Memory
 Distributed Memory Common Parallel Architectures
 Hybrid Memory
 Historically, parallel architectures were tightly coupled to specific programming models,
leading to divergent architectures with no clear, and predictable growth pattern.
 The development of parallel computer architectures and the programming models used to
write software for them were closely intertwined.
 As a result of this tight coupling, different parallel architectures emerged, each with
unique characteristics and programming models, leading to a landscape of diverse and
sometimes incompatible systems.
 The lack of a unified approach to parallel computing and the proliferation of different
architectures and programming models resulted in a pattern of growth that was not easily
predictable or scalable.
 Each new architecture often required a new set of programming tools and techniques,
hindering widespread adoption and standardization.
What is the solution for decoupling?
◼ The solution to the problem of tightly coupled parallel architectures and
programming models, which led to divergent and sometimes incompatible
systems?
◼ Focus on standardization, abstraction, and portability to break away from

the historical tight coupling between hardware and software which leads to
the above most common parallel architectures.
◼ When we say parallel architectures that are tightly coupled to specific programming
models, meaning the hardware design is optimized for a particular way of expressing
parallelism
◼ The introduction of high-level programming models has been a critical step in decoupling
parallel hardware from specific software paradigms.
◼ Programming models like OpenMP and CUDA offer higher-level abstractions, allowing
developers to focus on parallelism without worrying about the specific hardware being
used.
◼ These models provide a consistent interface to parallelize code, whether you're running on
a multi-core CPU, a GPU, or even a distributed system.
What is the solution for decoupling?
◼ OpenMP: A directive-based parallel programming model that enables multi-
threading within shared-memory systems. By using compiler directives,
OpenMP allows for parallel execution without needing to worry about the
underlying architecture.
◼ CUDA: A parallel computing platform and application programming interface
(API) that allows developers to write software that can run on GPUs,
abstracting away the hardware-specific intricacies.
◼ Message-Passing Interfaces (MPI): For distributed-memory systems, MPI (Message
Passing Interface) standardized how processes communicate in a parallel
environment.
◼ By abstracting the communication between processors, MPI allowed for a more
portable way of writing parallel programs that could run on different distributed
architectures, without being tightly coupled to the underlying hardware.
Parallel Architectures
◼ Parallel architecture extends traditional computer architecture by introducing
mechanisms for communication and cooperation among processing elements.
◦ OLD: Instruction Set Architecture--classical computer architecture focuses
on instruction execution
◦ NEW: Communication Architecture--emphasizes how multiple processing
units interact to solve large-scale problems efficiently. Communication
architecture is the foundation of parallel and distributed computing
systems.
◼ Communication architecture defines
◦ Critical abstractions, boundaries, and primitives (interfaces)
◦ Organizational structures that implement interfaces (hw or sw)
◼ Abstractions (logical models for interaction)
◼ Boundaries (separation of responsibilities)
◼ Primitives (basic operations for communication)
◼ Organizational Structures (how interfaces are implemented in HW/SW)
Shared Memory Architecture
 Shared memory parallel computers vary widely, but generally have in common the
ability for all processors to access all memory as global address space.

 Multiple processors can operate independently but share the same memory resources.
 Changes in a memory location effected by one processor are visible to all other
processors.
Shared Memory Architecture
 Shared memory machines can be divided into two main classes based upon memory
access times: UMA and NUMA.
 Multiple processors share access to a common memory space.
 This architecture simplifies communication and data sharing among processors but
requires mechanisms for synchronization and mutual exclusion to prevent data
hazards.
Uniform Memory Access (UMA):
 All processors in a UMA system have equal access times to any memory location.
 Sometimes called CC-UMA - Cache Coherent UMA.
 Challenges in SMS: Multiple processors may have cached copies of the same memory
location, leading to inconsistency if one modifies it. Soln: MESI & Catch Coherence
 MESI protocol
 Modified (M): Only one cache has the modified data; must write back before others read.
 Exclusive (E): Only one cache has the data, but it’s unmodified (clean).
 Shared (S): Multiple caches can read but not write.
 Invalid (I): The cache line is invalid and must be fetched.
Shared Memory : UMA vs. NUMA
• Cache coherent means if one processor updates a location in shared memory,
all the other processors know about the update.
• Cache coherence ensures that all processors or cores in a multi-core system
see a consistent view of shared data, preventing data inconsistencies and
ensuring reliable program execution when multiple caches store copies of
the same data.

• The Problem: In a multi-core system, each core has its own cache memory
to store frequently accessed data for faster retrieval. If multiple cores share
the same data, and one core modifies its cached copy, other cores might still
have outdated versions of that data, leading to inconsistencies and errors.

• The Solution: Cache coherence protocols are mechanisms that ensure all
caches maintain a consistent view of shared data. These protocols detect
when a shared data block is modified by one core and propagate the changes
to other caches, ensuring that all cores see the updated data.
Shared Memory : UMA vs. NUMA
 Non-Uniform Memory Access (NUMA):
• Each processor has its own local memory, and access to that local memory is faster than
accessing memory on another processor's board (remote memory).
• NUMA is often used in systems with multiple SMPs linked together.
• Suitable for real-time and time-critical applications where faster access to local data is
crucial.
• Not all processors have equal access time to all memories
• If cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent
NUMA
Shared Memory Architecture
 Any processor can directly reference any memory location
 Communication occurs implicitly as result of loads and stores
 Convenient:
 Location transparency (don’t need to worry about physical placement of
data)
 Similar programming model to time-sharing on uniprocessors
 Except processes run on different processors
 Good throughput on multiprogrammed workloads
 Naturally provided on wide range of platforms
 History dates at least to precursors of mainframes in early 60s
 Wide range of scale: few to hundreds of processors
 Popularly known as shared-memory machines / model
 Ambiguous: memory may be physically distributed among processors
Shared Memory Architecture
Shared Memory: Pro and Con
 Advantages
• Global address space provides a user-friendly programming perspective to
memory
• Data sharing between tasks is both fast and uniform due to the proximity of
memory to CPUs

 Disadvantages:
• Primary disadvantage is the lack of scalability between memory and CPUs.
Adding more CPUs can geometrically increases traffic on the shared memory-
CPU path, and for cache coherent systems, geometrically increase traffic
associated with cache/memory management.
• Programmer responsibility for synchronization constructs that insure "correct"
access of global memory.
• Expense: it becomes increasingly difficult and expensive to design and
produce shared memory machines with ever increasing numbers of processors.
Distributed Memory
 Like shared memory systems, distributed memory systems vary widely but share a
common characteristic.
 Distributed memory systems require a communication network to connect inter-
processor memory.
 Distributed-memory architecture comprises multiple independent processing units,
each with its own memory space.
 Communication between processors is achieved through message passing over a
network.
 This architecture offers scalability and fault tolerance but requires explicit data
distribution and communication protocols.
 Processors have their own local memory. Memory addresses in one processor do not
map to another processor, so there is no concept of global address space across all
processors.
 Because each processor has its own local memory, it operates independently.
 Changes it makes to its local memory have no effect on the memory of other
processors. Hence, the concept of cache coherency does not apply.
Distributed Memory
 When a processor needs access to data in another processor, it is
usually the task of the programmer to explicitly define how and when
data is communicated. Synchronisation between tasks is likewise the
programmer's responsibility.

 The network "fabric" used for data transfer varies widely, though it
can can be as simple as Ethernet.
Distributed Memory: Pro and Con
 Advantages
• Memory is scalable with number of processors. Increase the number of
processors and the size of memory increases proportionately.
• Each processor can rapidly access its own memory without interference and
without the overhead incurred with trying to maintain cache coherency.
• Cost effectiveness: can use commodity, off-the-shelf processors and networking.

 Disadvantages
• The programmer is responsible for many of the details associated with data
communication between processors.
• It may be difficult to map existing data structures, based on global memory, to
this memory organization.
• Non-uniform memory access (NUMA) times
Hybrid Distributed-Shared Memory
Comparison of Shared and Distributed Memory Architectures

Architecture CC-UMA CC-NUMA Distributed

Examples SMPs Bull NovaScale Cray T3E

Sun Vexx SGI Origin Maspar
DEC/Compaq Sequent IBM SP2
SGI Challenge HP Exemplar IBM BlueGene
IBM POWER3 DEC/Compaq
IBM POWER4 (MCM)

Communications MPI MPI MPI

Threads Threads
OpenMP OpenMP
shmem shmem

Scalability to 10s of processors to 100s of processors to 1000s of processors

Draw Backs Memory-CPU bandwidth Memory-CPU bandwidth System administration

Non-uniform access times Programming is hard to
develop and maintain

Software Availability many 1000s ISVs many 1000s ISVs 100s ISVs
Hybrid Distributed-Shared Memory
 The largest and fastest computers in the world today employ both
shared and distributed memory architectures.

 The shared memory component is usually a cache coherent SMP machine.

Processors on a given SMP can address that machine's memory as global.
 Hybrid architectures combine elements of both shared-memory and
distributed-memory systems.
 These architectures leverage the benefits of shared-memory parallelism
within individual nodes and distributed-memory scalability across multiple
nodes, making them suitable for a wide range of applications.
Hybrid Distributed-Shared Memory
 The distributed memory component is the networking of multiple
SMPs. SMPs know only about their own memory - not the memory
on another SMP.
 Therefore, network communications are required to move data from
one SMP to another.
 Current trends seem to indicate that this type of memory architecture
will continue to prevail and increase at the high end of computing for
the foreseeable future.
 Advantages and Disadvantages: whatever is common to both shared
and distributed memory architectures.
Interconnection networks and routing protocols
 Interconnection networks and routing protocols are the backbone of parallel and
distributed systems, determining how efficiently processors, memory, and
accelerators communicate.
 Interconnection networks connect processors and memory in a parallel computer,
facilitating communication and data transfer.
 Static (Direct) Networks: Use point-to-point communication links, where each node is
directly connected to others.
 Dynamic (Indirect) Networks: Employ switches to connect nodes dynamically, allowing for
more flexible communication paths.
Interconnection Network Topologies
 Buses: A simple topology where all processors share a common bus for data exchange.
 Crossbar: Provides a direct connection between any pair of nodes, but can be expensive and
power-intensive.
 Multistage: Uses multiple stages of switches to connect nodes, offering scalability and
flexibility.
Interconnection networks and routing protocols
 Meshes: Nodes are arranged in a grid-like structure, allowing for efficient
communication between neighboring nodes.
 Hypercubes: Nodes are connected in a way that each node has a neighbor for each
bit position in its address.

Routing Protocols
 Routing protocols determine the path a message takes through the network to
reach its destination.
 Dimension-Ordered Routing: A common technique for meshes and hypercubes,
where messages follow a specific dimension order.
 XY Routing: A specific dimension-ordered routing technique for two-dimensional
meshes.
 E-cube Routing: A specific dimension-ordered routing technique for hypercubes.
Interconnection networks and routing protocols

 Routing Decisions:
 Switches/Routers: Nodes in the network that make routing decisions, selecting the
appropriate link to forward a message.
 Hops: The number of steps a message takes to reach its destination, through intermediate
nodes.

Network and parallel algorithms design directly impacts:

• Latency (time for data to travel)
• Bandwidth (data transfer rate)
• Scalability (how well the system grows with added nodes)
• Fault Tolerance (handling link/node failures)
Implications of Interconnection networks and routing
protocols for parallel computing

Performance Metrics
 Latency: The time it takes for a message to travel from one processor
to another. Network latency can stall parallel computations.
 Bandwidth: The amount of data that can be transferred per unit of
time. Congestion at shared links (e.g., bus or tree root).
 Scalability: The ability of the network to maintain performance as the
number of processors increases.
 Cost: The hardware and software costs associated with the
interconnection network and routing protocol.
Thank you…!!!

GMW3122 11-2005 Ens
100% (2)
GMW3122 11-2005 Ens
42 pages
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
Introduction To Parallel Programming: Center For Institutional Research Computing
No ratings yet
Introduction To Parallel Programming: Center For Institutional Research Computing
98 pages
Parallel Programming Unit 2
No ratings yet
Parallel Programming Unit 2
71 pages
Parallel Programming
No ratings yet
Parallel Programming
108 pages
Parallel Computing
No ratings yet
Parallel Computing
57 pages
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
No ratings yet
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
63 pages
Untitled document
No ratings yet
Untitled document
23 pages
Lec6 - TLP Data Dependence Solutions
No ratings yet
Lec6 - TLP Data Dependence Solutions
20 pages
HPC BOOk
No ratings yet
HPC BOOk
68 pages
Parallel_computing
No ratings yet
Parallel_computing
32 pages
CSC-334_ P&DC_Lab manual_V2.0
No ratings yet
CSC-334_ P&DC_Lab manual_V2.0
102 pages
Parallel Programming
No ratings yet
Parallel Programming
5 pages
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
No ratings yet
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
78 pages
Co-1 (2)
No ratings yet
Co-1 (2)
66 pages
Week1-Parallel-and-Distributed-Computing
No ratings yet
Week1-Parallel-and-Distributed-Computing
55 pages
Introduction to Paralel Procesing
No ratings yet
Introduction to Paralel Procesing
40 pages
Lecture Week - 2 General Parallelism Terms
No ratings yet
Lecture Week - 2 General Parallelism Terms
24 pages
Lecture 4
No ratings yet
Lecture 4
27 pages
ParallelProgramming_Start2016
No ratings yet
ParallelProgramming_Start2016
41 pages
Concurrency: CS2403 Programming Languages
No ratings yet
Concurrency: CS2403 Programming Languages
44 pages
.Trashed-1650000204-Hpc Prac Exam
No ratings yet
.Trashed-1650000204-Hpc Prac Exam
5 pages
Lecture 2 General Parallelism Terms
No ratings yet
Lecture 2 General Parallelism Terms
22 pages
multicore02-2
No ratings yet
multicore02-2
18 pages
Concurrent Programming With Threads: Rajkumar Buyya
No ratings yet
Concurrent Programming With Threads: Rajkumar Buyya
168 pages
Group3_Parallel_Computing_Techniques_presentation power point 2025 (2)
No ratings yet
Group3_Parallel_Computing_Techniques_presentation power point 2025 (2)
27 pages
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
No ratings yet
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
70 pages
Demystifying Multicore Germany 14 PDF
No ratings yet
Demystifying Multicore Germany 14 PDF
82 pages
Multi Threading
No ratings yet
Multi Threading
168 pages
08 Systems Programming-Concurrent Programming
No ratings yet
08 Systems Programming-Concurrent Programming
61 pages
Govindarajan_ParallelizationPrinciples-NSM-AstroPhysics
No ratings yet
Govindarajan_ParallelizationPrinciples-NSM-AstroPhysics
50 pages
Programming Assignment: On Openmp
No ratings yet
Programming Assignment: On Openmp
19 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
2.ParallelArchExec
No ratings yet
2.ParallelArchExec
46 pages
001__DDS-IIIT-Jan-10th
No ratings yet
001__DDS-IIIT-Jan-10th
34 pages
Digital Assignment-1
No ratings yet
Digital Assignment-1
6 pages
Architecture
No ratings yet
Architecture
67 pages
Multicore Architecture
No ratings yet
Multicore Architecture
159 pages
HPC_SUMMARY
No ratings yet
HPC_SUMMARY
17 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
3.Introduction to Parallelism
No ratings yet
3.Introduction to Parallelism
64 pages
High Performance Computing (HPC) - Lec3
No ratings yet
High Performance Computing (HPC) - Lec3
35 pages
Week1 - Parallel and Distributed Computing
100% (1)
Week1 - Parallel and Distributed Computing
46 pages
Lecture 4
No ratings yet
Lecture 4
41 pages
Mansi Kadam PC Lab Assignment 1
No ratings yet
Mansi Kadam PC Lab Assignment 1
4 pages
Paralle Processing in Brief
No ratings yet
Paralle Processing in Brief
31 pages
Cloud Computing CS 15-319: Programming Models-Part I Lecture 4, Jan 25, 2012
No ratings yet
Cloud Computing CS 15-319: Programming Models-Part I Lecture 4, Jan 25, 2012
40 pages
Openmp
No ratings yet
Openmp
61 pages
DS1822 -Parallel Computing - Unit 1
No ratings yet
DS1822 -Parallel Computing - Unit 1
23 pages
Unit1 RMD PDF
No ratings yet
Unit1 RMD PDF
27 pages
CS-3006_2_PDC_Overview_compressed
No ratings yet
CS-3006_2_PDC_Overview_compressed
107 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
47 pages
PDC Complete Course File
No ratings yet
PDC Complete Course File
422 pages
Gpu Test Answer Bank
No ratings yet
Gpu Test Answer Bank
22 pages
PP Cuda Unit1 1
No ratings yet
PP Cuda Unit1 1
77 pages
The Complete Future Trait Guide
From Everand
The Complete Future Trait Guide
Hamze Ghalebi
No ratings yet
Foundation Course for Advanced Computer Studies
From Everand
Foundation Course for Advanced Computer Studies
Franck Ismael Djédjé
No ratings yet
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
Learn Java Programming in 24 Hours
From Everand
Learn Java Programming in 24 Hours
PublishDrive
No ratings yet
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
From Everand
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
Hunter Davis
No ratings yet
Lecture Notes on Network Security
No ratings yet
Lecture Notes on Network Security
6 pages
Implementing Secure Routable GOOSE and SV Messages Based On IEC 61850-90-5
No ratings yet
Implementing Secure Routable GOOSE and SV Messages Based On IEC 61850-90-5
10 pages
Ape DCP
No ratings yet
Ape DCP
6 pages
Vsolution Pon App Example v1.0.1
No ratings yet
Vsolution Pon App Example v1.0.1
60 pages
At&T 3G Microcell™: User Manual
No ratings yet
At&T 3G Microcell™: User Manual
15 pages
Programming Distributed Systems P310-Feldman
No ratings yet
Programming Distributed Systems P310-Feldman
7 pages
Lesson 3
No ratings yet
Lesson 3
35 pages
(MCQ) Computer Communication Networks - LMT
No ratings yet
(MCQ) Computer Communication Networks - LMT
18 pages
RAN Protocol Stack
No ratings yet
RAN Protocol Stack
15 pages
Azure 183 Qns Practice Test 1 Answers - Udemy Original - Copy
No ratings yet
Azure 183 Qns Practice Test 1 Answers - Udemy Original - Copy
11 pages
FortiManager Datasheet
No ratings yet
FortiManager Datasheet
13 pages
IT Security Event Management: Yahya Mehdizadeh
No ratings yet
IT Security Event Management: Yahya Mehdizadeh
14 pages
III ECE OSC - Scheme of Evaluation
No ratings yet
III ECE OSC - Scheme of Evaluation
29 pages
DIBM in Orion Edutech
No ratings yet
DIBM in Orion Edutech
2 pages
Enhancing IoT Security
100% (1)
Enhancing IoT Security
862 pages
Forticlient Endpoint Security User
No ratings yet
Forticlient Endpoint Security User
90 pages
1756 Um543 - en P
0% (1)
1756 Um543 - en P
152 pages
DC Q
No ratings yet
DC Q
23 pages
CP R80.10 MobileAccess AdminGuide PDF
No ratings yet
CP R80.10 MobileAccess AdminGuide PDF
232 pages
Huawei Optix Osn 580 Datasheet
No ratings yet
Huawei Optix Osn 580 Datasheet
4 pages
OptiX PTN 910E-F Quick Installation Guide 04
No ratings yet
OptiX PTN 910E-F Quick Installation Guide 04
16 pages
Q4 - Week 2 - V2 - 013849
No ratings yet
Q4 - Week 2 - V2 - 013849
3 pages
Disco Externo Iomega Datasheet
No ratings yet
Disco Externo Iomega Datasheet
2 pages
UNIT III Cloud Computing AU
No ratings yet
UNIT III Cloud Computing AU
20 pages
This Is An Az-Box User Guide: Put Together With Information by Numerous Az-Box Users
No ratings yet
This Is An Az-Box User Guide: Put Together With Information by Numerous Az-Box Users
56 pages
Iot MCQ
No ratings yet
Iot MCQ
38 pages
Synopsis On ChatApp
No ratings yet
Synopsis On ChatApp
43 pages
Iot Unit 3 Full Notes
No ratings yet
Iot Unit 3 Full Notes
38 pages
Course Structure and Syllabi For Mtech in Computer Science and Engineering
No ratings yet
Course Structure and Syllabi For Mtech in Computer Science and Engineering
3 pages

Chapter Three parallel computing

Uploaded by

Chapter Three parallel computing

Uploaded by

PARALLEL COMPUTING

Chapter Three: Parallel Computer Architectures

-Interconnection networks, which can be static or dynamic, define

 In parallel computing, the architecture comprises essential components such

 Understanding the roles and interactions of these components is crucial for

prints the message"Hello, world:" to the console.

"Hello, world:" to the console.

Parallel C program with OpenMP

prints the message"Hello, world:" to the console.

"Hello, world:" to the console.

Parallel C program with OpenMP

prints the message"Hello, world:" to the console.

"Hello, world:" to the console.

Parallel C program with OpenMP

prints the message"Hello, world:" to the console.

"Hello, world:" to the console.

Parallel C program with OpenMP

prints the message"Hello, world:" to the console.

"Hello, world:" to the console.

Parallel C program with OpenMP

"Hello, world:" to the console.

Parallel C program with OpenMP

 #pragma omp barrier:

 #pragma omp atomic:

 Processors are the central processing units responsible for executing

 Graphical Processing Units (GPU)

CUDA cores: CUDA (Compute Unified Device Architecture) cores

 Accelerated Processing Units (APU)

GPU cores: Alongside CPU cores, APUs also include GPU

 Main Memory (RAM)

 I/O Bus: Input/Output bus connects peripheral devices, such as

 Data access, Communication and Synchronization

 Performance and Scalability

 The Answer Today:

Architecture CC-UMA CC-NUMA Distributed

Examples SMPs Bull NovaScale Cray T3E

Communications MPI MPI MPI

Scalability to 10s of processors to 100s of processors to 1000s of processors

Draw Backs Memory-CPU bandwidth Memory-CPU bandwidth System administration

 The shared memory component is usually a cache coherent SMP machine.

Network and parallel algorithms design directly impacts:

You might also like