0% found this document useful (0 votes)
11 views

Parallel Programming- Unit 1

Parallel computing allows simultaneous execution of multiple calculations, enhancing efficiency by dividing tasks into smaller subproblems. It offers advantages over serial computing, such as time and cost savings, but also presents challenges like complexity and ensuring program correctness. Various types of parallelism, including bit-level, instruction-level, task, and data-level, are utilized across applications in databases, real-time simulations, and scientific research.

Uploaded by

beinggord02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Parallel Programming- Unit 1

Parallel computing allows simultaneous execution of multiple calculations, enhancing efficiency by dividing tasks into smaller subproblems. It offers advantages over serial computing, such as time and cost savings, but also presents challenges like complexity and ensuring program correctness. Various types of parallelism, including bit-level, instruction-level, task, and data-level, are utilized across applications in databases, real-time simulations, and scientific research.

Uploaded by

beinggord02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 81

Parallel Programming

Domain Specific Core

B.Sc(H)-VI Sem
What is Parallel Computing?

Parallel computing is a type of computation in which many calculations are


performed at the same time.

Basic principle: computation can be divided into smaller subproblems, each of


which can be solved simultaneously.

Assumption: we have parallel hardware at our disposal, which is capable of


executing these computations in parallel.
Why Parallel Computing?

Parallel programming is much harder than sequential programming.


► Separating sequential computations into parallel sub computations can be
challenging, or even impossible.

► Ensuring program correctness is more difficult, due to new types of errors.


Speedup is the only reason why we bother paying for this complexity..
Parallel Programming vs. Concurrent Programming

Parallelism and concurrency are closely related concepts.


● Parallel program uses parallel hardware to execute computation more quickly.
Efficiency is its main concern.
● Concurrent program may or may not execute multiple executions at the same
time. Improves modularity, responsiveness or maintainability.

PARALLEL (SPEEDUP) CONCURRENT (CONVENIENCE)


- division into - when can an execution start
subproblems - how can information exchange
- optimal use of parallel occur
hardware - how to manage access to shared
resources
Advantages of Parallel Computing

Advantages of Parallel Computing over Serial Computing are as follows:


1. It saves time and money as many resources working together will reduce the
time and cut potential costs.
2. It can be impractical to solve larger problems on Serial Computing.
3. It can take advantage of non-local resources when the local resources are
finite.
4. Serial Computing ‘wastes’ the potential computing power, thus Parallel
Computing makes better work of the hardware.
Types of Parallelism:

1. Bit-level parallelism –
It is the form of parallel computing which is based on the increasing processor’s
size. It reduces the number of instructions that the system must execute in order
to perform a task on large-sized data.
Example: Consider a scenario where an 8-bit processor must compute the sum
of two 16-bit integers. It must first sum up the 8 lower-order bits, then add the 8
higher-order bits, thus requiring two instructions to perform the operation. A 16-
bit processor can perform the operation with just one instruction.
2. Instruction-level parallelism –
A processor can only address one instruction for each clock cycle phase. These
instructions can be re-ordered and grouped which are later on executed side by side
without affecting the result of the program. This is called instruction-level parallelism.
3. Task Parallelism –
Task parallelism employs the decomposition of a task into subtasks and then allocating
each of the subtasks for execution. The processors perform the execution of sub-tasks
side by side.
4. Data-level parallelism –

Instructions from a single stream operate side-by-side on several data – Limited by


non-regular data manipulation patterns and by memory bandwidth
Need of Parallel Computing

● The whole real-world runs in dynamic nature i.e. many things happen at a certain time
but at different places concurrently. This data is extensively huge to manage.
● Real-world data needs more dynamic simulation and modeling, and for achieving the
same, parallel computing is the key.
● Parallel computing provides concurrency and saves time and money.
● Complex, large datasets, and their management can be organized only and only
using parallel computing approach.
● Ensures the effective utilization of the resources. The hardware is guaranteed to be
used effectively whereas in serial computation only some part of the hardware was
used and the rest rendered idle.
● Also, it is impractical to implement real-time systems using serial computing.
Applications of Parallel Computing

● Databases and Data mining.


● Real-time simulation of systems.
● Science and Engineering.
● Advanced graphics, augmented reality, and virtual reality.
Limitations of Parallel Computing:

● It addresses such as communication and synchronization between multiple sub-tasks


and processes which is difficult to achieve.
● The algorithms must be managed in such a way that they can be handled in a parallel
mechanism.
● The algorithms or programs must have low coupling and high cohesion. But it’s
difficult to create such programs.
● More technically skilled and expert programmers can code a parallelism-based
program well.
Traditional Computer Architecture

A sequential computer system consists of three key components:

● Processor: Executes instructions


● Memory: Stores data and instructions
● Datapath: Transfers data between memory and processor

Problem: These components create bottlenecks that slow down overall system performance.

Solution: Over the years, several architectural innovations have been introduced to overcome
these bottlenecks.
The Role of Multiplicity in Architecture
Multiplicity (Parallelism) is a key innovation that improves performance.

● Processor Multiplicity: Processor multiplicity refers to having multiple processors or cores to


execute instructions in parallel. This is the foundation of multiprocessing and multi-threading.
● Examples:
➔ Multi-Core CPUs (e.g., Intel Core i9, AMD Ryzen) → Each core can execute separate instructions
in parallel.
➔ Cluster Computing (e.g., Google’s data centers) → Multiple physical machines (nodes) work
together to process tasks.
➔ Distributed Computing (e.g., Apache Hadoop) → Tasks are distributed across multiple independent
computers in a network.
➔ Supercomputers (e.g., IBM Summit) → Thousands of processors work in parallel for high-
performance computing.
● Datapath Multiplicity: Datapath multiplicity refers to having multiple execution units within a
processor to enhance instruction throughput. This is primarily used in instruction-level
parallelism (ILP).Faster data transfer.
● Examples:
➔ Superscalar Processors (e.g., Intel i7, ARM Cortex-A76) → Multiple execution units (ALU,
FPU) execute instructions in parallel.
➔ SIMD (Single Instruction Multiple Data) Processors (e.g., AVX instructions in Intel, NVIDIA
CUDA cores in GPUs) → The same instruction operates on multiple data elements
simultaneously.
➔ VLIW (Very Long Instruction Word) Processors (e.g., Texas Instruments DSPs) → Each
instruction contains multiple operations for parallel execution.
➔ Pipelined Processors (e.g., RISC architecture like ARM Cortex-A72) → Different stages of
instruction execution are handled concurrently.
● Memory Multiplicity: Memory multiplicity refers to the presence of multiple memory units or
access paths to allow concurrent memory operations. This improves memory bandwidth and
reduces bottlenecks.
● Examples:
➔ Multi-Level Caches (L1, L2, L3) (e.g., Intel & AMD CPUs) → Each core has its own cache for
faster access to frequently used data.
➔ Memory Interleaving (e.g., IBM POWER architecture) → Different memory banks are accessed
in parallel to improve read/write speed.
➔ NUMA (Non-Uniform Memory Access) Architecture (e.g., AMD EPYC, Intel Xeon) → Each
processor has its own memory, reducing memory contention.
➔ Dual-Channel and Quad-Channel Memory (e.g., DDR4, DDR5) → Multiple memory channels
allow simultaneous data access.
➔ HBM (High Bandwidth Memory) (e.g., NVIDIA H100, AMD Instinct MI300) → Stacked memory
with wide buses for high-speed access.
Types of Parallelism:

1. Implicit Parallelism: The system automatically handles parallel execution without the
programmer's involvement. Hidden from the programmer, handled automatically by the system.
2. Explicit Parallelism: The programmer is responsible for designing and implementing parallel
execution. Programmer has control over parallel execution.

Why is this important?

● More units working simultaneously = higher processing power


● Better resource utilization = reduced bottlenecks
Implicit Parallelism: Trends in Microprocessor Architectures

● Microprocessor technology has improved significantly over the years, particularly in


clock speeds (processing speed).
● These increments in clock speed are severely diluted by the limitations of memory
technology.
● At the same time, higher levels of device integration have also resulted in a very large
transistor count, raising the obvious issue of how best to utilize them.
● Consequently, techniques that enable execution of multiple instructions in a single clock
cycle have become popular.
● Indeed, this trend is evident in the current generation of microprocessors such as the
Itanium, Sparc Ultra, MIPS, and Power4 .
Example:
● The Pentium 4 (2.0 GHz) uses a 20-stage pipeline for faster execution.It has a 20-stage pipeline,
meaning each instruction passes through 20 sequential processing stages before completion.
● Note that the speed of a single pipeline is ultimately limited by the largest atomic task in the
pipeline.
● Furthermore, in typical instruction traces, every fifth to sixth instruction is a branch instruction.
● Long instruction pipelines therefore need effective techniques for predicting branch destinations
so that pipelines can be speculatively filled.
● The penalty of a misprediction increases as the pipelines become deeper since a larger number
of instructions need to be flushed.
● These factors place limitations on the depth of a processor pipeline and the resulting
performance gains.
Deeper pipelines = Higher penalty for mispredictions.
An obvious way to improve instruction execution rate beyond this level is to use multiple
pipelines. During each clock cycle, multiple instructions are piped into the processor in parallel.
Scope of Parallelism

● Different applications utilize different aspects of parallelism - e.g., data


intensive applications utilize high aggregate throughput, server
applications utilize high aggregate network bandwidth, and scientific
applications typically utilize high processing and memory system
performance.
● It is important to understand each of these performance bottlenecks and their
interacting effect.
Implicit Parallelism: Trends in Microprocessor Architectures
● Current processors use these resources in multiple functional units
E.g.: Add R1, R2 (i) Instruction Fetch, (ii) Instruction Decode, (iii) Instruction Execute
● Consider these two instructions:
Add R1, R2
Add R2, R3
The precise manner in which these instructions are selected and executed provides
impressive diversity in architectures with different performance and for different purpose.
(Each architecture has its own merits and pitfalls.)
Pipelining
• Pipelining overlaps various stages of instruction execution to achieve performance.
• In Pipelining instead of executing one instruction at a time, the processor overlaps
different stages of multiple instructions:
● Fetch
● Decode
● Execute
● Store
• At a high level of abstraction, an instruction can be executed while the next one is being
decoded and the next one is being fetched.
• This increases instruction throughput, allowing multiple instructions to be processed at
once.
• This is akin to an assembly line, e.g., for manufacture of cars.
Pipelining
• Pipelining, however, has several limitations.
• The speed of a pipeline is eventually limited by the slowest stage.

1 unit per
20 Second 20 Second 20 Second 20 Second ?? second
130 Second

• Conventional processors rely on very deep pipelines (20 stage pipelines in state-
of-the-art Pentium processors).
• However, in typical program traces, every 5th to 6th instruction is a conditional jump
(such as in if-else, switch-case)!
• This requires very accurate branch prediction.
• The penalty of a mis-prediction grows with the depth of the pipeline, since a larger
number of instructions will have to be flushed.
Superscalar Execution

● The penalty of a misprediction increases as the pipelines become deeper since a larger
number of instructions need to be flushed.
● These factors place limitations on the depth of a processor pipeline and the resulting
performance gains.
● An obvious way to improve instruction execution rate beyond this level is to use multiple
pipelines.
● During each clock cycle, multiple instructions are piped into the processor in parallel.
● These instructions are executed on multiple functional units.
Superscalar Execution

Example of a two-way superscalar execution of instructions.


Superscalar Execution: An Example
● Consider the execution of the first code fragment for adding four numbers. The first and
second instructions are independent and therefore can be issued concurrently.
● This is illustrated in the simultaneous issue of the instructions load R1, @1000 and load
R2, @1008 at t = 0. The instructions are fetched, decoded, and the operands are
fetched.
● The next two instructions, add R1, @1004 and add R2,@100C are also mutually
independent, although they must be executed after the first two instructions.
● Consequently, they can be issued concurrently at t = 1 since the processors are
pipelined. These instructions terminate at t = 5. The next two instructions, add R1, R2
and store R1, @2000 cannot be executed concurrently since the result of the former
(contents of register R1) is used by the latter.
● Therefore, only the add instruction is issued at t = 2 and the store instruction at t = 3.
Note that the instruction add R1, R2 can be executed only after the previous two
instructions have been executed.
Superscalar Execution: An Example

• In the above example, there is some wastage of resources due to data


Superscalar Execution: An Example

• The example also illustrates that different instruction mixes with identical semantics can
take significantly different execution time.

(i), (ii) and (iii) actually produce the same answer in @2000.
Dependency in Superscalar Execution

• Superscalar Execution has multiple dependencies


– True Data Dependency
– Resource Dependency
– Branch Dependency
True Data Dependency:

● The results of an instruction may be required for subsequent instructions. This is referred
to as true data dependency.
● For instance, consider the second code fragment in (i) for adding four numbers. There is
a true data dependency between load R1, @1000 and add R1, @1004, and similarly
between subsequent instructions.
● Dependencies of this type must be resolved before simultaneous issue of instructions.
● This has two implications. First, since the resolution is done at runtime, it must be
supported in hardware. The complexity of this hardware can be high.
● Second, the amount of instruction level parallelism in a program is often limited and is a
function of coding technique.
● In the second code fragment, there can be no simultaneous issue, leading to poor
resource utilization.
Resource Dependency

● Another source of dependency between instructions results from the finite resources
shared by various pipelines.
● As an example, consider the co-scheduling of two floating point operations on a dual
issue machine with a single floating point unit.
● Although there might be no data dependencies between the instructions, they cannot be
scheduled together since both need the floating point unit.
● This form of dependency in which two instructions compete for a single processor
resource is referred to as resource dependency.
Branch Dependency

● The flow of control through a program enforces a third form of dependency between
instructions.
● Consider the execution of a conditional branch instruction.
● Since the branch destination is known only at the point of execution, scheduling
instructions a priori across branches may lead to errors.
● These dependencies are referred to as branch dependencies or procedural
dependencies and are typically handled by speculatively scheduling across branches
and rolling back in case of errors.
● Studies of typical traces have shown that on average, a branch instruction is
encountered between every five to six instructions. Therefore, just as in populating
instruction pipelines, accurate branch prediction is critical for efficient superscalar
execution.
● The ability of a processor to detect and schedule concurrent instructions is critical to
superscalar performance. For instance, consider the third code fragment in which also
computes the sum of four numbers.
● In this case, there is a data dependency between the first two instructions – load R1,
@1000 and add R1, @1004. Therefore, these instructions cannot be issued
simultaneously.
● However, if the processor had the ability to look ahead, it would realize that it is possible
to schedule the third instruction – load R2, @1008 –
● with the first instruction. In the next issue cycle, instructions two and four can be
scheduled, and so on.
● In this way, the same execution schedule can be derived for the first and third code
fragments.
● In this way, the same execution schedule can be derived for the first and third code
fragments.
● However, the processor needs the ability to issue instructions out-of-order to
accomplish desired reordering. The parallelism available in in-order issue of
instructions can be highly limited as illustrated by this example.
● Most current microprocessors are capable of out- of-order issue and completion.
● This model, also referred to as dynamic instruction issue, exploits maximum
instruction level parallelism.
● The processor uses a window of instructions from which it selects instructions for
simultaneous issue. This window corresponds to the look-ahead of the scheduler.
Superscalar Execution: Efficiency Considerations
● The performance of superscalar architectures is limited by the available instruction level
parallelism.
● Consider the example “i” These are essentially wasted cycles from the point of view of the
execution unit. If, during a particular cycle, no instructions are issued on the execution units,
it is referred to as vertical waste;
● if only part of the execution units are used during a cycle, it is termed horizontal waste.
Problems in Superscalar That VLIW Solves

Complex Hardware Scheduling

● Superscalar processors need complex control units to dynamically reorder instructions and resolve
dependencies.
● VLIW moves this responsibility to the compiler, simplifying processor design.

Branch Prediction and Dependency Handling Overhead

● Superscalar processors rely on branch prediction and speculative execution to keep pipelines full.
● VLIW avoids this overhead by statically scheduling instructions at compile time.

Energy and Power Consumption

● Superscalar processors require extra energy for out-of-order execution, instruction issue logic, and
branch prediction.
● VLIW reduces power consumption by eliminating complex hardware scheduling.
Very Long Instruction Word (VLIW) Processors

• The hardware cost and complexity of the superscalar scheduler is a major consideration in
processor design.
• To address this issues, VLIW processors rely on compile time analysis to identify and
bundle together instructions that can be executed concurrently.
• These instructions are packed and dispatched together, and thus the name very long
instruction word is used.

Add R1, R2 Sub R4, R3 Mul R2, R5 Div R1, R7

4-Way VLIW
How to bundle (pack) the instructions in VLIW?

add R1, R2 // add R2 to R1


sub R2, R1 // subtract R1 from R3
add R3, R4
sub R4, R3
Solution:

add R1, R2 sub R2,R1


(1)
add R3, R4 sub R4,R3
or
add R1, R2 add R3, R4
(2)
sub R2, R1 sub R4, R3
Very Long Instruction Word (VLIW) Processors: Considerations

• Hardware aspect is simpler.


• Compiler has a bigger context from which to select co- scheduled instructions. (More work
for compiler.)
• Branch and memory prediction is more difficult.
• VLIW performance is highly dependent on the compiler. A number of techniques such as
loop unrolling, speculative execution, branch prediction are critical.
• Typical VLIW processors are limited to 4-way to 8-way parallelism.
Limitations of Memory System Performance
The effective performance of a program on a computer relies not just on the speed of the
processor but also on the ability of the memory system to feed data to the processor
• Memory system performance is largely captured by two parameters, latency and
bandwidth.
• Latency is the time from the issue of a memory request to the time the data is available at
the processor. (Waiting time until the first data is received.
• Consider the example of a fire-hose. If the water comes out of the hose two seconds
after the hydrant is turned on, the latency of the system is two seconds. If you want
immediate response from the hydrant, it is important to reduce latency. )
• Bandwidth is the rate at which data can be pumped to the processor by the memory
system.
• Once the water starts flowing, if the hydrant delivers water at the rate of 5
gallons/second, the bandwidth of the system is 5 gallons/second. If you want to fight big
fires, you need a high bandwidth of water.
1. What is FLOP?

FLOP stands for Floating Point Operations Per Second. It measures how many floating-point
arithmetic operations (like addition, subtraction, multiplication, and division on decimal numbers) a
computer can perform in one second.

2. What are GFLOP and MFLOP?

● MFLOP (MegaFLOP) = 1 million (10⁶) floating-point operations per second


● GFLOP (GigaFLOP) = 1 billion (10⁹) floating-point operations per second

3. Why is FLOP important?

FLOP is a measure of a computer’s performance, especially in tasks requiring heavy


mathematical computations like machine learning, simulations, graphics rendering, and scientific
computing.
Memory Latency: An Example

○ Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns (no
caches). Assume that the processor has two multiply-add units and is capable of executing four
instructions in each cycle of 1 ns.
○ Since the memory latency is equal to 100 cycles and block size is one word, every time a memory
request is made, the processor must wait 100 cycles before it can start to process the data. This is a
serious drawback.

The processor that can perform 4 floating-point operations per clock cycle, and it runs at 1 GHz (1 billion
cycles per second).

● Since it can execute 4 operations per cycle, the total performance is:

4×1,000,000,000 = 4,000,000,000 FLOPs

=4 GFLOPs
Real Performance vs. Peak Performance

● Ideally, the processor should run at 4 GFLOPS.


● However, due to slow memory access, it can only perform one operation per 100 ns.
● That results in:

1 operation = 1 FLOP, i.e.

1 FLOP/100 ns = 1 FLOP/100 * 10−9

= 1 FLOP/ 10−7

= 10 MFLOP
Seriousness of Memory Latency

● It is easy to see that the peak speed of this computation is limited to one floating point
operation every 100 ns, or a speed of 10 MFLOPS, a very small fraction of the peak
processor rating.
● This example highlights the need for effective memory system performance in achieving
high computation rates.
Introduction of Latency

● Handling the mismatch in processor and DRAM speeds has motivated a number of
architectural innovations in memory system design.
● One such innovation addresses the speed mismatch by placing a smaller and faster memory
between the processor and the DRAM.
● This memory, referred to as the cache, acts as a low-latency high-bandwidth storage. The
data needed by the processor is first fetched into the cache.
● All subsequent accesses to data items residing in the cache are serviced by the cache. Thus,
in principle, if a piece of data is repeatedly used, the effective latency of this memory system
can be reduced by the cache.
● The fraction of data references satisfied by the cache is called the cache hit ratio of the
computation on the system.
● The effective computation rate of many applications is bounded not by the processing rate of
the CPU, but by the rate at which data can be pumped into the CPU.
● Such computations are referred to as being memory bound. The performance of memory
bound programs is critically impacted by the cache hit ratio.
Latency :- Example

As in the previous example, consider a 1 GHz processor with a 100 ns latency DRAM. In this
case, we introduce a cache of size 32 KB with a latency of 1 ns or one cycle (typically on the
processor itself). We use this setup to multiply two matrices A and B of dimensions 32 x 32 .

● Processor:- 1GHz = (1 billion instruction/cycle)


● Latency:- 100 ns = (100 ns/ cycle)
● Cache:- 32KB (Latency = 1 ns/cycle)
● Task :- Matrix Multiplication of 32 X 32
● Number of operations in matrix multiplication:- 2n3
Total Execution Time = Time taken to fetch matrices A and B + Time taken for
computation

Time taken to fetch matrices A and B =

Each matrix has 32X32 = 1024 elements = 1Kword

2 Matrices = 2K words to be loaded

Therefore, Fetch time = 2K* 100ns = 200 μs


Time taken for computation- 2n3

= 64k FLOPS (since n=32)

Since, processor can execute 4FLOPS/Cycle

Therefore, total cycles = 64K/4 = 16K FLOPS

Since 1 cycle = 1ns ( cache)

Therefore, 16k * 1ns = 16 μs

Total Execution Time= 200μs+16μs = 216μs


Real Performance

Performance = Operations / Total Execution Time

= 64K FLOPS/ 216μs

= 303 MFLOPS

that this is a thirty-fold improvement over the previous example


Impact of Memory Bandwidth

• Memory bandwidth is determined by the bandwidth (no. of bytes per second) of the memory
bus as well as the memory units.
• Memory bandwidth can be improved by increasing the size of memory blocks. This will increase
the size of the bus.
• It is important to note that increasing block size does not change latency of the system.
• In practice, wide data and address buses are expensive to construct.
• In a more practical system, consecutive words are sent on the memory bus on subsequent bus
cycles after the first word is retrieved. This reduces latency by half.
Alternate Approaches for Hiding Memory Latency

• Consider the problem of browsing the web on a very slow network connection. We deal with
the problem in one of three possible ways:
– we anticipate which pages we are going to browse ahead of time and issue requests for them
in advance;
– we open multiple browsers and access different pages in each browser, thus while we are
waiting for one page to load, we could be reading others; or

• The first approach is called prefetching, the second multithreading.


Multithreading for Latency Hiding
A thread is a single stream of control in the flow of a program. Lets illustrate threads with a simple
example:
for (i = 0; i < n; i++)
c[i] = dot_product(get_row(a, i), b);

Each dot-product is independent of the other, and therefore represents a concurrent unit of
execution. We can safely rewrite the above code segment as:
Each dot-product is independent of the other, and therefore represents a concurrent unit of
execution. We can safely rewrite the above code segment as:
for (i = 0; i < n; i++)
c[i] = create_thread(dot_product,get_row(a, i), b);

● Each dot product runs in a separate thread.


● While one thread waits for data, another thread starts its computation!
● Result: Useful work happens every cycle — hiding memory latency.
Multithreading for Latency Hiding: Example

• In the code, the first instance of this function accesses a pair of vector elements and waits for
them.
• In the meantime, the second instance of this function can access

Multiple threads are created.

• After l units of time, where l is the latency of the memory system, the first function instance
gets the requested data from memory and can perform the required computation.
• In the next cycle, the data items for the next function instance arrive, and so on. In this way,
in every clock cycle, we can perform a computation. This is how the memory latency is
reduced
• The execution schedule in the previous example is predicated upon two assumptions: the
memory system is capable of servicing multiple outstanding requests, and the processor is
capable of switching threads at every cycle.
• It also requires the program to have an explicit specification of concurrency in the form of
threads.
Prefetching for Latency Hiding

• Prefetching: Load data before it’s needed, so it’s ready by the time the processor uses it.
• The idea is to advance load operations and overlap memory access with computation.
Consider the problem of adding two vectors a and b using a single for loop. In the first iteration of the
loop, the processor requests a[0] and b[0]. Since these are not in the cache, the processor must pay
the memory latency. Assuming that each request is generated in one cycle (1 ns) and memory
requests are satisfied in 100 ns,after 100
for (i = 0; i < n; i++)
c[i] = a[i] + b[i];

● First iteration: Load a[0] and b[0] → Cache miss → 100-cycle stall.
● Processor idles until data arrives.
● Prefetching logic: Request a[1] and b[1] immediately after a[0] and b[0].
● Request generation: 1 cycle
● Memory latency: 100 cycles
● After 100 requests, data returns every cycle.

Result: One addition happens per cycle — no wasted CPU cycles!


Tradeoffs of Multithreading and Prefetching
• Bandwidth Bottleneck:
The system is no longer limited by latency but by bandwidth. Even if multithreading hides
latency, the sheer volume of memory traffic overwhelms DRAM bandwidth.
• Impact of Smaller Cache Per Thread:
Smaller cache slices for each thread reduce the hit ratio, forcing more memory accesses.
The more threads you add, the more you fragment the cache, and the worse the problem
gets.
• Prefetching and Hardware Constraints:
Prefetching reduces latency but increases bandwidth pressure.
Example: If the system prefetches 10 loads, those 10 registers must stay free. If reused, the
data must be fetched again, doubling the bandwidth requirement.
Dichotomy of Parallel Computing Platforms

● The dichotomy based on the logical and physical organization of parallel platforms.
● The logical organization refers to a programmer's view of the platform while the physical
organization refers to the actual hardware organization of the platform.
● The two critical components of parallel computing from a programmer's perspective are ways
of expressing parallel tasks and mechanisms for specifying interaction between these tasks.
● The former is sometimes also referred to as the control structure and the latter as the
communication model.
Control Structure of Parallel Programs
• Parallel tasks can be specified at various levels of granularity.
• At one extreme, each program in a set of programs can be viewed as one parallel task.
• At the other extreme, individual instructions within a program can be viewed as parallel tasks.
• Between these extremes lie a range of models for specifying the control structure of programs and
the corresponding architectural support for them.
• Processing units in parallel computers either operate under the centralized control of a single
control unit or work independently.
• If there is a single control unit that dispatches the same instruction to various processors (that work
on different data), the model is referred to as single instruction stream, multiple data stream (SIMD).
• If each processor has its own control unit, each processor can execute different instructions on
different data items. This model is called multiple instruction stream, multiple data stream (MIMD).
SIMD and MIMD Processors

(a) (b)
(a) A typical SIMD architecture, and (b) a typical MIMD architecture.
Conditional Execution in SIMD
Processors
Executing a conditional
statement on an SIMD computer
with four processors:

(a) the conditional statement;

(b) the execution of the


statement in two steps.
MIMD Processors

• In contrast to SIMD processors, MIMD processors can execute different programs on different
processors.
• A variant of this, called single program multiple data streams (SPMD) executes the same
program on different processors.
• It is easy to see that SPMD and MIMD are closely related in terms of programming flexibility and
underlying architectural support.
Communication Model of Parallel Platforms

There are two primary forms of data exchange between parallel tasks – accessing a shared data
space and exchanging messages.

● Data Exchange in Parallel Tasks:


○ Accessing a shared data space (shared-address-space)
○ Message Passing
Shared-Address-Space Memory Types

● The "shared-address-space" view of a parallel platform supports a common data space that
is accessible to all processors. Processors interact by modifying data objects stored in this
shared-address-space.
● Memory in shared-address-space platforms can be local (exclusive to a processor) or global
(common to all processors).
● Support a common data space accessible to all processors
● Processors modify shared data objects
● Shared-address-space platforms supporting SPMD programming are alsoreferred to as
multiprocessors.
Shared-Address-Space Memory Types

Shared-Address-Space Memory Types

● Uniform Memory Access (UMA):


○ If the time taken by a processor to access any memory word in the system (global or
local) is identical, the platform is classified as a uniform memory access (UMA)
multicomputer
○ Equal access time to any memory location
● Non-Uniform Memory Access (NUMA):
○ if the time taken to access certain memory words is longer than others, the platform is
called a non- uniform memory access (NUMA) multicomputer.
○ Access time varies based on memory location
Uniform Memory Access (UMA):

● Same access time for all memory locations


● Example systems: SGI Origin 2000, Sun Ultra HPC
● Has two variants :

(a) UMA Shared-Address-Space (only Global Space)

(b) UMA with Caches: (Global +Local)


UMA Shared-Address-Space:

Structure:

● Multiple processors (P) are connected to an interconnection network.


● The network links the processors to multiple memory modules (M).

Key Characteristics:

● All processors have equal access latency to any memory module.


● Memory is physically shared, and processors interact through a common address space.

Pros: Easier to program, as memory access time is predictable.


Cons: Can become a bottleneck as the number of processors grows, due to contention on the
interconnection network.
UMA with Caches
● Structure:
○ Each processor (P) has a private cache (C) between itself and the interconnection network.
○ Processors still access shared memory via the network, but they can first check their caches.
● Key Characteristics:
○ Faster access for frequently used data (due to caching).
○ Cache coherence problem arises: If one processor updates data, the caches of other
processors may become outdated.
● Pros: Faster average memory access due to caching.
● Cons: Requires hardware support for cache coherence protocols (like MESI((Modified, Exclusive,
Shared, Invalid))) to avoid inconsistent data views.
Cache Coherence Challenges
● Cache Coherence Problem:
○ Multiple processors may hold different copies of the same memory word
○ Changes in one copy may not reflect in others
Non-Uniform Memory Access (NUMA)
● Structure:
○ Each processor (P) has a local memory (M) and a cache (C).
○ The interconnection network links all the processors and their local memory modules.
● Key Characteristics:
○ Processors access their local memory faster than remote memory.
○ Memory access time is non-uniform: Local access is quick, but accessing another
processor’s memory is slower.
● Pros: Scales better than UMA, as memory is distributed.
● Cons: Programmers must optimize data locality to avoid performance penalties for remote
memory access
Message-Passing Platforms

○ Definition: A parallel computing platform where each processing node has its own exclusive
address space, and nodes communicate by exchanging messages.
○ Examples: Clustered workstations, non-shared-address-space multicomputers.
○ Key Feature: No direct memory access between nodes — all interactions happen through
message exchanges.
Logical View of Message-Passing Systems

● p Processing Nodes: Each with independent memory.


● Inter-node Communication: Only via messages.
● Typical Platforms: IBM SP, SGI Origin 2000, workstation clusters.
● Real-World Example: A set of computers connected over a network working on a common
task.
Core Operations in Message Passing

● Send: Transmit data to another process.


● Receive: Accept data from another process.
● whoami: Return the ID of the calling process.
● numprocs: Return the total number of processes.

Programming Interfaces for Message Passing

● MPI (Message Passing Interface)


○ Widely used, standardized API.
○ Supports basic and complex communication patterns.
● PVM (Parallel Virtual Machine)
○ Enables a collection of heterogeneous computers to work as a single parallel system.
How Message Passing Works

1. Process A wants data from Process B:


○ A sends a message to B.
○ B processes the request and sends back the result.
2. Synchronization: Ensures processes are in the right state to exchange data.
3. Explicit Communication: No implicit sharing of variables or memory.

Emulating Message Passing on Shared Memory Systems

● Divide shared memory into p partitions: Each processor gets a unique section.
● Send/Receive via memory writes/reads: One processor writes to another’s section.
● Synchronization needed: Use locks or barriers to manage access.

Example: Two processors write and read from pre-allocated memory slots to exchange
data.
Pros and Cons of Message Passing

● Advantages:
○ Scales well to large distributed systems.
○ Explicit control over communication.
● Disadvantages:
○ More programming effort.
○ Communication overhead.

Real-World Applications

● High-Performance Computing (HPC)


● Scientific Simulations
● Distributed Databases
● Cloud and Fog Computing
Physical Organization of Parallel Platforms

An ideal parallel computer extends the concept of a Random Access Machine (RAM) to multiple
processors — this is called a Parallel Random Access Machine (PRAM).Key Characteristics of
PRAM:

● Processors (p): Multiple processors working simultaneously.


● Global Memory (m): A single, large, shared memory accessible to all processors.
● Single Address Space: All processors access the same memory locations.
● Common Clock: All processors operate in sync, but they may execute different instructions
in each cycle.

This setup is simple and powerful in theory, but things get tricky when processors try to access
memory at the same time.
PRAM Subclasses-
PRAM models differ based on how they handle concurrent read and write operations to the same
memory location:
Exclusive-read, exclusive-write (EREW) PRAM. In this class, access to a memory location is
exclusive. No concurrent read or write operations are allowed. This is the weakest PRAM model,
affording minimum concurrency in memory access.
Concurrent-read, exclusive-write (CREW) PRAM. In this class, multiple read accesses to a
memory location are allowed. However, multiple write accesses to a memory location are
serialized.
Exclusive-read, concurrent-write (ERCW) PRAM. Multiple parallel write accesses are allowed to
a memory location, but multiple read accesses are serialized.
Concurrent-read, concurrent-write (CRCW) PRAM. This class allows multiple read and write
accesses to a common memory location. This is the most powerful PRAM model.
Allowing concurrent read access does not create any semantic discrepancies in the program.
However, concurrent write access to a memory location requires arbitration. Several protocols are
used to resolve concurrent writes.
The most frequently used protocols are as follows:
● Common, in which the concurrent write is allowed if all the values that the processors are
attempting to write are identical.
● Arbitrary, in which an arbitrary processor is allowed to proceed with the write operation and
the rest fail.
● Priority, in which all processors are organized into a predefined prioritized list, and the
processor with the highest priority succeeds and the rest fail.
● Sum, in which the sum of all the quantities is written (the sum-based write conflict resolution
model can be extended to any associative operator defined on the quantities being written).
Architectural Complexity of the Ideal Model

● Memory Access via Switches: Processors access memory through switches that connect
them to memory words.
● Switching Complexity: In an EREW PRAM, to allow each processor to access any memory
word (as long as no two access the same word simultaneously), the number of switches
required is proportional to:
O(m×p)
Where:
○ m = number of memory words
○ p = number of processors
● Cost of Hardware: For realistic memory sizes and processor counts, building this many
switches becomes extremely expensive and impractical.

You might also like