Parallel Programming- Unit 1
Parallel Programming- Unit 1
B.Sc(H)-VI Sem
What is Parallel Computing?
1. Bit-level parallelism –
It is the form of parallel computing which is based on the increasing processor’s
size. It reduces the number of instructions that the system must execute in order
to perform a task on large-sized data.
Example: Consider a scenario where an 8-bit processor must compute the sum
of two 16-bit integers. It must first sum up the 8 lower-order bits, then add the 8
higher-order bits, thus requiring two instructions to perform the operation. A 16-
bit processor can perform the operation with just one instruction.
2. Instruction-level parallelism –
A processor can only address one instruction for each clock cycle phase. These
instructions can be re-ordered and grouped which are later on executed side by side
without affecting the result of the program. This is called instruction-level parallelism.
3. Task Parallelism –
Task parallelism employs the decomposition of a task into subtasks and then allocating
each of the subtasks for execution. The processors perform the execution of sub-tasks
side by side.
4. Data-level parallelism –
● The whole real-world runs in dynamic nature i.e. many things happen at a certain time
but at different places concurrently. This data is extensively huge to manage.
● Real-world data needs more dynamic simulation and modeling, and for achieving the
same, parallel computing is the key.
● Parallel computing provides concurrency and saves time and money.
● Complex, large datasets, and their management can be organized only and only
using parallel computing approach.
● Ensures the effective utilization of the resources. The hardware is guaranteed to be
used effectively whereas in serial computation only some part of the hardware was
used and the rest rendered idle.
● Also, it is impractical to implement real-time systems using serial computing.
Applications of Parallel Computing
Problem: These components create bottlenecks that slow down overall system performance.
Solution: Over the years, several architectural innovations have been introduced to overcome
these bottlenecks.
The Role of Multiplicity in Architecture
Multiplicity (Parallelism) is a key innovation that improves performance.
1. Implicit Parallelism: The system automatically handles parallel execution without the
programmer's involvement. Hidden from the programmer, handled automatically by the system.
2. Explicit Parallelism: The programmer is responsible for designing and implementing parallel
execution. Programmer has control over parallel execution.
1 unit per
20 Second 20 Second 20 Second 20 Second ?? second
130 Second
• Conventional processors rely on very deep pipelines (20 stage pipelines in state-
of-the-art Pentium processors).
• However, in typical program traces, every 5th to 6th instruction is a conditional jump
(such as in if-else, switch-case)!
• This requires very accurate branch prediction.
• The penalty of a mis-prediction grows with the depth of the pipeline, since a larger
number of instructions will have to be flushed.
Superscalar Execution
● The penalty of a misprediction increases as the pipelines become deeper since a larger
number of instructions need to be flushed.
● These factors place limitations on the depth of a processor pipeline and the resulting
performance gains.
● An obvious way to improve instruction execution rate beyond this level is to use multiple
pipelines.
● During each clock cycle, multiple instructions are piped into the processor in parallel.
● These instructions are executed on multiple functional units.
Superscalar Execution
• The example also illustrates that different instruction mixes with identical semantics can
take significantly different execution time.
(i), (ii) and (iii) actually produce the same answer in @2000.
Dependency in Superscalar Execution
● The results of an instruction may be required for subsequent instructions. This is referred
to as true data dependency.
● For instance, consider the second code fragment in (i) for adding four numbers. There is
a true data dependency between load R1, @1000 and add R1, @1004, and similarly
between subsequent instructions.
● Dependencies of this type must be resolved before simultaneous issue of instructions.
● This has two implications. First, since the resolution is done at runtime, it must be
supported in hardware. The complexity of this hardware can be high.
● Second, the amount of instruction level parallelism in a program is often limited and is a
function of coding technique.
● In the second code fragment, there can be no simultaneous issue, leading to poor
resource utilization.
Resource Dependency
● Another source of dependency between instructions results from the finite resources
shared by various pipelines.
● As an example, consider the co-scheduling of two floating point operations on a dual
issue machine with a single floating point unit.
● Although there might be no data dependencies between the instructions, they cannot be
scheduled together since both need the floating point unit.
● This form of dependency in which two instructions compete for a single processor
resource is referred to as resource dependency.
Branch Dependency
● The flow of control through a program enforces a third form of dependency between
instructions.
● Consider the execution of a conditional branch instruction.
● Since the branch destination is known only at the point of execution, scheduling
instructions a priori across branches may lead to errors.
● These dependencies are referred to as branch dependencies or procedural
dependencies and are typically handled by speculatively scheduling across branches
and rolling back in case of errors.
● Studies of typical traces have shown that on average, a branch instruction is
encountered between every five to six instructions. Therefore, just as in populating
instruction pipelines, accurate branch prediction is critical for efficient superscalar
execution.
● The ability of a processor to detect and schedule concurrent instructions is critical to
superscalar performance. For instance, consider the third code fragment in which also
computes the sum of four numbers.
● In this case, there is a data dependency between the first two instructions – load R1,
@1000 and add R1, @1004. Therefore, these instructions cannot be issued
simultaneously.
● However, if the processor had the ability to look ahead, it would realize that it is possible
to schedule the third instruction – load R2, @1008 –
● with the first instruction. In the next issue cycle, instructions two and four can be
scheduled, and so on.
● In this way, the same execution schedule can be derived for the first and third code
fragments.
● In this way, the same execution schedule can be derived for the first and third code
fragments.
● However, the processor needs the ability to issue instructions out-of-order to
accomplish desired reordering. The parallelism available in in-order issue of
instructions can be highly limited as illustrated by this example.
● Most current microprocessors are capable of out- of-order issue and completion.
● This model, also referred to as dynamic instruction issue, exploits maximum
instruction level parallelism.
● The processor uses a window of instructions from which it selects instructions for
simultaneous issue. This window corresponds to the look-ahead of the scheduler.
Superscalar Execution: Efficiency Considerations
● The performance of superscalar architectures is limited by the available instruction level
parallelism.
● Consider the example “i” These are essentially wasted cycles from the point of view of the
execution unit. If, during a particular cycle, no instructions are issued on the execution units,
it is referred to as vertical waste;
● if only part of the execution units are used during a cycle, it is termed horizontal waste.
Problems in Superscalar That VLIW Solves
● Superscalar processors need complex control units to dynamically reorder instructions and resolve
dependencies.
● VLIW moves this responsibility to the compiler, simplifying processor design.
● Superscalar processors rely on branch prediction and speculative execution to keep pipelines full.
● VLIW avoids this overhead by statically scheduling instructions at compile time.
● Superscalar processors require extra energy for out-of-order execution, instruction issue logic, and
branch prediction.
● VLIW reduces power consumption by eliminating complex hardware scheduling.
Very Long Instruction Word (VLIW) Processors
• The hardware cost and complexity of the superscalar scheduler is a major consideration in
processor design.
• To address this issues, VLIW processors rely on compile time analysis to identify and
bundle together instructions that can be executed concurrently.
• These instructions are packed and dispatched together, and thus the name very long
instruction word is used.
4-Way VLIW
How to bundle (pack) the instructions in VLIW?
FLOP stands for Floating Point Operations Per Second. It measures how many floating-point
arithmetic operations (like addition, subtraction, multiplication, and division on decimal numbers) a
computer can perform in one second.
○ Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns (no
caches). Assume that the processor has two multiply-add units and is capable of executing four
instructions in each cycle of 1 ns.
○ Since the memory latency is equal to 100 cycles and block size is one word, every time a memory
request is made, the processor must wait 100 cycles before it can start to process the data. This is a
serious drawback.
The processor that can perform 4 floating-point operations per clock cycle, and it runs at 1 GHz (1 billion
cycles per second).
● Since it can execute 4 operations per cycle, the total performance is:
=4 GFLOPs
Real Performance vs. Peak Performance
= 1 FLOP/ 10−7
= 10 MFLOP
Seriousness of Memory Latency
● It is easy to see that the peak speed of this computation is limited to one floating point
operation every 100 ns, or a speed of 10 MFLOPS, a very small fraction of the peak
processor rating.
● This example highlights the need for effective memory system performance in achieving
high computation rates.
Introduction of Latency
● Handling the mismatch in processor and DRAM speeds has motivated a number of
architectural innovations in memory system design.
● One such innovation addresses the speed mismatch by placing a smaller and faster memory
between the processor and the DRAM.
● This memory, referred to as the cache, acts as a low-latency high-bandwidth storage. The
data needed by the processor is first fetched into the cache.
● All subsequent accesses to data items residing in the cache are serviced by the cache. Thus,
in principle, if a piece of data is repeatedly used, the effective latency of this memory system
can be reduced by the cache.
● The fraction of data references satisfied by the cache is called the cache hit ratio of the
computation on the system.
● The effective computation rate of many applications is bounded not by the processing rate of
the CPU, but by the rate at which data can be pumped into the CPU.
● Such computations are referred to as being memory bound. The performance of memory
bound programs is critically impacted by the cache hit ratio.
Latency :- Example
As in the previous example, consider a 1 GHz processor with a 100 ns latency DRAM. In this
case, we introduce a cache of size 32 KB with a latency of 1 ns or one cycle (typically on the
processor itself). We use this setup to multiply two matrices A and B of dimensions 32 x 32 .
= 303 MFLOPS
• Memory bandwidth is determined by the bandwidth (no. of bytes per second) of the memory
bus as well as the memory units.
• Memory bandwidth can be improved by increasing the size of memory blocks. This will increase
the size of the bus.
• It is important to note that increasing block size does not change latency of the system.
• In practice, wide data and address buses are expensive to construct.
• In a more practical system, consecutive words are sent on the memory bus on subsequent bus
cycles after the first word is retrieved. This reduces latency by half.
Alternate Approaches for Hiding Memory Latency
• Consider the problem of browsing the web on a very slow network connection. We deal with
the problem in one of three possible ways:
– we anticipate which pages we are going to browse ahead of time and issue requests for them
in advance;
– we open multiple browsers and access different pages in each browser, thus while we are
waiting for one page to load, we could be reading others; or
Each dot-product is independent of the other, and therefore represents a concurrent unit of
execution. We can safely rewrite the above code segment as:
Each dot-product is independent of the other, and therefore represents a concurrent unit of
execution. We can safely rewrite the above code segment as:
for (i = 0; i < n; i++)
c[i] = create_thread(dot_product,get_row(a, i), b);
• In the code, the first instance of this function accesses a pair of vector elements and waits for
them.
• In the meantime, the second instance of this function can access
• After l units of time, where l is the latency of the memory system, the first function instance
gets the requested data from memory and can perform the required computation.
• In the next cycle, the data items for the next function instance arrive, and so on. In this way,
in every clock cycle, we can perform a computation. This is how the memory latency is
reduced
• The execution schedule in the previous example is predicated upon two assumptions: the
memory system is capable of servicing multiple outstanding requests, and the processor is
capable of switching threads at every cycle.
• It also requires the program to have an explicit specification of concurrency in the form of
threads.
Prefetching for Latency Hiding
• Prefetching: Load data before it’s needed, so it’s ready by the time the processor uses it.
• The idea is to advance load operations and overlap memory access with computation.
Consider the problem of adding two vectors a and b using a single for loop. In the first iteration of the
loop, the processor requests a[0] and b[0]. Since these are not in the cache, the processor must pay
the memory latency. Assuming that each request is generated in one cycle (1 ns) and memory
requests are satisfied in 100 ns,after 100
for (i = 0; i < n; i++)
c[i] = a[i] + b[i];
● First iteration: Load a[0] and b[0] → Cache miss → 100-cycle stall.
● Processor idles until data arrives.
● Prefetching logic: Request a[1] and b[1] immediately after a[0] and b[0].
● Request generation: 1 cycle
● Memory latency: 100 cycles
● After 100 requests, data returns every cycle.
● The dichotomy based on the logical and physical organization of parallel platforms.
● The logical organization refers to a programmer's view of the platform while the physical
organization refers to the actual hardware organization of the platform.
● The two critical components of parallel computing from a programmer's perspective are ways
of expressing parallel tasks and mechanisms for specifying interaction between these tasks.
● The former is sometimes also referred to as the control structure and the latter as the
communication model.
Control Structure of Parallel Programs
• Parallel tasks can be specified at various levels of granularity.
• At one extreme, each program in a set of programs can be viewed as one parallel task.
• At the other extreme, individual instructions within a program can be viewed as parallel tasks.
• Between these extremes lie a range of models for specifying the control structure of programs and
the corresponding architectural support for them.
• Processing units in parallel computers either operate under the centralized control of a single
control unit or work independently.
• If there is a single control unit that dispatches the same instruction to various processors (that work
on different data), the model is referred to as single instruction stream, multiple data stream (SIMD).
• If each processor has its own control unit, each processor can execute different instructions on
different data items. This model is called multiple instruction stream, multiple data stream (MIMD).
SIMD and MIMD Processors
(a) (b)
(a) A typical SIMD architecture, and (b) a typical MIMD architecture.
Conditional Execution in SIMD
Processors
Executing a conditional
statement on an SIMD computer
with four processors:
• In contrast to SIMD processors, MIMD processors can execute different programs on different
processors.
• A variant of this, called single program multiple data streams (SPMD) executes the same
program on different processors.
• It is easy to see that SPMD and MIMD are closely related in terms of programming flexibility and
underlying architectural support.
Communication Model of Parallel Platforms
There are two primary forms of data exchange between parallel tasks – accessing a shared data
space and exchanging messages.
● The "shared-address-space" view of a parallel platform supports a common data space that
is accessible to all processors. Processors interact by modifying data objects stored in this
shared-address-space.
● Memory in shared-address-space platforms can be local (exclusive to a processor) or global
(common to all processors).
● Support a common data space accessible to all processors
● Processors modify shared data objects
● Shared-address-space platforms supporting SPMD programming are alsoreferred to as
multiprocessors.
Shared-Address-Space Memory Types
Structure:
Key Characteristics:
○ Definition: A parallel computing platform where each processing node has its own exclusive
address space, and nodes communicate by exchanging messages.
○ Examples: Clustered workstations, non-shared-address-space multicomputers.
○ Key Feature: No direct memory access between nodes — all interactions happen through
message exchanges.
Logical View of Message-Passing Systems
● Divide shared memory into p partitions: Each processor gets a unique section.
● Send/Receive via memory writes/reads: One processor writes to another’s section.
● Synchronization needed: Use locks or barriers to manage access.
Example: Two processors write and read from pre-allocated memory slots to exchange
data.
Pros and Cons of Message Passing
● Advantages:
○ Scales well to large distributed systems.
○ Explicit control over communication.
● Disadvantages:
○ More programming effort.
○ Communication overhead.
Real-World Applications
An ideal parallel computer extends the concept of a Random Access Machine (RAM) to multiple
processors — this is called a Parallel Random Access Machine (PRAM).Key Characteristics of
PRAM:
This setup is simple and powerful in theory, but things get tricky when processors try to access
memory at the same time.
PRAM Subclasses-
PRAM models differ based on how they handle concurrent read and write operations to the same
memory location:
Exclusive-read, exclusive-write (EREW) PRAM. In this class, access to a memory location is
exclusive. No concurrent read or write operations are allowed. This is the weakest PRAM model,
affording minimum concurrency in memory access.
Concurrent-read, exclusive-write (CREW) PRAM. In this class, multiple read accesses to a
memory location are allowed. However, multiple write accesses to a memory location are
serialized.
Exclusive-read, concurrent-write (ERCW) PRAM. Multiple parallel write accesses are allowed to
a memory location, but multiple read accesses are serialized.
Concurrent-read, concurrent-write (CRCW) PRAM. This class allows multiple read and write
accesses to a common memory location. This is the most powerful PRAM model.
Allowing concurrent read access does not create any semantic discrepancies in the program.
However, concurrent write access to a memory location requires arbitration. Several protocols are
used to resolve concurrent writes.
The most frequently used protocols are as follows:
● Common, in which the concurrent write is allowed if all the values that the processors are
attempting to write are identical.
● Arbitrary, in which an arbitrary processor is allowed to proceed with the write operation and
the rest fail.
● Priority, in which all processors are organized into a predefined prioritized list, and the
processor with the highest priority succeeds and the rest fail.
● Sum, in which the sum of all the quantities is written (the sum-based write conflict resolution
model can be extended to any associative operator defined on the quantities being written).
Architectural Complexity of the Ideal Model
● Memory Access via Switches: Processors access memory through switches that connect
them to memory words.
● Switching Complexity: In an EREW PRAM, to allow each processor to access any memory
word (as long as no two access the same word simultaneously), the number of switches
required is proportional to:
O(m×p)
Where:
○ m = number of memory words
○ p = number of processors
● Cost of Hardware: For realistic memory sizes and processor counts, building this many
switches becomes extremely expensive and impractical.