HPC Unit 1 Solution
HPC Unit 1 Solution
UNIT 1
Introduction to Parallel Computing
Q.1 What are the applications of Parallel Computing? April
Traditionally, software has been written for serial computation: To be run on a single 2023[6]
computer having a single Central Processing Unit (CPU).A problem is broken into a
discrete series of instructions. Instructions are executed one after another. Only one
instruction may execute at any moment in time. In the simplest sense, parallel computing is
the simultaneous use of multiple compute resources to solve a computational problem. To
be run using multiple CPUs A problem is broken into discrete parts that can be solved
concurrently Each part is further broken down to a series of instructions .Instructions from
each part execute simultaneously on different CPUs .
Motivating Parallelism:
• Development of parallel software has traditionally been thought of as time and
1
Shree Ramchandra College of Engineering, Lonikand, Pune
effort intensive.
In 1965, Gordon Moore made the following simple observation:"The complexity for
minimum component costs has increased at a rate of roughly a factor of two per year.
Certainly over the short term this rate can be expected to continue, if not to increase. Over
the longer term, the rate of increase is a bit more uncertain, although there is no reason to
believe it will not remain nearly constant for at least 10 years. That means by 1975, the
number of components per integrated circuit for minimum cost will be 65,000."
2. The Memory/Disk Speed Argument
The overall speed of computation is determined not just by the speed of the processor, but
also by the ability of the memory system to feed data to it. While clock rates of high-end
processors have increased at roughly 40% per year over the past decade, DRAM access
times have only improved at the rate of roughly 10% per year over this interval.
• The overall performance of the memory system is determined by the fraction of the
total memory requests that can be satisfied from the cache
In many applications there are constraints on the location of data and/or resources across
the Internet. An example of such an application is mining of large commercial datasets
distributed over a relatively low bandwidth network. In such applications, even if the
computing power is available to accomplish the required task without resorting to parallel
computing, it is infeasible to collect the data at a central location. In these cases, the
motivation for parallelism comes not just from the need for computing resources but also
from the infeasibility or undesirability of alternate (centralized) approaches.
Q.2 Explain with suitable diagram SIMD architecture. April
2023[4]
SIMD stands for 'Single Instruction and Multiple Data Stream'. It represents an
2
Shree Ramchandra College of Engineering, Lonikand, Pune
organization that includes many processing units under the supervision of a
common control unit. All processors receive the same instruction from the control
unit but operate on different items of data. The shared memory unit must contain
multiple modules so that it can communicate with all the processors simultaneously.
SIMD is short for Single Instruction/Multiple Data, while the term SIMD
operations refers to a computing method that enables processing of multiple data
with a single instruction. In contrast, the conventional sequential approach using
one instruction to process each individual data is called scalar operations.
The current era of SIMD processors grew out of the desktop-computer market
rather than the supercomputer market. As desktop processors became powerful
enough to support real-time gaming and audio/video processing during the 1990s,
demand grew for this particular type of computing power, and microprocessor
vendors turned to SIMD to meet the demand.[
Some of the earliest parallel computers such as the Illiac IV,MPP, DAP, CM-2, and
MasPar MP-1 belonged to this class of machines.
Variants of this concept have found use in co-processing units such as the MMX
units in Intel processors and DSP chips such as the Sharc. SIMD relies on the
regular structure of computations (such as those in image processing).It is often
necessary to selectively turn off operations on certain data items. For this reason,
most SIMD programming paradigms allow for an “activity mask”, which
determines if a processor should participate in a computation or not.
3
Shree Ramchandra College of Engineering, Lonikand, Pune
In this organization, all processors in a parallel computer can execute different
instructions and operate on various data at the same time. In MIMD, each processor
has a separate program and an instruction stream is generated from each program.
Which is the most fundamental and well-known type of parallel processor. It is a
technique used to achieve parallelism. The shared memory programming paradigm
and the distributed memory programming model are used in the MIMD
architecture. Every model has its own set of benefits and drawbacks. MIMD is an
abbreviation for Multiple Instruction and Multiple Data Stream. All processors of a
parallel computer may execute distinct instructions and act on different data at the
same time in this organization. Each processor in MIMD has its own program, and
each program generates an instruction stream.
• SIMD-MIMD Comparison
• SIMD computers require less hardware than MIMD computers (single control
unit).
Q.4 Explain the impact of Memory Latency and Memory Bandwidth on system April
performance. 2023[5]
It is very important to understand the difference between latency and bandwidth.
Consider the example of a fire-hose. If the water comes out of the hose two seconds
after the hydrant is turned on, the latency of the system is two seconds. Once the
water starts flowing, if the hydrant delivers water at the rate of 5 gallons/second, the
4
Shree Ramchandra College of Engineering, Lonikand, Pune
bandwidth of the system is 5 gallons/second.
The peak processor rating is 4 GFLOPS. Since the memory latency is equal to 100
cycles and block size is one word, every time a memory request is made, the
processor must wait 100 cycles before it can process the data. On the above
architecture, consider the problem of computing a dot-product of two vectors.
Caches are small and fast memory elements between the processor and DRAM.
This memory acts as low-latency high-bandwidth storage. If a piece of data is
repeatedly used, the effective latency of this memory system can be reduced by the
cache. The fraction of data references satisfied by the cache is called the cache hit
ratio of the computation on the system.Cache hit ratio achieved by a code on a
memory system often.
Q.5 Explain Message Passing Costs in Parallel Computers in parallel machines. April
2023[6]
These platforms comprise of a set of processors and their own (exclusive) memory.
Instances of such a view come naturally from clustered workstations and non-
shared-address-space multicomputer. These platforms are programmed using
(variants of) send and receive primitives. Shared address space platforms can easily
5
Shree Ramchandra College of Engineering, Lonikand, Pune
emulate message passing. The reverse is more difficult to do (in an efficient
manner).
The time taken to communicate a message between two nodes in a network is the
sum of the time to prepare a message for transmission and the time taken by the
message to traverse the network to its destination.
1. Startup time (ts): The startup time is the time required to handle a message
at the sending and receiving nodes. This includes the time to prepare the
message (adding header, trailer, and error correction information), the time
to execute the routing algorithm, and the time to establish an interface
between the local node and the router. This delay is incurred only once for a
single message transfer.
2. Per-hop time (th): After a message leaves a node, it takes a finite amount of
time to reach the next node in its path. The time taken by the header of a
message to travel between two directly-connected nodes in the network is
called the per-hop time. It is also known as node latency. The per-hop time
is directly related to the latency within the routing switch for determining
which output buffer or channel the message should be forwarded to.
3. Per-word transfer time (tw): If the channel bandwidth is r words per second,
then each word takes time tw = 1/r to traverse the link. This time is called the
per-word transfer time. This time includes network as well as buffering
overheads.
Store-and-Forward Routing
6
Shree Ramchandra College of Engineering, Lonikand, Pune
In store-and-forward routing, when a message is traversing a path with multiple
links, each intermediate node on the path forwards the message to the next node
after it has received and stored the entire message. And shows the communication
of a message through a store-and-forward network.
Packet Routing
Cut-Through Routing
7
Shree Ramchandra College of Engineering, Lonikand, Pune
computer with caches and memories (c) Non-uniform-memory-access shared-
address-space computer with local memory only.
8
Shree Ramchandra College of Engineering, Lonikand, Pune