0% found this document useful (0 votes)
26 views8 pages

HPC Unit 1 Solution

The document outlines the syllabus for a High Performance Computing course at Shree Ramchandra College of Engineering, focusing on parallel computing concepts such as SIMD and MIMD architectures, memory latency and bandwidth, and message passing costs. It emphasizes the importance of understanding different computing architectures and their applications, as well as the challenges associated with developing parallel software. Additionally, it discusses the implications of uniform and non-uniform memory access in shared-address-space architectures.

Uploaded by

Msdian 7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views8 pages

HPC Unit 1 Solution

The document outlines the syllabus for a High Performance Computing course at Shree Ramchandra College of Engineering, focusing on parallel computing concepts such as SIMD and MIMD architectures, memory latency and bandwidth, and message passing costs. It emphasizes the importance of understanding different computing architectures and their applications, as well as the challenges associated with developing parallel software. Additionally, it discusses the implications of uniform and non-uniform memory access in shared-address-space architectures.

Uploaded by

Msdian 7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

SRES’s

SHREE RAMCHANDRA COLLEGE OF ENGINEERING


Department of Computer Engineering
Lonikand, Pune – 412216

Ref. №: SRCOE/COMP /2023-24 Date: 28/01/24

Class: BE Subject: High Performance Computing.


Academic Year: 2023-2024 Semester: II

UNIT 1
Introduction to Parallel Computing
Q.1 What are the applications of Parallel Computing? April
Traditionally, software has been written for serial computation: To be run on a single 2023[6]
computer having a single Central Processing Unit (CPU).A problem is broken into a
discrete series of instructions. Instructions are executed one after another. Only one
instruction may execute at any moment in time. In the simplest sense, parallel computing is
the simultaneous use of multiple compute resources to solve a computational problem. To
be run using multiple CPUs A problem is broken into discrete parts that can be solved
concurrently Each part is further broken down to a series of instructions .Instructions from
each part execute simultaneously on different CPUs .

Motivating Parallelism:
• Development of parallel software has traditionally been thought of as time and

1
Shree Ramchandra College of Engineering, Lonikand, Pune
effort intensive.

• This can be largely attributed to the inherent complexity of specifying and


coordinating concurrent tasks, a lack of portable algorithms, standardized
environments, and software development toolkits.

1. The Computational Power Argument – from Transistors to FLOPS

In 1965, Gordon Moore made the following simple observation:"The complexity for
minimum component costs has increased at a rate of roughly a factor of two per year.
Certainly over the short term this rate can be expected to continue, if not to increase. Over
the longer term, the rate of increase is a bit more uncertain, although there is no reason to
believe it will not remain nearly constant for at least 10 years. That means by 1975, the
number of components per integrated circuit for minimum cost will be 65,000."
2. The Memory/Disk Speed Argument

The overall speed of computation is determined not just by the speed of the processor, but
also by the ability of the memory system to feed data to it. While clock rates of high-end
processors have increased at roughly 40% per year over the past decade, DRAM access
times have only improved at the rate of roughly 10% per year over this interval.
• The overall performance of the memory system is determined by the fraction of the
total memory requests that can be satisfied from the cache

3. The Data Communication Argument

In many applications there are constraints on the location of data and/or resources across
the Internet. An example of such an application is mining of large commercial datasets
distributed over a relatively low bandwidth network. In such applications, even if the
computing power is available to accomplish the required task without resorting to parallel
computing, it is infeasible to collect the data at a central location. In these cases, the
motivation for parallelism comes not just from the need for computing resources but also
from the infeasibility or undesirability of alternate (centralized) approaches.
Q.2 Explain with suitable diagram SIMD architecture. April
2023[4]

SIMD stands for 'Single Instruction and Multiple Data Stream'. It represents an

2
Shree Ramchandra College of Engineering, Lonikand, Pune
organization that includes many processing units under the supervision of a
common control unit. All processors receive the same instruction from the control
unit but operate on different items of data. The shared memory unit must contain
multiple modules so that it can communicate with all the processors simultaneously.

SIMD is short for Single Instruction/Multiple Data, while the term SIMD
operations refers to a computing method that enables processing of multiple data
with a single instruction. In contrast, the conventional sequential approach using
one instruction to process each individual data is called scalar operations.

The current era of SIMD processors grew out of the desktop-computer market
rather than the supercomputer market. As desktop processors became powerful
enough to support real-time gaming and audio/video processing during the 1990s,
demand grew for this particular type of computing power, and microprocessor
vendors turned to SIMD to meet the demand.[
Some of the earliest parallel computers such as the Illiac IV,MPP, DAP, CM-2, and
MasPar MP-1 belonged to this class of machines.
Variants of this concept have found use in co-processing units such as the MMX
units in Intel processors and DSP chips such as the Sharc. SIMD relies on the
regular structure of computations (such as those in image processing).It is often
necessary to selectively turn off operations on certain data items. For this reason,
most SIMD programming paradigms allow for an “activity mask”, which
determines if a processor should participate in a computation or not.

Q.3 Explain with suitable diagram SIMD architecture. April


2023[4]

MIMD stands for 'Multiple Instruction and Multiple Data Stream'.

In computing, multiple instruction, multiple data (MIMD) is a technique employed


to achieve parallelism. Machines using MIMD have a number of processors.

3
Shree Ramchandra College of Engineering, Lonikand, Pune
In this organization, all processors in a parallel computer can execute different
instructions and operate on various data at the same time. In MIMD, each processor
has a separate program and an instruction stream is generated from each program.
Which is the most fundamental and well-known type of parallel processor. It is a
technique used to achieve parallelism. The shared memory programming paradigm
and the distributed memory programming model are used in the MIMD
architecture. Every model has its own set of benefits and drawbacks. MIMD is an
abbreviation for Multiple Instruction and Multiple Data Stream. All processors of a
parallel computer may execute distinct instructions and act on different data at the
same time in this organization. Each processor in MIMD has its own program, and
each program generates an instruction stream.

It includes parallel architectures are made of multiple processors and multiple


memory modules linked via some interconnection network. They fall into two
broad types including shared memory or message passing. A shared memory
system generally accomplishes interprocessor coordination through a global
memory shared by all processors. These are frequently server systems that
communicate through a bus and cache memory controller. The bus/ cache
architecture alleviates the need for expensive multi-ported memories and interface
circuitry as well as the need to adopt a message-passing paradigm when developing
application software. Because access to shared memory is balanced, these systems
are also called SMP (symmetric multiprocessor) systems. Each processor has an
equal opportunity to read/write to memory, including equal access speed.

• SIMD-MIMD Comparison

• SIMD computers require less hardware than MIMD computers (single control
unit).

• However, since SIMD processors are specially designed, they tend to be


expensive and have long design cycles.

• Not all applications are naturally suited to SIMD processors.

• In contrast, platforms supporting the SPMD paradigm can be built from


inexpensive off-the-shelf components with relatively little effort in a short amount
of time.

Q.4 Explain the impact of Memory Latency and Memory Bandwidth on system April
performance. 2023[5]
It is very important to understand the difference between latency and bandwidth.
Consider the example of a fire-hose. If the water comes out of the hose two seconds
after the hydrant is turned on, the latency of the system is two seconds. Once the
water starts flowing, if the hydrant delivers water at the rate of 5 gallons/second, the

4
Shree Ramchandra College of Engineering, Lonikand, Pune
bandwidth of the system is 5 gallons/second.

• If you want immediate response from the hydrant, it is important to reduce


latency. If you want to fight big fires, you want high bandwidth.

• Memory Latency: An Example

Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a


latency of 100 ns (no caches). Assume that the processor has two multiply-add units
and is capable of executing four instructions in each cycle of 1 ns. The following
observations follow:

The peak processor rating is 4 GFLOPS. Since the memory latency is equal to 100
cycles and block size is one word, every time a memory request is made, the
processor must wait 100 cycles before it can process the data. On the above
architecture, consider the problem of computing a dot-product of two vectors.

A dot-product computation performs one multiply-add on a single pair of vector


elements, i.e., each floating point operation requires one data fetch. It follows that
the peak speed of this computation is limited to one floating point operation every
100 ns, or a speed of 10 MFLOPS, a very small fraction of the peak processor
rating.

• Improving Effective Memory Latency Using Caches

Caches are small and fast memory elements between the processor and DRAM.
This memory acts as low-latency high-bandwidth storage. If a piece of data is
repeatedly used, the effective latency of this memory system can be reduced by the
cache. The fraction of data references satisfied by the cache is called the cache hit
ratio of the computation on the system.Cache hit ratio achieved by a code on a
memory system often.

• Impact of Memory Bandwidth.

Memory bandwidth is determined by the bandwidth of the memory bus as well as


the memory units. Memory bandwidth can be improved by increasing the size of
memory blocks. The underlying system takes l time units (where l is the latency of
the system) to deliver b units of data (where b is the block size).

Q.5 Explain Message Passing Costs in Parallel Computers in parallel machines. April
2023[6]
These platforms comprise of a set of processors and their own (exclusive) memory.
Instances of such a view come naturally from clustered workstations and non-
shared-address-space multicomputer. These platforms are programmed using
(variants of) send and receive primitives. Shared address space platforms can easily

5
Shree Ramchandra College of Engineering, Lonikand, Pune
emulate message passing. The reverse is more difficult to do (in an efficient
manner).

The time taken to communicate a message between two nodes in a network is the
sum of the time to prepare a message for transmission and the time taken by the
message to traverse the network to its destination.

1. Startup time (ts): The startup time is the time required to handle a message
at the sending and receiving nodes. This includes the time to prepare the
message (adding header, trailer, and error correction information), the time
to execute the routing algorithm, and the time to establish an interface
between the local node and the router. This delay is incurred only once for a
single message transfer.
2. Per-hop time (th): After a message leaves a node, it takes a finite amount of
time to reach the next node in its path. The time taken by the header of a
message to travel between two directly-connected nodes in the network is
called the per-hop time. It is also known as node latency. The per-hop time
is directly related to the latency within the routing switch for determining
which output buffer or channel the message should be forwarded to.
3. Per-word transfer time (tw): If the channel bandwidth is r words per second,
then each word takes time tw = 1/r to traverse the link. This time is called the
per-word transfer time. This time includes network as well as buffering
overheads.

Store-and-Forward Routing

6
Shree Ramchandra College of Engineering, Lonikand, Pune
In store-and-forward routing, when a message is traversing a path with multiple
links, each intermediate node on the path forwards the message to the next node
after it has received and stored the entire message. And shows the communication
of a message through a store-and-forward network.

Packet Routing

Store-and-forward routing makes poor use of communication resources. A message


is sent from one node to the next only after the entire message has been received. in
which the original message is broken into two equal sized parts before it is sent. In
this case, an intermediate node waits for only half of the original message to arrive
before passing it on. The increased utilization of communication resources and
reduced communication time is apparent from goes a step further and breaks the
message into four parts.

Cut-Through Routing

In interconnection networks for parallel computers, additional restrictions can be


imposed on message transfers to further reduce the overheads associated with
packet switching. By forcing all packets to take the same path, we can eliminate the
overhead of transmitting routing information with each packet. By forcing in-
sequence delivery, sequencing information can be eliminated. By associating error
information at message level rather than packet level, the overhead associated with
error detection and correction can be reduced. Finally, since error rates in
interconnection networks for parallel machines are extremely low, lean error
detection mechanisms can be used instead of expensive error correction schemes.

Q.5 Describe Uniform-Memory-Access and Non-Uniform-Memory-Access with April


diagrammatic representation. 2023[6]

Typical shared-address-space architectures: (a) Uniform-memory-access shared-


address-space computer (b) Uniform-memory-access shared-address-space

7
Shree Ramchandra College of Engineering, Lonikand, Pune
computer with caches and memories (c) Non-uniform-memory-access shared-
address-space computer with local memory only.

Part of the memory is accessible to all processors. Processors interact by modifying


data objects stored in this shared-address-space. If the time taken by a processor to
access any memory word in the system global or local is identical, the platform is
classified as a uniform memory access (UMA), else, a non- uniform memory access
(NUMA) machine. His distinction between NUMA and UMA platforms is
important from the point of view of algorithm design. NUMA machines require
locality from underlying algorithms for performance. Programming these platforms
is easier since reads and writes are implicitly visible to other processors. However,
read-write data to shared data must be coordinated (This will be discussed in
greater detail when we talk about threads programming).
• Caches in such machines require coordinated access to multiple copies. This leads
to the cache coherence problem. A weaker model of these machines provides an
address map, but not coordinated access. These models are called non cache
coherent shared address space machines.

8
Shree Ramchandra College of Engineering, Lonikand, Pune

You might also like