Parallel Processing Assignment 1
Parallel Processing Assignment 1
INSTUTE OF TECHNOLOGY
FACULITY OF INFORMAT
COMPUTER SCIENCE DEPARTMENT
Introduction
1. What is Parallel Computing?
Parallel computing is a type of computation in which many calculations or the execution of processes are
carried out simultaneously. Large problems can often be divided into smaller ones, which can then be
solved at the same time. There are several different forms of parallel computing: bit-level, instruction-
level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but
it's gaining broader interest due to the physical constraints preventing frequency scaling. As power
consumption (and consequently heat generation) by computers has become a concern in recent years,
parallel computing has become the dominant paradigm in computer architecture, mainly in the form of
multi-core processors.
Parallel computing is closely related to concurrent computing—they are frequently used together, and
often conflated, though the two are distinct: it is possible to have parallelism without concurrency (such
as bitlevel parallelism), and concurrency without parallelism (such as multitasking by time-sharing on a
single core CPU). In parallel computing, a computational task is typically broken down into several, often
many, very similar sub-tasks that can be processed independently and whose results are combined
afterwards, upon completion. In contrast, in concurrent computing, the various processes often do not
address related tasks; when they do, as is typical in distributed computing, the separate tasks may have a
varied nature and often require some inter-process communication during execution.
Parallel computers can be roughly classified according to the level at which the hardware supports
parallelism, with multi-core and multi-processor computers having multiple processing elements within a
single machine, while clusters, MPPs, and grids use multiple computers to work on the same task.
Specialized parallel computer architectures are sometimes used alongside traditional processors, for
accelerating specific tasks.
1
Today, parallel computing is becoming mainstream based on multi-core processors. Most desktop and
laptop systems now ship with dual-core microprocessors, with quad-core processors readily available.
Chip manufacturers have begun to increase overall processing performance by adding additional CPU
cores. The reason is that increasing performance through parallel processing can be far more energy-
efficient than increasing microprocessor clock frequencies. In a world which is increasingly mobile and
energy conscious, this has become essential. Fortunately, the continued transistor scaling predicted by
Moore’s Law will allow for a transition from a few cores to many.
Parallel Software
The software world has been very active part of the evolution of parallel computing. Parallel programs
have been harder to write than sequential ones. A program that is divided into multiple concurrent tasks is
more difficult to write, due to the necessary synchronization and communication that needs to take place
between those tasks. Some standards have emerged. For MPPs and clusters, a number of application
programming interfaces converged to a single standard called MPI by the mid 1990’s. For shared
memory multiprocessor computing, a similar process unfolded with convergence around two standards by
the mid to late 1990s: pthreads and OpenMP. In addition to these, a multitude of competing parallel
programming models and languages have emerged over the years. Some of these models and languages
may provide a better solution to the parallel programming problem than the above “standards”, all of
which are modifications to conventional, non-parallel languages like C.
As multi-core processors bring parallel computing to mainstream customers, the key challenge in
computing today is to transition the software industry to parallel programming. The long history of
parallel software has not revealed any “silver bullets,” and indicates that there will not likely be any
single technology that will make parallel software ubiquitous. Doing so will require broad collaborations
THE MANYCORE SHIFT: Microsoft Parallel Computing Initiative Ushers Computing into the Next Era
across industry and academia to create families of technologies that work together to bring the power of
parallel computing to future mainstream applications. The changes needed will affect the entire industry,
from consumers to hardware manufacturers and from the entire software development infrastructure to
application developers who rely upon it.
Future capabilities such as photorealistic graphics, computational perception, and machine learning really
heavily on highly parallel algorithms. Enabling these capabilities will advance a new generation of
experiences that expand the scope and efficiency of what users can accomplish in their digital lifestyles
and work place. These experiences include more natural, immersive, and increasingly multi-sensory
interactions that offer multi-dimensional richness and context awareness. The future for parallel
computing is bright, but with new opportunities come new challenges.
Performance Metrics for Parallel Systems
It is important to study the performance of parallel programs with a view to determining the best algorithm,
evaluating hardware platforms, and examining the benefits from parallelism. A number of metrics have been used
based on the desired outcome of performance analysis.
2
1 Execution Time
The serial runtime of a program is the time elapsed between the beginning and the end of its execution on a
sequential computer. The parallel runtime is the time that elapses from the moment a parallel computation starts to
the moment the last processing element finishes execution. We denote the serial runtime by TS and the parallel
runtime by TP.
The total time spent in solving a problem summed over all processing elements is pTP . TS units of this time are spent
performing useful work, and the remainder is overhead. Therefore, the overhead function (To) is given by
Equation 1
3 Speedup
When evaluating a parallel system, we are often interested in knowing how much performance
gain is achieved by parallelizing a given application over a sequential implementation. Speedup
is a measure that captures the relative benefit of solving a problem in parallel. It is defined as the
ratio of the time taken to solve a problem on a single processing element to the time required to
solve the same problem on a parallel computer with p identical processing elements. We denote
speedup by the symbol S.
4 Efficiency
Only an ideal parallel system containing p processing elements can deliver a speedup equal to p.
In practice, ideal behavior is not achieved because while executing a parallel algorithm, the
processing elements cannot devote 100% of their time to the computations of the algorithm. Part
of the time required by the processing elements to compute the sum of n numbers is spent idling
(and communicating in real systems). Efficiency is a measure of the fraction of time for which a
processing element is usefully employed; it is defined as the ratio of speedup to the number of
processing elements. In an ideal parallel system, speedup is equal to p and efficiency is equal to
one. In practice, speedup is less than p and efficiency is between zero and one, depending on the
effectiveness with which the processing elements are utilized. We denote efficiency by the
symbol E.
5 Cost
We define the cost of solving a problem on a parallel system as the product of parallel runtime and the number of
processing elements used. Cost reflects the sum of the time that each processing element spends solving the
problem. Efficiency can also be expressed as the ratio of the execution time of the fastest known sequential
algorithm for solving a problem to the cost of solving the same problem on p processing elements.
3
The cost of solving a problem on a single processing element is the execution time of the fastest known sequential
algorithm. A parallel system is said to be cost-optimal if the cost of solving a problem on a parallel computer has the
same asymptotic growth (in Q terms) as a function of the input size as the fastest-known sequential algorithm on a
single processing element. Since efficiency is the ratio of sequential cost to parallel cost, a cost-optimal parallel
system has an efficiency of Q(1).
Cost is sometimes referred to as work or processor-time product, and a cost-optimal system is also known as a pTP -
optimal system.
4
Much like the graphical user interfaces on modern computer systems, there's a lot more going on
"under the hood" than the simplicity of the interface would imply. In my article
on multithreading, super threading and hyper-threading, I talked about the various ways in which
the OS and processor collaborate to fool the user into thinking that he or she is executing
multiple programs at once. There's a similar sort of trickery that goes on beneath the
programming model in a modern microprocessor, but it's intended to fool the programmer into
thinking that there's only one thing going on at a time, when really there are multiple things
happening simultaneously. Let me explain.
Back in the days when you could fit only a few transistors on a single die, many of the parts of
the programming model actually fit on separate chips attached to a single circuit board. For
instance, one chip would contain the ALU, another the control unit, another the registers, etc.
Such computers were obviously quite slow, and the fact that they were made of multiple chips
made them expensive. Each chip had its own manufacturing and packaging costs, so the fewer
chips you put on a board the cheaper the overall system was. Each chip had its own
manufacturing and packaging costs, and then there was the cost and complexity of putting them
all together on a single circuit board. (Note that this is still true, today. The cost of producing
systems and components can be drastically reduced by packing the functionality of multiple
chips into a single chip.)
With the advent of the Intel 4004 in 1971, all of that changed. The 4004 was the world's first
microprocessor on a chip. Designed to be the brains of a calculator manufactured by a now
defunct company named Busicom, the 4004 had sixteen 4-bit registers, an ALU, decoding and
control logic all packed onto a single, 2,300 transistor chip. The 4004 was quite a feat for its day,
and it paved the way for the PC revolution. However, it wasn't until Intel released the 8080 four
years later that the world saw the first true general purpose CPU
During the decades following the 4004, transistor densities increased at a stunning pace. As CPU
designers had more and more transistors to work with when designing new chips, they began to
think up novel ways for using those transistors to increase computing performance on application
code. One of the first things that occurred to designers was that they could put more than one
ALU a chip, and have both ALUs working in parallel to process code faster. Since these designs
could do more than one scalar (or integer, for our purposes) operation at once, they were
called superscalar computers. The RS6000 from IBM was released in 1990 and was the world's
first superscalar CPU. Intel followed in 1993 with the Pentium, which with its two ALUs
brought the x86 world into the superscalar era.
5
execution units accept the same instruction from the control unit but perform on separate
elements of data. The shared memory unit is divided into modules so that it can interact with all
the processors simultaneously.
Definition of MIMD
SIMD stands for Single Instruction While MIMD stands for Multiple Instruction
2. SIMD requires small or less memory. While it requires more or large memory.
3. The cost of SIMD is less than MIMD. While it is costlier than SIMD.
6
Difference between UMA and NUMA
Definition of UMA
UMA (Uniform Memory Access) system is a shared memory architecture for the
multiprocessors. In this model, a single memory is used and accessed by all the processors
present the multiprocessor system with the help of the interconnection network. Each processor
has equal memory accessing time (latency) and access speed. It can employ either of the single
bus, multiple bus or crossbar switch. As it provides balanced shared memory access, it is also
known as SMP (Symmetric multiprocessor) systems.
Definition of NUMA
NUMA (Non-uniform Memory Access) is also a multiprocessor model in which each processor
connected with the dedicated memory. However, these small parts of the memory combine to
make a single address space. The main point to ponder here is that unlike UMA, the access time
of the memory relies on the distance where the processor is placed which means varying memory
access time. It allows access to any of the memory location by using the physical address.
1 Definition UMA stands for Uniform Memory Access. NUMA stands for Non Uniform Memory Access.
Memory UMA has single memory controller. NUMA has multiple memory controllers.
2
Controller
Memory UMA memory access is slow. NUMA memory accsss is faster than UMA memory.
3
Access
4 Bandwidth UMA has limited bandwidth. NUMA has more bandwidth than UMA.
Suitability UMA is used in general purpose and time NUMA is used in real time and time critical
5
sharing applications. applications.
Memory UMA has equal memory access time. NUMA has varying memory access time.
6 Access
time
Bus types 3 types of Buses supported: Single, Multiple 2 types of Buses supported: Tree,
7
and Crossbar. hierarchical.
7
Difference between Static and Dynamic Network in prallel system
Static interconnection networks
Static interconnection networks for elements of parallel systems (ex. processors, memories) are
based on fixed connections that cannot be modified without a physical re-designing of a system.
Static interconnection networks can have many structures such as a linear structure (pipeline), a
matrix, a ring, a torus, a complete connection structure, a tree, a star, a hyper-cube.
Definition of DRAM
DRAM (Dynamic Random Access Memory) is also a type of RAM which is constructed using
capacitors and few transistors. The capacitor is used for storing the data where bit value 1
signifies that the capacitor is charged and a bit value 0 means that capacitor is discharged.
Capacitor
Comparision SRAM DRAM tends to
discharge,
Cost Expensive Cheap which result
in leaking
of charges.
Used in Cache memory Main memory
Construction Complex and uses transistors and Simple and uses capacitors and very few
latches. transistors.
requires
Charge leakage property Not present Present hence require power refresh
8 circuitry
RISC Architecture
The term RISC stands for ‘’Reduced Instruction Set Computer’’. It is a CPU design plan based
on simple orders and acts fast.
This is small or reduced set of instructions. Here, every instruction is expected to attain very
small jobs. In this machine, the instruction sets are modest and simple, which help in comprising
more complex commands. Each instruction is about the similar length; these are wound together
to get compound tasks done in a single operation. Most commands are completed in one machine
cycle. This pipelining is a crucial technique used to speed up RISC machines.
CISC Architecture
The term CISC stands for ‘’Complex Instruction Set Computer’’. It is a CPU design plan based
on single commands, which are skilled in executing multi-step operations.
CISC computers have small programs. It has a huge number of compound instructions, which
takes a long time to perform. Here, a single set of instruction is protected in several steps; each
instruction set has additional than 300 separate instructions. Maximum instructions are finished
in two to ten machine cycles. In CISC, instruction pipelining is not easily implemented.
9
Assignment 2
Problem 2.1 A 40-MHz processor was used to execute a benchmark program with the following
instruction mix and clock cycle counts:
Clock
Instructio
Instruction type cycle
n count
count
Determine the effective CPI, MIPS rate, and execution time for this program.
Solution
CPI
CPI=∑^n i=1(cpi*Li)
Ic
Here,
CPI =cycle per instruction
IC= instruction count
CPI=(1*45000)+(2*32000)+(2*150000)+(2*8000)
100000
CPI=45000+64000+30000+16000
100000
CPI=1550000
10
100000
CPI=1.55ans.
MIPS RATE
MIPS= ___IC____
T*10^6
= f/CPI*10^6
= ____IC______
IC*CPI*T*10^6
= ____I______
CPI*1/f*10^6
= _____f______
CPI*10^6
MIPS=40*10^6 /1.55*10^6
=25.8064
EXECUTION TIME
T=IC*CPI*T
=100 000*1.55*1/f
=100,000*1.55*1/40*10^6
=0.003875ans.
Problem 2.2: What is pipelining? Describe the speed up gain due to pipelining.
What are the various factors that affect Throughput of an Instruction Pipeline?
What is Pipeline (computing)
In computing, a pipeline, also known as a data pipeline, is a set of data processing elements
connected in series, where the output of one element is the input of the next one. The elements of
a pipeline are often executed in parallel or in time-sliced fashion. Some amount of buffer storage
is often inserted between elements.
Computer-related pipelines include:
Instruction pipelines, such as the classic RISC pipeline, which are used in central processing
units (CPUs) and other microprocessors to allow overlapping execution of multiple instructions
11
with the same circuitry. The circuitry is usually divided up into stages and each stage processes a
specific part of one instruction at a time, passing the partial results to the next stage. Examples of
stages are instruction decode, arithmetic/logic and register fetch. They are related to the
technologies of superscalar execution, operand forwarding, speculative execution and out-of-
order execution.
Graphics pipelines, found in most graphics processing units (GPUs), which consist of
multiple arithmetic units, or complete CPUs, that implement the various stages of
common rendering operations (perspective projection, window clipping, color and light
calculation, rendering, etc.).
Software pipelines, which consist of a sequence of computing processes (commands,
program runs, tasks, threads, procedures, etc.), conceptually executed in parallel, with the
output stream of one process being automatically fed as the input stream of the next one.
The Unix system call pipe is a classic example of this concept.
HTTP pipelining, the technique of issuing multiple HTTP requests through the same TCP
connection, without waiting for the previous one to finish before issuing a new one.
Problem 2.3: What is Cache Mapping? How and which mapping affect the speed in parallel
execution of task?
Cache Mapping
Cache mapping defines how a block from the main memory is mapped to the cache
memory in case of a cache miss.
OR
Cache mapping is a technique by which the contents of main memory are brought into
the cache memory.
NOTES
Main memory is divided into equal size partitions called as blocks or frames.
Cache memory is divided into partitions having same size as that of blocks
called as lines.
During cache mapping, block of main memory is simply copied to the cache
and the block is not actually brought from the main memory.
12
Cache Mapping Techniques-
Cache mapping is performed using following three different techniques
1. Direct Mapping
2. Fully Associative Mapping
3. K-way Set Associative Mapping
13